I’ve written some more CUDA demonstration-code: The Tiny Encryption Algorithm implemented in CUDA.
The code demonstrates 100% occupancy, 100% coalesced 128bit memory transactions and use of page-locked memory. It performed at around 380 mb/s on a GTX 260. Compare that to 40mb/s on a 2×2.5Ghz Core2Duo (without using SSE).
Beware some pitfalls when playing with the execution parameters. Especially beware those implicit memory/threadblock alignment requirements from hell!
Get it here and compile with ‘nvcc -Xptxas “-v” -maxrregcount=10 tea_cuda.cu‘
Leave a comment
No comments yet.