Optimizing Matrix Multiplication with CUDA
The slides used in the presentation can be found here.
Most of the material used in this presentation was from the nVidia CUDA Programming Guide and the nVidia CUDA SDK example software.
The code for mesauring the runtimes can be found here.
The Makefile for the GPU part is still in CUDA SDK format, requiring that the directory is placed in SDK_ROOT/C/src/ and that the required libraries are built (by running make in SDK_ROOT/C/). The executable will be saved in SDK_ROOT/C/bin/linux/release/.
CUDA code taken from the programming guide and SDK software examples matrixMul and simpleCUBLAS.
About CUBLAS performance
It seems that the sample code for using CUBLAS which I used wrote to device memory using vector specific calls cublasSetVector() and cublasGetVector(). I just tried running it by using cublasSetMatrix() and cublasGetMatrix(), but still got the interesting faster-if-divisible-by-16 phenomenon. As a clarification the cublas library requires one to first allocate memory using cublasAlloc(), a point at which cublas doesn't know anything about the nature of the data stored there.
The data obtained from the above code can be found here.
GPU times are in CSV format, and CPU times in CSV with single space as the delimiter (sorry for the inconsistency, just did lazy hacking) . If you are wondering about the times.plot file, it is the fileformat for Plot, a nice little plotting app for OS X.