Child pages
  • talk intro
Skip to end of metadata
Go to start of metadata

Introduction to Cuda and OpenCl

The presentation is here There were few questions that were left open during the introduction presentation.


Why Cuda's warp size is 32?
To best of my knowledge, this is an implementation detail that is subject to change. Though it remains a bit mystery, why, for example in 8800GTX where each symmetric multiprocessor (SM) consists of 8 stream processors (SPs, execution units), there are 32 threads/warp. My guess is that this has something to do that the ALU operations have 4 cycle latencies and by having 32 threads the scheduler can provide higher utilization for the ''pipeline''.

How do the threads and blocks actually work?
The Cuda kernel invocation syntax is

dim3 dimBlock(X,Y,Z);
dim3 dimGrid(X,Y,ignored)
kernel<<<dimGrid, dimBlock>>>

where dimGrid and dimBlock are defined with Cuda's builtin dim3 type. The dimBlock defines the thread organization in 1D, 2D or 3D-layout, which can be accessed in the kernel by the 3-element built-in variable threadIdx.{x,y,z}. In Cuda maximum number of threads is 512 and this dimBlock 1D-3D syntax is just a way to organize it so that you can avoid some divisions when indexing multidimensional data.

The dimGrid definition ignores the last parameter and can define 1D or 2D thread blocks. In the kernel, user can access the grid number via blockIdx.{x,y}. I believe that this 2D definition of grids has the same reason as the 3D thread definition. That is, to avoid some unnecessary calculations when using multi-dimensional data.

Sample code

You can download the example vector addition for Cuda and OpenCl from here . Alternatively, in miranda you can copy and compile it by issuing

cp /usr/local/gpgpu/example.tar.gz .
tar zxvf example.tar.gz
cd example
make -C cuda
make -C opencl

Once these are executed you can run the binaries cuda/vec and opencl/vec.

  • No labels