Child pages
  • talk Cuda and OpenCL API comparison
Skip to end of metadata
Go to start of metadata

CUDA and OpenCL API comparison

The presentation slides of the talk: Cuda and OpenCL API comparison_presented.pdf
Kernel code handout: CUDA and OpenCL kernels handout.pdf

See the last presentation slide for references used.

Answers to open questions

Here are answers to some open questions and discussion items brought up during the presentation.

- Does IBM's OpenCL have support for the Cell processor?
Seems so, although strangely it is not mentioned in the kit's overview page. You can find hints for Cell support in the FAQ page, see items 8. and 9.

- OpenCL has experimental C++ interface under work at Khronos
True, and not mentioned in my presentation. Here's link to more information about it: http://www.khronos.org/registry/cl/

- Does OpenCL require work-items in a work-group execute the same instruction at the same time for optimum performance?
No, quoting from the OpenCL specification, chapter 3.2 Execution Model:

Each work-item executes the same code but the specific execution pathway through the code and the data operated upon can vary per work-item.

Obviously on Nvidia CUDA-based implementation of OpenCL you'd want to avoid diverging work-item execution to achieve full performance.

- Does OpenCL support load-balancing of work-items?
I am not completely sure I remeber or understand the question correctly, but the quotes from the OpenCL specification below could hint in the possibility of work-items of a work-group migrating from one processing element to another.

2. Glossary

Processing Element: A virtual scalar processor. A work-item may execute on one or more processing elements.

Work-item: One of a collection of parallel executions of a kernel invoked on a device by a command. A work-item is executed by one or more processing elements as part of a work-group executing on a compute unit.

Seems that no migration between compute units though:
3.2 Execution Model

The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.

Individual work-items migrating from one compute untit to another would break the work-group memory locality and synchronization requirements, which are identical to CUDA thread block requirements:
2. Glossary

Work-group: A collection of related work-items that execute on a single compute unit. The work-items in the group execute the same kernel and share local memory and work-group barriers.

- You work at Digia. Does Digia use GPGPU computing for something?
As I said I am not aware of any ongoing projects in the area. I work in mobile computing area, with smartphones and their operating systems and not definitively with any High Performance Computing applications. GPGPU can be interesting also in the mobile/embedded area, see for example this presentation. Especially take a look at the energy efficiency gains for this particular application.

- Why did Apple push OpenCL standard?
Apple has been for a long time migrating OS X UI to be rendered fully on the GPU. For a comprehensive look into this, see ArsTechnica's OS X Tiger review starting at this page.

My guess for Apple pushing open standard in the area is that they wanted to avoid vendor lock-in by pushing a standard instead of just using CUDA. E.g. current Apple Macs sport GPUs from both Nvidia and and AMD, and actually currently the higher-end models are equipped with AMDs. See also this analysis (page 2) in ArsTechnica's OS X Snow Leopard review to Apple's motivations behind OpenCL.

If I forgot your question, please post it as a comment and I'll try to answer it.

  • No labels