Tuesday, October 23, 2007

What you should know before diving into GPGPU study

Before choosing your Graduate/Master/PhD thesis subject as something in GPGPU, you should be aware that:

1- A new technology, from NVidia, called CUDA, has been created and you don't need to use Graphics API to make general purpose programming using GPUs. Besides being outdated, confusing and not productive, graphic API code lacks several features available from CUDA, specially the ability to make scatter. Unfortunatelly programs written for CUDA are only going to run in GeForce 8000 series (and later), QuadroFX 5600, 4600 and Tesla C870, D870, S870. Other frameworks as Brook and RapidMind should be checked as well.

2- You need to understand that a GPU, as you'll use it, is a set of SIMD (single instruction multiple data) processors grouped in multiprocessors. It means THEY WILL ALWAYS EXECUTE THE SAME INSTRUCTION during a clock cycle, but in different data. So, you should not think of them as being able to execute threads in a fashion like a dual-core Intel would, or even the Cell processor. Make sure your algorithm can be expressed in such a way.

3- Memory transfers can become potential bottlenecks for your application. Remember that, when programming GPU code, the data you access needs to be located at the global memory space in GPU. In order to load something there, you'll firstly need to download data to the GPU. As a rule of thumb, it takes a GeForce 8600M GT around 6ms to download a 1024x1024 pixels image (around 4MB). And YES, it might be the single detail that will prevent your application from running faster on GPUs.

4- There's some initialization time for the GPU (which is around 100ms). If you need to run only one iteration of your algorithm, and it is supposed to take less than 100 ms... CPU will probably be the best choice.

5- There's very small shared memory for GPUs. CUDA architecture makes available is 16KB per multiprocessor. It means, that if you are running 256 threads in a single multiprocessor (which is what you should be generally doing) you'd have 64 Bytes per thread... which are about 16 integers... if you can solve your problem with such a small amount of memory, your application will be extremely fast, because shared memory access is almost as fast as register access. Else, you'll need to write to global memory... it takes around 300 cycles to retrieve a value from global memory - which has around 256MB depending on your GPU. If you cannot hide this latency processing the retrieved data while other requests are made, your application will really lag in this point. A good strategy is to allow other threads to process the data while others are waiting for memory requests. An application heavily focused on memory requests will suffer dramatically because of this.

6- Only single precision floating point registers are available. It means around 7 digits of mantissa. Of course, future GPU generations will include double precision values.

7- Linear algebra problems will most probably run very fast on GPUs.

8- Large arrays, such as N-Body simulations are also a good fit for GPUs.

Some benchmarks on CPU, Cell and GPUs are included in this paper. It should be noticed, though, that the article hasn't been accepted for the congress, mainly because this application really does not require much processing from Cell or GPUs, as it spends most of the time on memory transfers. For more info, a great article about the subject and a great site www.gpgpu.org.

*Source: NVidia CUDA Programming Guide and personal experience with GPGPU

No comments: