Step into Parallel computing: Part 2: CUDA programming.

6 min readJul 3, 2023

The programming paradigm to perform computations on GPGPUs.

Let’s start with the first question, why I wrote this now?

Let me answer that, take a look at this piece of recent news that NVIDIA joined the $1 trillion valuation club recently.
https://www.reuters.com/technology/nvidia-sets-eye-1-trillion-market-value-2023-05-30/ . This implies that NVIDIA has undoubtedly become one of the winners in the AI boom.

So why do we need to know CUDA?
CUDA is the programming model that leverages the parallel compute engine in NVIDIA GPUs. So if you have been through my previous article on parallel programming you will have a better understanding of these parallel computing paradigms.

Prerequisites for this article: I would highly recommend you to read my first article here and be familiar with parallel computing with OpenMP which will enable you to grab the concepts of parallel computing.

What is CUDA?

CUDA is the general-purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU.

I hope that you are now clear with the intention of my article to follow up on my first article on parallel computing, Step into Parallel Computing, which explained the importance of parallel computing.

This article will be focusing on CUDA programming as promised at the end of the first article. Use this as a starting point to know about CUDA programming. Let’s dive into this paradigm of parallel computing

Imporatance of CUDA programming.

At the time of writing this article, the world is talking about the emergence of AI. There is a very important role that parallel computing plays in the age of AI. That is in order to AI/ML to be survived it requires computing, so as the AI models get larger and larger the amount of computations that should be done is also getting bigger simultaneously. As a result, we can’t solely rely on a single processor anymore. Therefore, in my opinion, the development of AI is thoroughly co-related with capabilities of processing where we definitely need parallel computing paradigms in order to do AI computations quickly and efficiently.

A basic introduction to GPGPU Programming with CUDA:

CUDA allows developers to write parallel code that can be executed on NVIDIA GPUs. By harnessing the massive parallelism offered by GPUs, CUDA enables significant speedups for various computationally intensive tasks. It provides a flexible programming model and a rich set of libraries and tools, making it easier for developers to leverage the power of parallel computing using GPUs.

GPGPU programming with CUDA is particularly beneficial for applications such as scientific simulations, data analytics, machine learning, and deep learning, where parallelism plays a crucial role in accelerating computations.

Comparison of CUDA parallel computations vs Sequential approach

Just like my first article which explained OpenMP which is one paradigm to perform parallel computing, let’s get to know CUDA with an example.

Now in this section, I will demonstrate how CUDA programming can perform parallel execution and do a comparison with parallel computing with OpenMP. In order to perform parallel computations and compare we need a significant calculation to be done. I have chosen the following example which calculates the sum of elements in an array.

If you come from a C language background, in the above code example you will have to understand some basic concepts associated with CUDA programming which can be summarised as below. Furthermore, you can refer to the official Nvidia docs and understand more concepts.

CUDA Threads and Blocks: In CUDA programming, you work with thousands of lightweight threads that can execute instructions concurrently. These threads are organized into blocks, and multiple blocks form a grid. Understanding how to launch and manage threads and blocks is essential for leveraging parallelism effectively.
Kernel Functions: In CUDA, you define kernel functions that are executed by multiple threads in parallel. These functions are prefixed with the __global__ specifier and are invoked on the GPU. Kernel functions operate on chunks of data, with each thread processing a separate element or subset of the data.
Memory Hierarchy: GPUs have different memory types with varying access speeds and scopes. Understanding the memory hierarchy, which includes global memory, shared memory, and local memory, is crucial for optimizing data access and minimizing memory transfers between the host (CPU) and the device (GPU).
Thread Synchronization: When multiple threads are executing in parallel, synchronization is necessary to coordinate their actions. CUDA provides synchronization mechanisms such as barriers, atomic operations, and thread cooperation to ensure correct and orderly execution of parallel code.
Memory Coalescing: Efficient memory access patterns can significantly impact performance. Coalesced memory access refers to accessing adjacent memory locations by adjacent threads, which can improve memory bandwidth. Optimizing memory access patterns can enhance performance in CUDA programs.

In order to perform and test the sequential vs parallel executions in your local machine you will need to following pre-requisites.

NVIDIA GPU: The code utilizes CUDA programming, which requires an NVIDIA GPU with CUDA support. Ensure that you have a compatible NVIDIA GPU installed in your system.
CUDA Toolkit: Install the CUDA Toolkit, which includes the necessary libraries, compilers, and tools for CUDA development. You can download the CUDA Toolkit from the NVIDIA website (https://developer.nvidia.com/cuda-downloads).
C Compiler: You need a C compiler to compile the code. Most systems come with a default C compiler installed. For example, on Linux, you can use GCC (GNU Compiler Collection), while on Windows, you can use MinGW or Microsoft Visual Studio.

Once you have met these requirements, follow these steps to compile and run the code:

Save the code in a file with a “.c” extension, for example, “parallel_sum.c”.
Open a terminal or command prompt and navigate to the directory where the code file is saved.
Compile the code using the C compiler. If you are using GCC, run the following command:

gcc -o parallel_sum parallel_sum.c -fopenmp -lcuda

Result and comparisons.

I executed to above code in GPU-enabled workstation with different num of array sizes and get time for execution for both OpenMP and CUDA. Also, I have written a small C code that will do the sequential calculation for the same array sizes. Then we can calculate the Speedup of OpenMP and CUDA compared to the sequential execution as follows.

Figure 02: Parallel vs Non-parallel execution times

Based on the results I have generated the following graph based on the above speedup vs array sizes that shows a comparison of speedup using OpenMP and CUDA.

Conclusion

From the above results, we can see that for smaller input sizes, OpenMP parallelism performs better. Because cuda Data transfer between CPU and GPU is more costly in such cases than the computation time.CUDA cannot perform very small sizes(number of elements ) less than the allocated number of threads. But when the Input Size grows very large then Cuda performs well than OpenMP parallelism. Even Sequential code and the OpenMP code have a limit for the number of elements which is GPU version can perform with a larger number of elements as it has a large global memory space too. So for larger computations Cuda GPU version performs better. For Comparatively smaller computations OpenMP parallelism performs well.

That concludes part 2 of parallel computing which introduced you to the basics of CUDA programming. You can refer to the official NVIDIA docs to continue learning furthermore. Happy to answer anything via comments.If you are interested keep following for more.

Happy Coding!