Wednesday, September 10, 2008

Cuda Hello World: part 2

In Part 1 we setup a Visual Studio project to run our first cuda program. In this part we will look deeper into the template and explain what it is doing.

A Cuda program is structured as followed
1) Define a Kernel
2) Copy system memory to GPU memory
3) Execute the Kernel
4) Copy results from GPU memory back to system memory
5) Print our results, and cleanup

We'll look at each part.

Define a Kernel

A kernel is a function that is executed. For our example will follow the sample template from the Visual Studio plugin and create a HelloCuda method.



__global__ static void HelloCUDA(char* result, int num, clock_t* time)
{
int i = 0;
char p_HelloCUDA[] = "Hello CUDA!";
clock_t start = clock();
for(i = 0; i < num; i++) {
result[i] = p_HelloCUDA[i];
}
*time = clock() - start;
}

The __global__ declaration specifier indicates that the procedure is a kernel entry point. Our function takes in an:
array of characters
size of the array
pointer to clock_t structure

The kernel doesn't do much it copies into our character array "Hello CUDA!". As you can probably infer from the assignment of time we are planning on only calling this function one time.

Copy system memory to GPU memory

Now that our kernel is defined we need to prep our data structures such that we can execute the kernel. For this example we'll need to allocate a block of memory to hold the "Hello CUDA!" string, and a clock_t structure for the elapsed time.

To allocate memory on the GPU we use cudaMalloc.

char *device_result = 0;
clock_t *time = 0;
cudaMalloc((void**) &device_result, sizeof(char) * 11);
cudaMalloc((void**) &time, sizeof(clock_t));


The memory has now been set aside in the GPU and were ready to execute the kernel

Execute the Kernel

The following code will execute our defined kernel passing in the arguments we just created.

HelloCUDA<<<1, 1, 0&rt;&rt;&rt;(device_result, 11 , time,1);


At this stage the GPU will execute our program and will have our results, we need to copy those results back to the system so we can use them.

Copy results from GPU memory back to system memory

To get our results off of the GPU we use cudaMemcpy and copy the data back into our own storage. To do that we must define a char*, and a clock_t then memcpy the results back.


char host_result[12] ={0};
clock_t time_used = 0;
cudaMemcpy(&host_result, device_result, sizeof(char) * 11, cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t), cudaMemcpyDeviceToHost);

Print Results and Clean up

That's it the GPU has ran our program created a string called 'Hello CUDA!' copied that into the char * result that we have a pointer to called device_result. It has also filled in clock_t struct in gpu memory which we have a pointer to called time.

We've copied the gpu memory into local variables called host_result, and time_used and now can use it like an C program. So lets print it out.


cudaMemcpy(&host_result, device_result, sizeof(char) * 11, cudaMemcpyDeviceToHost);
cudaMemcpy(&time_used, time, sizeof(clock_t), cudaMemcpyDeviceToHost);


The last thing we need to do is free the memory on the device we use cudaFree.


cudaFree(device_result);
cudaFree(time);


That's it! If you run the program you should get output similar to the following:

CUDA initialized.
Hello CUDA!,0
Press any key to continue . . .


Click here for the entire code listing

No comments: