Tuesday, April 21, 2009
GPU programming using CUDA: Study #1
Study material source
CUDA U [Education] Exercise
Methods : See CUDA Reference Manual for details.
1. Allocate host memory and device memory
cudaMalloc(void **devPtr, size_t count): allocate memory on the GPU
cudaMallocHost(void **hostPtr, size_t size): allocate page-locked memory on the host
2. Copy memory host to device, device to device, and device to host
cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind): copies data between GPU and host
cudaMemcpyKind: cudaMemcpyHostToDevice, cudaMemcpyDeviceToDevice, cudaMemcpyDeviceToHost
3. Free host and device memory
cudaFree(void *devPtr): frees memory on the GPU
4. Block until all threads in the block have written their data to shared memory
__syncthreads(); // called on the GPU
5. Block until the device has completed
cudaThreadSynchronize(); // called on the host
Kernel configuration and Launch
dim3 dimGrid(1024);
dim3 dimBlock(256);
kernel_name <<< dimGrid, dimBlock>>>(kernel_arguments);
1D Indexing
single block:
int idx = threadIdx.x;
int reversed_idx = blockDim.x - 1 - threadIdx.x;
multi blocks:
int offset = blockIdx.x * blockDim.x;
int idx = offset + threadIdx.x;
int reversed_offset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int reversed_threadIdx = blockDim.x - 1 - threadIdx.x;
int reversed_idx = reversed_offset + reversed_threadIdx;
Using shared memory
Declaration:
extern __shared__ int s_data[];
Terminology
1. Host: CPU
2. Device: GPU
3. Kernel: a function that runs on the GPU, executed by an array of threads
4. Grid: a set of blocks (dimension of grid == # blocks a grid)
5. Block: a set of threads (dimension of block == # threads in a block)
6. Thread: one thread runs kernel
7. Shared memory:
Declarations on functions
__host__: host only
__global__: interface
__device__: device only
__shared__: shared memory
__local__:
CUDA U [Education] Exercise
Methods : See CUDA Reference Manual for details.
1. Allocate host memory and device memory
cudaMalloc(void **devPtr, size_t count): allocate memory on the GPU
cudaMallocHost(void **hostPtr, size_t size): allocate page-locked memory on the host
2. Copy memory host to device, device to device, and device to host
cudaMemcpy(void *dst, const void *src, size_t count, enum cudaMemcpyKind kind): copies data between GPU and host
cudaMemcpyKind: cudaMemcpyHostToDevice, cudaMemcpyDeviceToDevice, cudaMemcpyDeviceToHost
3. Free host and device memory
cudaFree(void *devPtr): frees memory on the GPU
4. Block until all threads in the block have written their data to shared memory
__syncthreads(); // called on the GPU
5. Block until the device has completed
cudaThreadSynchronize(); // called on the host
Kernel configuration and Launch
dim3 dimGrid(1024);
dim3 dimBlock(256);
kernel_name <<< dimGrid, dimBlock>>>(kernel_arguments);
1D Indexing
single block:
int idx = threadIdx.x;
int reversed_idx = blockDim.x - 1 - threadIdx.x;
multi blocks:
int offset = blockIdx.x * blockDim.x;
int idx = offset + threadIdx.x;
int reversed_offset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int reversed_threadIdx = blockDim.x - 1 - threadIdx.x;
int reversed_idx = reversed_offset + reversed_threadIdx;
Using shared memory
Declaration:
extern __shared__ int s_data[];
Terminology
1. Host: CPU
2. Device: GPU
3. Kernel: a function that runs on the GPU, executed by an array of threads
4. Grid: a set of blocks (dimension of grid == # blocks a grid)
5. Block: a set of threads (dimension of block == # threads in a block)
6. Thread: one thread runs kernel
7. Shared memory:
Declarations on functions
__host__: host only
__global__: interface
__device__: device only
__shared__: shared memory
__local__:
Subscribe to:
Posts (Atom)