Dana Vrajitoru
B424 Parallel and Distributed Programming
Parallel Libraries
Major Libraries / APIs
OpenMP
- An API for shared-memory parallel programming.
- Name stands for Open Multi-Processing.
- Higher level than pthreads. You can sometimes mark a block for parallel execution without specifying what should be done by each thread (implicit parallelization).
- May not be compatible with all C/C++ compilers.
- Designed to allow an incremental parallelization of a sequential program.
- Versions exist for embedded / mobile systems.
Pragma
- Precompiler directives - actually many of them are “linker”
directives.
#pragma comment(lib, "kernel32")
basically tells the compiler to leave a comment in the object
file. The linker can read it and add it as a dependency in the
build/link process.
- It’s equivalent to -lkernel32 for gcc or adding kernel32.dll as a
dependency in MS VS.
#pragma once
causes the file to be included in the compilation only once. Not
universally portable - support by VS and gcc.
Directive-Based OMP
- Using preprocessor commands of type pragma.
#pragma omp parallel num_threads(count)
where count is defined beforehand - either input from the user
directly, or based on the content of argv[1], or by other means. It
applies to one block.
- The parallel keyword in the directive specifies that the code that
follows should be executed in parallel.
- If the num_threads is missing, and we’re not already inside
another thread, the library will create one thread per core.
- The number of threads can be set by an environment variable
OMP_NUM_THREADS directly in bash.
OpenMP Directives
Spawning a parallel region
Dividing blocks of code among threads
Distributing loop iterations between threads
Serializing sections of code
Synchronization of work among threads
Shared / Not Shared Memory
- The directive creating the threads allow for arguments specifying
which variables should be shared and which variables should be local
to each thread.
#pragma omp parallel private(var1,...)\
shared(var2,...)
- All the variables declared as private are local. The shared ones
are common.
- \ allows one directive to continue on the next line.
OpenMP Routines / Functions
Setting and querying the number of threads
Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier, the thread team size
Setting and querying the dynamic threads feature
Querying if in a parallel region, and at what level
Setting and querying nested parallelism
Setting, initializing and terminating locks and nested locks
Querying wall clock time and resolution.
OpenMP Environment Variables
- Setting the number of threads
- Specifying how loop interactions are divided
- Binding threads to processors
- Enabling/disabling nested parallelism; setting the maximum levels
of nested parallelism
- Enabling/disabling dynamic threads
- Setting thread stack size
- Setting thread wait policy
Example
#include <omp.h>
#include <cstdio>
#include <cstdlib>
int main (int argc, char *argv[]) {
int nthreads, tid;
/* Fork a team of threads giving them their own
copies of variables */
#pragma omp parallel private(nthreads, tid){
/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0) {
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
/* All threads join master thread and disband */
}
Critical Section
- Also achieved with a directive
#pragma omp critical
block of critical code
- The compiler will allow only one thread at a time in the block or
instruction that follows.
#pragma omp parallel shared(a, min) \ private(mymin)
{ // each thread executes this block
// first, last computed based on id
mymin = Find_min(a, first, last);
#pragma omp critical
if (mymin < min)
min = mymin;
}
Loop Parallelization
Synchronization
- Atomic operations:
#pragma omp atomic
single assignment
- Explicit locks: omp_lock_t type.
omp_set_lock(omp_lock_t *lock); // lock
omp_unset_lock(omp_lock_t *lock); // unlock
- Barrier:
#pragma omp barrier
parallel block
OpenCL
- The name stands for Open Computing Language.
- Multi-platform, including mobile.
- Uses multi-core CPUs, GPUs, and DSPs.
- Native on Intel processors, AMD, Apple, others.
- An API for coordinating parallel computations across heterogeneous
processors.
- Supports both data and functional parallelization. Works well
with OpenGL.
Main Idea
- Get information about the platform.
- Get information about the device.
- Divide the device into subdevices based on existing computing
units.
- Create a queue of tasks to be sent to each unit.
- Have a context manager keep everything together, something that
know which device has access to which task queue or memory object.
- Store the data in buffers handled by the context managers.
Specific Examples
- clGetPlatformIDs (...) -> returns a list of ids of available
platforms.
- clGetPlatformInfo(id, ...) -> returns properties of the platform
with the specified id. It provides a profile such as FULL_PROFILE or
EMBEDDED_PROFILE, version, vendor, etc.
- clGetDeviceIDs4(platformId, ...) -> returns (in a parameter) a
list of available devices on that platform, such as
CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU, CL_DEVICE_TYPE_ACCELERATOR,
etc.
- clCreateSubDevices(deviceId,...) -> a parameter specifies how the
partition happens, like CL_DEVICE_PARTITION _EQUALLY,
CL_DEVICE_PARTITION _BY_AFFINITY_DOMAIN.
More Examples
- clCreateContext(..., devices,...userData,...) -> creates a context.
- clCreateCommandQueue(context, device, ...)-> creates a queue of
commands for this device in this context.
- clCreateBuffer(context, ..., flags, ...) -> the flags specify
things like read only or read-write.
- clEnqueueReadBuffer(queue, buffer, ..., events)
- clEnqueueWriteBuffer(queue, buffer, ...,events)
- clCreateImage(context,...)
CUDA
- Parallel library/API/Toolkit created by Nvidia.
- GPU-oriented, part of the General Purpose GPU effort.
- Toolkit: compiler for C C++, math libraries, optimization tools.
GPU Structure
Ideas
- Host - CPU and its memory
- Device - GPU and its memory
- By default the code runs on the host.
- To run it on the device, you add the qualifier __global__
to the function (prototype).
- The function can then be called from the host code.
- Everything running on the device is compiled by the Nvidia
compiler. The host code is compiler with the usual C compiler.
Hello World
__global__ void kernel( void ) {
}
int main( void ) {
kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return 0;
}
// Angle brackets:
// <<<#blocks, #threads>>>
Memory Management
Vector Operations
__global__ void add( int *a, int *b, int *c ) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
add<<< N, 1 >>>( dev_a, dev_b, dev_c );
where blockIdx.x identifies the current block. OR
__global__ void add( int *a, int *b, int *c ) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
add<<< 1, N >>>( dev_a, dev_b, dev_c );
where threadIdx.x identifies the current thread.
Threads / Blocks Properties
- The threads in the same block can share memory:
__shared__ int temp[N];
- Threads within a block can synchronize at some places in the code with
__syncthreads(); // a barrier
- Atomic operations available: atomicAdd, atomicInc, atomicExch.
atomicAdd(a, b); is an uninterrupted a += b;
Links