Parallel Libraries

Dana Vrajitoru
B424 Parallel and Distributed Programming

Parallel Libraries

Major Libraries / APIs

OpenMP
OpenCL
Cuda

OpenMP

An API for shared-memory parallel programming.
Name stands for Open Multi-Processing.
Higher level than pthreads. You can sometimes mark a block for parallel execution without specifying what should be done by each thread (implicit parallelization).
May not be compatible with all C/C++ compilers.
Designed to allow an incremental parallelization of a sequential program.
Versions exist for embedded / mobile systems.

Pragma

Precompiler directives - actually many of them are “linker” directives.
#pragma comment(lib, "kernel32")
basically tells the compiler to leave a comment in the object file. The linker can read it and add it as a dependency in the build/link process.
It’s equivalent to -lkernel32 for gcc or adding kernel32.dll as a dependency in MS VS.
#pragma once
causes the file to be included in the compilation only once. Not universally portable - support by VS and gcc.

Directive-Based OMP

Using preprocessor commands of type pragma.
#pragma omp parallel num_threads(count)
where count is defined beforehand - either input from the user directly, or based on the content of argv[1], or by other means. It applies to one block.
The parallel keyword in the directive specifies that the code that follows should be executed in parallel.
If the num_threads is missing, and we’re not already inside another thread, the library will create one thread per core.
The number of threads can be set by an environment variable OMP_NUM_THREADS directly in bash.

OpenMP Directives

Spawning a parallel region

Dividing blocks of code among threads

Distributing loop iterations between threads

Serializing sections of code

Synchronization of work among threads

Shared / Not Shared Memory

The directive creating the threads allow for arguments specifying which variables should be shared and which variables should be local to each thread.
#pragma omp parallel private(var1,...)\ shared(var2,...)
All the variables declared as private are local. The shared ones are common.
\ allows one directive to continue on the next line.

OpenMP Routines / Functions

Setting and querying the number of threads

Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier, the thread team size

Setting and querying the dynamic threads feature

Querying if in a parallel region, and at what level

Setting and querying nested parallelism

Setting, initializing and terminating locks and nested locks

Querying wall clock time and resolution.

OpenMP Environment Variables

Setting the number of threads
Specifying how loop interactions are divided
Binding threads to processors
Enabling/disabling nested parallelism; setting the maximum levels of nested parallelism
Enabling/disabling dynamic threads
Setting thread stack size
Setting thread wait policy

Example

#include <omp.h>
#include <cstdio>
#include <cstdlib>
int main (int argc, char *argv[]) {  
   int nthreads, tid;  
   /* Fork a team of threads giving them their own 
      copies of variables */
#pragma omp parallel private(nthreads, tid){   
   /* Obtain thread number */
   tid = omp_get_thread_num();
   printf("Hello World from thread = %d\n", tid);
   /* Only master thread does this */
   if (tid == 0) {
      nthreads = omp_get_num_threads();
      printf("Number of threads = %d\n", nthreads);
   }
   /* All threads join master thread and disband */
}

Critical Section

Also achieved with a directive
#pragma omp critical block of critical code

The compiler will allow only one thread at a time in the block or instruction that follows.

#pragma omp parallel shared(a, min) \ private(mymin)
{  // each thread executes this block
   // first, last computed based on id
   mymin = Find_min(a, first, last);
#pragma omp critical
   if (mymin < min)
      min = mymin;
}

Loop Parallelization

A directive to execute a whole loop in parallel:

#pragma omp parallel for 
for (i=0; i < n; i++)
   a[i] = rand() % LIMIT;

The block following this directive must be a for loop.
The system will divide the iterations among the threads, usually in blocks.

Synchronization

Atomic operations:
#pragma omp atomic single assignment
Explicit locks: omp_lock_t type.
omp_set_lock(omp_lock_t *lock); // lock omp_unset_lock(omp_lock_t *lock); // unlock
Barrier:
#pragma omp barrier parallel block

OpenCL

The name stands for Open Computing Language.
Multi-platform, including mobile.
Uses multi-core CPUs, GPUs, and DSPs.
Native on Intel processors, AMD, Apple, others.
An API for coordinating parallel computations across heterogeneous processors.
Supports both data and functional parallelization. Works well with OpenGL.

Main Idea

Get information about the platform.
Get information about the device.
Divide the device into subdevices based on existing computing units.
Create a queue of tasks to be sent to each unit.
Have a context manager keep everything together, something that know which device has access to which task queue or memory object.
Store the data in buffers handled by the context managers.

Specific Examples

clGetPlatformIDs (...) -> returns a list of ids of available platforms.
clGetPlatformInfo(id, ...) -> returns properties of the platform with the specified id. It provides a profile such as FULL_PROFILE or EMBEDDED_PROFILE, version, vendor, etc.
clGetDeviceIDs4(platformId, ...) -> returns (in a parameter) a list of available devices on that platform, such as CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU, CL_DEVICE_TYPE_ACCELERATOR, etc.
clCreateSubDevices(deviceId,...) -> a parameter specifies how the partition happens, like CL_DEVICE_PARTITION _EQUALLY, CL_DEVICE_PARTITION _BY_AFFINITY_DOMAIN.

More Examples

clCreateContext(..., devices,...userData,...) -> creates a context.
clCreateCommandQueue(context, device, ...)-> creates a queue of commands for this device in this context.
clCreateBuffer(context, ..., flags, ...) -> the flags specify things like read only or read-write.
clEnqueueReadBuffer(queue, buffer, ..., events)
clEnqueueWriteBuffer(queue, buffer, ...,events)
clCreateImage(context,...)

CUDA

Parallel library/API/Toolkit created by Nvidia.
GPU-oriented, part of the General Purpose GPU effort.
Toolkit: compiler for C C++, math libraries, optimization tools.

GPU Structure

Ideas

Host - CPU and its memory
Device - GPU and its memory
By default the code runs on the host.
To run it on the device, you add the qualifier __global__ to the function (prototype).
The function can then be called from the host code.
Everything running on the device is compiled by the Nvidia compiler. The host code is compiler with the usual C compiler.

Hello World

__global__ void kernel( void ) {
}
int main( void ) {
  kernel<<<1,1>>>();
  printf( "Hello, World!\n" );
  return 0;
}
// Angle brackets: 
// <<<#blocks, #threads>>>

Memory Management

Simple addition:

__global__ void add( int *a, int *b, int *c ) {
   *c = *a + *b;
}

This represents a single operation and can be called as

add<<< 1, 1 >>>( dev_a, dev_b, dev_c );

where dev_a,b,c will have to be allocated on the device with cudaMalloc(...).
The content of some local variables on the host can be copied into the device memory or the other way around with cudaMemcpy(...).

Vector Operations

__global__ void add( int *a, int *b, int *c ) {
  c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
add<<< N, 1 >>>( dev_a, dev_b, dev_c );

where blockIdx.x identifies the current block. OR

__global__ void add( int *a, int *b, int *c ) {
  c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
add<<< 1, N >>>( dev_a, dev_b, dev_c );

where threadIdx.x identifies the current thread.

Threads / Blocks Properties

The threads in the same block can share memory:
__shared__ int temp[N];
Threads within a block can synchronize at some places in the code with
__syncthreads(); // a barrier
Atomic operations available: atomicAdd, atomicInc, atomicExch.
atomicAdd(a, b); is an uninterrupted a += b;

Links