Child pages
  • openmp tutorial 2015
Skip to end of metadata
Go to start of metadata

T-106.5450 (year 2015)
A short OpenMP tutorial

Vesa Hirvisalo
ESG/CSE/Aalto
2015-03-04

The purpose of this tutorial is to give the information on OpenMP that is needed for doing the course assignments and understanding the compilation methods handled by the course.

Other sources of information

There is a rather complete description of OpenMP available at

For the T-106.5450 course you need not to read it, but (of course) you are encouraged to study.

Note that OpenMP has a long development history and there exists several versions of OpenMP. Read, e.g., if you are interested

For the T-106.5450 course basic parallel directives and for loops are essential. We will handle the related compilation mechanisms. However, compilation techniques related to many other OpenMP features (such as task parallelism) will not be handled during the course.

Various compilers have different support for OpenMP. The support by LLVM is not yet very mature, see

In this tutorial, gcc has been used (as its support is rather mature).

Introduction

OpenMP is a set of compiler directives and library calls that is designed for programmers of parallel applications. Basically, OpenMP encapsulates the typical way of writing parallel scientific code (HPC code) on a POSIX-like threading system and provides a rather clean interface to the programmer.

  • OpenMP is an API for parallel programming, i.e., it is meant for writing code that maximizes the performance of a single computational task. For such, there are no timing requirements for sub-tasks.
  • OpenMP is not for, e.g., concurrent programming. A typical example of concurrent programming is a system with user interface and physical control: the threads handling the user interaction and the real-time control sub-tasks have inherently different timing. Thus, support for concurrency and synchronization is needed.

OpenMP is also a programming model. An interface is available for several languages, Fortran, C and C++ being the most supported.

OpenMP is designed to be run on a shared memory computer with an operating system providing threading. Basically, it is for SMP systems and currently often used with multi-core CPUs.

OpenMP stack has several components. For the parallel operation, OpenMP runtime is essential. On top of this, the compiler must implement OpenMP directives and OpenMP library routines.

OpenMP supports structured parallelism. There is always present at least one thread (the master thread). At the opening of a parallel region, the master spawns worker threads. At the closing of the parallel region, all threads of the region are synchronized by a barrier and after the barrier only the master thread continues.

Inside parallel regions, each of the worker threads can act as a local master, i.e., spawn a parallel sub-region of its own. Thus, block structured nested parallelism is possible.

A simple OpenMP program

(Note that in the following and the rest of this tutorial, we use C and omit other languages with OpenMP support)

Basic operation of OpenMP is controlled by directives

#pragma omp construct [clause [clause]...]

For example

#pragma omp parallel

sets the next statement to be a parallel region. The statement can be a compound structure, but it must have a single entry and exit.

The OpenMP library is as used as any C library

#include <omp.h>

For example

#include <stdio.h>
#include <omp.h>
void main() {
  int x;
  printf("start\n");
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    x = id;
    printf("hello world (%d)\n", id);
  }
  printf("finish (%d)\n", x);
}

will print as many "hellos" as there are threads available on the system used.

  • The call "omp_get_thread_num()" is a library call returning the id of the thread
  • There is a call "omp_set_num_threads(int)" for setting the number of threads (up to the number of hardware threads is reasonable)
  • There is a call "int omp_get_num_threads()" getting the number of threads

You must use a suitable compilation command to compile the code. E.g., with gcc and the program stored in file "hello.c":

sh> gcc -fopenmp -o hello hello.c

The parallelism is not obvious on all systems, but it will become visible if we add suitable delays between the "printf"s, e.g.,

  {
    int id = omp_get_thread_num();
    printf("hello(%d) ", id);
    fflush(stdout);
    x = id;
    printf("world(%d)\n", id);
    fflush(stdout);
  }

(Flushing the I/O buffers will cause unpredictable delays on most systems - you should get a randomized ordering and interleaving on a typical multicore CPU.) A typical execution yields something like:

sh> ./hello
start
hello(0) world(0)
hello(1) hello(3) hello(2) world(3)
world(1)
world(2)
finish(2)
sh>

Note that there is first a single thread (the master thread) executing and that thread prints "start". After that the parallel region opens and several threads print in parallel. Finally, the threads stop at the implicit barrier (at the "}" after the last "fflush"). From that point on, there is only one thread.

There are several possible execution orders given the constraints:

  • All executions begin with "start"
  • All "hellos" are before their "world" pairs
  • All execution end with "finish"

But otherwise the order of execution (interleaving of the threads) is free.

Note that if we use assignments to variables (like "x = id"), the outcome depends on the execution order.

Such situations (i.e., outcome depends on execution order) are called race conditions.

Threads can co-operate by using shared values, but instead of the way above, proper synchronization must be used. Otherwise, the outcome of a program can be uncontrolled.

Parallel loops

The basic programming model in OpenMP is SPMD (Single Program Multiple Data). As you open up a parallel region by

#pragma omp parallel

you get multiple threads executing the same code. However, in typical situations there is more work to be done than there are threads available on a multicore system (e.g., think about a matrix multiplication on a typical multicore: you can hundreds of lines and columns but only a handful of cores). Thus, you often partition the work items among the worker threads. The code

/* example A */
count = omp_get_num_threads();
#pragma omp parallel
{
  id = omp_get_thread_num();
  for (mine = id; mine < LIMIT; mine += count) {
... do the work for item "mine" ..
  }
}

assigns different values of "mine" to different threads that together cover all the points in the work range 0...LIMIT-1. The above basically utilizes the task-farm paradigm of parallel programming and does a static cyclic schedule for the tasking. (The assignment of the work to the threads could also be dynamic, especially if the individual work items are of different size. Such an implementation would basically be a work-pool.)

For many such computations, the individual work items are (essentially) independent. Such computation can be described as a parallel loop iterating over the work range. In OpenMP, we can write

/* example B */
#pragma omp parallel
{
  #pragma omp for
  for (i = 0; i < LIMIT; i++) {
... do the work for item "i" ...
  }
}

and let the system do the mapping of the iterations to the threads. The iteration count is "LIMIT" and we have no explicit thread assignment. In "example A", the loop iteration count is "LIMIT/count" and we have "count" explicitly assigned threads doing the loop iteration. Further in "example A", the threads run independently, but the for loop is done in the sequential order. In "example" B (importantly!), the "for loop" is not guaranteed to be run in sequential order. The system can run the "for loop" in any order - even backwards! Actually, instead of a loop we have a parallel iterator, but OpenMP abstracts the structure as a parallel for loop.

From the programmer point of view, such a structure behaves as a sequential for loop if the loop iterations are independent (as a parallel programming abstraction this worksharing concept is called a do-all loop). To make the code shorter, several directives can be placed on the same line, i.e., the above can also be written:

  #pragma omp parallel for
  for (i = 0; i < LIMIT; i++) {
... do the work for item "i"...
  }

As in OpenMP in general, also such a program operates in a shared memory. For efficient operation the concept of sharing data is essential. In OpenMP, a shared variable must be protected against races (the explicit synchronization primitives for this are described in the next section of the tutorial).

In parallel for loops, some variables are iteration specific and need no such protection (as they are not used for communication across the iterations of the loop). The loop counter "i" in the previous for loop is an example of such. Such variables can be made private by using a directive:

  #pragma omp parallel for private(i)
  for (i = 0; i < LIMIT; i++) {
... do the work for item "i"...
  }

The directive "private" means that each of the worker threads have a local copy of the variable that is updated according to the iterations assigned to that thread. Thus, the loop runs faster as unnecessary synchronization is omitted.

Some loops have iterations that are not fully independent, but the dependency is simple. For example

  double factors[TERMSIZE], term = 1.0;
  #pragma omp parallel for private(i) reduction(*:term)
  for (i = 0; i < TERMSIZE; i++) {
    term *= factors[i];
  }

iterates over an array containing the factors of a term, whose value is to be computed. Each of the iterations is a multiplication of the product "term" by the current factor. Such a dependency is called a reduction. OpenMP can do the computation of "term" for you (i.e., using the "reduction" directive). Above, a local copy of "term" is made for each thread, the compiler finds standard reduction expressions containing "*" and uses them to update the local copy. Finally, the local copies are reduced into a single value and combined with the original global value. In the C version of OpenMP, the compiler understands reductions for operators +, *, -, &, |, ^, &&, and ||.

Synchronization

The programmer must explicitly do the protection against races in complex sharing situations. Consider for example:

#pragma omp parallel shared (A, B, C) private(id)
{
  id = omp_get_thread_num();
  A[id] = initialization(id);
  #pragma omp barrier          /* all threads stop here to wait for each other */
  #pragma omp for
  for (i = 0;  i < N; i++) {
     B[i] = compute(A, i);
  }                             /* By default, there is an implicit barrier here */
  #pragma omp for nowait
  for (i = 0;  i < N; i++) {
     C[i] = compute(A, i);
  }                             /* The "nowait" directive removes the implicit barrier, so */
  A[id] += finalization(id); /* threads can leave the for construct independently. */
}                            /* Leaving a parallel region, an implicit barrier here */                 

The programmer can do synchronization between threads with such directives. (Note that synchronization is expensive as we are doing parallel programming! I.e., performance is traded for correctness. Instead, the data representation should be changed to avoid the need for the synchronization!)

The progress of individual threads can be controlled. Consider for example:

#pragma omp parallel shared(data)
{
  id = omp_get_thread_num();
  n = omp_get_num_threads();
  for (i = id;  i < N; i += n) {
    if (p(i)) {
read(data);
} else {
#pragma omp critical
write(data); /* the critical section is protected */
}
  }
}

Above only one of the writes can write to the shared data structure, but reading can go on simultaneously by several (and also during writing). The "critical" directive implements a basically a mutex (mutual exclusion) that stops the threads racing with each other. Only one thread at a time is allowed to be executing inside the statement (which can be a compound statement, called a critical section in concurrent programming).

Sometimes it is enough to control the reading of the updated values instead of stopping threads (as a whole). Consider for example:

#pragma omp parallel
{
  tmp = compute(B);
  #pragma omp atomic
  X += tmp;
}

The directive "atomic" protects the reads and updates of X, i.e., its is a mutual exclusion for only the reads and writes inside the statement. The "atomic" directive is valid only for a single update statement using +=, *=,  -=, /=, ^=, |=, <<=, >>=, ++, or --. Further, the updated variable must be a scalar and left hand side expression may not reference the variable to be updated.

Tasks

In addition to thread parallelism, OpenMP implements also task parallelism. In task parallel computation, parallelism is applied by performing distinct computations (i.e., tasks) at the same time. A task consist of

  • code to execute
  • a data environment (shared, private, reduction, etc.)

Tasks parallelism is often used to implement processing pipelines, but also simple work division can be done with tasks.

#pragma omp parallel single {
  #pragma omp task {
    task1();
  } #pragma omp task {
    task2();
  }
}

Since OpenMP 4.0, tasks can have dependencies and user defined reductions (only supported by the most recent implementations). With such features, tasks can be used at high abstraction level to handle unstructured parallelism. Compared to thread parallelism, task parallelism is more flexible in handling errors, exceptions, etc. With earlier versions of OpenMP (version 3.0), similar things can be done with tasks, but the programmer must use low-level primitives (this removes many of the main benefits of the task parallel programming model).

As such, task parallelism is not scalable (e.g., the number of steps in a pipeline is fixed), but scaling is achieved as you enter massive amounts of tasks into processing. Compared to synchronized threads, tasks can yield lower overheads as the compiler runtime system can resolve the task dependencies. Guided by the compiler, the runtime system can do the thread assignment by using worker threads (mapped from the hardware by the runtime system) to execute the code and use the data of the tasks given by the programmer. Tasks also allow more efficient load balancing between processing elements as thread migration is not needed (only some advanced OpenMP implementation actually do proper balancing by using, e.g., work stealing).

 

A remark: Actually since its version 3.0, OpenMP implements a memory consistency model called weak ordering as described in S. V. Adve and K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial”, IEEE Computer, 29(12), pp.66-76, December 1996. However, for the T-106.5450 individual assignment 3/2014 such details are not needed.

  • No labels