What is OpenMP

Course Handout: (last update 23 March 2009)


These notes may be found at http://www.dartmouth.edu/~rc/classes/intro_matlab. The online version has many links to additional information and may be more up to date than the printed notes

What is OpenMP?


diagram of serial and parallel threads
c Fortran example
call omp_set_num_threads(nthread) !requests "nthread" threads
!$OMP PARALLEL DO
DO i=1,N
DO j=1,M
.
.
.
END DO
END DO
!$OMP END PARALLEL DO


omp_set_num_threads(nthread); /* requests nthread threads */
#pragma omp parallel for
{
for (i=0; i<n; i++) {
for (j=0; j<m; j++) {
.
.
.
}
}
}






Table of Contents

1.Memory Architectures
2.Pros and Cons of OpenMP
3.Approaches to Parallelism Using OpenMP
4.Example OpenMP Hello World
5.Loop Level Parallelization
6.Shared vs. Private Variables
7.Example of Parallelizing a Loop
8.Basic OpenMP Functions
9.How to Compile and Run An OpenMP Program
10.OpenMP Clauses
11.Reduction Clause
12.Thread Control
13.Thread Control (continued)
14.Scheduling of Parallel Loops
15.Parallel Regions Example
16.Parallel Regions Example Improved
17.Amdahl's Law
18.Example of Profiling Your Code
19.Strategies for Improving Performance
20.Hybrid OpenMP/MPI Programming
21.OpenMP v3.0
22.OpenMP Resources

(1)

Memory Architectures and Parallel Programming



Distributed Memory


diagram of distributed memory layout


Shared Memory

diagram of shared memory architecture

(2)

Pros and Cons Of OpenMP


Pros

Cons

Examples of Applications and Libraries That Use OpenMP

Applications
Libraries

(3)

Approaches to Parallelism Using OpenMP


Two main approaches:

Loop-Level Parallelism

Parallel Regions Parallelism

 


body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(4)

Example OpenMP Hello World

Example C Program Example Fortran Program
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {

int nthreads, tid;

/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{

/* Obtain thread number */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}

} /* All threads join master thread and disband */

}


PROGRAM HELLO

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM

C Fork a team of threads giving them their own copies of variables
!$OMP PARALLEL PRIVATE(NTHREADS, TID)


C Obtain thread number
TID = OMP_GET_THREAD_NUM()
PRINT *, 'Hello World from thread = ', TID

C Only master thread does this
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Number of threads = ', NTHREADS
END IF

C All threads join master thread and disband
!$OMP END PARALLEL

END







(5)

Loop level Parallelizaion

Requirements for Loop Parallelization

Example of Code with No Data Dependencies
!     Fortran example

!$omp parallel do
do i = 1, n
a(i) = b(i) + c(i)
enddo



/* C/C++ Example */

#pragma omp parallel for
for(i=1; i<=n; i++)
a[i] = b[i] + c[i]


Example of Code With Data Dependencies

! Fortran example

do i = 2, 5
a(i) = a(i) + a(i-1)
enddo

/* C/C++ example */

for(i=2; i<=5; i++)
a[i] = a[i] + a[i-1];




body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(6)

Shared vs. Private Variables


Example of Code with No Data Dependencies
!Fortran example 

!$omp parallel do private(temp) shared(n,a,b,c)
do i = 1, n
temp = 2.0*a(i)
a(i) = temp
b(i) = c(i)/temp
enddo


/* C/C++ Example */

#pragma omp parallel for private(temp) shared(n,a,b,c)
{
for(i=1; i<=n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
}






body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(7)

Example of Parallelizing A Loop

Example C ProgramExample Fortran Program
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define N 100

int main (int argc, char *argv[]) {

int nthreads, tid, i;
float a[N], b[N], c[N];

/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i;


#pragma omp parallel shared(a,b,c,nthreads) private(i,tid)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}

printf("Thread %d starting...\n",tid);

#pragma omp for
for (i=0; i<N; i++)
{
c[i] = a[i] + b[i];
printf("Thread %d: c[%d]= %f\n",tid,i,c[i]);
}

} /* end of parallel section */

}

 
PROGRAM WORKSHARE1

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
+ OMP_GET_THREAD_NUM, N, I
PARAMETER (N=100)
REAL A(N), B(N), C(N)

! Some initializations
DO I = 1, N
A(I) = I
B(I) = A(I)
ENDDO

!$OMP PARALLEL SHARED(A,B,C) PRIVATE(I,TID)

TID = OMP_GET_THREAD_NUM()
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Number of threads =', NTHREADS
END IF
PRINT *, 'Thread',TID,' starting...'

!$OMP DO
DO I = 1, N
C(I) = A(I) + B(I)
WRITE(*,100) TID,I,C(I)
100 FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
ENDDO
!$OMP END DO NOWAIT

PRINT *, 'Thread',TID,' done.'

!$OMP END PARALLEL

END






body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(8)

Basic OpenMP Functions


omp_get_thread_num() - get the thread rank in a parallel region (0- omp_get_num_threads() -1)
omp_set_num_threads(nthreads) - set the number of threads used in a parallel region
omp_get _num_threads() - get the number of threads used in a parallel region


C/C++
#include <omp.h>
int omp_get_thread_num();
#
pragma omp parallel
{
printf("Thread rank: %d\n", omp_get_thread_num());
}

Fortran

INTEGER FUNCTION OMP_GET_THREAD_NUM()
!$OMP PARALLEL
write(*,*)'Thread rank: ', OMP_GET_THREAD_NUM()
!$OMP END PARALLEL


Note that in general the rank output are not in order.
Thread rank:  2
Thread rank: 0
Thread rank: 3
Thread rank: 1



body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(9)

How to Compile and Run an OpenMP Program


CompilerCompiler OptionsDefault behavior for # of threads
(OMP_NUM_THREADS not set)
GNU (gcc, g++, gfortran)-fopenmpas many threads as available cores
Intel (icc ifort)-openmpas many threads as available cores
Portland Group (pgcc,pgCC,pgf77,pgf90)-mpone thread


Environment Variable  OMP_NUM_THREADS sets the number of threads


GNU Compiler Example
$ gcc -o omp_helloc -fopenmp omp_hello.c
$ export OMP_NUM_THREADS=2
$ ./omp_helloc
Hello World from thread = 0
Hello World from thread = 1
Number of threads = 2
$
$ gfortran -o omp_hellof -fopenmp omp_hello.f
$ export OMP_NUM_THREADS=4
$ ./omp_hellof
Hello World from thread = 2
Hello World from thread = 1
Hello World from thread = 3
Hello World from thread = 0

Number of threads = 4

Intel Compiler Example
$ icc -o omp_helloc -openmp omp_hello.c
omp_hello.c(22): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ export OMP_NUM_THREADS=3
$ ./omp_helloc
Hello World from thread = 0
Hello World from thread = 2
Hello World from thread = 1
Number of threads = 3
$
$ ifort -o omp_hellof -openmp omp_hello.f
omp_hello.f(20): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ export OMP_NUM_THREADS=2
$ ./omp_hellof
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1


Portland Group Example
[sas@discovery intro_openmp]$ pgcc -o omp_helloc -mp omp_hello.c
[sas@discovery intro_openmp]$ export OMP_NUM_THREADS=2
[sas@discovery intro_openmp]$ ./omp_helloc
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1
$
$ pgf90 -o omp_hellof -mp omp_hello.f
$ export OMP_NUM_THREADS=2
$ ./omp_hellof
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1





body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(10)

Other OpenMP Clauses

firstprivate


Example of  firstprivate clause
!     Fortran example

j = jstart
!$omp parallel do firstprivate(j)
do i = 1, n
if(i == 1 .or. i == n) then
j = j + 1
endif
a(i) = a(i) + j
enddo


/* C/C++ Example */

j = jstart;
#pragma omp parallel for firstprivate(j)
{
for(i=1; i<=n; i++){
if(i == 1 || i == n)
j = j + 1;
a[i] = a[i] + j;
}
}


lastprivate

Example of  lastprivate clause

! Fortran example

!$omp parallel do lastprivate(x)
do i = 1, n
x = sin(pi*dx*real(i))
a(i) = exp(x)
enddo
lastx = x

/* C/C++ example */

#pragma omp parallel for lastprivate(x)
{
for(i=1; i<=n; i++){
x = sin( pi * dx * (float)i );
a[i] = exp(x);
}
}
lastx = x;


ordered
! Fortran example

!$omp parallel do private(myval) ordered
do i = 1, n
myval = do_lots_of_work(i)
!$omp ordered
print*, i, myval
!$omp end ordered
enddo

lastx = x
/* C/C++ example */

#pragma omp parallel for private(myval) ordered
{
for(i=1; i<=n; i++){
myval = do_lots_of_work(i);
#pragma omp ordered
{
printf("%d %d\n", i, myval);
}
}
}




body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(11)

Reduction Operations


An example of a reduction operation is a summation:

!     Fortran example

do i = 1, n
sum = sum + a(i)
enddo


/* C/C++ Example */

for(i=1; i<=n; i++){
sum = sum + a[i];
}
 
How reduction works:

Example of  reduction clause

! Fortran example

!$omp parallel do reduction(+:sum)
do i = 1, n
sum = sum + a(i)
enddo

/* C/C++ example */

#pragma omp parallel for reduction(+:sum)
{
for(i=1; i<=n; i++){
sum = sum + a[i];
}
}

C/C++  Reduction Operands
OperatorInitial value
+0
*1
-0
&~0
|0
^0
&&1
||0

Fortran Reduction Operands

OperatorInitial Value
+0
*1
-0
.AND..true.
.OR..false.
.IEOR.0
.IOR.0
.IAND.All bits on
.EQV. .true.
MINLargest positive #
MAXMost negative #



(12)

Thread Control

Barrier

Each thread wait at the barrier until all threads reach the barrier.

!     Fortran example

!$omp parallel private(myid,istart,iend)
call myrange(myid, nthreads, istart, iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$omp barrier
call dowork(a)
!$omp end parallel


/* C/C++ Example */

#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, &istart, &iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] - b[i];
}
#pragma omp barrier
dowork(a);
}
 
Master

A section of code that runs only on the master (thread with with rank=0)

! Fortran example

!$omp parallel private(myid, istart, iend)
call myrange(myid, nthreads, global_start, global_end, istart, iend)
do i = istart, iend
a(i) = b(i)
enddo
!$omp barrier
!$omp master
write(21) a
!$omp end master
call do_work(istart, iend)
!$omp end parallel

/* C/C++ example */

#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, global_start, global_end, &istart, &iend);
for(i=istart; i<=iend; i++){
a[i] = b[i];
}
#pragma omp barrier
#pragma omp master
{
n = global_end - global_start + 1;
write_size = fwrite(a, 1, n, file_pointer);
}
do_work(istart, iend);
}

Single

Similar to Master except runs only on the first thread to reach it




body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(13)

Thread Control (continued) 

Critical


!     Fortran example

the_max = 0.0
!$omp parallel private(myid, istart, iend)
call myrange(myid, nthreads, global_start, global_end, istart, iend)
call compute_a(a(istart:iend))
!$omp critical
the_max = max( maxval(a(istart:iend), the_max )
!$omp end critical
call more_work_on_a(a)
!$omp end parallel


/* C/C++ Example */

the_max = 0.0;
#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, global_start, global_end, &istart, &iend);
nvals = iend-istart+1;
compute_a(a[istart],nvals);
#pragma omp critical
the_max = max( maxval(a[istart],nvals), the_max );
#pragma omp end critical
call more_work_on_a(a)
}
 
Sections/Section


! Fortran example

!$omp parallel
!$omp sections

!$omp section
call init_field(field)

!$omp section
call check_grid(grid)

!$omp end sections
!$omp end parallel

/* C/C++ example */

#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
init_field(field);

#pragma omp section
check_grid(grid);
}
}





body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(14)

Scheduling of Parallel Loops


Schedule:

N= number of iterations of the parallel loop
P= number of threads
C=user-specified chunk size
Name Type Chunk Chunk Size # of Chunks Static or Dynamic Computer Overhead
Simple Static simple no N/P P static lowest
Interleaved simple yes C N/C static low
Simple Dynamic dynamic optional C N/C dynamic medium
Guided guided optional decreasing from N/P fewer than N/C dynamic high
Runtime runtime no varies varies varies varies


/* C/C++ Example */

#pragma omp parallel for schedule(dynamic) private(tmp, i, j, k)
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
for(k=0;k<Pdim;k++){
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}


body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(15)

Parallel Regions Example 



/* C/C++ Example */
/*

NAME: PI SPMD ... a simple version.

This program will numerically compute the integral of

4/(1+x*x)

from 0 to 1. The value of this integral is pi -- which
is great since it gives us an easy way to check the answer.

The program was parallelized using OpenMP and an SPMD
algorithm.

Note that this program will show low performance due to
false sharing. In particular, sum[id] is unique to each
thread, but adjacent values of this array share a cache line
causing cache thrashing as the program runs.

History: Written by Tim Mattson, 11/99.

*/

#include <stdio.h>
#include <omp.h>

#define MAX_THREADS 4

static long num_steps = 100000000;
double step;
int main ()
{
int i,j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];

step = 1.0/(double) num_steps;


for (j=1;j<=MAX_THREADS ;j++) {

omp_set_num_threads(j);
full_sum=0.0;
start_time = omp_get_wtime();

#pragma omp parallel
{
int i;
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;

sum[id] = 0.0;

if (id == 0)
printf(" num_threads = %d",numthreads);

for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
sum[id] = sum[id] + 4.0/(1.0+x*x);
}
}

for(full_sum = 0.0, i=0;i<j;i++)
full_sum += sum[i];

pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threads \n",pi,run_time,j);
}
}


body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(16)

Parallel Regions Example Improved



/* C/C++ Example */

/* NAME: PI SPMD final version without false sharing

This program will numerically compute the integral of

4/(1+x*x)

from 0 to 1. The value of this integral is pi -- which
is great since it gives us an easy way to check the answer.

History: Written by Tim Mattson, 11/99.
*/

#include <stdio.h>
#include <omp.h>

#define MAX_THREADS 4

static long num_steps = 100000000;
double step;
int main ()
{
int i,j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];

step = 1.0/(double) num_steps;


for(j=1;j<=MAX_THREADS ;j++){
omp_set_num_threads(j);
full_sum = 0.0;
start_time = omp_get_wtime();
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;

double partial_sum = 0;

#pragma omp single
printf(" num_threads = %d",numthreads);

for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
full_sum += partial_sum;
}

pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threds \n ",pi,run_time,j);
}
}


(17)


How to Know When and Where to Add OpenMP Directives


1. profile your code
2. determine the sections where the most time is spent and see if it can be parallelized
3. use Amdahl's Law to determine potential speed up
4. put directives in the code and measure run times on serial and parallel runs



Amdahl's Law

Speedup(N) =   1/((1-p)+p/N)

where p= portion of the code that can be made parallel
N=number of processor. For example:

p N Speedup
.5 2 1.3
.5 4 1.6
.5 infinite 2
.9 2 1.81
.9 4 3.07
.9 infinite 10





body { background-color: white; color: black; } pre { margin-left: 1em; padding: 0.5em; background-color: #f0f0f0; border: 1px solid black; } h1 { text-align: center; } -->

(18)

Profiling Your Code

Steps for Profiling



[sas@discovery]$ icc -g -pg -openmp-stubs -o matmul_par matmul_par.c
matmul_par.c(60): warning #161: unrecognized #pragma
#pragma omp parallel for private(tmp, i, j, k)
^
[sas@discovery]$ ./matmul_par
Order 1000 multiplication in 13.498900 seconds
1 threads
Hey, it worked
all done

[sas@discovery] ls -ls gmon.out
4 -rw-rw-r-- 1 sas sas 1317 Feb 17 11:27 gmon.out

[sas@discovery] gprof -l matmul_par >gprof.out

[sas@discovery] more gprof.out

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
94.39 18.06 18.06 main (matmul_par.c:68 @ 400aea)
5.72 19.15 1.09 main (matmul_par.c:66 @ 400b22)
0.05 19.16 0.01 main (matmul_par.c:50 @ 4009f8)
0.05 19.17 0.01 main (matmul_par.c:54 @ 400a57)
0.05 19.18 0.01 main (matmul_par.c:70 @ 400b30)
0.05 19.19 0.01 main (matmul_par.c:85 @ 400c39)
.
.





Portion of matmul_par.c where the most time is spent

60 #pragma omp parallel for private(tmp, i, j, k)
61 for (i=0; i<Ndim; i++){
62 for (j=0; j<Mdim; j++){
63
64 tmp = 0.0;
65
66 for(k=0;k<Pdim;k++){
67 /* C(i,j) = sum(over k) A(i,k) * B(k,j) */
68 tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
69 }
70 *(C+(i*Ndim+j)) = tmp;
71 }
72 }
 
Results of Parallelizing  the Code With OpenMP

# of threads Time (secs.) Speedup
1 16.28 1.00
2 8.326 1.95
4 4.268 3.81
8 2.137 7.61





Basic Strategies to Improve Performance


Things to check for when parallelized loops do not perform well

(20)

Hybrid MPI/OpenMP Programs




Example of a hybrid program

Strategies for Hybrid Parallelization



(21)

OpenMP v3.0


History of OpenMP




OpenMP Tasks:





c  C example  

#pragma omp parallel
{
#pragma omp single
{
p=listhead;
while (p) {
/* create a task for each element of the list */
#pragma omp task
process(p); /* porcess list element p */
p=next(P);
}
}
}






(22)

Resources


OpenMP Tutorial, SC08,  taught by Tim Mattson and Larry Meadows, Intel Corporation

"OpenMP Course" , Cyberinfrastructure Tutor at NCSA <http://ci-tutor.ncsa.uiuc.edu/login.php>

Parallel Programming With OpenMP Tutorial at OSC <http://www.osc.edu/supercomputing/training/openmp/big/fsld.002.html>

OpenMP Tutorial at LLNL <https://computing.llnl.gov/tutorials/openMP/exercise.html>




What is OpenMP: Course Handout
[an error occurred while processing this directive] (last update   23 March 2009)  ©Dartmouth College     http://www.dartmouth.edu/~rc/classes/intro_matlab