What is OpenMP
Course Handout: (last update 23 March 2009)
These notes may be found at http://www.dartmouth.edu/~rc/classes/intro_matlab. The online version has many links to additional information and may be more up to date than the printed notes
What is OpenMP?
- Used for multi-threaded parallel processing
- Used on shared-memory multi-processor (core) computers
- Part of program is a single thread and part is
multi-threaded
- Has 3 components
- directives that can be put into C/C++ or Fortran programs
- runtime library for setting and querying parallel
parameters (ex. # of threads)
- environment variables that define runtime parallel
parameters (ex. # of threads)
c Fortran example
call omp_set_num_threads(nthread) !requests "nthread" threads
!$OMP PARALLEL DO
DO i=1,N
DO j=1,M
.
.
.
END DO
END DO
!$OMP END PARALLEL DO
omp_set_num_threads(nthread); /* requests nthread threads */
#pragma omp parallel for
{
for (i=0; i<n; i++) {
for (j=0; j<m; j++) {
.
.
.
}
}
}
Table of Contents
(1)
Memory Architectures and Parallel Programming
Distributed Memory
- each processor has its own memory
- parallel programming by message passing (MPI)

Shared Memory
- processors shared memory
- two parallel programming approaches
- message passing (MPI)
- directives-based interface - OpenMP

(2)
Pros and Cons Of OpenMP
Pros
- Prevalence of multi-core computers
- Requires less code modification than using MPI
- OpenMP directives can be treated as comments if OpenMP
is not available
- Directives can be added incrementally
Cons
- OpenMP codes cannot be run on distributed memory computers
(exception is Intel's OpenMP)
- Requires a compiler that supports OpenMP (most do)
- limited by the number of processors available on a single
computer
- often have lower parallel efficiency
- rely more on parallelizable loops
- tend to have a higher % of serial code
- Amdahl's Law - if 50% of code is serial will only half
wall clock time no matter how may processors
Examples of Applications
and Libraries That Use OpenMP
Applications
Libraries
- Intel Math kernel Library (MKL)
- AMD Core Math Library (ACML)
- GNU Scientific Library (GSL)
(3)
Approaches to Parallelism Using OpenMP
Two main approaches:
- loop-level
- parallel regions
Loop-Level Parallelism
- sometimes call fine-grained
parallelism
- individual loops parallelized
- each thread assigned a unique range of the loop index
- execution starts on a single serial thread
- multiple threads are spawned inside a parallel loop
- after parallel loop execution is serial
- relatively easy to implement
Parallel Regions
Parallelism
- sometimes called coarse-grained
parallelism
- any sections of codes can be parallelized (not just loops)
- using the thread identifier to distribute the work
- execution starts on a single serial thread
- multiple threads are started for parallel regions (not
necessarily at a loop)
- ends on a single serial thread
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(4)
Example OpenMP Hello World
Example C Program |
Example Fortran Program |
#include <omp.h> #include <stdio.h> #include <stdlib.h>
int main (int argc, char *argv[]) {
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) {
/* Obtain thread number */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); }
} /* All threads join master thread and disband */
}
|
PROGRAM HELLO INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM C Fork a team of threads giving them their own copies of variables !$OMP PARALLEL PRIVATE(NTHREADS, TID)
C Obtain thread number TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID
C Only master thread does this IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads = ', NTHREADS END IF
C All threads join master thread and disband !$OMP END PARALLEL
END
|
(5)
Loop level Parallelizaion
Requirements for Loop Parallelization - no dependencies between loop indicies
- an element of an array is assigned to by at most one iteration
- no loop iteration reads array elements modified by any other dependency
- due to overhead of parallelization - use only on loops where individual iterations take a long time
Example of Code with No Data Dependencies
! Fortran example
!$omp parallel do
do i = 1, n
a(i) = b(i) + c(i)
enddo
/* C/C++ Example */
#pragma omp parallel for
for(i=1; i<=n; i++)
a[i] = b[i] + c[i]
Example of Code With Data Dependencies
! Fortran example
do i = 2, 5
a(i) = a(i) + a(i-1)
enddo
/* C/C++ example */
for(i=2; i<=5; i++)
a[i] = a[i] + a[i-1];
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(6)
Shared vs. Private Variables
- By default all variables in a loop share the same address space
- All threads can modify and access all variables (except the loop index)
- Can result in incorrect results
- Can use shared and private clauses with parallel for or parallel do
Example of Code with No Data Dependencies
!Fortran example
!$omp parallel do private(temp) shared(n,a,b,c)
do i = 1, n
temp = 2.0*a(i)
a(i) = temp
b(i) = c(i)/temp
enddo
/* C/C++ Example */
#pragma omp parallel for private(temp) shared(n,a,b,c)
{
for(i=1; i<=n; i++){
temp = 2.0*a[i];
a[i] = temp;
b[i] = c[i]/temp;
}
}
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(7)
Example of Parallelizing A Loop
Example C Program | Example Fortran Program |
#include <omp.h> #include <stdio.h> #include <stdlib.h> #define N 100
int main (int argc, char *argv[]) {
int nthreads, tid, i; float a[N], b[N], c[N];
/* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i;
#pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); }
printf("Thread %d starting...\n",tid);
#pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); }
} /* end of parallel section */
}
| PROGRAM WORKSHARE1
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM, N, I PARAMETER (N=100) REAL A(N), B(N), C(N)
! Some initializations DO I = 1, N A(I) = I B(I) = A(I) ENDDO
!$OMP PARALLEL SHARED(A,B,C) PRIVATE(I,TID)
TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads =', NTHREADS END IF PRINT *, 'Thread',TID,' starting...'
!$OMP DO DO I = 1, N C(I) = A(I) + B(I) WRITE(*,100) TID,I,C(I) 100 FORMAT(' Thread',I2,': C(',I3,')=',F8.2) ENDDO !$OMP END DO NOWAIT
PRINT *, 'Thread',TID,' done.'
!$OMP END PARALLEL
END
|
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(8)
Basic OpenMP Functions
omp_get_thread_num() - get the thread rank in a parallel region (0-
omp_get_num_threads() -1)
omp_set_num_threads(nthreads) - set the number of threads used in a parallel region
omp_get _num_threads() - get the number of threads used in a parallel region
C/C++
#include <omp.h>
int omp_get_thread_num();
#
pragma omp parallel
{
printf("Thread rank: %d\n", omp_get_thread_num());
}
Fortran
INTEGER FUNCTION OMP_GET_THREAD_NUM()
!$OMP PARALLEL
write(*,*)'Thread rank: ', OMP_GET_THREAD_NUM()
!$OMP END PARALLEL
Note that in general the rank output are not in order.
Thread rank: 2
Thread rank: 0
Thread rank: 3
Thread rank: 1
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(9)
How to Compile and Run an OpenMP Program
Compiler | Compiler Options | Default behavior for # of threads (OMP_NUM_THREADS not set) |
GNU (gcc, g++, gfortran) | -fopenmp | as many threads as available cores |
Intel (icc ifort) | -openmp | as many threads as available cores |
Portland Group (pgcc,pgCC,pgf77,pgf90) | -mp | one thread |
Environment Variable
OMP_NUM_THREADS sets the number of threads
GNU Compiler Example
$ gcc -o omp_helloc -fopenmp omp_hello.c
$ export OMP_NUM_THREADS=2
$ ./omp_helloc
Hello World from thread = 0
Hello World from thread = 1
Number of threads = 2
$
$ gfortran -o omp_hellof -fopenmp omp_hello.f
$ export OMP_NUM_THREADS=4
$ ./omp_hellof
Hello World from thread = 2
Hello World from thread = 1
Hello World from thread = 3
Hello World from thread = 0
Number of threads = 4
Intel Compiler Example$ icc -o omp_helloc -openmp omp_hello.c
omp_hello.c(22): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ export OMP_NUM_THREADS=3
$ ./omp_helloc
Hello World from thread = 0
Hello World from thread = 2
Hello World from thread = 1
Number of threads = 3
$
$ ifort -o omp_hellof -openmp omp_hello.f
omp_hello.f(20): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
$ export OMP_NUM_THREADS=2
$ ./omp_hellof
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1
Portland Group Example[sas@discovery intro_openmp]$ pgcc -o omp_helloc -mp omp_hello.c
[sas@discovery intro_openmp]$ export OMP_NUM_THREADS=2
[sas@discovery intro_openmp]$ ./omp_helloc
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1
$
$ pgf90 -o omp_hellof -mp omp_hello.f
$ export OMP_NUM_THREADS=2
$ ./omp_hellof
Hello World from thread = 0
Number of threads = 2
Hello World from thread = 1
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(10)
Other OpenMP Clauses
- firstprivate
- lastprivate
- ordered
firstprivate- initialize a variable from the serial part of the code
- private clause doesn't initialize the variable
Example of firstprivate clause
! Fortran example
j = jstart
!$omp parallel do firstprivate(j)
do i = 1, n
if(i == 1 .or. i == n) then
j = j + 1
endif
a(i) = a(i) + j
enddo
/* C/C++ Example */
j = jstart;
#pragma omp parallel for firstprivate(j)
{
for(i=1; i<=n; i++){
if(i == 1 || i == n)
j = j + 1;
a[i] = a[i] + j;
}
}
lastprivate- thread that executes the ending loop index copies its value to the master (serial) thread
- this gives the same result as serial execution
Example of lastprivate clause
! Fortran example
!$omp parallel do lastprivate(x)
do i = 1, n
x = sin(pi*dx*real(i))
a(i) = exp(x)
enddo
lastx = x
/* C/C++ example */
#pragma omp parallel for lastprivate(x)
{
for(i=1; i<=n; i++){
x = sin( pi * dx * (float)i );
a[i] = exp(x);
}
}
lastx = x;
ordered- used when part of the loop must execute in serial order
- ordered clause plus an ordered directive
! Fortran example
!$omp parallel do private(myval) ordered
do i = 1, n
myval = do_lots_of_work(i)
!$omp ordered
print*, i, myval
!$omp end ordered
enddo
lastx = x
/* C/C++ example */
#pragma omp parallel for private(myval) ordered
{
for(i=1; i<=n; i++){
myval = do_lots_of_work(i);
#pragma omp ordered
{
printf("%d %d\n", i, myval);
}
}
}
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(11)
Reduction Operations
An example of a reduction operation is a summation:
! Fortran example
do i = 1, n
sum = sum + a(i)
enddo
/* C/C++ Example */
for(i=1; i<=n; i++){
sum = sum + a[i];
}
How reduction works:- sum is the reduction variable
- cannot be declared shared
- threads would overwrite the value of sum
- cannot be declared private
- private variables don't persist outside of parallel region
- specified reduction operation performed on individual values from each thread
Example of reduction clause
! Fortran example
!$omp parallel do reduction(+:sum)
do i = 1, n
sum = sum + a(i)
enddo
/* C/C++ example */
#pragma omp parallel for reduction(+:sum)
{
for(i=1; i<=n; i++){
sum = sum + a[i];
}
}
C/C++ Reduction OperandsOperator | Initial value |
+ | 0 |
* | 1 |
- | 0 |
& | ~0 |
| | 0 |
^ | 0 |
&& | 1 |
|| | 0 |
Fortran Reduction Operands
Operator | Initial Value |
+ | 0 |
* | 1 |
- | 0 |
.AND. | .true. |
.OR. | .false. |
.IEOR. | 0 |
.IOR. | 0 |
.IAND. | All bits on |
.EQV. | .true. |
MIN | Largest positive # |
MAX | Most negative # |
(12)
Thread Control
BarrierEach thread wait at the barrier until all threads reach the barrier.
! Fortran example
!$omp parallel private(myid,istart,iend)
call myrange(myid, nthreads, istart, iend)
do i = istart, iend
a(i) = a(i) - b(i)
enddo
!$omp barrier
call dowork(a)
!$omp end parallel
/* C/C++ Example */
#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, &istart, &iend);
for(i=istart; i<=iend; i++){
a[i] = a[i] - b[i];
}
#pragma omp barrier
dowork(a);
}
Master
A section of code that runs only on the master (thread with with rank=0)
! Fortran example
!$omp parallel private(myid, istart, iend)
call myrange(myid, nthreads, global_start, global_end, istart, iend)
do i = istart, iend
a(i) = b(i)
enddo
!$omp barrier
!$omp master
write(21) a
!$omp end master
call do_work(istart, iend)
!$omp end parallel
/* C/C++ example */
#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, global_start, global_end, &istart, &iend);
for(i=istart; i<=iend; i++){
a[i] = b[i];
}
#pragma omp barrier
#pragma omp master
{
n = global_end - global_start + 1;
write_size = fwrite(a, 1, n, file_pointer);
}
do_work(istart, iend);
}
SingleSimilar to Master except runs only on the first thread to reach it
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(13)
Thread Control (continued)
Critical - Only one thread executes a specified section of the code at a time
- Threads can execute in any order
- Similar to ORDERED directive except ordered specifies that threads go in numerical order
! Fortran example
the_max = 0.0
!$omp parallel private(myid, istart, iend)
call myrange(myid, nthreads, global_start, global_end, istart, iend)
call compute_a(a(istart:iend))
!$omp critical
the_max = max( maxval(a(istart:iend), the_max )
!$omp end critical
call more_work_on_a(a)
!$omp end parallel
/* C/C++ Example */
the_max = 0.0;
#pragma omp parallel private(myid, istart, iend)
{
myrange(myid, nthreads, global_start, global_end, &istart, &iend);
nvals = iend-istart+1;
compute_a(a[istart],nvals);
#pragma omp critical
the_max = max( maxval(a[istart],nvals), the_max );
#pragma omp end critical
call more_work_on_a(a)
}
Sections/Section
- A section of code that is run by only one thread
- Sections are performed in parallel
! Fortran example
!$omp parallel
!$omp sections
!$omp section
call init_field(field)
!$omp section
call check_grid(grid)
!$omp end sections
!$omp end parallel
/* C/C++ example */
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
init_field(field);
#pragma omp section
check_grid(grid);
}
}
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(14)
Scheduling of Parallel
Loops
Schedule:
- Way in which parallel loop iterations are distributed among
the threads
- Depends upon whether each loop iteration yields the same
amount of work
- Schedule clause
schedule(type[,
chunk_size])
- Static schedule
- each thread is assigned a fixed number of chunks (default)
- Chunks
are groups of iterations (in contiguous order) that are assigned to
threads
- Dynamic
schedule - each thread grabs "chunk" iterations until all iterations are done
- Faster threads are assigned more iterations
- Interleaved is
when chunks are assigned to the processors in a round-robin manner
- Guided
is a variant of dynamic scheduling where successive chunks get smaller
- Runtime
is determined at runtime and set by a environment variable
export OMP_SCHEDULE="dynamic,
100"
N= number of
iterations of the parallel loop
P= number of
threads
C=user-specified
chunk size
Name |
Type |
Chunk |
Chunk Size |
# of Chunks |
Static or Dynamic |
Computer Overhead |
Simple Static |
simple |
no |
N/P |
P |
static |
lowest |
Interleaved |
simple |
yes |
C |
N/C |
static |
low |
Simple Dynamic |
dynamic |
optional |
C |
N/C |
dynamic |
medium |
Guided |
guided |
optional |
decreasing from N/P |
fewer than N/C |
dynamic |
high |
Runtime |
runtime |
no |
varies |
varies |
varies |
varies |
/* C/C++ Example */
#pragma omp parallel for schedule(dynamic) private(tmp, i, j, k)
for (i=0; i<Ndim; i++){
for (j=0; j<Mdim; j++){
tmp = 0.0;
for(k=0;k<Pdim;k++){
tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
}
*(C+(i*Ndim+j)) = tmp;
}
}
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(15)
Parallel Regions Example
- Coarse-grained parallelism (not loop parallelism)
- uses parallel pragma
- calls to functions to get and set number of
threads
- cyclic distribution of the loop
/* C/C++ Example */
/*
NAME: PI SPMD ... a simple version.
This program will numerically compute the integral of
4/(1+x*x)
from 0 to 1. The value of this integral is pi -- which
is great since it gives us an easy way to check the answer.
The program was parallelized using OpenMP and an SPMD
algorithm.
Note that this program will show low performance due to
false sharing. In particular, sum[id] is unique to each
thread, but adjacent values of this array share a cache line
causing cache thrashing as the program runs.
History: Written by Tim Mattson, 11/99.
*/
#include <stdio.h>
#include <omp.h>
#define MAX_THREADS 4
static long num_steps = 100000000;
double step;
int main ()
{
int i,j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];
step = 1.0/(double) num_steps;
for (j=1;j<=MAX_THREADS ;j++) {
omp_set_num_threads(j);
full_sum=0.0;
start_time = omp_get_wtime();
#pragma omp parallel
{
int i;
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
sum[id] = 0.0;
if (id == 0)
printf(" num_threads = %d",numthreads);
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
sum[id] = sum[id] + 4.0/(1.0+x*x);
}
}
for(full_sum = 0.0, i=0;i<j;i++)
full_sum += sum[i];
pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threads \n",pi,run_time,j);
}
}
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(16)
Parallel Regions Example
Improved
- improved version of previous example
- false sharing avoided by storing into a private scalar (partial_sum)
- barrier construct to make sure all threads are finished
- single construct so only one thread prints the number of
threads
- critcal construct so only one thread at a time sums its
local sum into a single global sum
/* C/C++ Example */
/* NAME: PI SPMD final version without false sharing
This program will numerically compute the integral of
4/(1+x*x)
from 0 to 1. The value of this integral is pi -- which
is great since it gives us an easy way to check the answer.
History: Written by Tim Mattson, 11/99.
*/
#include <stdio.h>
#include <omp.h>
#define MAX_THREADS 4
static long num_steps = 100000000;
double step;
int main ()
{
int i,j;
double pi, full_sum = 0.0;
double start_time, run_time;
double sum[MAX_THREADS];
step = 1.0/(double) num_steps;
for(j=1;j<=MAX_THREADS ;j++){
omp_set_num_threads(j);
full_sum = 0.0;
start_time = omp_get_wtime();
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
double partial_sum = 0;
#pragma omp single
printf(" num_threads = %d",numthreads);
for (i=id;i< num_steps; i+=numthreads){
x = (i+0.5)*step;
partial_sum += + 4.0/(1.0+x*x);
}
#pragma omp critical
full_sum += partial_sum;
}
pi = step * full_sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi is %f in %f seconds %d threds \n ",pi,run_time,j);
}
}
(17)
How to Know When and Where to Add OpenMP Directives
1. profile your code
2. determine the sections where the most time is spent and see if it
can be parallelized
3. use Amdahl's Law to determine potential speed up
4. put directives in the code and measure run times on serial and
parallel runs
Amdahl's Law
- speedup is a function of the fraction of the code that can
be parallelized and the number of processors
- non-parallel sections do not gain performance
Speedup(N) =
1/((1-p)+p/N)
where p=
portion of the code that can be made parallel
N=number of
processor. For example:
p |
N |
Speedup |
.5 |
2 |
1.3 |
.5 |
4 |
1.6 |
.5 |
infinite |
2 |
.9 |
2 |
1.81 |
.9 |
4 |
3.07 |
.9 |
infinite |
10 |
body {
background-color: white;
color: black;
}
pre {
margin-left: 1em;
padding: 0.5em;
background-color: #f0f0f0;
border: 1px solid black;
}
h1 {
text-align: center;
}
-->
(18)
Profiling Your Code
Steps for
Profiling
- Compile with "-g -gp"
(no optimization) options
- Run the program in a serial mode and produce file gmon.out
- Use gprof to
display the profile data and the lines where the program
spends the most time.
[sas@discovery]$ icc -g -pg -openmp-stubs -o matmul_par matmul_par.c
matmul_par.c(60): warning #161: unrecognized #pragma
#pragma omp parallel for private(tmp, i, j, k)
^
[sas@discovery]$ ./matmul_par
Order 1000 multiplication in 13.498900 seconds
1 threads
Hey, it worked
all done
[sas@discovery] ls -ls gmon.out
4 -rw-rw-r-- 1 sas sas 1317 Feb 17 11:27 gmon.out
[sas@discovery] gprof -l matmul_par >gprof.out
[sas@discovery] more gprof.out
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
94.39 18.06 18.06 main (matmul_par.c:68 @ 400aea)
5.72 19.15 1.09 main (matmul_par.c:66 @ 400b22)
0.05 19.16 0.01 main (matmul_par.c:50 @ 4009f8)
0.05 19.17 0.01 main (matmul_par.c:54 @ 400a57)
0.05 19.18 0.01 main (matmul_par.c:70 @ 400b30)
0.05 19.19 0.01 main (matmul_par.c:85 @ 400c39)
.
.
Portion of matmul_par.c where the most time is spent
60 #pragma omp parallel for private(tmp, i, j, k)
61 for (i=0; i<Ndim; i++){
62 for (j=0; j<Mdim; j++){
63
64 tmp = 0.0;
65
66 for(k=0;k<Pdim;k++){
67 /* C(i,j) = sum(over k) A(i,k) * B(k,j) */
68 tmp += *(A+(i*Ndim+k)) * *(B+(k*Pdim+j));
69 }
70 *(C+(i*Ndim+j)) = tmp;
71 }
72 }
Results of Parallelizing
the Code With OpenMP
# of
threads |
Time
(secs.) |
Speedup |
1 |
16.28 |
1.00 |
2 |
8.326 |
1.95 |
4 |
4.268 |
3.81 |
8 |
2.137 |
7.61 |
Basic Strategies to Improve Performance
Things to check for when parallelized loops do not perform well
- parallel startup costs (incurred at parallel directive)
- avoid parallelizing small loops
- watch for load imbalances (threads have differnet amounts of work)
- unnecessary synchronization (for example critical or ordered directives)
- non-cache friendly programs
- memory contention - one thread repeatedly updates a cache line that other threads use for input
- false sharing - two or more threads repeatedly update the same cache line
- use private variables where possible
- change size and arrangement of arrays to avoid cache misses
(20)
Hybrid MPI/OpenMP Programs
- MPI/OpenMP paradigm can work well for clusters of SMP
computers
- Use MPI across nodes and OpenMP within nodes
- Avoids the extra communication overhead with MPI within the
same node
- Works well for problems with two-level parallelism
- OpenMP works well for fine-grained parallelism
- MPI works well for coarse-grained parallelism

Strategies for Hybrid
Parallelization
- From serial code decompose first with MPI and then add
OpenMP
- Simplest and least error-prone method
- use MPI outside parallel region
- allow only master thread to communicate between MPI tasks.
- could use MPI with parallel region with thread-safe MPI
(21)
OpenMP v3.0
History of OpenMP- OpenMP v2.0 for C: 2002
- OpenMP v2.0 for Fortran: 2000
- OpenMP v2.5: 2005
- OpenMP v3.0: 2008 - tasks most important addition
OpenMP Tasks:- allow you to parallelize irregular problems
- unbounded loops (ex. while loops)
- recursive algorithms
- unstructured parallelism
- dynamically generated units of work
- task can be executed by any thread in the team , in parallel with others
- execution can be immediate or deferred until later
- execution might be suspended and continued later by the same or different thread
- parallel tasks can be nested within parallel loops or sections
c C example
#pragma omp parallel
{
#pragma omp single
{
p=listhead;
while (p) {
/* create a task for each element of the list */
#pragma omp task
process(p); /* porcess list element p */
p=next(P);
}
}
}
(22)
Resources
OpenMP Tutorial, SC08, taught by Tim Mattson and Larry
Meadows, Intel Corporation
"OpenMP Course" , Cyberinfrastructure Tutor at NCSA
<http://ci-tutor.ncsa.uiuc.edu/login.php>
Parallel Programming With OpenMP Tutorial at OSC
<http://www.osc.edu/supercomputing/training/openmp/big/fsld.002.html>
OpenMP Tutorial at LLNL
<https://computing.llnl.gov/tutorials/openMP/exercise.html>
What is OpenMP: Course Handout
[an error occurred while processing this directive]
(last update 23 March 2009) ©Dartmouth College
http://www.dartmouth.edu/~rc/classes/intro_matlab