SOLUTION MANUAL
Multicore & GPU Programming : An Integrated
Approach, 2e
Gerassimos Barlas
,Contents
Contents 2
1 Introduction 5
2 Multicore and Parallel Program Design 9
3 Threads and Concurrency in standard C++ 13
4 Parallel data structures 57
5 Distributed memory programming 61
6 GPU Programming 117
7 GPU and Accelerator Programming : OpenCL 143
8 Shared-memory programming : OpenMP 169
9 The Thrust Template Library 183
10 High-level multi-threaded programming with the Qt library 199
11 Load Balancing 205
3
,for more solution manuals,visit Library Genesis: libgen.is, libgen.st, libgen.rs, and forum.mhut.org
Chapter 1
Introduction
Exercises
1. Study one of the top 10 most powerful supercomputers in the world. Dis-
cover:
What kind of operating system does it run?
How many CPUs/GPUs is it made of?
What is its total memory capacity?
What kind of software tools can be used to program it?
Answer
Students should research the answer by visiting the Top 500 site and -if
available- the site of one of the reported systems.
2. How many cores are inside the top GPU offerings from NVidia and AMD?
What is the GFlop rating of these chips?
Answer N/A.
3. The performance of the most powerful supercomputers in the world is
usually reported as two numbers Rpeak and Rmax, both in TFlops (tera
floating point operations per second) units. Why is this done? What
are the factors reducing performance from Rpeak to Rmax? Would it be
possible to ever achieve Rpeak?
Answer
This is done because the peak performance is unattainable. Sustained,
measured performance on specific benchmarks, is a better indicator of the
true machine potential.
The reason these are different is communication overhead.
Rpeak and Rmax could never be equal. Extremely compute-heavy ap-
plications, that have no inter-node communications, could asymptotically
approach Rpeak if they were to run for a very long time. A very long
execution time is required to diminish the influence of the start-up costs.
5
, 6 CHAPTER 1. INTRODUCTION
4. A sequential application with a 20% part that must be executed sequen-
tially, is required to be accelerated five-fold. How many CPUs are required
for this task?
Answer
This requires the application of Amdahl’s law. The part that can be
parallelized is α = 1 −
1
20% = 80%. The speedup predicted by Amdahl’s
law is speedup = α .
1−α+ N
Achieving a three-fold speedup requires that:
1 1 0.8 1 0.8
=3⇒ =3⇒ = − 0.2 ⇒ N = 1 =6
1 − α +N
0.8
α 0.2 + N N 3 3 − 0.2
(1.1)
Achieving a 5-fold speedup requires that:
1 0.8 0.8
0.8 = 5 ⇒ N = 1
= =∞ (1.2)
5 − 0.2
0.2 + N 0
So, it is impossible to achieve a 5-fold speedup, according to Amdahl’s
law.
5. A parallel application running on 5 identical machines, has a 10% sequen-
tial part. What is the speedup relative to a sequential execution on one of
the machines? If we would like to double that speedup, how many CPU
would be required?
Answer
This requires the application of Gustafson-Barsis’ law as the information
relates to a parallel application. The parallel part is α = 1— 10% = 90%.
The speedup over a single machine is speedup — = 1 α+N · α = .1+5· 0.9 =
4.6.
Doubling the speedup would require .1 + ·N 0.9 = 9.2 ⇒ N = 0.9 = 10.1
9.1
machines. As N has to be an integer, we have to round-up to the closest
integer, i.e. N = 11.
6. An application with a 5% non-parallelizable part, is to be modified for
parallel execution. Currently on the market there are two parallel ma-
chines available: machine X with 4 CPUs, each CPU capable of executing
the application in 1hr on its own, and, machine Y with 16 CPUs, with
each CPU capable of executing the application in 2hr on its own. Which
is the machine you should buy, if the minimum execution time is required?
Answer
As the information provided relates to a sequential application, we have
to apply Amdahl’s law. The execution time for machine X is:
αT = 0.05 1hr + 0.95 · 1hr = 0.2875hr (1.3)
tX = (1 − α)T + ∗
N 4
The execution time for machine Y is:
αT = 0.05 2hr + 0.95 · 2hr = 0.21875hr (1.4)
tY = (1 − α)T + ∗
N 16
Multicore & GPU Programming : An Integrated
Approach, 2e
Gerassimos Barlas
,Contents
Contents 2
1 Introduction 5
2 Multicore and Parallel Program Design 9
3 Threads and Concurrency in standard C++ 13
4 Parallel data structures 57
5 Distributed memory programming 61
6 GPU Programming 117
7 GPU and Accelerator Programming : OpenCL 143
8 Shared-memory programming : OpenMP 169
9 The Thrust Template Library 183
10 High-level multi-threaded programming with the Qt library 199
11 Load Balancing 205
3
,for more solution manuals,visit Library Genesis: libgen.is, libgen.st, libgen.rs, and forum.mhut.org
Chapter 1
Introduction
Exercises
1. Study one of the top 10 most powerful supercomputers in the world. Dis-
cover:
What kind of operating system does it run?
How many CPUs/GPUs is it made of?
What is its total memory capacity?
What kind of software tools can be used to program it?
Answer
Students should research the answer by visiting the Top 500 site and -if
available- the site of one of the reported systems.
2. How many cores are inside the top GPU offerings from NVidia and AMD?
What is the GFlop rating of these chips?
Answer N/A.
3. The performance of the most powerful supercomputers in the world is
usually reported as two numbers Rpeak and Rmax, both in TFlops (tera
floating point operations per second) units. Why is this done? What
are the factors reducing performance from Rpeak to Rmax? Would it be
possible to ever achieve Rpeak?
Answer
This is done because the peak performance is unattainable. Sustained,
measured performance on specific benchmarks, is a better indicator of the
true machine potential.
The reason these are different is communication overhead.
Rpeak and Rmax could never be equal. Extremely compute-heavy ap-
plications, that have no inter-node communications, could asymptotically
approach Rpeak if they were to run for a very long time. A very long
execution time is required to diminish the influence of the start-up costs.
5
, 6 CHAPTER 1. INTRODUCTION
4. A sequential application with a 20% part that must be executed sequen-
tially, is required to be accelerated five-fold. How many CPUs are required
for this task?
Answer
This requires the application of Amdahl’s law. The part that can be
parallelized is α = 1 −
1
20% = 80%. The speedup predicted by Amdahl’s
law is speedup = α .
1−α+ N
Achieving a three-fold speedup requires that:
1 1 0.8 1 0.8
=3⇒ =3⇒ = − 0.2 ⇒ N = 1 =6
1 − α +N
0.8
α 0.2 + N N 3 3 − 0.2
(1.1)
Achieving a 5-fold speedup requires that:
1 0.8 0.8
0.8 = 5 ⇒ N = 1
= =∞ (1.2)
5 − 0.2
0.2 + N 0
So, it is impossible to achieve a 5-fold speedup, according to Amdahl’s
law.
5. A parallel application running on 5 identical machines, has a 10% sequen-
tial part. What is the speedup relative to a sequential execution on one of
the machines? If we would like to double that speedup, how many CPU
would be required?
Answer
This requires the application of Gustafson-Barsis’ law as the information
relates to a parallel application. The parallel part is α = 1— 10% = 90%.
The speedup over a single machine is speedup — = 1 α+N · α = .1+5· 0.9 =
4.6.
Doubling the speedup would require .1 + ·N 0.9 = 9.2 ⇒ N = 0.9 = 10.1
9.1
machines. As N has to be an integer, we have to round-up to the closest
integer, i.e. N = 11.
6. An application with a 5% non-parallelizable part, is to be modified for
parallel execution. Currently on the market there are two parallel ma-
chines available: machine X with 4 CPUs, each CPU capable of executing
the application in 1hr on its own, and, machine Y with 16 CPUs, with
each CPU capable of executing the application in 2hr on its own. Which
is the machine you should buy, if the minimum execution time is required?
Answer
As the information provided relates to a sequential application, we have
to apply Amdahl’s law. The execution time for machine X is:
αT = 0.05 1hr + 0.95 · 1hr = 0.2875hr (1.3)
tX = (1 − α)T + ∗
N 4
The execution time for machine Y is:
αT = 0.05 2hr + 0.95 · 2hr = 0.21875hr (1.4)
tY = (1 − α)T + ∗
N 16