Skip to content

Commit 888b2f4

Browse files
authored
Merge pull request #3 from ocaml-multicore/updates
updates from OCaml Workshop 2020 talk
2 parents 8643508 + 7c01f39 commit 888b2f4

File tree

1 file changed

+114
-35
lines changed

1 file changed

+114
-35
lines changed

README.md

Lines changed: 114 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,21 @@ Multicore OCaml. All the code examples along with their corresponding dune file
55
are available in the `code/` directory. The tutorial is organised into the
66
following sections:
77

8-
* [Introduction](#introduction)
9-
* [Installation](#installation)
10-
* [Domains](#domains)
11-
* [Domainslib](#domainslib)
12-
* [Task pool](#task-pool)
13-
* [Parallel for](#parallel-for)
14-
* [Async-Await](#async-await)
15-
* [Channels](#channels)
16-
* [Bounded Channels](#bounded-channels)
17-
* [Task Distribution using Channels](#task-distribution-using-channels)
18-
* [Profiling your code](#profiling-your-code)
19-
* [Perf](#perf)
20-
* [Eventlog](#eventlog)
21-
8+
- [Introduction](#introduction)
9+
* [Installation](#installation)
10+
* [Compatibility with existing code](#compatibility-with-existing-code)
11+
- [Domains](#domains)
12+
- [Domainslib](#domainslib)
13+
* [Task pool](#task-pool)
14+
* [Parallel for](#parallel-for)
15+
* [Async-Await](#async-await)
16+
+ [Fibonacci numbers in parallel](#fibonacci-numbers-in-parallel)
17+
- [Channels](#channels)
18+
* [Bounded Channels](#bounded-channels)
19+
* [Task Distribution using Channels](#task-distribution-using-channels)
20+
- [Profiling your code](#profiling-your-code)
21+
* [Perf](#perf)
22+
* [Eventlog](#eventlog)
2223

2324
# Introduction
2425

@@ -27,6 +28,13 @@ Parallelism through `Domains` and Concurrency through `Algebraic effects`. It is
2728
slowly, but steadily being merged to trunk OCaml. Domains-only multicore is
2829
expected to land first followed by Algebraic effects.
2930

31+
**Concurrency** is how we partition multiple computations such that they can
32+
run in overlapping time periods rather than strictly sequentially.
33+
**Parallelism** is the act of running multiple computations simultaneously,
34+
primarily by using multiple cores on a multicore machine. The multicore wiki
35+
has [comprehensive notes](https://github.com/ocaml-multicore/ocaml-multicore/wiki/Concurrency-and-parallelism-design-notes) on the design decisions and
36+
current status of concurrency and parallelism in Multicore OCaml.
37+
3038
The Multicore OCaml compiler comes with two variants of Garbage Collector,
3139
namely a concurrent minor collector (ConcMinor) and a stop-the-world parallel
3240
minor collector (ParMinor). Our experiments have shown us that ParMinor
@@ -35,7 +43,8 @@ need any changes in the C API of the compiler, unlike ConcMinor which breaks
3543
the C API. So, the consensus is to go forward with ParMinor during up-
3644
streaming of the Domains-only Multicore. ConcMinor is at OCaml version `4.06.1`
3745
and ParMinor has been promoted to `4.10.0` and `4.11.0`. More details on the GC
38-
design and evaluation are available in [this paper](https://arxiv.org/abs/2004.11663).
46+
design and evaluation are available in
47+
[this ICFP 2020 paper](https://dl.acm.org/doi/10.1145/3408995).
3948

4049
The Multicore ecosystem also has the following libraries to complement the
4150
compiler.
@@ -51,19 +60,27 @@ Multicore OCaml
5160
swap library
5261

5362
This tutorial takes you through ways in which one can profitably write parallel
54-
programs in Multicore OCaml. The effect handlers story is not touched upon
63+
programs in Multicore OCaml. A reader is assumed to be familiar with OCaml, if
64+
not, they are encouraged to read [Real World OCaml](https://dev.realworldocaml.org/toc.html). The effect handlers story is not touched upon
5565
here, for anyone interested, do check out this [tutorial](https://github.com/ocamllabs/ocaml-effects-tutorial) and [examples](https://github.com/ocaml-multicore/effects-examples).
5666

5767
## Installation
5868

59-
While up-streaming of the multicore bits to trunk is still work in progress, one
60-
can start using Multicore OCaml with the help of [multicore-opam](https://github.com/ocaml-multicore/multicore-opam). Installation instructions for
69+
Up-streaming of the multicore bits to trunk OCaml in progress, with [some PRs already merged to trunk](https://github.com/ocaml/ocaml/pulls?q=is%3Apr+label%3Amulticore-prerequisite+).
70+
One can start using Multicore OCaml with the help of [multicore-opam](https://github.com/ocaml-multicore/multicore-opam). Installation instructions for
6171
Multicore OCaml 4.10.0 compiler and domainslib can be found [here](https://github.com/ocaml-multicore/multicore-opam#install-multicore-ocaml).
6272
Other available compiler variants are [here](https://github.com/ocaml-multicore/multicore-opam/tree/master/packages/ocaml-variants).
6373

6474
It will also be useful to install `utop` on your Multicore switch.
6575
`opam install utop` should work out of the box.
6676

77+
## Compatibility with existing code
78+
79+
Multicore OCaml is compatible with existing OCaml code. It has support for the
80+
C API along with some tricky parts of the language like ephemerons and
81+
finalisers. To maintain compatibility with `ppx` there is a `no-effect-syntax`
82+
compiler variant in multicore-opam, that removes some syntax extensions.
83+
6784
# Domains
6885

6986
Domains are the basic unit of parallelism in Multicore OCaml.
@@ -129,8 +146,8 @@ Error: Unbound module Atomic
129146
Error: Library "domainslib" not found.
130147
```
131148

132-
These errors usually mean that the switch used to compile the code is not a
133-
multicore switch. Using a multicore switch should resolve them.
149+
These errors usually mean that the compiler switch used to compile the code is
150+
not a multicore switch. Using a multicore compiler variant should resolve them.
134151

135152
# Domainslib
136153

@@ -184,19 +201,20 @@ after all tasks are done.
184201

185202
## Parallel for
186203

187-
`parallel_for` is a powerful primitive in the Task API that can scale well with
188-
very little change in sequential code.
204+
`parallel_for` is a powerful primitive in the Task API which can be used to
205+
parallelise computations that use for loops. It can scale well with very little
206+
change to the sequential code.
189207

190208
Let us consider the example of matrix multiplication.
191209

192-
First, let us write the sequential version of a function which performs matrix
193-
multiplication of two matrices and returns the result.
210+
First, let us write down the sequential version of a function which performs
211+
matrix multiplication of two matrices and returns the result.
194212

195213
```ocaml
196-
let multiply_matrix a b =
214+
let matrix_multiply a b =
197215
let i_n = Array.length a in
198216
let j_n = Array.length b.(0) in
199-
let k_n = Array.length b
217+
let k_n = Array.length b in
200218
let res = Array.make_matrix i_n j_n 0 in
201219
for i = 0 to i_n - 1 do
202220
for j = 0 to j_n - 1 do
@@ -208,11 +226,30 @@ let multiply_matrix a b =
208226
res
209227
```
210228

211-
Arrays offer better efficiency compared with lists in the context of Multicore
212-
OCaml. Although they are not generally favoured in functional programming, using
213-
arrays for the sake of efficiency is a reasonable trade-off.
229+
To make this function run in parallel, one might be inclined to spawn a new
230+
domain for every iteration in the loop, which would look like:
214231

215-
We shall parallelise matrix multiplication with the help of a `parallel_for`.
232+
```ocaml
233+
let domains = Array.init i_n (fun i ->
234+
Domain.spawn(fun _ ->
235+
for j = 0 to j_n - 1 do
236+
for k = 0 to k_n - 1 do
237+
res.(i).(j) <- res.(i).(j) + a.(i).(k) * b.(k).(j)
238+
done
239+
done)) in
240+
Array.iter Domain.join domains
241+
```
242+
This will be *disastrous* in terms of performance majorly due to the fact that
243+
spawning a new domain is an expensive operation. What instead task pool offers
244+
us is, a finite set of available domains, which can be used to run your
245+
computations in parallel.
246+
247+
Arrays are usually more efficient compared with lists in the context of
248+
Multicore OCaml. Although they are not generally favoured in functional
249+
programming, using arrays for the sake of efficiency is a reasonable trade-off.
250+
251+
A better way to parallelise matrix multiplication with the help of a
252+
`parallel_for`.
216253

217254
```ocaml
218255
let parallel_matrix_multiply pool a b =
@@ -243,13 +280,16 @@ is recommended to use it as a function parameter.
243280
We shall examine the parameters of `parallel_for`. It takes in `pool` as
244281
discussed earlier, `start` and `finish` as the names suggset are the starting
245282
and ending values of the loop iterations, `body` contains the actual loop body
246-
to be executed. One parameter that doesn't exist in the sequential version is
247-
the `chunk_size`. Chunk size determines the granularity of tasks when executing on multiple cores. The ideal `chunk_size` depends on a combination
283+
to be executed.
284+
285+
One parameter that doesn't exist in the sequential version is
286+
the `chunk_size`. Chunk size determines the granularity of tasks when executing
287+
on multiple cores. The ideal `chunk_size` depends on a combination
248288
of factors:
249289

250290
* **Nature of the loop:** There are two things to consider pertaining to the
251-
loop while deciding on a `chunk_size` to use, the number of iterations in the
252-
loop and amount of time each iteration takes. If the amount of time taken by
291+
loop while deciding on a `chunk_size` to use, the *number of iterations* in the
292+
loop and *amount of time* each iteration takes. If the amount of time taken by
253293
every iteration is roughly equal, then the `chunk_size` could be number of
254294
iterations divided by the number of cores. On the other hand, if the amount of
255295
time taken is different for every iteration, the chunks should be smaller. If
@@ -598,6 +638,45 @@ debugging. Let's do that with the help of an example.
598638

599639
## Perf
600640

641+
Linux perf is a tool that has proved to be very useful to profile Multicore
642+
OCaml code.
643+
644+
**Profiling serial code**
645+
646+
Profiling serial code can help us identify parts of code which can potentially
647+
benefit from parallelising. Let's do it for the sequential version of matrix
648+
multiplication.
649+
650+
```
651+
perf record --call-graph dwarf ./matrix_multiplication.exe 1024
652+
```
653+
654+
We get a profile that tells us how much time is spent in the `matrix_multiply`
655+
function which we wanted to parallelise. What we also need to keep in mind, is
656+
that if a lot more time is spent outside the function we'd like to parallelise,
657+
the maximum speedup we could achieve would be lower.
658+
659+
Profiling serial code can help us discover the hotspots where we might want to
660+
introduce parallelism.
661+
662+
```
663+
Samples: 51K of event 'cycles:u', Event count (approx.): 28590830181
664+
Children Self Command Shared Object Symbol
665+
+ 99.84% 0.00% matmul.exe matmul.exe [.] caml_start_program
666+
+ 99.84% 0.00% matmul.exe matmul.exe [.] caml_program
667+
+ 99.84% 0.00% matmul.exe matmul.exe [.] camlDune__exe__Matmul__entry
668+
+ 99.32% 99.31% matmul.exe matmul.exe [.] camlDune__exe__Matmul__matrix_multiply_211
669+
+ 0.57% 0.04% matmul.exe matmul.exe [.] camlStdlib__array__init_104
670+
0.47% 0.37% matmul.exe matmul.exe [.] camlStdlib__random__intaux_278
671+
```
672+
673+
674+
675+
### Overheads in parallel code
676+
677+
Perf can be helpful in identifying overheads in your parallel code. We'll see
678+
one such example here where we improve the performance by removing overheads.
679+
601680
**Parallel initialisation of a float array with random numbers**
602681

603682
Array initialisation using standard library's `Array.init` is sequential.
@@ -700,7 +779,7 @@ Random states are all allocated by the same domain in an array with small
700779
number of elements, possibly located close to each other in physical memory.
701780
When multiple domains try to access them, they might possibly share cache
702781
lines, what's termed as `false sharing`. We can confirm our suspicion with the
703-
help of `perf c2c`.
782+
help of `perf c2c` on Intel machines.
704783

705784
```
706785
$ perf c2c record _build/default/float_init_par2.exe 4 100_000_000

0 commit comments

Comments
 (0)