mlkit-bench/par-benchmarks at master · melsman/mlkit-bench

Name	Name	Last commit message	Last commit date
parent directory ..
fib	fib
mandelbrot	mandelbrot
nqueens	nqueens
pmsort	pmsort
primes	primes
ray-orig	ray-orig
ray	ray
seam-carve	seam-carve
soac	soac
sobol	sobol
util	util
vpmsort	vpmsort
.gitignore	.gitignore
Makefile	Makefile
README.md	README.md

Parallel Benchmarks

Notice: See below for prerequisites to running make press...

Here is the output from running make press (headers slightly modified) on a MacBook Pro (15-inch, 2016) with a 2,7 GHz Quad-Core Intel Core i7 CPU:

 |             | MLton 20201023 | MLKitSeq      | MLKitPar - 1 Core | MLKitPar - 4 Cores | Prog |
 | Program     | real time (s)  | real time (s) | real time (s)     | real time (s)      | len  |
 +-------------+----------------+---------------+-------------------+--------------------+------+
 | fib         | 0.55 ±2.6%     | 0.51 ±2.0%    | 0.51 ±1.7%        | 0.12 ±4.3%         | 132  |
 | mandelbrot  | 1.34 ±0.4%     | 1.38 ±6.4%    | 1.29 ±1.6%        | 0.36 ±5.2%         | 319  |
 | nqueens     | 1.38 ±0.4%     | 1.76 ±2.8%    | 2.03 ±2.1%        | 0.89 ±1.1%         | 212  |
 | pmsort      | 0.51 ±3.1%     | 0.19 ±0.0%    | 0.45 ±1.7%        | 0.16 ±2.8%         | 178  |
 | primes      | 0.97 ±0.5%     | 2.20 ±2.8%    | 2.47 ±2.6%        | 0.50 ±1.4%         | 584  |
 | ray-orig    | 4.20 ±0.4%     | 8.76 ±1.1%    | 18.28 ±1.1%       | 4.77 ±1.9%         | 547  |
 | ray         | 4.28 ±0.7%     | 2.56 ±2.9%    | 2.48 ±1.5%        | 0.62 ±9.1%         | 614  |
 | filter      | 0.33 ±4.3%     | 0.61 ±7.6%    | 0.56 ±0.8%        | 0.22 ±5.6%         | 563  |
 | scan        | 0.47 ±2.5%     | 0.74 ±5.6%    | 0.72 ±1.2%        | 0.22 ±4.1%         | 566  |
 | sgm_scan    | 0.13 ±4.2%     | 0.23 ±2.2%    | 0.25 ±2.7%        | 0.56 ±4.5%         | 563  |
 | soboloption | 0.31 ±2.3%     | 0.57 ±1.7%    | 0.94 ±1.3%        | 0.22 ±4.7%         | 889  |
 | sobolpi     | 0.54 ±1.6%     | 0.99 ±5.9%    | 1.29 ±1.2%        | 0.31 ±2.5%         | 854  |
 | vpmsort     | 0.27 ±5.2%     | 0.32 ±1.6%    | 0.39 ±1.1%        | 0.17 ±3.2%         | 198  |

We need to investigate the overhead of the parallel runtime and the parallel generated code for the pmsort benchmark and the ray-orig benchmark. These benchmarks allocate many small values in infinite regions (pairs, etc) and it turns out that the more involved allocation instruction sequence slows down these benchmarks. If we could somehow identify that a region is allocated into only by a single thread, we could choose the more efficient allocation instruction sequences for these regions. The pmsort benchmark, for instance, has the property that no region is allocated into by more than one thread. Here is a comparison of using the unsafe allocation strategy for the pmsort benchmark compared to the safe version:

 |             | MLton 20201023 | MLKitSeq      | MLKitPar - 1 Core | MLKitPar - 4 Cores | MLKitPar0 - 1 Core | MLKitPar0 - 4 Cores |
 | Program     | real time (s)  | real time (s) | real time (s)     | real time (s)      | real time (s)      | real time (s)       |
 +-------------+----------------+---------------+-------------------+--------------------+--------------------+---------------------+
 | pmsort 1M   | 1.0288         | 0.8965        | 0.9380            | 0.2850             | 0.5223             | 0.1847              |

Here are the executions:

$ cd pmsort
$ make clean pmsort-mlton.exe pmsort-seq.exe pmsort-par.exe pmsort-par0.exe

$ ./pmsort-mlton.exe -N 1000000 --t         # MLTON
[Sorting: Finished in 1.0288s]

$ ./pmsort-seq.exe -N 1000000 --t           # SEQ
[Sorting: Finished in 0.9280s]

$ ./pmsort-par.exe -N 1000000 --t -P 0      # PAR 1-core
[Sorting: Finished in 0.9380s]

$ ./pmsort-par.exe -N 1000000 --t           # PAR 4-cores
[Sorting: Finished in 0.2850s]

$ ./pmsort-par0.exe -N 1000000 --t -P 0     # PAR0 1-core
[Sorting: Finished in 0.5223s]

$ ./pmsort-par0.exe -N 1000000 --t          # PAR0 4-core
[Sorting: Finished in 0.1847s]

There seems to be something wrong with the SEQ version...

Please notice that some of the benchmarks (especially the ray benchmark) have been hand-optimised for running well with MLKit (ray-orig is Troels Henriksen's original version). In particular, for the ray benchmark, special blocks of unboxed reals are used for avoiding the construction of tuples of boxed reals. We see that MLton's whole-program optimisation techniques do not result in different execution times for the modified and the original version of the ray benchmark.

The scan, filter, and sgm_scan benchmarks make use of the util/soac.sml library, which implements a series of data-parallel combinators on pull-arrays and ordinary arrays.

Prerequisites

Running these benchmarks (using make press) assumes the following:

MLton is installed and available as a command mlton.
The smlpkg tool is available as a command smlpkg.
A new version of MLKit is installed. You can set the root of the installation using the MLKIT_ROOT environment variable (the default location is ~/gits/mlkit.
The mlkit-bench tool is compiled and available as ../src/mlkit-bench (write make all in ..).
The mlkit-bench-press tool is compiled and available as ../src/press/mlkit-bench-press (write smlpkg sync in ../src/charting and make all in ../src/press/).

Geee - sorry - I need to automate the above stuff further...

Simple compilation

Some of the folders have Makefiles that allow for easy compilation. For instance, you may do as follows:

$ cd pmsort
$ make pmsort-par.exe
$ ./pmsort-par.exe --t
...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

par-benchmarks

par-benchmarks

README.md

Parallel Benchmarks

Prerequisites

Simple compilation

Files

par-benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

par-benchmarks

Folders and files

parent directory

README.md

Parallel Benchmarks

Prerequisites

Simple compilation