Implement threaded broadcast array type to make time integration multithreaded #722

efaulhaber · 2025-02-19T16:41:38Z

Currently, time integration is not using multithreading. On CPUs with a lot of threads, we can see that the time integration therefore doesn't scale and takes up a significant percentage of the total runtime.
It also makes the first multithreaded loops that touch the arrays after the time integration much slower, most likely due to cache misses with the inconsistent access pattern.

OrdinaryDiffEq.jl provides multithreading for some (not all) methods, including Carpenter-Kennedy and the automatic time stepping methods that we are using, by passing thread=OrdinaryDiffEq.True() to the method. However, this doesn't work with DynamicalODEProblems where u is an ArrayPartition due to YingboMa/FastBroadcast.jl#71.
This PR doesn't just provide a workaround for this, it also enables multithreading for schemes that don't have this option, notably the symplectic schemes.

The following benchmarks have been conducted on an AMD Threadripper 3990X using 64 threads:

2D benchmark with 778k particles

On main:

julia> trixi_include("examples/fluid/dam_break_2d.jl", saving_callback=nothing, saving_paper=nothing, density_diffusion=DensityDiffusionMolteniColagrossi(delta=0.1), fluid_particle_spacing=0.001, tspan=(0.0, 0.01));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 34.1s /  61.6%           1.37GiB / 100.0%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                              776    19.8s   94.5%  25.6ms   1.37GiB  100.0%  1.80MiB
  system interaction               776    14.0s   66.5%  18.0ms   1.46MiB    0.1%  1.93KiB
    fluid1-fluid1                  776    12.1s   57.5%  15.6ms    376KiB    0.0%     496B
    fluid1-boundary2               776    1.87s    8.9%  2.41ms    376KiB    0.0%     496B
    ~system interaction~           776   17.9ms    0.1%  23.1μs    747KiB    0.1%     986B
    boundary2-boundary2            776    119μs    0.0%   154ns     0.00B    0.0%    0.00B
    boundary2-fluid1               776   20.5μs    0.0%  26.4ns     0.00B    0.0%    0.00B
  update systems and nhs           776    4.88s   23.2%  6.29ms   1.37GiB   99.9%  1.80MiB
    update nhs                     776    4.07s   19.4%  5.24ms   1.36GiB   99.8%  1.80MiB
    ~update systems and nhs~       776    421ms    2.0%   543μs    135KiB    0.0%     178B
    compute boundary pressure      776    352ms    1.7%   453μs    534KiB    0.0%     704B
    inverse state equation         776   39.8ms    0.2%  51.3μs    230KiB    0.0%     304B
    update density diffusion       776   81.6μs    0.0%   105ns     0.00B    0.0%    0.00B
  reset ∂v/∂t                      776    946ms    4.5%  1.22ms     0.00B    0.0%    0.00B
  source terms                     776   27.9ms    0.1%  36.0μs    170KiB    0.0%     224B
  ~kick!~                          776   14.0ms    0.1%  18.1μs   1.55KiB    0.0%    2.04B
drift!                             776    1.16s    5.5%  1.50ms    146KiB    0.0%     193B
  reset ∂u/∂t                      776    603ms    2.9%   777μs     0.00B    0.0%    0.00B
  velocity                         776    552ms    2.6%   711μs    146KiB    0.0%     192B
  ~drift!~                         776   6.66ms    0.0%  8.58μs      976B    0.0%    1.26B
calculate dt                         1    160ns    0.0%   160ns     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────

This PR:

julia> trixi_include("examples/fluid/dam_break_2d.jl", saving_callback=nothing, saving_paper=nothing, density_diffusion=DensityDiffusionMolteniColagrossi(delta=0.1), fluid_particle_spacing=0.001, tspan=(0.0, 0.01));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 19.7s /  88.3%           1.37GiB / 100.0%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                              776    17.3s   99.4%  22.3ms   1.37GiB  100.0%  1.80MiB
  system interaction               776    12.8s   73.8%  16.5ms   1.50MiB    0.1%  1.99KiB
    fluid1-fluid1                  776    10.7s   61.7%  13.8ms    418KiB    0.0%     551B
    fluid1-boundary2               776    2.08s   11.9%  2.67ms    376KiB    0.0%     496B
    ~system interaction~           776   17.6ms    0.1%  22.7μs    747KiB    0.1%     986B
    boundary2-boundary2            776   97.9μs    0.0%   126ns     0.00B    0.0%    0.00B
    boundary2-fluid1               776   20.6μs    0.0%  26.6ns     0.00B    0.0%    0.00B
  update systems and nhs           776    4.40s   25.3%  5.67ms   1.37GiB   99.9%  1.80MiB
    update nhs                     776    3.66s   21.0%  4.72ms   1.36GiB   99.8%  1.80MiB
    compute boundary pressure      776    360ms    2.1%   464μs    534KiB    0.0%     704B
    ~update systems and nhs~       776    341ms    2.0%   439μs    135KiB    0.0%     178B
    inverse state equation         776   37.5ms    0.2%  48.3μs    230KiB    0.0%     304B
    update density diffusion       776   83.6μs    0.0%   108ns     0.00B    0.0%    0.00B
  reset ∂v/∂t                      776   24.8ms    0.1%  32.0μs   24.2KiB    0.0%    32.0B
  source terms                     776   23.5ms    0.1%  30.2μs    170KiB    0.0%     224B
  ~kick!~                          776   5.62ms    0.0%  7.24μs   1.55KiB    0.0%    2.04B
drift!                             776    109ms    0.6%   140μs    171KiB    0.0%     225B
  reset ∂u/∂t                      776   80.7ms    0.5%   104μs   24.2KiB    0.0%    32.0B
  velocity                         776   24.5ms    0.1%  31.5μs    146KiB    0.0%     192B
  ~drift!~                         776   3.31ms    0.0%  4.27μs      976B    0.0%    1.26B
calculate dt                         1    200ns    0.0%   200ns     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────

Three important changes:

Some timers are slightly faster now, notably reset ∂v/∂t (being the first access after the time integration), the fluid-fluid interaction and the NHS update. In total, the kick went from 19.8s to 17.3s. I assume this is because the access pattern is more consistent now (both time integration and our code are using multithreaded access now), optimizing cache hits.
The drift is significantly faster now for the same reasons.
Additional 10.9s are saved in the total time. Note that the % measured goes up from 61.6% up to 88.3%.

2D benchmark with 7k particles

On main:

julia> trixi_include("examples/fluid/dam_break_2d.jl", saving_callback=nothing, saving_paper=nothing, density_diffusion=DensityDiffusionMolteniColagrossi(delta=0.1));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 3.56s /  79.8%            122MiB / 100.0%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                            9.35k    2.70s   94.9%   289μs    120MiB   98.6%  13.1KiB
  update systems and nhs         9.35k    1.32s   46.3%   141μs    100MiB   82.5%  11.0KiB
    update nhs                   9.35k    649ms   22.9%  69.5μs   89.8MiB   73.8%  9.84KiB
    compute boundary pressure    9.35k    370ms   13.0%  39.5μs   6.27MiB    5.2%     704B
    ~update systems and nhs~     9.35k    224ms    7.9%  24.0μs   1.57MiB    1.3%     176B
    inverse state equation       9.35k   71.7ms    2.5%  7.67μs   2.71MiB    2.2%     304B
    update density diffusion     9.35k    388μs    0.0%  41.6ns     0.00B    0.0%    0.00B
  system interaction             9.35k    1.26s   44.3%   135μs   17.6MiB   14.5%  1.93KiB
    fluid1-fluid1                9.35k    895ms   31.5%  95.7μs   4.42MiB    3.6%     496B
    fluid1-boundary2             9.35k    255ms    9.0%  27.3μs   4.42MiB    3.6%     496B
    ~system interaction~         9.35k    108ms    3.8%  11.5μs   8.77MiB    7.2%     984B
    boundary2-boundary2          9.35k    354μs    0.0%  37.9ns     0.00B    0.0%    0.00B
    boundary2-fluid1             9.35k    333μs    0.0%  35.6ns     0.00B    0.0%    0.00B
  ~kick!~                        9.35k   52.3ms    1.8%  5.59μs   1.55KiB    0.0%    0.17B
  source terms                   9.35k   45.7ms    1.6%  4.89μs   2.00MiB    1.6%     224B
  reset ∂v/∂t                    9.35k   25.7ms    0.9%  2.75μs     0.00B    0.0%    0.00B
drift!                           9.35k    144ms    5.1%  15.4μs   1.71MiB    1.4%     192B
  velocity                       9.35k   67.9ms    2.4%  7.26μs   1.71MiB    1.4%     192B
  reset ∂u/∂t                    9.35k   49.5ms    1.7%  5.30μs     0.00B    0.0%    0.00B
  ~drift!~                       9.35k   26.5ms    0.9%  2.83μs      976B    0.0%    0.10B
calculate dt                         1    150ns    0.0%   150ns     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────

This PR:

julia> trixi_include("examples/fluid/dam_break_2d.jl", saving_callback=nothing, saving_paper=nothing, density_diffusion=DensityDiffusionMolteniColagrossi(delta=0.1));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 3.01s /  87.1%            125MiB /  98.2%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                            9.35k    2.51s   95.7%   268μs    120MiB   98.4%  13.2KiB
  update systems and nhs         9.35k    1.23s   46.8%   131μs    100MiB   82.1%  11.0KiB
    update nhs                   9.35k    570ms   21.7%  60.9μs   89.8MiB   73.5%  9.84KiB
    compute boundary pressure    9.35k    377ms   14.4%  40.4μs   6.27MiB    5.1%     704B
    ~update systems and nhs~     9.35k    202ms    7.7%  21.6μs   1.57MiB    1.3%     176B
    inverse state equation       9.35k   76.8ms    2.9%  8.22μs   2.71MiB    2.2%     304B
    update density diffusion     9.35k    394μs    0.0%  42.2ns     0.00B    0.0%    0.00B
  system interaction             9.35k    1.16s   44.1%   124μs   17.6MiB   14.4%  1.93KiB
    fluid1-fluid1                9.35k    853ms   32.5%  91.2μs   4.42MiB    3.6%     496B
    fluid1-boundary2             9.35k    216ms    8.3%  23.2μs   4.42MiB    3.6%     496B
    ~system interaction~         9.35k   85.1ms    3.3%  9.11μs   8.77MiB    7.2%     984B
    boundary2-boundary2          9.35k    415μs    0.0%  44.4ns     0.00B    0.0%    0.00B
    boundary2-fluid1             9.35k    405μs    0.0%  43.3ns     0.00B    0.0%    0.00B
  ~kick!~                        9.35k   51.8ms    2.0%  5.55μs   1.55KiB    0.0%    0.17B
  source terms                   9.35k   40.5ms    1.5%  4.34μs   2.00MiB    1.6%     224B
  reset ∂v/∂t                    9.35k   33.7ms    1.3%  3.61μs    292KiB    0.2%    32.0B
drift!                           9.35k    112ms    4.3%  12.0μs   2.00MiB    1.6%     224B
  velocity                       9.35k   45.8ms    1.7%  4.90μs   1.71MiB    1.4%     192B
  reset ∂u/∂t                    9.35k   40.3ms    1.5%  4.32μs    292KiB    0.2%    32.0B
  ~drift!~                       9.35k   25.9ms    1.0%  2.77μs      976B    0.0%    0.10B
calculate dt                         1   70.0ns    0.0%  70.0ns     0.00B    0.0%    0.00B
──────────────────────────────────────────────────────────────────────────────────────────

We can observe the same three changes, but much less pronounced:

In total, the kick went from 2.7s to 2.5s.
The drift is a bit faster as well.
Additional 0.35s are saved in the total time. The % measured goes up from 79.8% to 87.1%.

3D benchmark with 4.5M particles

On main:

julia> trixi_include("examples/fluid/dam_break_3d.jl", saving_callback=nothing, fluid_particle_spacing=0.01, tspan=(0.0, 0.01));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 45.0s /  76.9%            757MiB /  57.7%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                              123    33.9s   98.0%   276ms    437MiB  100.0%  3.55MiB
  system interaction               123    26.4s   76.3%   215ms    245KiB    0.1%  1.99KiB
    fluid1-fluid1                  123    19.8s   57.0%   161ms   65.3KiB    0.0%     544B
    fluid1-boundary2               123    6.67s   19.3%  54.2ms   59.6KiB    0.0%     496B
    ~system interaction~           123   7.44ms    0.0%  60.5μs    120KiB    0.0%     997B
    boundary2-boundary2            123   17.5μs    0.0%   142ns     0.00B    0.0%    0.00B
    boundary2-fluid1               123   13.0μs    0.0%   106ns     0.00B    0.0%    0.00B
  update systems and nhs           123    6.80s   19.6%  55.3ms    436MiB   99.9%  3.55MiB
    compute boundary pressure      123    4.84s   14.0%  39.3ms   78.8KiB    0.0%     656B
    update nhs                     123    1.22s    3.5%  9.94ms    436MiB   99.9%  3.55MiB
    ~update systems and nhs~       123    579ms    1.7%  4.70ms   24.6KiB    0.0%     205B
    inverse state equation         123    157ms    0.5%  1.27ms   30.8KiB    0.0%     256B
    update density diffusion       123   22.0μs    0.0%   179ns     0.00B    0.0%    0.00B
  reset ∂v/∂t                      123    556ms    1.6%  4.52ms     0.00B    0.0%    0.00B
  source terms                     123    137ms    0.4%  1.11ms   26.9KiB    0.0%     224B
  ~kick!~                          123   3.02ms    0.0%  24.6μs   1.55KiB    0.0%    12.9B
drift!                             123    706ms    2.0%  5.74ms   25.9KiB    0.0%     216B
  reset ∂u/∂t                      123    421ms    1.2%  3.42ms     0.00B    0.0%    0.00B
  velocity                         123    283ms    0.8%  2.30ms   25.0KiB    0.0%     208B
  ~drift!~                         123   2.63ms    0.0%  21.3μs      976B    0.0%    7.93B
──────────────────────────────────────────────────────────────────────────────────────────

This PR:

julia> trixi_include("examples/fluid/dam_break_3d.jl", saving_callback=nothing, fluid_particle_spacing=0.01, tspan=(0.0, 0.01));

──────────────────────────────────────────────────────────────────────────────────────────
           TrixiParticles.jl                     Time                    Allocations      
                                        ───────────────────────   ────────────────────────
           Tot / % measured:                 37.8s /  90.0%            757MiB /  57.7%    

Section                         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────
kick!                              123    33.8s   99.2%   275ms    437MiB  100.0%  3.55MiB
  system interaction               123    26.7s   78.4%   217ms    245KiB    0.1%  1.99KiB
    fluid1-fluid1                  123    19.9s   58.4%   162ms   65.3KiB    0.0%     544B
    fluid1-boundary2               123    6.79s   20.0%  55.2ms   59.6KiB    0.0%     496B
    ~system interaction~           123   7.85ms    0.0%  63.8μs    120KiB    0.0%     997B
    boundary2-fluid1               123   17.4μs    0.0%   141ns     0.00B    0.0%    0.00B
    boundary2-boundary2            123   11.8μs    0.0%  95.7ns     0.00B    0.0%    0.00B
  update systems and nhs           123    6.85s   20.1%  55.7ms    436MiB   99.9%  3.55MiB
    compute boundary pressure      123    4.61s   13.5%  37.5ms   78.8KiB    0.0%     656B
    update nhs                     123    1.42s    4.2%  11.6ms    436MiB   99.9%  3.55MiB
    ~update systems and nhs~       123    646ms    1.9%  5.25ms   24.6KiB    0.0%     205B
    inverse state equation         123    165ms    0.5%  1.35ms   30.8KiB    0.0%     256B
    update density diffusion       123   15.9μs    0.0%   129ns     0.00B    0.0%    0.00B
  source terms                     123    136ms    0.4%  1.11ms   26.9KiB    0.0%     224B
  reset ∂v/∂t                      123   87.0ms    0.3%   707μs   3.84KiB    0.0%    32.0B
  ~kick!~                          123   1.23ms    0.0%  10.0μs   1.55KiB    0.0%    12.9B
drift!                             123    282ms    0.8%  2.29ms   29.8KiB    0.0%     248B
  velocity                         123    169ms    0.5%  1.37ms   25.0KiB    0.0%     208B
  reset ∂u/∂t                      123    112ms    0.3%   912μs   3.84KiB    0.0%    32.0B
  ~drift!~                         123    668μs    0.0%  5.43μs      976B    0.0%    7.93B
──────────────────────────────────────────────────────────────────────────────────────────

Here, the difference is not as extreme as in 2D, probably because 3D SPH is more expensive, so the time integration already takes a smaller percentage of the total time.
The most notable difference is the much faster drift! and reset ∂v/∂t (because that is the first multithreaded loop that touches the arrays after the time integration).

It should be noted that we might get an even larger speedup by making the access pattern even more consistent. With this PR, there is one multithreaded loop over all of u_ode or v_ode, whereas in TrixiParticles, we first split these vectors per system and then run a multithreaded loop. However, I doubt the difference will be very large, as boundaries are still causing inconsistent access patterns (this is the reason why my multi-gpu experiments didn't work out very nicely).

…ithreaded

sloede · 2025-02-20T08:28:07Z

Here, the difference is not as extreme as in 2D, probably because 3D SPH is more expensive, so the time integration already takes a smaller percentage of the total time.

Still - a 15% decrease in compute time across the board is impressive, given that the total PR is < 100 LOC 🚀

codecov · 2025-02-20T09:40:56Z

Codecov Report

Attention: Patch coverage is 68.57143% with 11 lines in your changes missing coverage. Please review.

Project coverage is 70.40%. Comparing base (6d04215) to head (81f0656).

Files with missing lines	Patch %	Lines
src/util.jl	62.06%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #722      +/-   ##
==========================================
- Coverage   70.53%   70.40%   -0.13%     
==========================================
  Files          96       96              
  Lines        5976     6018      +42     
==========================================
+ Hits         4215     4237      +22     
- Misses       1761     1781      +20

Flag	Coverage Δ
unit	`70.40% <68.57%> (-0.13%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Implement threaded broadcast array type to make time integration mult…

a322f4d

…ithreaded

efaulhaber added the performance label Feb 19, 2025

efaulhaber requested review from svchb, vchuravy and LasNikas February 19, 2025 16:41

efaulhaber self-assigned this Feb 19, 2025

efaulhaber added 3 commits February 19, 2025 17:42

Fix similar(::Broadcasted) and remove unused code

126e8af

Add a comment

13753da

Merge branch 'main' into threaded-time-integration

bd4aa4e

efaulhaber mentioned this pull request Feb 19, 2025

[Proof of Concept] Implement data type to define threaded broadcasting trixi-framework/Trixi.jl#2284

Draft

Reformat

4ca5214

Fix tests

81f0656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement threaded broadcast array type to make time integration multithreaded #722

Implement threaded broadcast array type to make time integration multithreaded #722

efaulhaber commented Feb 19, 2025 •

edited

Loading

sloede commented Feb 20, 2025

codecov bot commented Feb 20, 2025

Implement threaded broadcast array type to make time integration multithreaded #722

Are you sure you want to change the base?

Implement threaded broadcast array type to make time integration multithreaded #722

Conversation

efaulhaber commented Feb 19, 2025 • edited Loading

2D benchmark with 778k particles

2D benchmark with 7k particles

3D benchmark with 4.5M particles

sloede commented Feb 20, 2025

codecov bot commented Feb 20, 2025

Codecov Report

efaulhaber commented Feb 19, 2025 •

edited

Loading