Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] Implement data type to define threaded broadcasting #2284

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

efaulhaber
Copy link
Member

@efaulhaber efaulhaber commented Feb 13, 2025

As discussed in #2283.

The new data type ThreadedBroadcastArray wraps any other array, but redefines broadcasting and functions like copyto! and fill! to use @threaded from Trixi.jl.
Using this data type for u_ode makes broadcasting in perform_step! of the time integration multithreaded.
This has the same effect as setting thread = True() for time integration schemes that support this option, but it also works for other schemes and would follow a potential redefinition of @threaded away from Polyester.jl.

Without threaded time integration:

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_source_terms.jl", initial_refinement_level=6, save_solution=nothing)

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             4.76s /  94.0%           12.6MiB /  82.4%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    4.29s   96.0%  4.70ms   7.61MiB   73.4%  8.52KiB
  rhs!                    4.57k    1.58s   35.4%   346μs   7.61MiB   73.4%  1.70KiB
    reset ∂u/∂t           4.57k    535ms   12.0%   117μs     0.00B    0.0%    0.00B
    volume integral       4.57k    454ms   10.1%  99.3μs   1.32MiB   12.8%     304B
    source terms          4.57k    189ms    4.2%  41.3μs   1.26MiB   12.1%     288B
    interface flux        4.57k    123ms    2.8%  27.0μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k   97.5ms    2.2%  21.3μs   1.32MiB   12.8%     304B
    surface integral      4.57k   74.6ms    1.7%  16.3μs   1.19MiB   11.4%     272B
    ~rhs!~                4.57k   73.8ms    1.7%  16.2μs   4.78KiB    0.0%    1.07B
    Jacobian              4.57k   32.5ms    0.7%  7.11μs   1.05MiB   10.1%     240B
    prolong2mortars       4.57k    350μs    0.0%  76.6ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    339μs    0.0%  74.3ns     0.00B    0.0%    0.00B
    mortar flux           4.57k    291μs    0.0%  63.7ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    183μs    0.0%  40.0ns     0.00B    0.0%    0.00B
  broadcast5              3.66k    1.37s   30.7%   375μs     0.00B    0.0%    0.00B
  broadcast4              3.66k    655ms   14.7%   179μs     0.00B    0.0%    0.00B
  broadcast2                914    305ms    6.8%   334μs     0.00B    0.0%    0.00B
  broadcast3              3.66k    175ms    3.9%  47.9μs     0.00B    0.0%    0.00B
  broadcast1                914    173ms    3.9%   190μs     0.00B    0.0%    0.00B
  ~perform step~            914   31.9ms    0.7%  34.9μs   2.14KiB    0.0%    2.40B
calculate dt                915    103ms    2.3%   113μs    257KiB    2.4%     288B
analyze solution             11   73.0ms    1.6%  6.64ms   2.50MiB   24.1%   233KiB

With thread = True():

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             1.26s /  75.8%           12.6MiB /  82.3%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    897ms   94.2%   981μs   7.61MiB   73.4%  8.52KiB
  rhs!                    4.57k    690ms   72.5%   151μs   7.61MiB   73.4%  1.70KiB
    volume integral       4.57k    185ms   19.4%  40.5μs   1.32MiB   12.8%     304B
    source terms          4.57k    147ms   15.4%  32.2μs   1.26MiB   12.1%     288B
    interface flux        4.57k    103ms   10.8%  22.6μs   1.46MiB   14.1%     336B
    prolong2interfaces    4.57k   77.8ms    8.2%  17.0μs   1.32MiB   12.8%     304B
    ~rhs!~                4.57k   71.2ms    7.5%  15.6μs   4.78KiB    0.0%    1.07B
    surface integral      4.57k   63.3ms    6.6%  13.9μs   1.19MiB   11.4%     272B
    Jacobian              4.57k   21.2ms    2.2%  4.64μs   1.05MiB   10.1%     240B
    reset ∂u/∂t           4.57k   20.4ms    2.1%  4.46μs     0.00B    0.0%    0.00B
    prolong2mortars       4.57k    255μs    0.0%  55.8ns     0.00B    0.0%    0.00B
    mortar flux           4.57k    246μs    0.0%  53.8ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    238μs    0.0%  52.0ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    216μs    0.0%  47.2ns     0.00B    0.0%    0.00B
  broadcast5              3.66k   80.7ms    8.5%  22.1μs     0.00B    0.0%    0.00B
  broadcast2                914   50.8ms    5.3%  55.6μs     0.00B    0.0%    0.00B
  ~perform step~            914   29.8ms    3.1%  32.6μs   2.14KiB    0.0%    2.40B
  broadcast4              3.66k   21.0ms    2.2%  5.74μs     0.00B    0.0%    0.00B
  broadcast3              3.66k   19.9ms    2.1%  5.45μs     0.00B    0.0%    0.00B
  broadcast1                914   4.37ms    0.5%  4.78μs     0.00B    0.0%    0.00B
analyze solution             11   46.8ms    4.9%  4.26ms   2.50MiB   24.1%   233KiB
calculate dt                915   8.61ms    0.9%  9.41μs    257KiB    2.4%     288B

With this PR:

───────────────────────────────────────────────────────────────────────────────────
            Trixi.jl                      Time                    Allocations      
                                 ───────────────────────   ────────────────────────
        Tot / % measured:             1.20s /  80.4%           13.2MiB /  83.0%    

Section                  ncalls     time    %tot     avg     alloc    %tot      avg
───────────────────────────────────────────────────────────────────────────────────
perform step                914    884ms   91.7%   967μs   8.19MiB   74.8%  9.18KiB
  rhs!                    4.57k    652ms   67.7%   143μs   7.61MiB   69.4%  1.70KiB
    volume integral       4.57k    174ms   18.0%  38.0μs   1.32MiB   12.1%     304B
    source terms          4.57k    136ms   14.1%  29.8μs   1.26MiB   11.5%     288B
    interface flux        4.57k   97.9ms   10.2%  21.4μs   1.46MiB   13.4%     336B
    prolong2interfaces    4.57k   76.9ms    8.0%  16.8μs   1.32MiB   12.1%     304B
    ~rhs!~                4.57k   70.6ms    7.3%  15.5μs   4.78KiB    0.0%    1.07B
    surface integral      4.57k   56.7ms    5.9%  12.4μs   1.19MiB   10.8%     272B
    Jacobian              4.57k   20.9ms    2.2%  4.58μs   1.05MiB    9.5%     240B
    reset ∂u/∂t           4.57k   18.7ms    1.9%  4.10μs     0.00B    0.0%    0.00B
    mortar flux           4.57k    230μs    0.0%  50.4ns     0.00B    0.0%    0.00B
    prolong2mortars       4.57k    220μs    0.0%  48.2ns     0.00B    0.0%    0.00B
    prolong2boundaries    4.57k    217μs    0.0%  47.5ns     0.00B    0.0%    0.00B
    boundary flux         4.57k    161μs    0.0%  35.2ns     0.00B    0.0%    0.00B
  broadcast5              3.66k   86.3ms    9.0%  23.6μs    171KiB    1.5%    48.0B
  broadcast2                914   52.7ms    5.5%  57.6μs   42.8KiB    0.4%    48.0B
  broadcast3              3.66k   30.0ms    3.1%  8.22μs    171KiB    1.5%    48.0B
  broadcast4              3.66k   27.9ms    2.9%  7.62μs    171KiB    1.5%    48.0B
  ~perform step~            914   25.1ms    2.6%  27.4μs   2.14KiB    0.0%    2.40B
  broadcast1                914   9.12ms    0.9%  10.0μs   42.8KiB    0.4%    48.0B
analyze solution             11   71.4ms    7.4%  6.49ms   2.50MiB   22.8%   233KiB
calculate dt                915   7.97ms    0.8%  8.71μs    257KiB    2.3%     288B

Some notes:

  • All benchmarks on 64 threads of an AMD Threadripper 3990X.
  • The difference between thread = True() and this PR is within measuring errors.
  • I added timers to perform_step! of the Carpenter-Kennedy scheme.
  • It can clearly be seen that the broadcasts inside the time integration are significantly faster in the multithreaded cases.
  • Our @batch approach always allocates a little. Interestingly, the thread = True() version, which uses FastBroadcast.jl (which in turn is based on Polyester.jl), does not allocate for the broadcastX timers. Not sure why. This is the code they're using:
    https://github.com/YingboMa/FastBroadcast.jl/blob/ad586d83ffcac15c92969b93dd5cf0c8fd025af9/src/FastBroadcast.jl#L326-L334

@efaulhaber efaulhaber added enhancement New feature or request performance We are greedy parallelization Related to MPI, threading, tasks etc. labels Feb 13, 2025
Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 81.57895% with 7 lines in your changes missing coverage. Please review.

Project coverage is 79.09%. Comparing base (1e1f643) to head (4a31315).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
src/semidiscretization/semidiscretization.jl 81.58% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2284       +/-   ##
===========================================
- Coverage   89.49%   79.09%   -10.40%     
===========================================
  Files         490      490               
  Lines       39507    39536       +29     
===========================================
- Hits        35355    31270     -4085     
- Misses       4152     8266     +4114     
Flag Coverage Δ
unittests 79.09% <81.58%> (-10.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@ranocha ranocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks interesting. We need to check how this interacts with the GPU plans. It would also be great to get a review from Valentin (who self-requested his review already).

@@ -101,8 +102,78 @@ function semidiscretize(semi::AbstractSemidiscretization, tspan;
return ODEProblem{iip, specialize}(rhs!, u0_ode, tspan, semi)
end

struct ThreadedBroadcastArray{T, N, A} <: AbstractArray{T, N}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct ThreadedBroadcastArray{T, N, A} <: AbstractArray{T, N}
struct ThreadedBroadcastArray{T, N, A <: AbstractArray{T, N}} <: AbstractArray{T, N}

Comment on lines +117 to +118
Base.parent(m::ThreadedBroadcastArray) = m.array
Base.size(m::ThreadedBroadcastArray) = size(m.array)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Base.parent(m::ThreadedBroadcastArray) = m.array
Base.size(m::ThreadedBroadcastArray) = size(m.array)
Base.parent(m::ThreadedBroadcastArray) = m.array
Base.pointer(m::ThreadedBroadcastArray) = pointer(parent(m))
Base.size(m::ThreadedBroadcastArray) = size(parent(m))

@sloede
Copy link
Member

sloede commented Feb 14, 2025

Looks interesting. We need to check how this interacts with the GPU plans. It would also be great to get a review from Valentin (who self-requested his review already).

It's also on the agenda for Tuesday.

@efaulhaber
Copy link
Member Author

Also see trixi-framework/TrixiParticles.jl#722 for the TrixiParticles version of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parallelization Related to MPI, threading, tasks etc. performance We are greedy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants