Implement threaded broadcast array type to make time integration multithreaded #722
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, time integration is not using multithreading. On CPUs with a lot of threads, we can see that the time integration therefore doesn't scale and takes up a significant percentage of the total runtime.
It also makes the first multithreaded loops that touch the arrays after the time integration much slower, most likely due to cache misses with the inconsistent access pattern.
OrdinaryDiffEq.jl provides multithreading for some (not all) methods, including Carpenter-Kennedy and the automatic time stepping methods that we are using, by passing
thread=OrdinaryDiffEq.True()
to the method. However, this doesn't work withDynamicalODEProblem
s whereu
is anArrayPartition
due to YingboMa/FastBroadcast.jl#71.This PR doesn't just provide a workaround for this, it also enables multithreading for schemes that don't have this option, notably the symplectic schemes.
The following benchmarks have been conducted on an AMD Threadripper 3990X using 64 threads:
2D benchmark with 778k particles
On main:
This PR:
Three important changes:
reset ∂v/∂t
(being the first access after the time integration), the fluid-fluid interaction and the NHS update. In total, the kick went from 19.8s to 17.3s. I assume this is because the access pattern is more consistent now (both time integration and our code are using multithreaded access now), optimizing cache hits.% measured
goes up from 61.6% up to 88.3%.2D benchmark with 7k particles
On main:
This PR:
We can observe the same three changes, but much less pronounced:
% measured
goes up from 79.8% to 87.1%.3D benchmark with 4.5M particles
On main:
This PR:
Here, the difference is not as extreme as in 2D, probably because 3D SPH is more expensive, so the time integration already takes a smaller percentage of the total time.
The most notable difference is the much faster
drift!
andreset ∂v/∂t
(because that is the first multithreaded loop that touches the arrays after the time integration).It should be noted that we might get an even larger speedup by making the access pattern even more consistent. With this PR, there is one multithreaded loop over all of
u_ode
orv_ode
, whereas in TrixiParticles, we first split these vectors per system and then run a multithreaded loop. However, I doubt the difference will be very large, as boundaries are still causing inconsistent access patterns (this is the reason why my multi-gpu experiments didn't work out very nicely).