-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
add static size information for hvcat and hvncat #58422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@nanosoldier |
I'm still not very happy with the performance for large matrix creation when the element types are different: using BenchmarkTools
f(x, y) = [
x y y y y y
y x y y y y
y y x y y y
y y y x y y
y y y y x y
y y y y y x
]
@btime f(x, y) setup=(x = rand(); y = rand(1:10)) It takes 7 μs on my machine (and I'm expecting ~70ns) But this looks more like Julia's compiler issue related to tuple splattings ( |
The package evaluation job you requested has completed - possible new issues were detected. Report summary❗ Packages that crashed1 packages crashed only on the current version.
4 packages crashed on the previous version too. ✖ Packages that failed532 packages failed only on the current version.
1459 packages failed on the previous version too. ✔ Packages that passed tests17 packages passed tests only on the current version.
4910 packages passed tests on the previous version too. ~ Packages that at least loaded3 packages successfully loaded only on the current version.
2715 packages successfully loaded on the previous version too. ➖ Packages that were skipped altogether1 packages were skipped only on the current version.
922 packages were skipped on the previous version too. |
This commit removes the dependency of generated function to IdSet because it is not yet defined during the bootstrap stages. By doing this, we can avoid the world age issues and use @generated function in abstractarray.jl.
86d863c
to
9ed559f
Compare
This significantly improves the performance of multi-dimensional array creation via scalar numbers using methods of [a b; c d], [a b;; c d] and typed T[a b; c d], T[a b;; c d] For small numeric array creation(length <= 16), manual loop unroll is used to further minimize the overhead, and it now has zero overhead and is as fast as the array initialization method.
9ed559f
to
299ce1e
Compare
@inline function hvcat_static(::Val{rows}, x::T, xs::Vararg{T}) where {rows, T<:Number} | ||
typed_hvcat_static(T, Val{rows}(), x, xs...) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only regression happens here:
using BenchmarkTools
f(x, y) = [
x y y y y y
y x y y y y
y y x y y y
y y y x y y
y y y y x y
y y y y y x
]
@btime f(x, y) setup=(x = rand(); y = rand())
It takes 984.615ns on my local laptop. In comparison, it takes only 128.225ns for Julia 1.12.
But this can be fixed if we duplicate the generated function definition:
@generated function hvcat_static(::Val{rows}, xs::T...) where {T<:Number, rows}
# just copied everything from the typed_hvcat_static generated function's implementation
end
By applying this fix, we get 24.995ns.
The only problems with this fix are:
- This introduces an ambiguity issue caused by
xs::T...
. That forces we to writehvcat_static(::Val{rows}, x::T, xs::Vararg{T})
rather thanhvcat_static(::Val{rows}, xs::Vararg{T})
. - This duplicates the code, and I have almost no idea why this duplication works.
- Doing this doesn't fix the performance for mixed element types, i.e.,
@btime f(x, y) setup=(x = rand(); y = rand(1:10))
(add static size information for hvcat and hvncat #58422 (comment))
Thus, I think the "fix" isn't elegant enough and did not commit it for this review version. If we find it necessary, I can add it up later.
@nanosoldier |
The package evaluation job you requested has completed - possible new issues were detected. Report summary✖ Packages that failed9 packages failed only on the current version.
56 packages failed on the previous version too. ✔ Packages that passed tests6 packages passed tests only on the current version.
273 packages passed tests on the previous version too. ~ Packages that at least loaded2 packages successfully loaded only on the current version.
185 packages successfully loaded on the previous version too. ➖ Packages that were skipped altogether1 packages were skipped on the previous version too. |
PkgEval and CI tests look good, I'll keep this PR as it is until feedback comes. |
This kinda scares me with respect to TTFX and over specialization and static compilability — especially for such a fundamental syntax. I don't think I can meaningfully review it beyond spreading this FUD; maybe someone closer to static compilation could evaluate its impacts? |
TTFX is a good point, here's the small experiment I've done for 2x2 hvcat:
Based on the data, we get:
![]()
The above experiment is about 2x2 hvcat where the manual unroll is applied. The following is about 6x6 hvcat where we call the generic ![]()
|
@@ -2276,11 +2276,13 @@ | |||
(if (any (lambda (row) (any vararg? row)) rows) | |||
`(call ,@hvcat_rows ,@(map (lambda (x) `(tuple ,@x)) rows)) | |||
`(call ,@hvcat | |||
(tuple ,@(map length rows)) | |||
(new (curly (top Val) (call (core tuple) ,@(map length rows)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get some inconsistency checking the result of @which
julia> @which [2 2; 3 3]
hvcat(rows::Tuple{Vararg{Int64}}, xs::T...) where T<:Number
@ Base abstractarray.jl:2203
julia> Meta.@lower [2 2; 3 3]
:($(Expr(:thunk, CodeInfo(
1 ─ %1 = builtin Core.tuple(2, 2)
│ %2 = builtin Core.apply_type(Base.Val, %1)
│ %3 = %new(%2)
│ %4 = dynamic Base.hvcat_static(%3, 2, 2, 3, 3)
└── return %4
))))
is it because I need to update JuliaSyntax as well?
is this related to / does this address #43844 ? |
Can you explain a bit where this matters? Do you have some examples of performance-sensitive code which allocates small matrices from numbers? In my head at least, That said I am surprised how widely the benchmarks above vary between versions. |
One place where this would have been nice is SciML/OrdinaryDiffEq.jl#2712. Constructing a matrix of literal values seems like the sort of thing that should be basically free, so it's somewhat unfortunate when it's expensive. |
The following is a typical pattern when one wants to solve a problem iteratively (generated from ChatGPT):
We identified issue #58397 when switching from Julia 1.9 to Julia 1.10 for our internal project, written by people who only know MATLAB and barely know Julia. In our case, we fixed it by teaching them how to use StaticArrays, but it's often the case that only power users of Julia know about StaticArrays and when to use them... And I won't be surprised if there are cases in the wild where array sizes are not deterministic like this one. using LinearAlgebra
function lorenz_matrix(u)
a, b = u[1,1], u[1,2]
c, d = u[2,1], u[2,2]
da = 10.0*(b - a)
db = a*(28.0 - d) - b - 0.1*b*d
dc = c*(b - c) - 0.1*c^2
dd = a*c - (8/3)*d - 0.1*d^2
return [da db; dc dd]
end
function euler_integrate(f, u0, tspan; dt=0.01, max_dt=0.1, min_dt=1e-6)
t0, tf = tspan
t = t0
u = copy(u0)
history = [(t, copy(u))]
while t < tf
# Compute derivative
du = f(u)
# Adaptive time stepping
scale = norm(du) + 1e-10 # Prevent division by zero
curr_dt = min(max_dt, max(min_dt, 0.1/scale))
curr_dt = min(curr_dt, tf - t) # Don't overshoot final time
# Euler step
u += curr_dt * du
t += curr_dt
push!(history, (t, copy(u)))
end
return history
end
# Initial condition
u0 = [1.0 0.01;
0.01 0.0]
# Integrate
tspan = (0.0, 50.0)
history = @time euler_integrate(lorenz_matrix, u0, tspan) |
@adienes Kind of, except there isn't much room for optimizing 1D case This PR intends to fix the performance gap for 2D and higher-dimensional cases. And applies to |
I don't have a strong opinion either way, but two questions that may guide decisions: How does the time compare with GC? If you're at the point where small arrays are a major bottleneck, is eliminating the repeated array allocation usually better than the gain from this? What are the trade-offs vs. a macro that would just expand the concat syntax into an elementwise fill? |
@BioTurboNick I'm very much aware that there are many work arounds to this performance issue. Reuse the created memory, manually rewrite or via macros, or adopt StaticArrays. They are all doable in a real project. The real reason I chose to provide a fix here at the language level is that I find it extremely awkward to explain this as a performance tip.
Consider this a numerical computing language; this performance tip is rather an excuse for something that we should fix, but we didn't. So if I were to sort the priorities, ensuring our users have a reasonable and consistent performance expectation of |
My question comes from my expectation that ever allocating an array on the heap in Julia is expensive and should be avoided in any performance-critical code. |
With some quick chat with @timholy on Slack, let's summarize the PR: Here's what I plan to do next:
MWE that breaks Zygote ADusing Zygote
f(x, y) = [
(x + y) (x - y);
(x * y) (x / y)
]
g(x, y) = hvcat((2, 2), x + y, x - y, x * y, x / y)
@assert f(0.5, 1.5) == g(0.5, 1.5)
df(x, y) = gradient((_x, _y)->sum(f(_x, _y)), x, y)
dg(x, y) = gradient((_x, _y)->sum(g(_x, _y)), x, y)
df(0.5, 1.5) # errors with mutation |
@BioTurboNick This is very true, and we can even replace Array with Tuples (as what StaticArrays is under the hood). We just need to do the fix with less impact in a less controversial way. |
When creating an array using scalar numbers, the array initialization method is currently the fastest way. That is,
is faster than the concatenation method
[x x; y y]
. This PR tries to close the performance gaps.Optimization overview:
Key results:
[]
is now comparable to that of direct matrix initialization. This resolves a previously significant performance gap.Future work:
Benchmark performance
The very first version of this commit is implemented based on 1.10.9 release branch, thus I've also included the 1.10-dev performance.
Notice the performance differences between the initialization method and the concatenation method.
[x x; y y]
— 2D Matrix CreationSame Data Type
2×2 Matrix
6×6 Matrix
Mixed Data Types
2×2 Matrix
6×6 Matrix
[x x;;; y y]
— Higher-Dimensional Array CreationSame Data Type
1×2×2 Array
3×3×3 Array
Mixed Data Types
1×2×2 Array
3×3×3 Array
benchmark test script
fixes #58397