BatchSolve.jl aims to provide various functionalities for solving batch-able, vectorized residual functions (for root finding) and objective functions (for minimizing). As long as your function is vectorized, then BatchSolve.jl will take care of the rest!
In summary, this package features:
- Solvers for (GPU) vectorized residual/merit functions
- The
AutoBatchautomatic-differentiation (AD) type for computing batches of derivatives of CPU/GPU vectorized/parallel functions, and is also fully compatible withDifferentiationInterface.jland any of its supported AD backends - Specialized bindings for
FiniteDiff.jlwithAutoBatchto compute CPU/GPU vectorized finite differences - Solvers that directly plug into CUDA's cuBLAS library for batched linear solving
What do we mean by "vectorized"? Say you want to find the root x of this function given some parameters p:
f(x, p) = (x-1)^2 - p^2You can do this easily with a Newton or Brent method, of course, probably available in some other root-finding package. But what if you want to solve this for 1 million different p? Well, if we let p be a vector of those million different parameters, then we can evaluate this function for a million different x using the broadcast operator "." in Julia:
f_vectorized(x, p) = f.(x, p)The result of f_vectorized will be a vector where each element is the corresponding f(x,p), and the evaluation will (hopefully) take advantage of SIMD if your CPU supports it. Better yet, if x and p are GPU arrays (e.g. CuArray for NVIDIA CUDA), then the evaluation will be GPU-parallelized!
With BatchSolve.jl, we can do root finding on this function in a vectorized way too. All we need to do is specify the batchdim - the dimension of the input array along which each independent element in a batch lies, where 1 corresponds to the rows and 2 to the columns. This allows for batched solving of many n-dimensional dimensional systems. For our 1D example here we would just choose 1. E.g. for a batch with p=1:10000,
sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1)where we specified that p is a Constant - not mutated throughout function evaluation. Arguments could also be specified as Cache if they are mutated. These constructs are re-exported from DifferentiationInterface.jl, and more details can be found here. For all AD purposes, BatchSolve uses DifferentiationInterface. Therefore it is easy to use a different AD backend, e.g. Enzyme:
using Enzyme
sol = newton(f_vectorized, zeros(10000), Constant(1:10000), batchdim=1, autodiff=AutoEnzyme())The returned object is a NamedTuple with the fields:
u: an array with size equal to the input array, containing the inputs at the rootsf: an array with size equal to the output array, containing the outputs at the roots (should be zero for converged inputs)jac: a (sparse) matrix storing the Jacobian for the entire batched-system at the last iterationretcode: an array of return codes for each element in the batch, where0x0is success,0x1is failure, and0x2means the maximum number of iterations was reachediters: an array storing the number of iterations until convergence for each element in the batch
To observe the newton iteration while it is running, set the verbose keyword argument to true.
Of course, BatchSolve.jl is not limited to only 1D functions. Consider this example:
g(x, p) = x .+ pThis function takes in a vector x and adds a vector p to it. Suppose we want to do a Newton search on this system now for vectors of length 2, for many different p, say 10000. In this simple case we can evaluate g in a vectorized way for all systems simply by initializing x and p as matrices, either 2 x 10000 or 10000 x 2. For batchdim=1, this would be 10000 x 2:
x0 = zeros(10000, 2)
p = rand(10000, 2)
sol = newton(g, x0, Constant(p), batchdim=1)sol.u will be a 10000 x 2 matrix with the roots. For batchdim=2:
x0 = zeros(2, 10000)
p = rand(2, 10000)
sol = newton(g, x0, Constant(p), batchdim=2)Of course, you could just treat this batch all together as one vector of size 20000 (number of elements in the "batch" equal to one). This can be done by not setting batchdim, which basically gives a regular non-batched Newton solver:
x0 = zeros(20000)
p = rand(20000)
sol = newton(g, x0, Constant(p))The problem with this approach is that now instead of solving many small 2x2 systems, the linear algebra solver needs to solve a giant 20000 x 20000 matrix equation. This will generally be much slower than just solving many small systems. In fact, we tested this approach with CUDA's cuDSS solver, and unfortunately we found performance to scale poorly beyond 60000 x 60000 systems.
For CUDA arrays, BatchSolve.jl plugs directly into CUDA's cuBLAS library which provides GPU-accelerated batched linear system solving.
Implemented batched solvers include:
- Newton-Raphson
newton(uses derivatives)
- 1D Brent's method
brent - COMING SOON: Differential evolution
de - COMING SOON: Genetic algorithm
ga
While the example shown above uses the broadcast operators with GPU arrays, you can of course use this package with any vectorized function. We generally recommend writing kernels explicitly with KernelAbstractions.jl, and laying ouy memory in a SIMD-able, structure-of-arrays format. This would correspond to batchdim=1 in Julia as a column-major language. The only thing to note is that for CUDA arrays with batchdim=1, the sparse Jacobian matrix must be permutedims-ed in order to be inputted into CUBLAS's batched linear system solver, and then the solution permutedims-ed back to the solution array. Depending on the cost of your function, it is possible that using batchdim=2 (for array-of-structures) is faster, because this permutedims is then not necessary. Therefore we recommended testing both for your particular use case.
This package also features special bindings for FiniteDiff.jl with AutoBatch to accelerate finite differences calculations for vectorized functions: if a function is vectorized, then instead of evaluating the same function many times for each tangent, we can just evaluate it a single time for all tangents. This can offer massive speedups in the case where your function is extremely expensive to evaluate.
This package is geared towards lightweight but high performance batched solvers, and requires the user to have some knowledge of the appropriate solver to choose for their problem in order to get reasonable results. This is quite different from the polyalgorithm approaches in other packages, e.g. BlackBoxOptim.jl, which is generally better for cases where you just want to "press go" and get a solution, even for hard problems. Furthermore, in cases where a solution doesn't converge or blows up, all BatchSolve.jl will tell you is that it failed (and give you the last Jacobian if derivatives are used). This package also lacks many features in comparison to e.g. NonlinearSolve.jl. It is very barebones in comparison, but will get the job done fast and correctly if your problem is well-paired with the selected solver.
BatchSolve.jl makes very heavy use of the powerful DifferentiationInterface.jl package and its sparse automatic-differentiation features via SparseMatrixColorings.jl. It goes without saying that BatchSolve.jl would not be possible without the massive amount of careful work put into both of these packages. We are extremely thankful for this work.
