Skip to content

Deformable image registration on the GPU#3292

Open
daljit46 wants to merge 99 commits intodevfrom
non_linear_reg
Open

Deformable image registration on the GPU#3292
daljit46 wants to merge 99 commits intodevfrom
non_linear_reg

Conversation

@daljit46
Copy link
Copy Markdown
Member

@daljit46 daljit46 commented Mar 31, 2026

This work is a from scratch implementation of non-linear scalar image registration on the GPU. The two main objectives are to:

  • Demonstrate the first genuinely complex application of the GPU API recently introduced in MRtrix3; extending and superseding the work started in Affine registration on the GPU #3258.
  • Implement a registration algorithm that improves the current approach in mrregister, both in terms of the quality of the warps generated and runtime performance. The code is able to perform registration of two large images (~256^3) in well under 20 seconds on Apple M2 Max (which is at least an order of magnitude faster compared to mrregister and ANTs). It also outperforms mrregister both in terms of increased (DICE) overlap and also better behaved Jacobian statistics in my tests.

My original intent was to introduce this PR in multiple stages to make it easier for review, but due to time constraints I'm posting this as it is.
The implementation is usable and has been tested on various machines, but it may need refinements in terms of code quality (and possibly architectural decisions) in order to be merged.
My hope is that this will be useful for someone in the future to improve upon or at least take inspiration from to use the GPU API we have in place.

The registration algorithm is an implementation of Log-Demons style approach as originally introduced by Vercauteren et al. in their Symmetric Log-Domain Diffeomorphic Registration: A Demons-based Approach paper. It supports two different metrics: SSD (sum of square differences) and LNCC (local normalised cross-correlation). I have primarily focused on tuning the code and parameters to work with LNCC as, generally speaking, one is able to achieve higher quality results than the naive SSD metric. The implementation also ensures that the generated warps are diffeomorphic and symmetric.

For reference (see the original paper for more details), here is a brief description of the algorithm:

  • The algorithm uses a symmetric log-domain diffeomorphic registration scheme based on a stationary velocity field.
  • At each iteration, the velocity field is exponentiated by scaling and squaring to obtain both forward and backward deformation fields.
  • These deformations are used to evaluate the chosen similarity metric, currently either SSD or LNCC, and to compute local forward and backward Demons-style updates.
  • The backward update is resampled onto the fixed-image lattice and combined with the forward update to form a symmetric update field.
  • This update field is Gaussian-smoothed (fluid-like regularisation).
  • When using LNCC, the update is additionally rescaled so that its maximum effective step size remains close to a prescribed voxel-scale target, which helps avoid overly aggressive updates.
  • The update is then composed into the current stationary velocity field using a truncated Baker-Campbell-Hausdorff approximation.
  • The updated velocity field is Gaussian-smoothed again (diffusion-like regularisation).
  • After optimisation, the final velocity field is exponentiated to produce the forward and backward warps in scanner space.

Below are some thoughts about the implementation and its possible future extensions.

LNCC update formula

Originally, the LNCC update step was adapted from LCC-Demons: A robust and accurate symmetric diffeomorphic registration algorithm. That paper extends the original Log-Demons formulation of Vercauteren et al. by replacing SSD with a local cross-correlation metric. However, its update rule is derived in a midway-space formulation rather than the Lie-algebra averaging framework used here. In the initial version of my implementation, I therefore adapted the update expression from Appendix B of the paper to the asymmetric setting so that it could still be combined using Lie-algebra averaging.
However, despite a significant amount of tuning (and rechecking of the derivation math), I wasn't able to push the algorithm to provide the same deformation quality as the LNCC implementation in ANTs (measured via DICE overlap and Jacobian statistics of the warps using the NIREP and MindBoggle-101 datasets). I had various (untested) theories on why that would be the case, one of them being that an update mechanism based on a stationary velocity field may lack the necessary degree of freedoms of a more general framework like SyN.
However, reading Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain p.34, I came across this statement:

The Demons force, based on optical flow, is extremely aggressive compared to our analytical cross-correlation derivative.

This inspired me to replace the second-order approximated LCC-Demons force with a first-order analytic gradient force. The implementation in local_lncc_update.slang computes the local LNCC moments over a window, writing the sums of $R$, $M$, $R^2$, $M^2$, and $RM$ ($R$ for reference intensity and $M$ for moving intensity), from which the local means, covariance, and variances are recovered. Writing the local correlation as

$$\rho = \frac{A}{\sqrt{BC}},$$

with $A$ the local covariance and $B$, $C$ the local variances of the fixed and moving intensities, the update is obtained by differentiating $\rho$ with respect to the moving intensity at the centre voxel and then using the first-order relation

$$dM \approx \nabla M^T \, du.$$

This gives a local ascent direction proportional to

$$\frac{1}{N\sqrt{BC}}\left(R_c - \frac{A}{C} M_c\right)\nabla M,$$

where $R_c$ and $M_c$ are the centred fixed and moving intensities. In the code, this becomes a scalar coefficient multiplying the moving image spatial gradient in scanner coordinates. The full derivation is a bit long, but I'd happy to write it here if needed.

Surprisingly, this replacement (plus a small renormalisation step) proved to yield much better registrations, arguably similar in quality to those produced by ANTs in terms of overlap improvement and Jacobian statistics.
The old update scheme is still reachable in the git history of the changes here.

Regularisation

As mentioned above, the original log-demons approach uses two forms of regularisation: a fluid-like regularisation of the update field and a diffusion-like regularisation of the velocity field. The former acts mainly as a preconditioning step, whereas the latter can be viewed as the primary regularisation mechanism. In both cases, regularisation is implemented through Gaussian smoothing.

The recent MMORF paper shows the importance of biologically plausible warp regularisation for achieving high registration accuracy without excessive shape or volume distortion.
While Gaussian smoothing regularises the deformation field, it does so only indirectly: it encourages spatial smoothness, but does not deliberately optimise for minimal deformation or for maximal metric improvement at a given deformation cost.

I have very briefly researched an alternative regularisation scheme applied to the update step in the current implementation. In contrast to MMORF, where regularisation is defined directly on the displacement field and can therefore penalise local volume and shape changes more explicitly, doing the same in an stationary velocity field (SVF) based framework is less straightforward. In an SVF parameterisation, the optimised quantity is the velocity field rather than the final displacement, so quantities such as local volume change and shear are only related to the transformation indirectly through the exponential map.

One possible approximation in the SVF setting is to regularise first-order derivatives of the velocity update rather than Gaussian-smoothing the field directly. In particular, one can form the infinitesimal strain-rate tensor,

$$\varepsilon(u) = \frac{1}{2}\left(\nabla u + \nabla u^T\right),$$

which captures the local symmetric part of the deformation induced by the update field $u$. Its trace, $\nabla \cdot u$, corresponds to infinitesimal volume change, while its deviatoric part,

$$\mathrm{dev}\,\varepsilon(u) = \varepsilon(u) - \frac{1}{3}(\nabla \cdot u)I,$$

captures shear-like distortion. A regulariser that penalises $(\nabla \cdot u)^2$ and $|\mathrm{dev},\varepsilon(u)|^2$ would therefore discourage local expansion/contraction and shear, respectively. Although this is only an indirect proxy in an SVF framework, it should provide a more targeted alternative to Gaussian smoothing.

A relevant article would be Constrained H^1-regularization schemes for diffeomorphic image registration by Mang and Biros.

FOD reorientation

I haven't really given too much thought on how FOD reorientation could be implemented in the current framework, but I think it's worth mentioning that the current algorithm is completely reliant on GPU 3D textures (to take advantage of hardware linear interpolation). This will obviously be a limitation when it comes to dealing with FOD images, where each voxel contains a set of spherical harmonic coefficients rather than a single scalar value. For that purpose, one would need to devise a way to upload the images as regular buffer objects on the GPU and then wire the appropriate sampling mechanisms on the shader side. For optimal performance, it might be worth considering the use of space-filling curves.

Other

  • There's probably bugs that need to be squashed out. The code is also missing targeted unit tests.

  • As mentioned earlier, the code has mainly tuned to work with LNCC as the metric. Changes might be needed to make it work best with SSD.

  • Multicontrast registration for nonlinear registration is not supported.

  • The affine code path and the non-linear code were practically developed independently. Not much thought has been given to ensure consistency in terms of naming and code structure.

  • Shader code is currently resident in cpp/core/gpu/shaders/registration and is symlinked in the build directory. A proper installation strategy is required, see Handle the installation of shader files for GPU commands #3289.

  • For multi-modal settings, a metric like mutual information may be required. A possible approach is described in Diffeomorphic Demons using Normalised Mutual Information by Modat et al. However, given that mutual information is a global measure and non-linear registration involves local deformations, an approach such as MIND may be considered (I believe there is an implementation of this approach in niftyreg).

- Add support for 3-4 channels vector texture sampling
- Add new function for sampling with zero outside of bounds
- Rename nonlinear_metric -> nl_metric
- Disallow ncc (will restore in future)
- Implement -warp and -transformed options
daljit46 added 28 commits March 13, 2026 02:48
Avoid threaded_copy overhead when dropping alpha by forcing 2 inner loop
axes for the Extract1D RGB copy path.
It doesn't work well and introduces too much shear in deformations
We compute an initial SVF from the affine transform and then feed into
the nonlinear registration pipeline at the coarsest level.
The previous nonlinear LNCC update used a closed-form LCC-demons style solve.
It was also seemingly too aggressive and produced deformations to with
less uniform Jacobian values.
Switch the LNCC update to a first-order force obtained by differentiating the
local correlation directly and mapping that intensity derivative through the
moving-image spatial gradient. This gives a simpler update step and
produces deformations that produce same increase in overlap, but with a
smaller volume change (tested on nirep and mindboggle-101 data set).

We also perform an update renormalisation based on a target maximum step
in voxel units, so update magnitudes remain tunable after moving to the 
gradient force.
- The search now evaluates a real grid of candidates instead of a thin
 diagonal through axis-angle space.

- Previously, the search only replaced the rotation parameters and left
translation unchanged. That is fine only if the search pivot happens to
be the true rotation centre. The new code keeps an anchor point fixed
while rotating.
@daljit46 daljit46 changed the title Non linear reg Deformable image registration on the GPU Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant