Deformable image registration on the GPU by daljit46 · Pull Request #3292 · MRtrix3/mrtrix3

daljit46 · 2026-03-31T11:46:53Z

This work is a from scratch implementation of non-linear scalar image registration on the GPU. The two main objectives are to:

Demonstrate the first genuinely complex application of the GPU API recently introduced in MRtrix3; extending and superseding the work started in Affine registration on the GPU #3258.
Implement a registration algorithm that improves the current approach in mrregister, both in terms of the quality of the warps generated and runtime performance. The code is able to perform registration of two large images (~256^3) in well under 20 seconds on Apple M2 Max (which is at least an order of magnitude faster compared to mrregister and ANTs). It also outperforms mrregister both in terms of increased (DICE) overlap and also better behaved Jacobian statistics in my tests.

My original intent was to introduce this PR in multiple stages to make it easier for review, but due to time constraints I'm posting this as it is.
The implementation is usable and has been tested on various machines, but it may need refinements in terms of code quality (and possibly architectural decisions) in order to be merged.
My hope is that this will be useful for someone in the future to improve upon or at least take inspiration from to use the GPU API we have in place.

The registration algorithm is an implementation of Log-Demons style approach as originally introduced by Vercauteren et al. in their Symmetric Log-Domain Diffeomorphic Registration: A Demons-based Approach paper. It supports two different metrics: SSD (sum of square differences) and LNCC (local normalised cross-correlation). I have primarily focused on tuning the code and parameters to work with LNCC as, generally speaking, one is able to achieve higher quality results than the naive SSD metric. The implementation also ensures that the generated warps are diffeomorphic and symmetric.

For reference (see the original paper for more details), here is a brief description of the algorithm:

The algorithm uses a symmetric log-domain diffeomorphic registration scheme based on a stationary velocity field.
At each iteration, the velocity field is exponentiated by scaling and squaring to obtain both forward and backward deformation fields.
These deformations are used to evaluate the chosen similarity metric, currently either SSD or LNCC, and to compute local forward and backward Demons-style updates.
The backward update is resampled onto the fixed-image lattice and combined with the forward update to form a symmetric update field.
This update field is Gaussian-smoothed (fluid-like regularisation).
When using LNCC, the update is additionally rescaled so that its maximum effective step size remains close to a prescribed voxel-scale target, which helps avoid overly aggressive updates.
The update is then composed into the current stationary velocity field using a truncated Baker-Campbell-Hausdorff approximation.
The updated velocity field is Gaussian-smoothed again (diffusion-like regularisation).
After optimisation, the final velocity field is exponentiated to produce the forward and backward warps in scanner space.

Below are some thoughts about the implementation and its possible future extensions.

LNCC update formula

Originally, the LNCC update step was adapted from LCC-Demons: A robust and accurate symmetric diffeomorphic registration algorithm. That paper extends the original Log-Demons formulation of Vercauteren et al. by replacing SSD with a local cross-correlation metric. However, its update rule is derived in a midway-space formulation rather than the Lie-algebra averaging framework used here. In the initial version of my implementation, I therefore adapted the update expression from Appendix B of the paper to the asymmetric setting so that it could still be combined using Lie-algebra averaging.
However, despite a significant amount of tuning (and rechecking of the derivation math), I wasn't able to push the algorithm to provide the same deformation quality as the LNCC implementation in ANTs (measured via DICE overlap and Jacobian statistics of the warps using the NIREP and MindBoggle-101 datasets). I had various (untested) theories on why that would be the case, one of them being that an update mechanism based on a stationary velocity field may lack the necessary degree of freedoms of a more general framework like SyN.
However, reading Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain p.34, I came across this statement:

The Demons force, based on optical flow, is extremely aggressive compared to our analytical cross-correlation derivative.

This inspired me to replace the second-order approximated LCC-Demons force with a first-order analytic gradient force. The implementation in local_lncc_update.slang computes the local LNCC moments over a window, writing the sums of $R$, $M$, $R^2$, $M^2$, and $RM$ ($R$ for reference intensity and $M$ for moving intensity), from which the local means, covariance, and variances are recovered. Writing the local correlation as

$$\rho = \frac{A}{\sqrt{BC}},$$

with $A$ the local covariance and $B$, $C$ the local variances of the fixed and moving intensities, the update is obtained by differentiating $\rho$ with respect to the moving intensity at the centre voxel and then using the first-order relation

$$dM \approx \nabla M^T \, du.$$

This gives a local ascent direction proportional to

$$\frac{1}{N\sqrt{BC}}\left(R_c - \frac{A}{C} M_c\right)\nabla M,$$

where $R_c$ and $M_c$ are the centred fixed and moving intensities. In the code, this becomes a scalar coefficient multiplying the moving image spatial gradient in scanner coordinates. The full derivation is a bit long, but I'd happy to write it here if needed.

Surprisingly, this replacement (plus a small renormalisation step) proved to yield much better registrations, arguably similar in quality to those produced by ANTs in terms of overlap improvement and Jacobian statistics.
The old update scheme is still reachable in the git history of the changes here.

Regularisation

As mentioned above, the original log-demons approach uses two forms of regularisation: a fluid-like regularisation of the update field and a diffusion-like regularisation of the velocity field. The former acts mainly as a preconditioning step, whereas the latter can be viewed as the primary regularisation mechanism. In both cases, regularisation is implemented through Gaussian smoothing.

The recent MMORF paper shows the importance of biologically plausible warp regularisation for achieving high registration accuracy without excessive shape or volume distortion.
While Gaussian smoothing regularises the deformation field, it does so only indirectly: it encourages spatial smoothness, but does not deliberately optimise for minimal deformation or for maximal metric improvement at a given deformation cost.

I have very briefly researched an alternative regularisation scheme applied to the update step in the current implementation. In contrast to MMORF, where regularisation is defined directly on the displacement field and can therefore penalise local volume and shape changes more explicitly, doing the same in an stationary velocity field (SVF) based framework is less straightforward. In an SVF parameterisation, the optimised quantity is the velocity field rather than the final displacement, so quantities such as local volume change and shear are only related to the transformation indirectly through the exponential map.

One possible approximation in the SVF setting is to regularise first-order derivatives of the velocity update rather than Gaussian-smoothing the field directly. In particular, one can form the infinitesimal strain-rate tensor,

$$\varepsilon(u) = \frac{1}{2}\left(\nabla u + \nabla u^T\right),$$

which captures the local symmetric part of the deformation induced by the update field $u$. Its trace, $\nabla \cdot u$, corresponds to infinitesimal volume change, while its deviatoric part,

$$\mathrm{dev}\,\varepsilon(u) = \varepsilon(u) - \frac{1}{3}(\nabla \cdot u)I,$$

captures shear-like distortion. A regulariser that penalises $(\nabla \cdot u)^2$ and $|\mathrm{dev},\varepsilon(u)|^2$ would therefore discourage local expansion/contraction and shear, respectively. Although this is only an indirect proxy in an SVF framework, it should provide a more targeted alternative to Gaussian smoothing.

A relevant article would be Constrained H^1-regularization schemes for diffeomorphic image registration by Mang and Biros.

FOD reorientation

I haven't really given too much thought on how FOD reorientation could be implemented in the current framework, but I think it's worth mentioning that the current algorithm is completely reliant on GPU 3D textures (to take advantage of hardware linear interpolation). This will obviously be a limitation when it comes to dealing with FOD images, where each voxel contains a set of spherical harmonic coefficients rather than a single scalar value. For that purpose, one would need to devise a way to upload the images as regular buffer objects on the GPU and then wire the appropriate sampling mechanisms on the shader side. For optimal performance, it might be worth considering the use of space-filling curves.

Other

There's probably bugs that need to be squashed out. The code is also missing targeted unit tests.
As mentioned earlier, the code has mainly tuned to work with LNCC as the metric. Changes might be needed to make it work best with SSD.
Multicontrast registration for nonlinear registration is not supported.
The affine code path and the non-linear code were practically developed independently. Not much thought has been given to ensure consistency in terms of naming and code structure.
Shader code is currently resident in cpp/core/gpu/shaders/registration and is symlinked in the build directory. A proper installation strategy is required, see Handle the installation of shader files for GPU commands #3289.
For multi-modal settings, a metric like mutual information may be required. A possible approach is described in Diffeomorphic Demons using Normalised Mutual Information by Modat et al. However, given that mutual information is a global measure and non-linear registration involves local deformations, an approach such as MIND may be considered (I believe there is an implementation of this approach in niftyreg).

- Add support for 3-4 channels vector texture sampling - Add new function for sampling with zero outside of bounds

- Rename nonlinear_metric -> nl_metric - Disallow ncc (will restore in future) - Implement -warp and -transformed options

Avoid threaded_copy overhead when dropping alpha by forcing 2 inner loop axes for the Extract1D RGB copy path.

It doesn't work well and introduces too much shear in deformations

We compute an initial SVF from the affine transform and then feed into the nonlinear registration pipeline at the coarsest level.

The previous nonlinear LNCC update used a closed-form LCC-demons style solve. It was also seemingly too aggressive and produced deformations to with less uniform Jacobian values. Switch the LNCC update to a first-order force obtained by differentiating the local correlation directly and mapping that intensity derivative through the moving-image spatial gradient. This gives a simpler update step and produces deformations that produce same increase in overlap, but with a smaller volume change (tested on nirep and mindboggle-101 data set). We also perform an update renormalisation based on a target maximum step in voxel units, so update magnitudes remain tunable after moving to the gradient force.

- The search now evaluates a real grid of candidates instead of a thin diagonal through axis-angle space. - Previously, the search only replaced the rotation parameters and left translation unchanged. That is fine only if the search pivot happens to be the true rotation centre. The new code keeps an anchor point fixed while rotating.

daljit46 added 30 commits January 30, 2026 11:45

Add magic_enum dependency

0102320

First draft of affine registration on GPU

73e1189

Symlink shader code + hack to find registration shaders

022149f

Use exec path for finding shaders

9384096

Export boolean link-time constants

0f76ded

Update registration uniforms for new GPU API

8866f5f

Add write_object_to_buffer helper

cff396b

Add test for write_object_to_buffer

983f1f1

Enforce 4-byte size for GPU object uploads

b412e03

Rename Metric class -> GlobalMetric

3bb7533

Rename RegistrationConfig -> GlobalRegistrationConfig

0a6b4e2

Add missing include header in registrationtypes.h

cfb2dca

Add new shaders for log-demons nonlinear registration

a51ba42

Refactor VolumeSamplingField and add VectorVolumeSamplingField

7a62d9c

- Add support for 3-4 channels vector texture sampling - Add new function for sampling with zero outside of bounds

Start refactoring for addition of non-linear registration

4b9bb15

Add shader for composing a displacement field

813e5ab

Update comment with definition of Lie Bracket

1e624c0

First sketch of log-demons nonlinear registration

cdf3920

Add new function to download textures as an MR::Image

e6d5795

Produce warp for non linear registration

2ecdb02

Add new function for Gaussian blur of 3D textures

60ba4f4

Apply gaussian blur smoothing to velocity update field

7acc56c

Apply gaussian smoothing to velocity field itself

eaddc3a

Add missing includes in imageoperations.cpp

471053d

Update blur shader for gaussian smoothing MERGE PREV

5b80047

Tweak CLI options for non linear case

aa0a2fd

- Rename nonlinear_metric -> nl_metric - Disallow ncc (will restore in future) - Implement -warp and -transformed options

Add ssd cost evaluation + logging

5893c29

Add missing includes

caf1d20

Update comment

4f5568f

Add comment for instructions for MacOS profiling

802bed0

daljit46 added 28 commits March 13, 2026 02:48

Use single workgroup reduction using custom op

b7d88fb

Use 2 inner axes for RGB extract copy in GPU texture download

29877cc

Avoid threaded_copy overhead when dropping alpha by forcing 2 inner loop axes for the Extract1D RGB copy path.

Add support for masks in non linear registration

48b1163

Remove inertia code path

f6ec914

It doesn't work well and introduces too much shear in deformations

Add/update copyright headers for Slang shaders

1d5b514

Apply clang-format + update copyright cpp files

d52e5e3

Add initial affine transform for nonlinear registration

d10f3cc

We compute an initial SVF from the affine transform and then feed into the nonlinear registration pipeline at the coarsest level.

Expose Gaussian regularisation to CLI

74c01dc

Don't converge with worsening objective

8fd0e24

Fix row/column multiplication bug

767ec8f

Don't allow flat local windows to contribute for lncc

1c8570b

Remove unused struct in texture_utils.slang

94780a3

Remove unused file

d316dea

Update synopsis and references

0c9baf5

Remove midway outputs

4818130

Rename mrreggpu -> mrgpureg

4e3c6ec

Implement moment based initialisation

982ce63

Add some comments in reduction_utils.slang

bb67efb

Remove memory reduction request

7722bc6

Merge remote-tracking branch 'origin/dev' into non_linear_reg

075b263

Make rotation search default init strategy

185e62f

Allow to rigid/affine model

9dfbb1e

Fix bug that was causing NaNs in CoM computation

73dfab4

Add multistage support

4c05ec7

Minor CLI tweaks

36f746a

Remove duplicate lines

8c78512

daljit46 added the enhancement label Mar 31, 2026

daljit46 changed the title ~~Non linear reg~~ Deformable image registration on the GPU Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deformable image registration on the GPU#3292

Deformable image registration on the GPU#3292
daljit46 wants to merge 99 commits intodevfrom
non_linear_reg

daljit46 commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

daljit46 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LNCC update formula

Regularisation

FOD reorientation

Other

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

daljit46 commented Mar 31, 2026 •

edited

Loading