0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening

## Overview

0.5.0 is a major feature + hardening release. 131 commits, 875 files changed, +85k/-28k lines. Delivers fused GPU kernels, FP8/quantized compute, structured sparsity, and significantly expanded autograd coverage — while also refactoring and deduplicating extensively.

Downstream integration validated: solvr and boostr tested against numr 0.5.0 — this release unblocks publishing updated versions of both.

Closes via #6.

---

## Completed

### Fused Operations
- [x] Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
- [x] Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
- [x] Fused add-norm: residual add + normalization in one pass (forward + backward)
- [x] Fused elementwise operation chains across all backends

### FP8 & Quantized Compute
- [x] FP8 (E4M3/E5M2) matmul across all backends
- [x] FP8 kernel support across CUDA compute paths
- [x] i8×i8→i32 quantized matrix multiplication (CPU)

### Sparse
- [x] 2:4 structured sparsity with multi-backend support

### Autograd Expansion
- [x] Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather
- [x] Activation checkpointing
- [x] Backward hooks for distributed gradient sync

### Performance
- [x] CUDA caching allocator (replaces stream-ordered alloc)
- [x] CUDA pipelined D2H copy for concurrent execution
- [x] GEMV-BT fast paths across CPU/CUDA/WebGPU
- [x] Online softmax in SIMD kernels
- [x] Welford algorithm for numerically stable variance
- [x] AVX2 transcendental/special function SIMD kernels
- [x] Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
- [x] Half-precision GEMV-BT acceleration (f16/bf16)

### Runtime & Infrastructure
- [x] CUDA graph capture support
- [x] NCCL communicator for multi-GPU collectives
- [x] Nexar inter-node communicator
- [x] Seeded deterministic RNG across all backends
- [x] Internal RNG (removed external rand/rand_distr dependency)
- [x] Slice assign operation across all backends
- [x] Streaming sync ops for compute-communication overlap

### Architecture
- [x] Runtime::DType associated type
- [x] CPU backend made unconditional
- [x] Backward pass accumulation in precision-appropriate float type
- [x] Static WGSL shaders replacing runtime generation

### Code Organization (completed splits)
- [x] Autograd reduce ops — split by operation
- [x] CPU AVX2 math kernels — split by function category
- [x] CUDA sparse merge kernels — split by strategy
- [x] CUDA index kernel launchers — split into modules

### Downstream Integration
- [x] Test solvr against numr 0.5.0
- [x] Test boostr against numr 0.5.0
- [x] Resolve build failures in downstream crates

### Fixes
- [x] aarch64 NEON: replaced non-existent `vmvnq_u64` with correct bitwise NOT
- [x] Softmax NaN prevention for -inf inputs
- [x] Contiguity check for size-1 dim strides
- [x] CUDA graph capture allocator freeze/unfreeze
- [x] Batched matmul broadcasting across all backends
- [x] F16/BF16 backward pass numerical stability (f32 accumulation)

---

## Deferred to 0.6.0

Tracked separately:

- Error handling cleanup (~1,400 unwraps)
- Remaining oversized file splits (22 files)
- Migration guide (ndarray/PyTorch)
- API stability audit
- Second-order derivative fragility fix
- Remaining autograd ops (complex, scatter, index_select)
- CI hardening (cargo audit, coverage metrics)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

Overview

Completed

Fused Operations

FP8 & Quantized Compute

Sparse

Autograd Expansion

Performance

Runtime & Infrastructure

Architecture

Code Organization (completed splits)

Downstream Integration

Fixes

Deferred to 0.6.0

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

Description

Overview

Completed

Fused Operations

FP8 & Quantized Compute

Sparse

Autograd Expansion

Performance

Runtime & Infrastructure

Architecture

Code Organization (completed splits)

Downstream Integration

Fixes

Deferred to 0.6.0

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions