Skip to content

0.5.0: Fused ops, FP8 compute, 2:4 sparsity, autograd expansion, production hardening #5

Description

@farhan-syah

Overview

0.5.0 is a major feature + hardening release. 131 commits, 875 files changed, +85k/-28k lines. Delivers fused GPU kernels, FP8/quantized compute, structured sparsity, and significantly expanded autograd coverage — while also refactoring and deduplicating extensively.

Downstream integration validated: solvr and boostr tested against numr 0.5.0 — this release unblocks publishing updated versions of both.

Closes via #6.


Completed

Fused Operations

  • Fused GEMM epilogue: matmul+bias+activation in a single kernel (forward + backward)
  • Fused activation-mul for gated architectures (SwiGLU, SiLU-mul)
  • Fused add-norm: residual add + normalization in one pass (forward + backward)
  • Fused elementwise operation chains across all backends

FP8 & Quantized Compute

  • FP8 (E4M3/E5M2) matmul across all backends
  • FP8 kernel support across CUDA compute paths
  • i8×i8→i32 quantized matrix multiplication (CPU)

Sparse

  • 2:4 structured sparsity with multi-backend support

Autograd Expansion

  • Differentiable conv1d, conv2d, softmax, rms_norm, layer_norm, SiLU, softplus, SwiGLU, dropout, fused GEMM epilogue, fused add-norm, dtype cast, narrow, cat, gather
  • Activation checkpointing
  • Backward hooks for distributed gradient sync

Performance

  • CUDA caching allocator (replaces stream-ordered alloc)
  • CUDA pipelined D2H copy for concurrent execution
  • GEMV-BT fast paths across CPU/CUDA/WebGPU
  • Online softmax in SIMD kernels
  • Welford algorithm for numerically stable variance
  • AVX2 transcendental/special function SIMD kernels
  • Tiled GEMM with dual-accumulator FMA microkernels (AVX2/AVX-512/NEON)
  • Half-precision GEMV-BT acceleration (f16/bf16)

Runtime & Infrastructure

  • CUDA graph capture support
  • NCCL communicator for multi-GPU collectives
  • Nexar inter-node communicator
  • Seeded deterministic RNG across all backends
  • Internal RNG (removed external rand/rand_distr dependency)
  • Slice assign operation across all backends
  • Streaming sync ops for compute-communication overlap

Architecture

  • Runtime::DType associated type
  • CPU backend made unconditional
  • Backward pass accumulation in precision-appropriate float type
  • Static WGSL shaders replacing runtime generation

Code Organization (completed splits)

  • Autograd reduce ops — split by operation
  • CPU AVX2 math kernels — split by function category
  • CUDA sparse merge kernels — split by strategy
  • CUDA index kernel launchers — split into modules

Downstream Integration

  • Test solvr against numr 0.5.0
  • Test boostr against numr 0.5.0
  • Resolve build failures in downstream crates

Fixes

  • aarch64 NEON: replaced non-existent vmvnq_u64 with correct bitwise NOT
  • Softmax NaN prevention for -inf inputs
  • Contiguity check for size-1 dim strides
  • CUDA graph capture allocator freeze/unfreeze
  • Batched matmul broadcasting across all backends
  • F16/BF16 backward pass numerical stability (f32 accumulation)

Deferred to 0.6.0

Tracked separately:

  • Error handling cleanup (~1,400 unwraps)
  • Remaining oversized file splits (22 files)
  • Migration guide (ndarray/PyTorch)
  • API stability audit
  • Second-order derivative fragility fix
  • Remaining autograd ops (complex, scatter, index_select)
  • CI hardening (cargo audit, coverage metrics)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions