Awesome Machine Learning Reliability

Background

MLSys: The New Frontier of Machine Learning Systems — Position paper outlining the co-design challenges/opportunities across ML, systems, and hardware.
AI Engineering Quick Start — Practical guide to end-to-end AI engineering workflows and best practices.

High-Level View

Machine Learning Testing: Survey, Landscapes and Horizons, TSE 2020 — Comprehensive survey of ML testing techniques, tools, and open challenges.
Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015 — Seminal discussion of non-obvious maintenance costs and systemic risks in ML systems.

Empirical Studies

A First Look at Bugs in LLM Inference Engines, arXiv 2025 [Inference] — Early taxonomy and root causes of bugs in LLM inference engines across open-source stacks.
Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models, arXiv 2025 [Training] [Inference] — Characterizes bug patterns in distributed LLM frameworks and offers mitigation guidance.
Characterization of Large Language Model Development in the Datacenter, NSDI 2024 [Training] — Cluster-scale study of LLM development workloads and bottlenecks at Shanghai AI Lab. Review
An Empirical Study on Low GPU Utilization of Deep Learning Jobs, ICSE 2024 [Training] — Identifies causes of low GPU utilization in DL jobs and practical optimizations.
Toward Understanding Deep Learning Framework Bugs, TOSEM 2023 [Kernels] — Analyzes bug types, triggers, and impact across major DL frameworks.
Are Machine Learning Cloud APIs Used Correctly?, ICSE 2021 [Inference] — Studies real-world misuse patterns of ML cloud APIs and their consequences.
A Comprehensive Empirical Study on Bug Characteristics of Deep Learning Frameworks, IST 2021 [Kernels] — Large-scale analysis of DL framework bug reports to extract categories and trends.
An Empirical Study on Program Failures of Deep Learning Jobs, ICSE 2020 [Training] — Characterizes failure modes in production DL jobs at Microsoft focusing on exception-throwing failures.
An Empirical Study of Common Challenges in Developing Deep Learning Applications, IEEE Software 2020 [Training] [Inference] — Surveys and categorizes practical challenges in building correct and accurate DL apps.
Taxonomy of Real Faults in Deep Learning Systems, ICSE 2019 [Training] [Inference] — Builds a taxonomy of real-world DL faults with testing implications.

Silent Errors

Empirical Studies

Understanding Silent Data Corruption in LLM Training, arXiv 2025 [Training] — Amazon's study of SDC causes, manifestations, and detection challenges in LLM training.
Silent Errors in Large-scale LLM Training: Challenges and Lessons Learned, 2025 [Training] — NVIDIA experience report on prevalence, sources, and mitigations for silent training errors.

Detection & Diagnosis

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — TrainCheck automatically infers training invariants and proactively flags silent correctness errors.
XPUTIMER: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale, arXiv 2025 [Training] — Real-time anomaly diagnostics tailored for large-scale distributed LLM training.

Distributed Training

TrainVerify: Equivalence-Based Verification for Distributed LLM Training, SOSP 2025 [Training] — Verifies distributed training by checking equivalence against a reference to catch silent errors.
TTrace: Lightweight Error Checking and Diagnosis for Distributed Training, arXiv 2025 [Training] — Low-overhead tracing to detect and localize errors in distributed training.

Diagnosis

Defeating Nondeterminism in LLM Inference, Blog 2025 [Inference] [Kernels] — Blog on ensuring "batch invariance" of kernels by Thinking Machines
PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production, arXiv 2025 [Training] — Alibaba Cloud system for online localization of training performance bottlenecks and regressions.
Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development, arXiv 2024 [Training] — Case study on co-designing software and hardware platforms to support rapidly evolving LLMs.
Debugging Machine Learning Pipelines, DEEM 2019 [Training] — Uses decision trees over historical runs to localize ML pipeline performance anomalies. Review

Code Bug Testing

CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries, ICSE 2019 [Kernels] — Differential testing across DL backends to expose and localize inconsistencies.
A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code, ICSE 2022 [Training] — Abstract-interpretation-based static analysis to detect tensor shape errors. Review
AutoTrainer: An Automatic DNN Training Problem Detection and Repair System, ICSE 2021 [Training] — Detects training issues and applies automated repairs to improve convergence.
Reliability Assurance for Deep Neural Network Architectures against Numerical Defects, ICSE 2023 [Kernels] — Identifies and mitigates numerical instability (e.g., NaNs/overflow) in DNN computation.
NeuRI: Diversifying DNN Generation via Inductive Rule Inference, FSE 2023 [Kernels] — Generates diverse DNNs via learned transformation rules to boost test coverage.
NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers, ASPLOS 2023 [Kernels] — Synthesizes semantically valid models to stress and validate DL compilers.
Fuzzing Automatic Differentiation in Deep-Learning Libraries, ICSE 2023 [Kernels] — Fuzzes autodiff implementations to reveal gradient calculation bugs.
Fuzzing Deep-Learning Libraries via Automated Relational API Inference, FSE 2022 [Kernels] — Infers API relations to generate relational checks that uncover defects.
Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source, ICSE 2022 [Kernels] — Leverages OSS artifacts to derive oracles and fuzz tests for DL libraries.
Automated Testing of Software that Uses Machine Learning APIs, ICSE 2022 [Inference] — Techniques for testing applications that integrate ML APIs and handling API misuses.
Keeper: Automated Testing and Fixing of Machine Learning Software, TOSEM 2024 [Training] — End-to-end system to automatically generate tests and propose fixes for ML code.

Monitoring

Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — Proactive runtime monitoring via inferred invariants to detect silent training errors.
Self-Checking Deep Neural Networks in Deployment, ICSE 2022 [Inference] — Embeds runtime checks to detect anomalies and trigger self-tests during inference.

Model Behavior Testing

DeepXplore: Automated Whitebox Testing of Deep Learning Systems, SOSP 2017 [Inference] — Introduces neuron coverage and differential testing to generate inputs and expose discrepancies.
Oracle Issues in Machine Learning and Where to Find Them, ICSEW 2020 [Data] — Identifies and detects issues in ML oracles/labels using entropy and semantic analysis. Review

Fault Injection Tools

NVBitFI: Dynamic Fault Injection for GPUs, DSN 2021 [Kernels] — Injects faults into GPU binaries to evaluate resilience and error propagation in DL workloads.

Industry Post Mortems

Anthropic: A Postmortem of Three Recent Issues — Lessons and mitigations drawn from consecutive reliability incidents in Claude models during Aug 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
reviews		reviews
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Machine Learning Reliability

Table of Contents

Background

High-Level View

Empirical Studies

Silent Errors

Empirical Studies

Detection & Diagnosis

Distributed Training

Diagnosis

Code Bug Testing

Monitoring

Model Behavior Testing

Fault Injection Tools

Industry Post Mortems

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

OrderLab/awesome-machine-learning-reliability

Folders and files

Latest commit

History

Repository files navigation

Awesome Machine Learning Reliability

Table of Contents

Background

High-Level View

Empirical Studies

Silent Errors

Empirical Studies

Detection & Diagnosis

Distributed Training

Diagnosis

Code Bug Testing

Monitoring

Model Behavior Testing

Fault Injection Tools

Industry Post Mortems

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Packages