- Background
- High-Level View
- Empirical Studies
- Silent Errors
- Distributed Training
- Diagnosis
- Code Bug Testing
- Monitoring
- Model Behavior Testing
- Fault Injection Tools
- Industry Post Mortems
- MLSys: The New Frontier of Machine Learning Systems — Position paper outlining the co-design challenges/opportunities across ML, systems, and hardware.
- AI Engineering Quick Start — Practical guide to end-to-end AI engineering workflows and best practices.
- Machine Learning Testing: Survey, Landscapes and Horizons, TSE 2020 — Comprehensive survey of ML testing techniques, tools, and open challenges.
- Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015 — Seminal discussion of non-obvious maintenance costs and systemic risks in ML systems.
- A First Look at Bugs in LLM Inference Engines, arXiv 2025 [Inference] — Early taxonomy and root causes of bugs in LLM inference engines across open-source stacks.
- Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models, arXiv 2025 [Training] [Inference] — Characterizes bug patterns in distributed LLM frameworks and offers mitigation guidance.
- Characterization of Large Language Model Development in the Datacenter, NSDI 2024 [Training] — Cluster-scale study of LLM development workloads and bottlenecks at Shanghai AI Lab. Review
- An Empirical Study on Low GPU Utilization of Deep Learning Jobs, ICSE 2024 [Training] — Identifies causes of low GPU utilization in DL jobs and practical optimizations.
- Toward Understanding Deep Learning Framework Bugs, TOSEM 2023 [Kernels] — Analyzes bug types, triggers, and impact across major DL frameworks.
- Are Machine Learning Cloud APIs Used Correctly?, ICSE 2021 [Inference] — Studies real-world misuse patterns of ML cloud APIs and their consequences.
- A Comprehensive Empirical Study on Bug Characteristics of Deep Learning Frameworks, IST 2021 [Kernels] — Large-scale analysis of DL framework bug reports to extract categories and trends.
- An Empirical Study on Program Failures of Deep Learning Jobs, ICSE 2020 [Training] — Characterizes failure modes in production DL jobs at Microsoft focusing on exception-throwing failures.
- An Empirical Study of Common Challenges in Developing Deep Learning Applications, IEEE Software 2020 [Training] [Inference] — Surveys and categorizes practical challenges in building correct and accurate DL apps.
- Taxonomy of Real Faults in Deep Learning Systems, ICSE 2019 [Training] [Inference] — Builds a taxonomy of real-world DL faults with testing implications.
- Understanding Silent Data Corruption in LLM Training, arXiv 2025 [Training] — Amazon's study of SDC causes, manifestations, and detection challenges in LLM training.
- Silent Errors in Large-scale LLM Training: Challenges and Lessons Learned, 2025 [Training] — NVIDIA experience report on prevalence, sources, and mitigations for silent training errors.
- Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — TrainCheck automatically infers training invariants and proactively flags silent correctness errors.
- XPUTIMER: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale, arXiv 2025 [Training] — Real-time anomaly diagnostics tailored for large-scale distributed LLM training.
- TrainVerify: Equivalence-Based Verification for Distributed LLM Training, SOSP 2025 [Training] — Verifies distributed training by checking equivalence against a reference to catch silent errors.
- TTrace: Lightweight Error Checking and Diagnosis for Distributed Training, arXiv 2025 [Training] — Low-overhead tracing to detect and localize errors in distributed training.
- Defeating Nondeterminism in LLM Inference, Blog 2025 [Inference] [Kernels] — Blog on ensuring "batch invariance" of kernels by Thinking Machines
- PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production, arXiv 2025 [Training] — Alibaba Cloud system for online localization of training performance bottlenecks and regressions.
- Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development, arXiv 2024 [Training] — Case study on co-designing software and hardware platforms to support rapidly evolving LLMs.
- Debugging Machine Learning Pipelines, DEEM 2019 [Training] — Uses decision trees over historical runs to localize ML pipeline performance anomalies. Review
- CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries, ICSE 2019 [Kernels] — Differential testing across DL backends to expose and localize inconsistencies.
- A Static Analyzer for Detecting Tensor Shape Errors in Deep Neural Network Training Code, ICSE 2022 [Training] — Abstract-interpretation-based static analysis to detect tensor shape errors. Review
- AutoTrainer: An Automatic DNN Training Problem Detection and Repair System, ICSE 2021 [Training] — Detects training issues and applies automated repairs to improve convergence.
- Reliability Assurance for Deep Neural Network Architectures against Numerical Defects, ICSE 2023 [Kernels] — Identifies and mitigates numerical instability (e.g., NaNs/overflow) in DNN computation.
- NeuRI: Diversifying DNN Generation via Inductive Rule Inference, FSE 2023 [Kernels] — Generates diverse DNNs via learned transformation rules to boost test coverage.
- NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers, ASPLOS 2023 [Kernels] — Synthesizes semantically valid models to stress and validate DL compilers.
- Fuzzing Automatic Differentiation in Deep-Learning Libraries, ICSE 2023 [Kernels] — Fuzzes autodiff implementations to reveal gradient calculation bugs.
- Fuzzing Deep-Learning Libraries via Automated Relational API Inference, FSE 2022 [Kernels] — Infers API relations to generate relational checks that uncover defects.
- Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source, ICSE 2022 [Kernels] — Leverages OSS artifacts to derive oracles and fuzz tests for DL libraries.
- Automated Testing of Software that Uses Machine Learning APIs, ICSE 2022 [Inference] — Techniques for testing applications that integrate ML APIs and handling API misuses.
- Keeper: Automated Testing and Fixing of Machine Learning Software, TOSEM 2024 [Training] — End-to-end system to automatically generate tests and propose fixes for ML code.
- Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks, OSDI 2025 [Training] — Proactive runtime monitoring via inferred invariants to detect silent training errors.
- Self-Checking Deep Neural Networks in Deployment, ICSE 2022 [Inference] — Embeds runtime checks to detect anomalies and trigger self-tests during inference.
- DeepXplore: Automated Whitebox Testing of Deep Learning Systems, SOSP 2017 [Inference] — Introduces neuron coverage and differential testing to generate inputs and expose discrepancies.
- Oracle Issues in Machine Learning and Where to Find Them, ICSEW 2020 [Data] — Identifies and detects issues in ML oracles/labels using entropy and semantic analysis. Review
- NVBitFI: Dynamic Fault Injection for GPUs, DSN 2021 [Kernels] — Injects faults into GPU binaries to evaluate resilience and error propagation in DL workloads.
- Anthropic: A Postmortem of Three Recent Issues — Lessons and mitigations drawn from consecutive reliability incidents in Claude models during Aug 2025.