Skip to content

sderosiaux/performance-engineering-handbook

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

109 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Performance Engineering Handbook

Performance engineering, debugging, and tuning across the full stack: the Linux kernel and CPU microarchitecture, profiling and tracing, language runtimes, storage engines, databases, distributed systems, and the data structures that decide whether any of it is fast.

It started as a Linux performance toolkit. It now spans ~50 chapters and guides, from perf and eBPF up to vector indexes, LSM compaction, and HFT C++.

The Linux chapters target kernel 6.6+ (EEVDF scheduler, modern eBPF). The data-structure, database, and architecture chapters are OS-agnostic.

Last Updated: 2026-06

How to Use This Handbook

  1. Start with the 60-second checklist (see Linux Quick Start) for quick triage
  2. Use the Navigation below to find a topic, grouped into six parts
  3. Check the cheatsheets for copy-paste commands
  4. Refer to curated sources for deep dives on specific areas

For investigation workflow:

Symptom → 60-Second Analysis → Identify resource bottleneck → Deep dive chapter

Before reaching for any tool, read 00b - Observability Boundaries. It encodes the single most important principle in this handbook: every layer's metrics lie by omission about the layer below it. A 99% PG buffer-hit ratio can coexist with a saturated NVMe; a JVM at 2 GB heap can be OOM-killed at 8 GB RSS; a container at 80% memory can be 50% reclaimable cache. If you skip this chapter, you will trust one layer and miss the truth in the next.

Navigation

Part I — Triage & Methodology

Where to start, and how to not fool yourself.

Chapter Description
00 - Troubleshooting Framework USE/RED methods, 60-second checklist, decision trees
00b - Observability Boundaries Read first. Layer stack, "stats lie" anti-patterns, triangulation pattern, tool-by-layer cheat sheet
Bryan Cantrill Debugging Methodology Questions-first, state preservation, systematic elimination

Part II — Linux Systems Performance

Kernel, CPU, memory, disk, network: the original core.

Chapter Description
01 - Modern CLI Replacements Rust/Go tools replacing classic Unix utils
02 - System Monitoring CPU, memory, process monitoring
03 - Network Analysis Traffic analysis, DNS, HTTP testing
04 - Disk & Storage I/O benchmarking, filesystem tools
05 - Performance Profiling perf, flame graphs, profilers
06 - eBPF & Tracing BCC tools, bpftrace, ftrace, sched_ext
07 - Containers & K8s Docker, Kubernetes debugging
08 - Kernel Tuning sysctl, EEVDF scheduler, memory, I/O
09 - Network Tuning TCP, BBR, io_uring, NUMA networking
11 - GPU & HPC GPU profiling, MIG, HPC tracing
12 - Observability & Metrics Prometheus, Grafana, OpenTelemetry
15 - Memory Subsystem NUMA, huge pages, memory profiling
16 - Scheduler & Interrupts CPU scheduling, context switching
17 - Ftrace Production Function tracing, trace-cmd
18 - VDSO & Clock Source Time syscalls, TSC, cloud VM performance
18 - Off-CPU Analysis Wall-clock profiling, blocking detection, load-scaling bottlenecks
eBPF Performance Overhead Hook overhead, map types, production deployment
Container Debugging Patterns cAdvisor, GOMAXPROCS, PSI, cgroup v2 debugging
Scheduler Debugging Deep Dive CFS bugs, run-queue attribution, Perfetto, noisy-neighbor detection
TCP Edge Cases & Load Balancers SYN retry, LB buffering, timeout hierarchies

Part III — Runtimes & Latency

JVM behavior, and how tail latency hides from you.

Chapter Description
10 - Java/JVM JVM profiling and tuning (ZGC, async-profiler, GC analysis)
13 - Latency Analysis Tail latency, coordinated omission, P99
Coordinated Omission Guide Load-testing correctness, wrk2, timestamp injection

Part IV — Databases & Storage Engines

Query plans, storage internals, and what breaks in production.

Chapter Description
14 - Database Profiling PostgreSQL, MySQL, query optimization, cross-layer PG profiling
Database Production Debugging Hot partitions, cache pollution, admission control, lock analysis
19 - Storage Engine Patterns LMDB, RocksDB, LSM, columnar, vectorized engines
23 - Database Scaling Sharded ID generation, zero-downtime reshard, connection pools, Trino
35 - LSM Compaction Strategies Leveled/STCS/Universal/FIFO/TWCS, Dostoevsky, Monkey, tombstones, RUM conjecture

Part V — Distributed Systems & Data Architectures

Messaging, streaming, caching, and keeping state consistent across nodes.

Chapter Description
20 - Resilience Patterns Circuit breakers, bulkheads, retry/backoff tuning, timeout budgets
21 - Caching Patterns Redis memory encoding, stampede protection, Redis→MySQL/Kafka migrations
22 - Kafka & Messaging Partitioning, replication, producer/broker config, scaling at volume
24 - Real-time Analytics Architectures Uber AresDB/AthenaX, streaming SQL, RisingWave, distributed time lineage
25 - Big Data & ML Platforms Lakehouse evolution, distributed training, inference throughput, billion-scale vector search
CRDTs: Lock-Free Distributed State G-Counter, PN-Counter, OR-Set, LWW-Register with delta sync

Part VI — Low-Latency Engineering & Performance Data Structures

The structures and tricks behind fast databases, search, and trading systems.

Chapter Description
26 - C++ HFT Optimization Patterns SwissTable/F14, branchless, hugepages, FastQueue, kernel bypass, 30 LLM-actionable levers
27 - Compact Integer Sets Roaring bitmaps, Elias-Fano/PEF, Bloom/Cuckoo/XOR/Binary Fuse, ART — decision matrix + heuristics
28 - Probabilistic Sketches HLL/UltraLogLog, Count-Min, t-digest/DDSketch/KLL, Theta, MinHash/LSH — cardinality, frequency, quantile
29 - Vector ANN Indexes HNSW, IVF-PQ, DiskANN, SCANN, filtered ANN, hybrid search — recall vs latency, pgvector/FAISS/Qdrant
30 - Hash Tables at Scale SwissTable/F14/hashbrown, Robin Hood, Cuckoo, MPHF/PTHash/RecSplit, hash DoS, concurrent maps
31 - Columnar Encoding Cookbook Dict, RLE, FOR/PFOR, Gorilla/Chimp/ALP, FSST, Zstd dict — Parquet/ORC/Arrow/ClickHouse
32 - Ordered/Range/Spatial Structures Skip list, Bw-tree, Fenwick, segment tree, BKD, R-tree, H3/S2, space-filling curves
33 - Compressed Strings & Tries FST, FSST, Patricia, DAWG, front-coding, FM-index, wavelet trees, LOUDS — Lucene/Tantivy/DuckDB
34 - Learned Indexes RMI, PGM-index, ALEX, learned Bloom — honest 2026 verdict vs B-tree/ART

Cheatsheets

Cheatsheet Description
One-Liners Quick diagnostic commands by problem type
Sysctl Reference Key kernel parameters
VDSO/Clock Troubleshooting Quick detection and fixes for time-related performance

Linux Quick Start

60-Second Analysis

From Brendan Gregg's Linux Performance Analysis in 60,000 Milliseconds:

uptime                           # load averages
dmesg | tail                     # kernel errors
vmstat 1                         # system-wide stats
mpstat -P ALL 1                  # CPU balance
pidstat 1                        # process CPU
iostat -xz 1                     # disk I/O
free -m                          # memory usage
sar -n DEV 1                     # network I/O
sar -n TCP,ETCP 1               # TCP stats
top                              # overview

Classic -> Modern Replacements

Classic Modern Why
ls eza Git status, icons, tree view
cat bat Syntax highlighting
find fd 5x faster, simpler syntax
grep ripgrep 10x faster, .gitignore aware
du dust Visual bars
df duf Clean tables
top btop Dashboard UI
dig dog/doggo DoH/DoT, colors
curl xh Human-friendly HTTP
sed sd Sane regex
cd zoxide Frecency-based jump

Performance Stack

Application  ->  async-profiler (Java), py-spy (Python), rbspy (Ruby)
     |
Userspace    ->  perf, valgrind, heaptrack
     |
Syscalls     ->  strace, ltrace
     |
Kernel       ->  eBPF/BCC, bpftrace, ftrace
     |
Hardware     ->  perf stat, turbostat, intel_gpu_top

Network Stack

L7 (HTTP)    ->  httpie, xh, hey, wrk, k6, vegeta
     |
L4 (TCP)     ->  ss, netstat, tcpdump, termshark
     |
L3 (IP)      ->  mtr, traceroute, ping, gping
     |
L2 (Link)    ->  ethtool, ip link

Quick Install (Debian/Ubuntu)

# Modern CLI tools
sudo apt install ripgrep fd-find bat eza fzf btop git-delta zoxide duf gping

# Performance tools
sudo apt install linux-tools-common linux-tools-$(uname -r) bpfcc-tools bpftrace

# Network tools
sudo apt install mtr-tiny tcpdump nmap iperf3 netcat-openbsd

# Monitoring
sudo apt install sysstat htop iotop

Version Requirements

Component Minimum Recommended Notes
Linux Kernel 5.15 6.6+ EEVDF scheduler, modern eBPF
bpftrace 0.16 0.20+ BTF support, newer features
bcc-tools 0.25 0.28+ Latest BPF features
perf matches kernel - Install linux-tools-$(uname -r)
iproute2 6.0 6.7+ netkit, newer tc features

Key kernel features by version:

Version Feature
5.15+ io_uring maturity, BTF by default
6.1+ MGLRU, improved memory management
6.4+ Per-VMA locks, reduced mmap contention
6.6+ EEVDF scheduler (replaces CFS)
6.7+ Netkit stable
6.9+ BPF Arena, new kfuncs
6.12+ sched_ext merged, PREEMPT_RT mainline

Curated Sources

Essential Reading

Source Focus Link
Brendan Gregg Performance methodology, eBPF brendangregg.com
Julia Evans Debugging, Linux internals jvns.ca
Netflix Tech Blog Production performance netflixtechblog.com
Cloudflare Blog Network performance, eBPF blog.cloudflare.com
Meta Engineering eBPF at scale, kernel engineering.fb.com
Dan Luu Systems analysis, measurement danluu.com

In This Repository

Curated extracts from these sources with actionable insights:

Tools & References

Resources

About

Full-stack performance engineering handbook — Linux kernel & eBPF, profiling/tracing, network & JVM/GPU tuning, storage engines, databases, distributed systems and streaming/ML data architectures, plus the high-performance data structures behind them: probabilistic sketches, vector/ANN indexes, columnar encoding, LSM & learned indexes, HFT C++.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors