// Overview / Tutorial / Studies / API / FAQ / Resources
Performance
is not a number!
Performance-oriented library that combines the power of:
c++23
,linux/perf
,llvm/mca
,intel/pt
,gnuplot
,terminal/sixel
, ...
CPU Benchmarking, Profiling, Tracing, Analysis
Optimal
(recommended)
clang-19+
,gcc-13+
/c++23+
llvm-dev-19+
-apt-get install llvm-dev
linux-6.x+
perf_event_open
-apt-get install linux-tools-common
intel-12th+
withtpebs
,ipt
support
intel_pt
-apt-get install libipt-dev
terminal
withsixel
support
gnuplot
-apt-get install gnuplot
Optional
Profiling
linux-perf
-apt get install linux-tools-common
llvm-xray
-apt-get install llvm
gperftools
-apt get install google-perftools
intel-vtune
-apt get install intel-oneapi-vtune
callgrind
-apt-get install valgrind
Data analysis
jupyter notebook
-apt-get install jupyter
perfetto
-https://perfetto.dev
Sharing analysis - https://gist.github.com
gh
-apt-get install gh
Supports
timing
,profiling
,tracing
,analyzing
,plotting
,testing
time
(tsc, steady_clock, cpu,thread, real, monotonic)stat
(perf_event_open)record
(perf_event_open)trace
(perf_event_open/intel_pt)mc/mca
(llvm: disassemble, llvm-mca: resource_pressure, timeline, bottleneck)prof
(linux-perf, llvm-xray, gperftools, intel-vtune, callgrind)plot
(gnuplot: hist, bar, box, ecdf, line)
Strives for
simplicity
,flexibility
,accuracy
Single Header/Module - https://github.com/qlibs/perf/blob/main/perf https://github.com/qlibs/perf/blob/main/perf.cppm
API
/bench
import perf; int main() { perf::runner bench{perf::bench::latency{}}; // what and how auto fizz_buzz = [](int n) { if (n % 15 == 0) { return "FizzBuzz"; } else if (n % 3 == 0) { return "Fizz"; } else if (n % 5 == 0) { return "Buzz"; } else { return "Unknown"; } }; bench(fizz_buzz, 1); bench(fizz_buzz, 3); bench(fizz_buzz, perf::data::repeat<int>{{3,5,15}}); bench(fizz_buzz, perf::data::uniform<int>{.min = 0, .max = 15}); perf::report(bench[perf::time::steady_clock]); // per benchmark perf::annotate(bench[perf::mc::assembly]); // per instruction perf::plot::bar(bench[perf::stat::cycles]) // hist, bar, box, ecdf, line }See
How does it work?
fore more details
Enables
dev
,reaserch
workflows
dev
: edit -> compile -> run -> analyze (c++ only
)
research
: edit -> compile -> run -> save(json) -> notebook/perfetto -> analyze (c++/python/browser
)
Performs
self-verification
upon compilation
compile-time
tests are executed uponinclude/import
(enabled by default)
run-time/sanity check
tests can be executed byint main() { perf::test::run({.verbose = true}); }
compile-time/run-time
tests can be disabled with-DNTEST
(not recommended)
Try it!
perf.mp4
docker build -t perf . docker run --rm --privileged -v ${PWD}:${PWD} -w ${PWD} \ clang++-20 -std=c++23 -O3 --precompile perf.cppm gh gist create --public --web hello_world.cpp
Build
&Test
local env
build dependencies
(optional)apt-get install linux-tools-common apt-get install llvm-dev apt-get install libipt-dev apt-get install gnuplot pip3 install pyperf
enable linux tracing
(optional).github/scripts/setup.sh --perf
tune machine for benchmarking
(recommended).github/scripts/tune.sh
import
perfcat <<EOF > perf.cpp import perf; int main() { perf::test::run({.verbose = true}); } EOF
build
# perf tests are running at compile-time upon inclusion/import and run-time on start-upclang++-20 \ -std=c++23 -O3 \ -I. \ -I/usr/lib/llvm-20/include \ -lLLVM-20 \ -lipt \ -o perf \ perf.cpp && \ ./perf # runs run-time tests and sanity checks
build
perf imagewget https://raw.githubusercontent.com/qlibs/perf/refs/heads/main/.github/workflows/Dockerfile docker build -t perf .
import
perfcat <<EOF > perf.cpp import perf; int main() { perf::test::run({.verbose = true}); } EOF
build
# perf tests are running at compile-time upon inclusion/import and run-time on start-updocker run \ --rm \ --privileged \ -v ${PWD}:${PWD} \ -w ${PWD} \ perf \ clang++-20 \ -std=c++23 -O3 \ -I. \ -I/usr/lib/llvm-20/include \ -lLLVM-20 \ -lipt \ -o perf \ perf.cpp && \ ./perf
Setup
Profiling
/Tracing
/Analysis
Retiring
Bad speculation
Frontend bound
Backend bound
Config
Macros
/** * PERF version # https://semver.org */ #define PERF (MAJOR, MINOR, PATCH) // ex. (1, 0, 0) /** * GNU # default: deduced based on `__GNUC__` * - 0 not compatible * - 1 compatible */ #define PERF_GNU 0/1 /** * Linux # default: deduced based on `__linux__` and `perf_event_open.h` * - 0 not supported * - 1 supported */ #define PERF_LINUX 0/1 /** * UEFI # default: 0 * - 0 not supported * - 1 supported */ #define PERF_UEFI 0/1 /** * LLVM # default: deduced based on `llvm-dev` headers * - 0 not supported * - 1 supported */ #define PERF_LLVM 0/1 /** * Intel Processor Trace # default: deduced based on `<intel_pt.h>` header * - 0 not supported * - 1 supported */ #define PERF_INTEL 0/1 /** * tests # default: not-defined * - defined: disables all compile-time, run-time tests * - not-defined: compile-time tests executed, run-time tests via `test::run` */ #define NTESTEnviornment variables
/** * gnuplot terminal # see `gnuplot -> set terminal` # default: 'sixel' * - 'sixel' # console image # https://www.arewesixelyet.com * - 'wxt' # popup window * - 'dumb size 150,25 ansi' # console with colors * - 'dumb size 80,25' # console */ ENV:PERF_PLOT_TERM /** * style # default: dark * - light * - dark */ ENV:PERF_PLOT_STYLE
Usage
How does it work?
unroll - show run
views - gnuplot via sixel
function - invoke
sampler, counter - perf_event_open
mca - llvm-dev.mca
Backward/type-safe deduction on what to bench based on usage
bench(fn); // `time::cpu, stat::instructions (ipc), stat::cycles` will be benchmarked report(bench[stat::cycles, time::cpu]); annotate(bench[record::cycles]); plot(bench[stat::ipc]);How to share results?
with
https://gist.github.com
apt-get install gh
gh gist create perf.html --public | awk '{print "https://htmlpreview.github.io/?perf.html/raw"}'
Sharing options
markdown
(.md) - easiest to edit but github doesn`t render embedded imagesnotebook
(.ipynb) - still markdown based but githbu rendered embedded imageshtml
(.html) - best looking but the hardest to edit, github doesn`t render html but https://htmlpreview.github.io doesHow to use ascii based charts?
// - 'dumb size 150,25 ansi' # console with colors // - 'dumb size 80,25' # console PERF_PLOT_TERM='dumb size 150,25 ansi' ./perfHow to use popup based charts?
// - 'dumb size 150,25 ansi' # console with colors // - 'dumb size 80,25' # console PERF_PLOT_TERM='wxt' ./perfHow to change plots style?
ENV:PERF_PLOT_STYLE='light' ./perf ENV:PERF_PLOT_STYLE='dark' ./perf # defaultWhat is
prevent_elision
and when it`s needed?
optimizing compiler may elide your code completely if it
s not required (doesn
t have side effects).prevent_elision
will prevent that.verify(perf::compiler::is_elided([] { })); verify(perf::compiler::is_elided([] { auto value = 4 + 2; })); verify(perf::compiler::is_elided([] { int i{}; i++; })); verify(not perf::compiler::is_elided([&] { i++; })); verify(not perf::compiler::is_elided([] { static int i; i++; })); verify(not perf::compiler::is_elided([=] { int i{}; perf::compiler::prevent_elision(i++); }));How to change assembly syntax from intel to at&t?
perf::llvm llvm{.syntax = perf::arch::syntax::att };
perf::bench::runner bench{ {.syntax = perf::arch::syntax::att}, perf::bench::latency, };
How to disassemble for different platform?
perf::llvm llvm{.triple = "x86_64-pc-linux-gnu" }; // see `llvm-llc` for detailsHow to integration with testing framework?
- perf can be intergrated with any unit testing framework - https://github.com/qlibs/ut
import perf; import ut; int main() { "benchmark"_test = [] { // ... }; }How to save images?
gnuplot save{{.term = "svg"}}; save.send("set output 'output.svg'"); bar(save, results);How to use
perf
as a C++20 module?clang++-20 -std=c++23 -O3 -I. --precompile perf.cppm clang++-20 -std=c++23 -O3 -fprebuilt-module-path=. perf.pcm perf.cpp -lLLVM-18 -liptWhat is required to display images on the terminal?
- terminal with sixel support - https://www.arewesixelyet.com note: Visual Studio Code supports images on terminal (requires enabling Terminal -> Enable images option)
How to plot on the server without sixel?
PERF_TERM=ascii ./a.out
# see gnuplot - set terminalHow to write gnuplot charts?
sudo apt install gnuplot
- gnuplot documentation - http://www.gnuplot.info/documentation.html
- gnuplot demos - http://www.gnuplot.info/demo
- online gnuplot - https://gnuplot.io
how to write custom profiler?
struct my_profiler { constexpr auto start() { } constexpr auto stop() { } [[nodiscard]] constexpr auto *operator() { } };static_assert(perf::prof_like<your_profiler>);
How to pollute cache, heap when benchmarking?
perf::memory::pollute_heap
perf::memory::flush_cacheline
Instrumentation with llvm-xray
[[clang::xray_always_instrument]] void always_profile(); [[clang::xray_always_instrument, clang::xray_log_args(1)]] void always_profile_and_log_i(int i); [[clang::xray_never_instrument]] void never_profile();# profiling threshold -fxray-instruction-threshold=1 # default 200 instructions# instrumentation info llvm-xray extract ./a.out --symbolizeConditional profiling with callgrind
prof::callgrind profiler{"example"}; while (true) { profiler.start(); // resets profile if (should_trigger()) { trigger(); profiler.stop(); proflier.flush(); // dumps `example` profile } }kcachegrind callgrind.* # opens all profiles combinedHow
perf
tests are working?#ifndef NTEST "demo"_suite = [] { "run-time and compile-time"_test = [] constexpr { expect(3 == sum({1, 2})); }; "run-time only"_test = [] mutable { expect(std::rand() >= 0); }; "compile-time only"_test = [] consteval { expect(sizeof(int) == sizeof(std::int32_t)); }; }; #endifHow
perf
compares togoogle.benchmark
,nanobench
,celero
?
Firstly, google.benchmark, nanobench, celero are great and established libraries.
perf
philosophy is more about the fact that performance is not a number which leads to the following
- data driven inputs (to avoid branch prediction overfitting)
- statistical, data driven inputs
- instruction level and function level
- statistical, don`t avoid branch prediction
- latency and throughput
- analysis, profiling and plotting
Performance
Common pitfalls?
- Undersand
- what (latency, throughput),
- state (cold, warm, data distribution)
- how (timing, counters, profiling) to measure
- hardware (what env, os, setup, ...)
- ensure proper analysis (avoid premature conclusions)
- assert your expectations
- verify in the production-like use case/env
- Ensure the machine is tuned for micro-benchmarking (see #tuning)
- Use structured and diligent methods (see #top_down_analysis)
- Use realistic scenarios (see #setup)
- Use statistical methods for analysis (#stat)
- Analyze generated assembly (see #annotate/#mca)
- Visualize results (see #gnuplot)
- Document / share analysis (see #html)
- Automate expectations (see #testing)
- Enhnace your understanding (see #performance)
- Verify changes in production like system (see #prof) - active-benchmarking
- Measure / verify consistently (see #json)
Latency and Throughput?
- latency is the time it takes for a single operation to complete (ns)
- throughput is the total number of operations or tasks completed in a given amount of time (op/s)
note: In single threaded execution if the algo can`t be parallilzed (where throughput would be important) you likely care about the latency.
Performance compilation flags?
-O1 # optimizations (level1) [0] -O2 # optimizations (level1 + level2) [0] -O3 # [unsafe] optimizations (level1 + level2 + level3) [0] -DNDEBUG # disable asserts, etc. -march=native # specifies architecture [1] -ffast-math # [unsafe] faster math but non-conforming math [2] -g # debug symbols -fno-omit-frame-pointer # keep the frame pointer in a register for functions that don`t need one -fcf-protection=none # [unsafe] stops emmitting `endbr64` # control-flow enforcement technology (cet)
performance attributes?
[[gnu::target("avx2")]]
[[gnu::optimize("O3")]
[[gnu::optimize("ffast-math")]
note: gcc support is much more comprehensive than clang
When to finish micro-benchmarking?
- when IPC if you reached won`t get any better.
How to handle cases when there are no obvious bottlenecks?
- The most likely scenario is cache utilization.
How to profile in production?
- see
prof
APIWhat are different micro-benchmarking workflows?
[cpp] # measure, analyze from cpp pros: self contained, easy itartions, supports unit testing, supports running on the server (PERF_PLOT_TERM="dumb") cons: requires multiple runs for different compiler/flags note: can be shared with export.sh and
gh
with https://gist.github[cpp->json->notebook] measure with cpp and export json, analyze results in notebook pros: can be unified across different for different compiler/flags and production runs, power of notebooks, easy to share, can be done offline cons: multiple places, longer runs, harder to change the code, harder to run on remote servers, harder to unit test expectations, requires GUI
[cpp->json->CI] measure with cpp and use json for CI
- used for continous benchmarking
What is top-down microarchitecture analysis method?
What is jupyter notebook and how to use it?
jupter-notebook is great for analysis and leaving a trace of it. however, C++ doesn
t integrate well with jupyter-notbook (cling is often not an option) so it doesn
t work well with servers (requires coping data) and it`s an additional file. perf approach is to leave trace of benchmarking with automatically testing it direclty from C++ so that it can be seen for the future and explains what and why.# apt install jupyter jupyter notebook -ip 0.0.0.0 --no-browser notebook.ipynb
Workflows
dev
pros
- easy editing and integration with exising tooling
- output to the console (charts with sixel)
- easy to share via gh gist create # github can render markdown
- can assert and verify expectation
- can be executed in headless mode on the server without ui
cons
harder analysis than in python
single compiler workflow
research
(run from jupter notebook (produce json) analyze using python)
pros
- infite way of data analysis with great matpotlib suppport
- can run different compilers/options
- easy to share via gh gist create # github can render ipynb
- can be used on kaggle (avx512) or google colab
cons
- editing C++ is not that well supported by notebooks but runnig is fine
- not that great to analyze assembly output
- usually would require running through a browser (there are extensions for vim/emacs/vscode)
- might be slow with a lot of data
Troubleshooting
How to setup docker?
docker build -t perf .
docker run \ -it \ --privileged \ --network=host \ -e DISPLAY=${DISPLAY} \ -v ${PWD}:${PWD} \ -w ${PWD} \ perfHow to setup
linux performance counters
?sudo mount -o remount,mode=755 /sys/kernel/debug sudo mount -o remount,mode=755 /sys/kernel/debug/tracing sudo chown `whoami` /sys/kernel/debug/tracing/uprobe_events sudo chmod a+rw /sys/kernel/debug/tracing/uprobe_events echo 0 | sudo tee /proc/sys/kernel/kptr_restrict echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid echo 1000 | sudo tee /proc/sys/kernel/perf_event_max_sample_rateHow to setup
rdpmc
?echo 2 | sudo tee /sys/devices/cpu_core/rdpmcHow to find out which performance events are supported by the cpu?
perf list
# https://perfmon-events.intel.comHow to reduce noise when benchmarking?
Linux
pyperf
(https://pyperf.readthedocs.io/en/latest/system.html)sudo pyperf system tune sudo pyperf system show sudo pyperf system reset # Set Process CPU Affinity (apt install util-linux) # note: `perf::sys::thread::affinity` can be used intead taskset -c 0 ./a.out # Set Process Scheduling Priority (apt install coreutils) # note: `perf::sys::thread::priority` can be used instead nice -n -20 taskset -c 0 ./a.out # -20..19 (most..less favorable to the process) # Disable CPU Frequency Scaling (apt install cpufrequtils) sudo cpupower frequency-set --governor performance # cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # Disable Address Space Randomization echo 0 > /proc/sys/kernel/randomize_va_space # Disable Processor Boosting echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost # Disable Turbo Mode echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo # Disable Hyperthreading/SMT echo off | sudo tee /sys/devices/system/cpu/smt/control # Restrict memory to a single socket numactl -m 0 -N 0 ./a.out # Enable Huge Pages sudo numactl --cpunodebind=1 --membind=1 hugeadm \ --obey-mempolicy --pool-pages-min=1G:64 sudo hugeadm --create-mounts
boot / grub
# Enable Kernel Mode Task-Isolation (https://lwn.net/Articles/816298) isolcpus=<cpu number>,...,<cpu number> # cat /sys/devices/system/cpu/isolated # Disable P-states and C-states idle=pool intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1 # cat /sys/devices/system/cpu/intel_pstate/status # Disable NMI watchdog (cat /proc/sys/kernel/nmi_watchdog) nmi_watchdog=0int efi_main(void*, uefi::efi_system_table* system_table) { ctrl::cpu::perf::init(); const auto b1 = ctrl::cpu::perf::branch_instructions(); const auto c1 = ctrl::cpu::perf::cache_misses(); u64 checksum{}; for (auto i = 0u ;i < (1u << 16u); ++i) { checksum ^= i; } const auto b2 = ctrl::cpu::perf::branch_instructions(); const auto c2 = ctrl::cpu::perf::cache_misses(); uefi::println(system_table->out, L"results"); uefi::println(system_table->out, b2 - b1); uefi::println(system_table->out, c2 - c1); return uefi::run(); }clang++-18 -std=c++20 -I. -target x86_64-pc-win32-coff -fno-stack-protector -fshort-wchar -mno-red-zone -c uefi.cpp -o uefi.o lld-link-18 -filealign:16 -subsystem:efi_application -nodefaultlib -dll -entry:efi_main uefi.o -out:BOOTX64.EFI mkdir -p efi/boot cp BOOTX64.EFI /usr/share/ovmf/OVMF.fd efi/boot qemu-system-x86_64 \ -drive if=pflash,format=raw,file=efi/boot/OVMF.fd \ -drive format=raw,file=fat:rw:. \ -net noneRequires reboot Gives more control such disabling cache, ...
Specs
Manuals
- Intel - https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- AMD - https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf
- ARM - https://developer.arm.com/documentation/ddi0487/latest
- Apple - https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms, https://github.com/mikeroyal/Apple-Silicon-Guide
Books
- Optimizing Software in C++: An Optimization Guide for Windows, Linux and Mac platforms - https://www.agner.org/optimize/optimizing_cpp.pdf
- Optimizing Subroutines in Assembly Language: An Optimization Guide for x86 platforms - https://www.agner.org/optimize/optimizing_assembly.pdf
- The Microarchitecture of Intel, AMD and VIA CPUs: An Optimization Guide for Assembly programmers and compiler makers - https://www.agner.org/optimize/microarchitecture.pdf
- Instruction Tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs - https://www.agner.org/optimize/instruction_tables.pdf
- Calling Conventions for different C++ compilers and operating systems - https://www.agner.org/optimize/calling_conventions.pdf
- What Every Programmer Should Know About Memory - https://www.akkadia.org/drepper/cpumemory.pdf
- Performance Analysis and Tuning on Modern CPUs - https://github.com/dendibakh/perf-book/releases
- Algorithms for Modern Hardware - https://en.algorithmica.org/hpc
- The Art of Writing Efficient Programs - https://www.packtpub.com/product/the-art-of-writing-efficient-programs
- Computer Architecture - https://dl.acm.org/doi/book/10.5555/1999263
- Is Parallel Programming Hard, And, If So, What Can You Do About It? - https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html
- SIMD for C++ Developers - http://const.me/articles/simd/simd.pdf
- Hackers Delight - https://doc.lagout.org/security/Hackers%20Delight.pdf
- Data-Oriented Design - https://www.dataorienteddesign.com/dodbook
- Top-Down Microarchitecture Analysis Method - https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html
- A Top-Down method for performance analysis and counters architecture - https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture
- Top-Down Metrics - https://github.com/intel/perfmon/blob/main/TMA_Metrics-full.xlsx
- Measuring Workloads With TopLev - https://github.com/andikleen/pmu-tools/wiki/toplev-manual
- The Art of Assembly Language - https://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/AoATOC.html
- Bits Of Architecture - https://github.com/CoffeeBeforeArch/bits_of_architecture
- Bit Twiddling Hacks - https://graphics.stanford.edu/~seander/bithacks.html
- Memory Models - https://research.swtch.com/mm
- nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems - https://arxiv.org/abs/1911.03282
- The Linux Scheduler: a Decade of Wasted Cores - https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf
- The Tail At Scale - https://www.barroso.org/publications/TheTailAtScale.pdf
- Producing wrong data without doing anything obviously wrong! - https://dl.acm.org/doi/10.1145/1508284.1508275
- Robust benchmarking in noisy environments - https://arxiv.org/abs/1608.04295
- Can Seqlocks Get Along With Programming Language Memory Models - https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf
- Cache-Oblivious Algorithms and Data Structures - https://erikdemaine.org/papers/BRICS2002/
- Performance suiteering of Software Systems - https://ocw.mit.edu/courses/6-172-performance-suiteering-of-software-systems-fall-2018
Cheatsheets
Operation Costs in CPU Clock Cycles - http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles
Intel Intrinsics Guide - https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
x86
Intrinsics Cheatsheet - https://db.in.tum.de/~finis/x86-intrin-cheatsheet-v2.1.pdfA machine-readable CPUID data repository - https://x86-cpuid.org
Instruction Matrix - https://github.com/google/highway/blob/master/g3doc/instruction_matrix.pdf
Microarchitecture Cheatsheet - https://docs.google.com/spreadsheets/d/18ln8SKIGRK5_6NymgdB9oLbTJCFwx0iFI-vUs6WFyuE
Performance Monitoring Events - https://perfmon-events.intel.com
Core to Core Latency - https://github.com/nviennot/core-to-core-latency
Clock Time Analysis - https://gitlab.com/chriscox/CppPerformanceBenchmarks/-/wikis/ClockTimeAnalysis
Speed of light ............................ ~1 foot/ns L1 cache reference ......................... 0.5 ns Branch mispredict ............................ 5 ns L2 cache reference ........................... 7 ns Mutex lock/unlock ........................... 25 ns Main memory reference ...................... 100 ns Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs SSD random read ........................ 150,000 ns = 150 µs Read 1 MB sequentially from memory ..... 250,000 ns = 250 µs Round trip within same datacenter ...... 500,000 ns = 0.5 ms Read 1 MB sequentially from SSD ..... 1,000,000 ns = 1 ms Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms Send packet CA->UK->CA .... 150,000,000 ns = 150 ms# System topology (apt install hwloc) lstopo # lstopo-no-graphics # CPU info (apt install util-linux) lscpu | grep -E ^CPU|^Model|^Core|^Socket|^Thread # Cache info lscpu | grep cache getconf -a | grep CACHE_LINESIZE # Numa nodes lscpu | grep -E ^NUMA # Huge pages cat /proc/meminfo | grep -i hugeGuides
- The Linux Kernel Documentation - https://www.kernel.org/doc/html/latest/index.html
- RHEL Performance Guide - https://myllynen.github.io/rhel-performance-guide/
- Monitoring and Managing System Status and Performance - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance
- Active Benchmarking - https://www.brendangregg.com/activebenchmarking.html
- A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection - https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html
- X3 Low Latency Quickstart - https://docs.amd.com/r/en-US/ug1586-onload-user/X3-Low-Latency-Quickstart
- Modular platform for computer-system architecture research - https://www.gem5.org
- Reverse debugging at scale - https://suiteering.fb.com/2021/04/27/developer-tools/reverse-debugging
- Asynchronous Programming Under Linux - https://unixism.net/loti/async_intro.html
- Phoronix Test Suite - https://www.phoronix-test-suite.com
- CPU benchmark - https://www.cpubenchmark.net
- Modern Microprocessors A 90-Minute Guide! - https://www.lighterra.com/papers/modernmicroprocessors
- 7-Zip LZMA Benchmark - https://www.7-cpu.com
- CPU Benchmarks - https://curiouscoding.nl/posts/cpu-benchmarks
- JVM Anatomy Quarks - https://shipilev.net/jvm/anatomy-quarks
linux-perf
Source Code - https://github.com/torvalds/linux/blob/master/tools/perf, https://elixir.bootlin.com/linux/v6.13-rc1/source/include/uapi/linux/perf_event.hgcc
optimization - https://wiki.gentoo.org/wiki/GCC_optimizationgcc
assembler syntax - https://www.felixcloutier.com/documents/gcc-asmllvm
Scheduling Models - https://github.com/llvm/llvm-project/tree/main/llvm/lib/Targetllvm
Optimization Passes - https://llvm.org/docs/Passes.htmlllvm
Vectorizers - https://llvm.org/docs/Vectorizers.htmlTutorials
- Performance Ninja Class - https://github.com/dendibakh/perf-ninja
- Hardware Effects - https://github.com/Kobzol/hardware-effects
- Performance Tuning - https://github.com/NAThompson/performance_tuning_tutorial
- Mastering C++ with Google Benchmark - https://ashvardanian.com/posts/google-benchmark
- Learning to Write Less Slow C, C++, and Assembly Code - https://github.com/ashvardanian/less_slow.cpp
Feeds
News
- Linux News - https://lwn.net
- Chips and Cheese - https://chipsandcheese.com
- WikiChip - https://wikichip.org
- CPUID - https://www.cpuid.com/news.html
- Real World Tech - https://www.realworldtech.com
- Tom`s Hardware - https://www.tomshardware.com
- Phoronix - https://www.phoronix.com
comp.lang.asm.x86
- https://groups.google.com/g/comp.lang.asm.x86Blogs
- Agner Fog`s Blog - https://www.agner.org
- Denis Bakhvalov`s Blog - https://easyperf.net/blog
- Daniel Lemire`s Blog - https://lemire.me/blog
- Wojciech Mula`s Blog - http://0x80.pl/articles/index.html
- Erik Rigtorp`s Blog - https://rigtorp.se
- Johnny`s Software Blog - https://johnnysswlab.com
- JabPerf`s Blog - https://jabperf.com/blog
- Brendan Gregg`s Blog - https://brendangregg.com/blog
- Geoff Langdale`s Blog - https://branchfree.org
- Ragnar Groot Koerkamp`s Blog - https://curiouscoding.nl/posts/cpu-benchmarks
- Travis Downs`s Blog - https://travisdowns.github.io
- Tristan`s Blog - https://thume.ca/archive.html
- Stefanos Baziotis`s Blog - https://sbaziotis.com/#blog
- Gamozo Labs Blog - https://gamozolabs.github.io
- Mechanical Sympathy Blog - https://mechanical-sympathy.blogspot.com
- Performance Engineering Blog - https://pramodkumbhar.com
- Dmitry Vyukov Blog - https://www.1024cores.net
- John Farrier`s Blog - https://johnfarrier.com
- Performance Tricks Blog - https://www.performetriks.com/blog
- Coding Confessions Blog - https://blog.codingconfessions.com
- The Netflix Tech Blog - https://netflixtechblog.com
- Cloudflare Blog - https://blog.cloudflare.com
Lists
- C++ Links - https://github.com/MattPD/cpplinks
- Awesome Performance C++ - https://github.com/fenbf/AwesomePerfCpp
- Awesome Lock Free - https://github.com/rigtorp/awesome-lockfree
- Awesome SIMD - https://github.com/awesome-simd/awesome-simd
- Computer, Enhance! - https://www.computerenhance.com
- Low Latency Trading Insights - https://lucisqr.substack.com
Miscellaneous
- Conferences - https://www.p99conf.io, https://supercomputing.org, https://hotchips.org, https://microarch.org
- Podcasts - https://signals-threads.simplecast.com, https://microarch.club, https://tlbh.it, https://twoscomplement.org
- C++ Low Latency Group (SG14) - https://github.com/WG21-SG14/SG14
Videos
Channels
- Computer Architecture - Onur Mutlu - https://www.youtube.com/@OnurMutluLectures
- Computer, Enhance - Casey Muratori - https://www.youtube.com/@MollyRocket
- Assembly / Creel - https://www.youtube.com/c/WhatsACreel
- EasyPerf / Spaces - https://www.youtube.com/@easyperf3992
- SIMD algorithms - Denis Yaroshevskiy - https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg_jaSFHpmlSBD
Benchmarking
- Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! - Chandler Carruth - https://www.youtube.com/watch?v=nXaxk27zwlk
- Counting Nanoseconds Microbenchmarking C++ Code - David Gross - https://www.youtube.com/watch?v=Czr5dBfs72U
- Benchmarking C++ Code - Bryce Adelstein-Lelbach - https://www.youtube.com/watch?v=zWxSZcpeS8Q
- Benchmarking C++, From video games to algorithmic trading - Alexander Radchenko - https://www.youtube.com/watch?v=7YVMC5v4qCA
- Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! - Chandler Carruth - https://www.youtube.com/watch?v=nXaxk27zwlk
- Going Nowhere Faster - Chandler Carruth - https://www.youtube.com/watch?v=2EWejmkKlxs
- Measurement and Timing - Performance suiteering of Software Systems - https://www.youtube.com/watch?v=LvX3g45ynu8
- How NOT to Measure Latency - Gil Tene - https://www.youtube.com/watch?v=lJ8ydIuPFeU
Analyzing
- Understanding the Performance of code using LLVM-MCA - A. Biagio & M. Davis - https://www.youtube.com/watch?v=Ku2D8bjEGXk
- LLVM Optimization Remarks - Ofek Shilon - https://www.youtube.com/watch?v=qmEsx4MbKoc
Profiling
- From Top-down Microarchitecture Analysis to Structured Performance Optimizationsa - https://cassyni.com/events/YKbqoE4axHCgvQ9vuQq7Cy
- Coz: finding code that counts with causal profiling - ACM - https://www.youtube.com/watch?v=jE0V-p1odPg
- Take Advantage for Intel Instrumentation and Tracing Technology for Performance Analysis - https://www.youtube.com/watch?v=1zdVFLajewM&list=PLg-UKERBljNw3_6Q598CS3DE7KqDXjP-d
- LIKWID Performance Tools - https://www.youtube.com/playlist?list=PLxVedhmuwLq2CqJpAABDMbZG8Whi7pKsk
- Introduction to the Tracy Profiler - Bartosz Taudul - https://youtu.be/fB5B46lbapc
- Performance Matters - Emery Berger - https://www.youtube.com/watch?v=r-TLSBdHe1A
Optimizing
- Understanding Compiler Optimization - Chandler Carruth - https://www.youtube.com/watch?v=haQ2cijhvhE
- Efficiency with Algorithms, Performance with Data Structures - Chandler Carruth - https://www.youtube.com/watch?v=fHNmRkzxHWs
- Design for Performance - Fedor Pikus - https://www.youtube.com/watch?v=m25p3EtBua4
- Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - https://www.youtube.com/watch?v=wGSSUSeaLgA
- Branchless Programming in C++ - Fedor Pikus - https://www.youtube.com/watch?v=g-WPhYREFjk
- CPU design effects - Jakub Beranek - youtube.com/watch?v=ICKIMHCw--Y
- Fastware - Andrei Alexandrescu - https://www.youtube.com/watch?v=o4-CwDo2zpg
- Performance Tuning - Matt Godbolt - https://www.youtube.com/watch?v=fV6qYho-XVs
- Memory & Caches - Matt Godbolt - https://www.youtube.com/watch?v=4_smHyqgDTU
- What Every Programmer Should Know about How CPUs Work - Matt Godbolt - https://www.youtube.com/watch?v=-HNpim5x-IE
- There Are No Zero-cost Abstractions - Chandler Carruth - https://www.youtube.com/watch?v=rHIkrotSwcc&
- Understanding Optimizers: Helping the Compiler Help You - Nir Friedman - https://www.youtube.com/watch?v=8nyq8SNUTSc
- C++ Algorithmic Complexity, Data Locality, Parallelism, Compiler Optimizations, & Some Concurrency - Avi Lachmish - https://www.youtube.com/watch?v=0iXRRCnurvo
- Software Optimizations Become Simple with Top-Down Analysis on Intel Skylake - Ahmad Yasin - https://www.youtube.com/watch?v=kjufVhyuV_A
- Being Friendly to Your Computer Hardware in Software Development - Ignas Bagdonas - https://www.youtube.com/watch?v=eceFgsiPPmk
- Want fast C++? Know your hardware - Timur Doumler - https://www.youtube.com/watch?v=BP6NxVxDQIs
- What is Low Latency C++ - Timur Doumler - https://www.youtube.com/watch?v=EzmNeAhWqVs, https://www.youtube.com/watch?v=5uIsadq-nyk
- Where Have All the Cycles Gone? - Sean Parent - https://www.youtube.com/watch?v=B-aDBB34o6Y
- Understanding CPU Microarchitecture to Increase Performance - https://www.youtube.com/watch?v=rglmJ6Xyj1c
- Performance Analysis & Tuning on Modern CPU - Denis Bakhvalov - https://www.youtube.com/watch?v=Ho3bCIJcMcc
- Comparison of C++ Performance Optimization Techniques for C++ Programmers - Eduardo Madrid - https://www.youtube.com/watch?v=4DQqcRwFXOI
- Simple Code, High Performance - Molly Rocket - https://www.youtube.com/watch?v=Ge3aKEmZcqY
- Assembly, System Calls, and Hardware in C++ - David Sankel - https://www.youtube.com/watch?v=7xwjjolDnwg
- Optimizing Binary Search - Sergey Slotin - https://www.youtube.com/watch?v=1RIPMQQRBWk
- A Deep Dive Into Dispatching Techniques in C++ - Jonathan Muller - https://www.youtube.com/watch?v=vUwsfmVkKtY
- C++ Memory Model: from C++11 to C++23 - Alex Dathskovsky - https://www.youtube.com/watch?v=SVEYNEWZLo4
- Abusing Your Memory Model for Fun and Profit - Samy Al Bahra, Paul Khuong - https://www.youtube.com/watch?v=N07tM7xWF1U&t=1s
- The speed of concurrency (is lock-free faster?) - Fedor Pikus - https://www.youtube.com/watch?v=9hJkWwHDDxs
- Read, Copy, Update, then what? RCU for non-kernel programmers - Fedor Pikus - https://www.youtube.com/watch?v=rxQ5K9lo034
- Single Producer Single Consumer Lock-free FIFO From the Ground Up - Charles Frasch - https://www.youtube.com/watch?v=K3P_Lmq6pw0
- Introduction to Hardware Efficiency in Cpp - Ivica Bogosavljevic - https://www.youtube.com/watch?v=Fs_T070H9C8
- The Performance Price of Dynamic Memory in C++ - Ivica Bogosavljevic - https://www.youtube.com/watch?v=LC4jOs6z-ZI
- The Hidden Performance Price of C++ Virtual Functions - Ivica Bogosavljevic - https://www.youtube.com/watch?v=n6PvvE_tEPk
- Why do Programs Get Slower with Time? - Ivica Bogosavljevic - https://www.youtube.com/watch?v=nS5vjnPKX0I
- CPU Cache Effects - Sergey Slotin - https://www.youtube.com/watch?v=mQWuX_KgH00
- Cpu Caches and Why You Care - Scott Meyers - https://www.youtube.com/watch?v=WDIkqP4JbkE
- CPU vs FPGA - https://www.youtube.com/watch?v=BML1YHZpx2o
- Designing for Efficient Cache Usage - Scott McMillan - https://www.youtube.com/watch?v=3-ityWN-FdE
- Cache consistency and the C++ memory model - Yossi Moale - https://www.youtube.com/watch?v=Sa08x_NMZIg
- std::simd: How to Express Inherent Parallelism Efficiently Via Data-parallel Types - Matthias Kretz - https://www.youtube.com/watch?v=LAJ_hywLtMA
- The Art of SIMD Programming - Sergey Slotin - https://www.youtube.com/watch?v=vIRjSdTCIEU
- Advanced SIMD Algorithms in Pictures - Denis Yaroshevskiy - https://www.youtube.com/watch?v=vGcH40rkLdA
- Performance Optimization, SIMD and Cache - Sergiy Migdalskiy - https://www.youtube.com/watch?v=Nsf2_Au6KxU
- Data-Oriented Design and C++ - Mike Acton - https://www.youtube.com/watch?v=rX0ItVEVjHc
- Practical Data Oriented Design (DoD) - Andrew Kelley - https://www.youtube.com/watch?v=IroPQ150F6c
- Data Orientation For The Win - Eduardo Madrid - https://www.youtube.com/watch?v=QbffGSgsCcQ
- You Can Do Better than std::unordered_map - Malte Skarupke - https://www.youtube.com/watch?v=M2fKMP47slQ
- Faster than Rust and C++: the PERFECT hash table - https://www.youtube.com/watch?v=DMQ_HcNSOAI
- Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step - Matt Kulukundis - https://www.youtube.com/watch?v=ncHmEUmJZf4
- C++ Run-Time Optimizations for Compile-Time Reflection - Kris Jusiak - https://www.youtube.com/watch?v=ncHmEUmJZf4 - https://www.youtube.com/watch?v=kCATOctR0BA
High Frequency Trading
- When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - https://www.youtube.com/watch?v=sX2nF1fW7kI
- When a Microsecond Is an Eternity: High Performance Trading Systems in C++ - Carl Cook - https://www.youtube.com/watch?v=NH1Tta7purM
- The Speed Game: Automated Trading Systems in C++ - Carl Cook - https://www.youtube.com/watch?v=ulOLGX3HNCI
- Low-Latency Trading Systems in C++ - Jason McGuiness - https://www.youtube.com/watch?v=FnMfhWiSweo
- High Frequency Trading and Ultra Low Latency development techniques - Nimrod Sapir - https://www.youtube.com/watch?v=_0aU8S-hFQI
- Trading at light speed: designing low latency systems in C++ - David Gross - https://www.youtube.com/watch?v=8uAW5FQtcvE&list=PLSkBiuVO9yj1MvDkYJ5WOnPeKsoRi3eiW&index=2
- Optimizing Trading Strategies for FPGAs in C/C++ - https://www.youtube.com/watch?v=4Wklh0XS5i0
- C++ Electronic Trading for Cpp Programmers - Mathias Gaunard - https://www.youtube.com/watch?v=ltT2fDqBCEo
- Achieving performance in financial data processing through compile time introspection - Eduardo Madrid - https://www.youtube.com/watch?v=z6fo90R8q5U
- How to Simulate a Low Latency Exchange in C++ - Benjamin Catterall - https://www.youtube.com/watch?v=QQrTE4YLkSE
- Building Low Latency Trading Systems - https://www.youtube.com/watch?v=yBNpSqOOoRk
- Cache Warming: Warm Up The Code - Jonathan Keinan - https://www.youtube.com/watch?v=XzRxikGgaHI
- How Linux Took Over the World of Finance - Christoph H Lameter - https://www.youtube.com/watch?v=UUOM4KdaHkY
Tools
Online
- Compiler Explorer - https://compiler-explorer.com
- Latency, Throughput, and Port Usage Information - https://uops.info / https://uica.uops.info
- Latency, Memory Latency and CPUID dumps - http://instlatx64.atw.hu
- Instruction Reference - https://www.felixcloutier.com/x86
x86
Processor Information - https://sandpile.org- Memory Latency Data - https://chipsandcheese.com/memory-latency-data
- Instruction Discovery And Analysis on
x86-64
- https://explore.liblisa.nl- Quick C++ Benchmark - https://quick-bench.com
Benchmarking
- google-benchmark - https://github.com/google/benchmark
- nanobench - https://github.com/martinus/nanobench
- celero - https://github.com/DigitalInBlue/Celero
- nanobench - https://github.com/andreas-abel/nanoBench
- uarch-bench - https://github.com/travisdowns/uarch-bench
- llvm-exegesis - https://llvm.org/docs/CommandGuide/llvm-exegesis.html
Profiling
- linux-perf - https://perf.wiki.kernel.org
- intel-vtune - https://www.intel.com/content/www/us/en/docs/vtune-profiler
- intel-advisor - https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
- intel-sde - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html
- intel-pin - https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html
- amd-uprof - https://www.amd.com/en/developer/uprof.html
- callgrind - https://valgrind.org/docs/manual/cl-manual.html
- pmu-tools - https://github.com/andikleen/pmu-tools
- perf-tools - https://github.com/brendangregg/perf-tools
- yperf - https://github.com/aayasin/perf-tools
- ebpf - https://ebpf.io
- dtrace - https://www.oracle.com/linux/downloads/linux-dtrace.html
- ftrace - https://www.kernel.org/doc/html/latest/trace/ftrace.html
- utrace - https://github.com/Gui774ume/utrace
- strace - https://strace.io
- magictrace - https://github.com/janestreet/magic-trace
- omnitrace - https://github.com/ROCm/omnitrace
- tracy - https://github.com/wolfpld/tracy
- optick - https://github.com/bombomby/optick
- rad_telemetry - https://www.radgametools.com/telemetry.html
- wachy - https://rubrikinc.github.io/wachy
- easy_profiler - https://github.com/yse/easy_profiler
- gprof - https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html
- gperftools - https://github.com/gperftools/gperftools
- oprofile - https://oprofile.sourceforge.io
- optview2 - https://github.com/OfekShilon/optview2
- llvm-xray - https://llvm.org/docs/XRay.html
- likwid - https://github.com/RRZE-HPC/likwid
- lttng - https://lttng.org
- sysprof - https://www.sysprof.com
- coz - https://github.com/plasma-umass/coz
- bcc - https://github.com/iovisor/bcc
Analysis
- llvm-mca - https://llvm.org/docs/CommandGuide/llvm-mca.html
- osaca - https://github.com/RRZE-HPC/OSACA
- uica - https://uica.uops.info
Profile-Guided Optimization (PGO)
- clang-pgo - https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
- gcc-pgo - https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- llvm-bolt - https://github.com/llvm/llvm-project/blob/main/bolt/README.md
- llvm-propelleer - https://github.com/google/llvm-propeller
- autofdo - https://github.com/google/autofdo
Utilities
- pyperf - https://github.com/psf/pyperf
- kcachegrind - https://kcachegrind.sourceforge.net/html/Home.html
- hyperfine - https://github.com/sharkdp/hyperfine
- hotspot - https://github.com/KDAB/hotspot
- numatop - https://github.com/intel/numatop
- bpftop - https://github.com/Netflix/bpftop
- pahole - https://github.com/acmel/dwarves
- perfetto - https://perfetto.dev
- speedscope - https://github.com/jlfwong/speedscope
- retsnoop - https://github.com/anakryiko/retsnoop
- core-to-core-latency - https://github.com/nviennot/core-to-core-latency
- llvm-opt-report - https://llvm.org/docs/CommandGuide/llvm-opt-report.html
- jupyter notebook - https://jupyter.org
Libraries
perf_event_open
- https://man7.org/linux/man-pages/man2/perf_event_open.2.html- perfmon2 - https://perfmon2.sourceforge.net
- papi - https://github.com/icl-utk-edu/papi
- libpfc - https://github.com/obilaniu/libpfc
- intel_pt - https://github.com/intel/libipt
- intel_pcm - https://github.com/intel/pcm
MIT / Apache2:LLVM*