GitHub

// Overview / Tutorial / Studies / API / FAQ / Resources

`perf`: C++23 Performance library

Performance is not a number!

Overview

Performance-oriented library that combines the power of:
c++23, linux/perf, llvm/mca, intel/pt, gnuplot, terminal/sixel, ...

Use cases

CPU Benchmarking, Profiling, Tracing, Analysis

Requirements

Minimal

clang-16+, gcc-12+ / c++23+

Optimal (recommended)

clang-19+, gcc-13+ / c++23+

llvm-dev-19+ - apt-get install llvm-dev

linux-6.x+

perf_event_open - apt-get install linux-tools-common

intel-12th+ with tpebs, ipt support

intel_pt - apt-get install libipt-dev

terminal with sixel support

gnuplot - apt-get install gnuplot

Optional

Profiling

linux-perf - apt get install linux-tools-common

llvm-xray - apt-get install llvm

gperftools - apt get install google-perftools

intel-vtune - apt get install intel-oneapi-vtune

callgrind - apt-get install valgrind

Data analysis

jupyter notebook - apt-get install jupyter

perfetto - https://perfetto.dev

Sharing analysis - https://gist.github.com

gh - apt-get install gh

Features

Supports timing, profiling, tracing, analyzing, plotting, testing

time (tsc, steady_clock, cpu,thread, real, monotonic)

stat (perf_event_open)

record (perf_event_open)

trace (perf_event_open/intel_pt)

mc/mca (llvm: disassemble, llvm-mca: resource_pressure, timeline, bottleneck)

prof (linux-perf, llvm-xray, gperftools, intel-vtune, callgrind)

plot (gnuplot: hist, bar, box, ecdf, line)

Strives for simplicity, flexibility, accuracy

Single Header/Module - https://github.com/qlibs/perf/blob/main/perf https://github.com/qlibs/perf/blob/main/perf.cppm

API / bench

import perf;

int main() {
  perf::runner bench{perf::bench::latency{}};     // what and how

  auto fizz_buzz = [](int n) {
    if (n % 15 == 0) {
      return "FizzBuzz";
    } else if (n % 3 == 0) {
      return "Fizz";
    } else if (n % 5 == 0) {
      return "Buzz";
    } else {
      return "Unknown";
    }
  };

  bench(fizz_buzz, 1);
  bench(fizz_buzz, 3);
  bench(fizz_buzz, perf::data::repeat<int>{{3,5,15}});
  bench(fizz_buzz, perf::data::uniform<int>{.min = 0, .max = 15});

  perf::report(bench[perf::time::steady_clock]);  // per benchmark
  perf::annotate(bench[perf::mc::assembly]);      // per instruction
  perf::plot::bar(bench[perf::stat::cycles])      // hist, bar, box, ecdf, line
}

See How does it work? fore more details

Enables dev, reaserch workflows

dev: edit -> compile -> run -> analyze (c++ only)

research: edit -> compile -> run -> save(json) -> notebook/perfetto -> analyze (c++/python/browser)

Performs self-verification upon compilation
compile-time tests are executed upon include/import (enabled by default)
run-time/sanity check tests can be executed by
int main() {
  perf::test::run({.verbose = true});
}
compile-time/run-time tests can be disabled with -DNTEST (not recommended)

Setup

Try it!

perf.mp4

docker build -t perf .
docker run --rm --privileged -v ${PWD}:${PWD} -w ${PWD} \
  clang++-20 -std=c++23 -O3 --precompile perf.cppm
gh gist create --public --web hello_world.cpp

Build & Test

local env

build dependencies (optional)

apt-get install linux-tools-common
apt-get install llvm-dev
apt-get install libipt-dev
apt-get install gnuplot
pip3 install pyperf

enable linux tracing (optional)

.github/scripts/setup.sh --perf

tune machine for benchmarking (recommended)

.github/scripts/tune.sh

import perf

cat <<EOF > perf.cpp
  import perf;
  int main() {
    perf::test::run({.verbose = true});
  }
EOF

build # perf tests are running at compile-time upon inclusion/import and run-time on start-up

clang++-20 \
  -std=c++23
  -O3 \
  -I. \
  -I/usr/lib/llvm-20/include \
  -lLLVM-20 \
  -lipt \
  -o perf \
  perf.cpp && \
./perf # runs run-time tests and sanity checks

docker

build perf image

wget https://raw.githubusercontent.com/qlibs/perf/refs/heads/main/.github/workflows/Dockerfile
docker build -t perf .

import perf

 cat <<EOF > perf.cpp
   import perf;
   int main() {
     perf::test::run({.verbose = true});
   }
 EOF

build # perf tests are running at compile-time upon inclusion/import and run-time on start-up

docker run \
  --rm \
  --privileged \
  -v ${PWD}:${PWD} \
  -w ${PWD} \
  perf \
  clang++-20 \
    -std=c++23
    -O3 \
    -I. \
    -I/usr/lib/llvm-20/include \
    -lLLVM-20 \
    -lipt \
    -o perf \
    perf.cpp && \
  ./perf

Tutorial

Setup

tune / info / core

Benchmarking

what & how

latency vs throughput

report / annotate / plot

debug & test

Profiling / Tracing / Analysis

disassemble vs trace

profile vs analyze

Miscellaneous

export + markdown / notebook / html

share + github

workflow + notebook / prefetto

Studies

Retiring

Bad speculation

Frontend bound

Backend bound

max_ipc

API

Config

Macros

/**
 * PERF version # https://semver.org
 */
#define PERF (MAJOR, MINOR, PATCH) // ex. (1, 0, 0)

/**
 * GNU # default: deduced based on `__GNUC__`
 * - 0 not compatible
 * - 1 compatible
 */
#define PERF_GNU 0/1

/**
 * Linux # default: deduced based on `__linux__` and `perf_event_open.h`
 * - 0 not supported
 * - 1 supported
 */
#define PERF_LINUX 0/1

/**
 * UEFI # default: 0
 * - 0 not supported
 * - 1 supported
 */
#define PERF_UEFI 0/1

/**
 * LLVM # default: deduced based on `llvm-dev` headers
 * - 0 not supported
 * - 1 supported
 */
#define PERF_LLVM 0/1

/**
 * Intel Processor Trace # default: deduced based on `<intel_pt.h>` header
 * - 0 not supported
 * - 1 supported
 */
#define PERF_INTEL 0/1

/**
 * tests # default: not-defined
 * - defined:     disables all compile-time, run-time tests
 * - not-defined: compile-time tests executed, run-time tests via `test::run`
 */
#define NTEST

Enviornment variables

/**
 * gnuplot terminal # see `gnuplot -> set terminal` # default: 'sixel'
 * - 'sixel'                  # console image # https://www.arewesixelyet.com
 * - 'wxt'                    # popup window
 * - 'dumb size 150,25 ansi'  # console with colors
 * - 'dumb size 80,25'        # console
 */
ENV:PERF_PLOT_TERM

/**
 * style # default: dark
 * - light
 * - dark
 */
ENV:PERF_PLOT_STYLE

Synopsis

FAQ

Usage
How does it work?
unroll - show run

views - gnuplot via sixel

function - invoke

sampler, counter - perf_event_open

mca - llvm-dev.mca
Backward/type-safe deduction on what to bench based on usage
bench(fn); // `time::cpu, stat::instructions (ipc), stat::cycles` will be benchmarked
report(bench[stat::cycles, time::cpu]);
annotate(bench[record::cycles]);
plot(bench[stat::ipc]);
How to share results?
export.sh
with https://gist.github.com
apt-get install gh
gh gist create perf.html --public | awk '{print "https://htmlpreview.github.io/?perf.html/raw"}'
Sharing options

markdown (.md) - easiest to edit but github doesn`t render embedded images

notebook (.ipynb) - still markdown based but githbu rendered embedded images

html (.html) - best looking but the hardest to edit, github doesn`t render html but https://htmlpreview.github.io does
How to use ascii based charts?
// - 'dumb size 150,25 ansi'  # console with colors
// - 'dumb size 80,25'        # console
PERF_PLOT_TERM='dumb size 150,25 ansi' ./perf
How to use popup based charts?
// - 'dumb size 150,25 ansi'  # console with colors
// - 'dumb size 80,25'        # console
PERF_PLOT_TERM='wxt' ./perf
How to change plots style?
ENV:PERF_PLOT_STYLE='light' ./perf
ENV:PERF_PLOT_STYLE='dark' ./perf # default
What is prevent_elision and when it`s needed?
optimizing compiler may elide your code completely if its not required (doesnt have side effects). prevent_elision will prevent that.
verify(perf::compiler::is_elided([] { }));
verify(perf::compiler::is_elided([] { auto value = 4 + 2; }));
verify(perf::compiler::is_elided([] { int i{}; i++; }));
verify(not perf::compiler::is_elided([&] { i++; }));
verify(not perf::compiler::is_elided([] { static int i; i++; }));
verify(not perf::compiler::is_elided([=] {
  int i{};
  perf::compiler::prevent_elision(i++);
}));
How to change assembly syntax from intel to at&t?
perf::llvm llvm{.syntax = perf::arch::syntax::att };
perf::bench::runner bench{
  {.syntax = perf::arch::syntax::att},
  perf::bench::latency,
};
How to disassemble for different platform?
perf::llvm llvm{.triple = "x86_64-pc-linux-gnu" }; // see `llvm-llc` for details
How to integration with testing framework?
perf can be intergrated with any unit testing framework - https://github.com/qlibs/ut
import perf;
import ut;

int main() {
  "benchmark"_test = [] {
    // ...
  };
}
How to save images?
gnuplot save{{.term = "svg"}};
save.send("set output 'output.svg'");
bar(save, results);
How to use perf as a C++20 module?
clang++-20 -std=c++23 -O3 -I. --precompile perf.cppm
clang++-20 -std=c++23 -O3 -fprebuilt-module-path=. perf.pcm perf.cpp -lLLVM-18 -lipt
What is required to display images on the terminal?

terminal with sixel support - https://www.arewesixelyet.com note: Visual Studio Code supports images on terminal (requires enabling Terminal -> Enable images option)

How to plot on the server without sixel? PERF_TERM=ascii ./a.out # see gnuplot - set terminal

How to write gnuplot charts? sudo apt install gnuplot

gnuplot documentation - http://www.gnuplot.info/documentation.html

gnuplot demos - http://www.gnuplot.info/demo

online gnuplot - https://gnuplot.io
how to write custom profiler?
struct my_profiler {
  constexpr auto start() { }
  constexpr auto stop() { }
  [[nodiscard]] constexpr auto *operator() { }
};
static_assert(perf::prof_like<your_profiler>);
How to pollute cache, heap when benchmarking?

perf::memory::pollute_heap

perf::memory::flush_cacheline
Instrumentation with llvm-xray
[[clang::xray_always_instrument]]
void always_profile();

[[clang::xray_always_instrument, clang::xray_log_args(1)]]
void always_profile_and_log_i(int i);

[[clang::xray_never_instrument]]
void never_profile();
# profiling threshold
-fxray-instruction-threshold=1 # default 200 instructions
# instrumentation info
llvm-xray extract ./a.out --symbolize
https://godbolt.org/z/WhsEYf9cc
Conditional profiling with callgrind
prof::callgrind profiler{"example"};

while (true) {
  profiler.start(); // resets profile

  if (should_trigger()) {
    trigger();
    profiler.stop();
    proflier.flush(); // dumps `example` profile
  }
}
kcachegrind callgrind.* # opens all profiles combined
How perf tests are working?
#ifndef NTEST
  "demo"_suite = [] {
    "run-time and compile-time"_test = [] constexpr {
      expect(3 == sum({1, 2}));
    };

    "run-time only"_test = [] mutable {
      expect(std::rand() >= 0);
    };

    "compile-time only"_test = [] consteval {
      expect(sizeof(int) == sizeof(std::int32_t));
    };
  };
#endif
How perf compares to google.benchmark, nanobench, celero?

Firstly, google.benchmark, nanobench, celero are great and established libraries.

perf philosophy is more about the fact that performance is not a number which leads to the following

data driven inputs (to avoid branch prediction overfitting)

statistical, data driven inputs

instruction level and function level

statistical, don`t avoid branch prediction

latency and throughput

analysis, profiling and plotting

Performance
Common pitfalls?

Undersand

what (latency, throughput),

state (cold, warm, data distribution)

how (timing, counters, profiling) to measure

hardware (what env, os, setup, ...)

ensure proper analysis (avoid premature conclusions)

assert your expectations

verify in the production-like use case/env

Ensure the machine is tuned for micro-benchmarking (see #tuning)

Use structured and diligent methods (see #top_down_analysis)

Use realistic scenarios (see #setup)

Use statistical methods for analysis (#stat)

Analyze generated assembly (see #annotate/#mca)

Visualize results (see #gnuplot)

Document / share analysis (see #html)

Automate expectations (see #testing)

Enhnace your understanding (see #performance)

Verify changes in production like system (see #prof) - active-benchmarking

Measure / verify consistently (see #json)

Latency and Throughput?

latency is the time it takes for a single operation to complete (ns)

throughput is the total number of operations or tasks completed in a given amount of time (op/s)

note: In single threaded execution if the algo can`t be parallilzed (where throughput would be important) you likely care about the latency.
Performance compilation flags?
-O1                     # optimizations (level1) [0]
-O2                     # optimizations (level1 + level2) [0]
-O3                     # [unsafe] optimizations (level1 + level2 + level3) [0]
-DNDEBUG                # disable asserts, etc.
-march=native           # specifies architecture [1]
-ffast-math             # [unsafe] faster math but non-conforming math [2]
-g                      # debug symbols
-fno-omit-frame-pointer # keep the frame pointer in a register for functions that don`t need one
-fcf-protection=none    # [unsafe] stops emmitting `endbr64` # control-flow enforcement technology (cet)
performance attributes?
[[gnu::target("avx2")]]
[[gnu::optimize("O3")]
[[gnu::optimize("ffast-math")]
note: gcc support is much more comprehensive than clang
When to finish micro-benchmarking?

when IPC if you reached won`t get any better.

How to handle cases when there are no obvious bottlenecks?

The most likely scenario is cache utilization.

How to profile in production?

see prof API

What are different micro-benchmarking workflows?

[cpp] # measure, analyze from cpp pros: self contained, easy itartions, supports unit testing, supports running on the server (PERF_PLOT_TERM="dumb") cons: requires multiple runs for different compiler/flags note: can be shared with export.sh and gh with https://gist.github

[cpp->json->notebook] measure with cpp and export json, analyze results in notebook pros: can be unified across different for different compiler/flags and production runs, power of notebooks, easy to share, can be done offline cons: multiple places, longer runs, harder to change the code, harder to run on remote servers, harder to unit test expectations, requires GUI

[cpp->json->CI] measure with cpp and use json for CI

used for continous benchmarking

What is top-down microarchitecture analysis method?

https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html

https://github.com/andikleen/pmu-tools/wiki/toplev-manual
What is jupyter notebook and how to use it?
jupter-notebook is great for analysis and leaving a trace of it. however, C++ doesnt integrate well with jupyter-notbook (cling is often not an option) so it doesnt work well with servers (requires coping data) and it`s an additional file. perf approach is to leave trace of benchmarking with automatically testing it direclty from C++ so that it can be seen for the future and explains what and why.
# apt install jupyter
jupyter notebook -ip 0.0.0.0 --no-browser notebook.ipynb
Workflows

dev

pros

easy editing and integration with exising tooling

output to the console (charts with sixel)

easy to share via gh gist create # github can render markdown

can assert and verify expectation

can be executed in headless mode on the server without ui

cons

harder analysis than in python

single compiler workflow

research (run from jupter notebook (produce json) analyze using python)

pros

infite way of data analysis with great matpotlib suppport

can run different compilers/options

easy to share via gh gist create # github can render ipynb

can be used on kaggle (avx512) or google colab

cons

editing C++ is not that well supported by notebooks but runnig is fine

not that great to analyze assembly output

usually would require running through a browser (there are extensions for vim/emacs/vscode)

might be slow with a lot of data

Troubleshooting

How to setup docker?

Dockerfile

docker build -t perf .

docker run \
  -it \
  --privileged \
  --network=host \
  -e DISPLAY=${DISPLAY} \
  -v ${PWD}:${PWD} \
  -w ${PWD} \
  perf

How to setup linux performance counters?

setup.sh

sudo mount -o remount,mode=755 /sys/kernel/debug
sudo mount -o remount,mode=755 /sys/kernel/debug/tracing
sudo chown `whoami` /sys/kernel/debug/tracing/uprobe_events
sudo chmod a+rw /sys/kernel/debug/tracing/uprobe_events
echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 1000 | sudo tee /proc/sys/kernel/perf_event_max_sample_rate

How to setup rdpmc?

rdpmc

echo 2 | sudo tee /sys/devices/cpu_core/rdpmc

How to find out which performance events are supported by the cpu?

perf list # https://perfmon-events.intel.com

How to reduce noise when benchmarking?

Linux

pyperf (https://pyperf.readthedocs.io/en/latest/system.html)

sudo pyperf system tune
sudo pyperf system show
sudo pyperf system reset

tune.sh

# Set Process CPU Affinity (apt install util-linux)
# note: `perf::sys::thread::affinity` can be used intead
taskset -c 0 ./a.out

# Set Process Scheduling Priority (apt install coreutils)
# note: `perf::sys::thread::priority` can be used instead
nice -n -20 taskset -c 0 ./a.out # -20..19 (most..less favorable to the process)

# Disable CPU Frequency Scaling (apt install cpufrequtils)
sudo cpupower frequency-set --governor performance
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable Address Space Randomization
echo 0 > /proc/sys/kernel/randomize_va_space

# Disable Processor Boosting
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost

# Disable Turbo Mode
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# Disable Hyperthreading/SMT
echo off | sudo tee /sys/devices/system/cpu/smt/control

# Restrict memory to a single socket
numactl -m 0 -N 0 ./a.out

# Enable Huge Pages
sudo numactl --cpunodebind=1 --membind=1 hugeadm \
  --obey-mempolicy --pool-pages-min=1G:64
sudo hugeadm --create-mounts

boot / grub

# Enable Kernel Mode Task-Isolation (https://lwn.net/Articles/816298)
isolcpus=<cpu number>,...,<cpu number>
# cat /sys/devices/system/cpu/isolated

# Disable P-states and C-states
idle=pool intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1
# cat /sys/devices/system/cpu/intel_pstate/status

# Disable NMI watchdog (cat /proc/sys/kernel/nmi_watchdog)
nmi_watchdog=0

UEFI

int efi_main(void*, uefi::efi_system_table* system_table) {
  ctrl::cpu::perf::init();

  const auto b1 = ctrl::cpu::perf::branch_instructions();
  const auto c1 = ctrl::cpu::perf::cache_misses();
  u64 checksum{};
  for (auto i = 0u ;i < (1u << 16u); ++i) {
    checksum ^= i;
  }
  const auto b2 = ctrl::cpu::perf::branch_instructions();
  const auto c2 = ctrl::cpu::perf::cache_misses();

  uefi::println(system_table->out, L"results");
  uefi::println(system_table->out, b2 - b1);
  uefi::println(system_table->out, c2 - c1);

  return uefi::run();
}

clang++-18 -std=c++20 -I. -target x86_64-pc-win32-coff -fno-stack-protector -fshort-wchar -mno-red-zone -c uefi.cpp -o uefi.o
lld-link-18 -filealign:16 -subsystem:efi_application -nodefaultlib -dll -entry:efi_main uefi.o -out:BOOTX64.EFI
mkdir -p efi/boot
cp BOOTX64.EFI /usr/share/ovmf/OVMF.fd efi/boot
qemu-system-x86_64 \
  -drive if=pflash,format=raw,file=efi/boot/OVMF.fd \
  -drive format=raw,file=fat:rw:. \
  -net none

Requires reboot Gives more control such disabling cache, ...

Resources

Specs
Manuals

Intel - https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

AMD - https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf

ARM - https://developer.arm.com/documentation/ddi0487/latest

Apple - https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms, https://github.com/mikeroyal/Apple-Silicon-Guide

Books

Optimizing Software in C++: An Optimization Guide for Windows, Linux and Mac platforms - https://www.agner.org/optimize/optimizing_cpp.pdf

Optimizing Subroutines in Assembly Language: An Optimization Guide for x86 platforms - https://www.agner.org/optimize/optimizing_assembly.pdf

The Microarchitecture of Intel, AMD and VIA CPUs: An Optimization Guide for Assembly programmers and compiler makers - https://www.agner.org/optimize/microarchitecture.pdf

Instruction Tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs - https://www.agner.org/optimize/instruction_tables.pdf

Calling Conventions for different C++ compilers and operating systems - https://www.agner.org/optimize/calling_conventions.pdf

What Every Programmer Should Know About Memory - https://www.akkadia.org/drepper/cpumemory.pdf

Performance Analysis and Tuning on Modern CPUs - https://github.com/dendibakh/perf-book/releases

Algorithms for Modern Hardware - https://en.algorithmica.org/hpc

The Art of Writing Efficient Programs - https://www.packtpub.com/product/the-art-of-writing-efficient-programs

Computer Architecture - https://dl.acm.org/doi/book/10.5555/1999263

Is Parallel Programming Hard, And, If So, What Can You Do About It? - https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

SIMD for C++ Developers - http://const.me/articles/simd/simd.pdf

Hackers Delight - https://doc.lagout.org/security/Hackers%20Delight.pdf

Data-Oriented Design - https://www.dataorienteddesign.com/dodbook

Top-Down Microarchitecture Analysis Method - https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html

A Top-Down method for performance analysis and counters architecture - https://www.researchgate.net/publication/269302126_A_Top-Down_method_for_performance_analysis_and_counters_architecture

Top-Down Metrics - https://github.com/intel/perfmon/blob/main/TMA_Metrics-full.xlsx

Measuring Workloads With TopLev - https://github.com/andikleen/pmu-tools/wiki/toplev-manual

The Art of Assembly Language - https://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/AoATOC.html

Bits Of Architecture - https://github.com/CoffeeBeforeArch/bits_of_architecture

Bit Twiddling Hacks - https://graphics.stanford.edu/~seander/bithacks.html

Memory Models - https://research.swtch.com/mm

nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems - https://arxiv.org/abs/1911.03282

The Linux Scheduler: a Decade of Wasted Cores - https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf

The Tail At Scale - https://www.barroso.org/publications/TheTailAtScale.pdf

Producing wrong data without doing anything obviously wrong! - https://dl.acm.org/doi/10.1145/1508284.1508275

Robust benchmarking in noisy environments - https://arxiv.org/abs/1608.04295

Can Seqlocks Get Along With Programming Language Memory Models - https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf

Cache-Oblivious Algorithms and Data Structures - https://erikdemaine.org/papers/BRICS2002/

Performance suiteering of Software Systems - https://ocw.mit.edu/courses/6-172-performance-suiteering-of-software-systems-fall-2018
Cheatsheets
Operation Costs in CPU Clock Cycles - http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles

Intel Intrinsics Guide - https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

x86 Intrinsics Cheatsheet - https://db.in.tum.de/~finis/x86-intrin-cheatsheet-v2.1.pdf

A machine-readable CPUID data repository - https://x86-cpuid.org

Instruction Matrix - https://github.com/google/highway/blob/master/g3doc/instruction_matrix.pdf

Microarchitecture Cheatsheet - https://docs.google.com/spreadsheets/d/18ln8SKIGRK5_6NymgdB9oLbTJCFwx0iFI-vUs6WFyuE

Performance Monitoring Events - https://perfmon-events.intel.com

Core to Core Latency - https://github.com/nviennot/core-to-core-latency
Clock Time Analysis - https://gitlab.com/chriscox/CppPerformanceBenchmarks/-/wikis/ClockTimeAnalysis
Speed of light ............................ ~1 foot/ns
L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD .....  1,000,000 ns  =   1 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->UK->CA ....          150,000,000 ns  = 150 ms
# System topology (apt install hwloc)
lstopo # lstopo-no-graphics

# CPU info (apt install util-linux)
lscpu | grep -E ^CPU|^Model|^Core|^Socket|^Thread

# Cache info
lscpu | grep cache
getconf -a | grep CACHE_LINESIZE

# Numa nodes
lscpu | grep -E ^NUMA

# Huge pages
cat /proc/meminfo | grep -i huge
Guides

The Linux Kernel Documentation - https://www.kernel.org/doc/html/latest/index.html

RHEL Performance Guide - https://myllynen.github.io/rhel-performance-guide/

Monitoring and Managing System Status and Performance - https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance

Active Benchmarking - https://www.brendangregg.com/activebenchmarking.html

A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection - https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html

X3 Low Latency Quickstart - https://docs.amd.com/r/en-US/ug1586-onload-user/X3-Low-Latency-Quickstart

Modular platform for computer-system architecture research - https://www.gem5.org

Reverse debugging at scale - https://suiteering.fb.com/2021/04/27/developer-tools/reverse-debugging

Asynchronous Programming Under Linux - https://unixism.net/loti/async_intro.html

Phoronix Test Suite - https://www.phoronix-test-suite.com

CPU benchmark - https://www.cpubenchmark.net

Modern Microprocessors A 90-Minute Guide! - https://www.lighterra.com/papers/modernmicroprocessors

7-Zip LZMA Benchmark - https://www.7-cpu.com

CPU Benchmarks - https://curiouscoding.nl/posts/cpu-benchmarks

JVM Anatomy Quarks - https://shipilev.net/jvm/anatomy-quarks

linux-perf Source Code - https://github.com/torvalds/linux/blob/master/tools/perf, https://elixir.bootlin.com/linux/v6.13-rc1/source/include/uapi/linux/perf_event.h

gcc optimization - https://wiki.gentoo.org/wiki/GCC_optimization

gcc assembler syntax - https://www.felixcloutier.com/documents/gcc-asm

llvm Scheduling Models - https://github.com/llvm/llvm-project/tree/main/llvm/lib/Target

llvm Optimization Passes - https://llvm.org/docs/Passes.html

llvm Vectorizers - https://llvm.org/docs/Vectorizers.html

Tutorials

Performance Ninja Class - https://github.com/dendibakh/perf-ninja

Hardware Effects - https://github.com/Kobzol/hardware-effects

Performance Tuning - https://github.com/NAThompson/performance_tuning_tutorial

Mastering C++ with Google Benchmark - https://ashvardanian.com/posts/google-benchmark

Learning to Write Less Slow C, C++, and Assembly Code - https://github.com/ashvardanian/less_slow.cpp

Feeds

News

Linux News - https://lwn.net

Chips and Cheese - https://chipsandcheese.com

WikiChip - https://wikichip.org

CPUID - https://www.cpuid.com/news.html

Real World Tech - https://www.realworldtech.com

Tom`s Hardware - https://www.tomshardware.com

Phoronix - https://www.phoronix.com

comp.lang.asm.x86 - https://groups.google.com/g/comp.lang.asm.x86

Blogs

Agner Fog`s Blog - https://www.agner.org

Denis Bakhvalov`s Blog - https://easyperf.net/blog

Daniel Lemire`s Blog - https://lemire.me/blog

Wojciech Mula`s Blog - http://0x80.pl/articles/index.html

Erik Rigtorp`s Blog - https://rigtorp.se

Johnny`s Software Blog - https://johnnysswlab.com

JabPerf`s Blog - https://jabperf.com/blog

Brendan Gregg`s Blog - https://brendangregg.com/blog

Geoff Langdale`s Blog - https://branchfree.org

Ragnar Groot Koerkamp`s Blog - https://curiouscoding.nl/posts/cpu-benchmarks

Travis Downs`s Blog - https://travisdowns.github.io

Tristan`s Blog - https://thume.ca/archive.html

Stefanos Baziotis`s Blog - https://sbaziotis.com/#blog

Gamozo Labs Blog - https://gamozolabs.github.io

Mechanical Sympathy Blog - https://mechanical-sympathy.blogspot.com

Performance Engineering Blog - https://pramodkumbhar.com

Dmitry Vyukov Blog - https://www.1024cores.net

John Farrier`s Blog - https://johnfarrier.com

Performance Tricks Blog - https://www.performetriks.com/blog

Coding Confessions Blog - https://blog.codingconfessions.com

The Netflix Tech Blog - https://netflixtechblog.com

Cloudflare Blog - https://blog.cloudflare.com

Lists

C++ Links - https://github.com/MattPD/cpplinks

Awesome Performance C++ - https://github.com/fenbf/AwesomePerfCpp

Awesome Lock Free - https://github.com/rigtorp/awesome-lockfree

Awesome SIMD - https://github.com/awesome-simd/awesome-simd

Computer, Enhance! - https://www.computerenhance.com

Low Latency Trading Insights - https://lucisqr.substack.com

Miscellaneous

Conferences - https://www.p99conf.io, https://supercomputing.org, https://hotchips.org, https://microarch.org

Podcasts - https://signals-threads.simplecast.com, https://microarch.club, https://tlbh.it, https://twoscomplement.org

C++ Low Latency Group (SG14) - https://github.com/WG21-SG14/SG14

Videos

Channels

Computer Architecture - Onur Mutlu - https://www.youtube.com/@OnurMutluLectures

Computer, Enhance - Casey Muratori - https://www.youtube.com/@MollyRocket

Assembly / Creel - https://www.youtube.com/c/WhatsACreel

EasyPerf / Spaces - https://www.youtube.com/@easyperf3992

SIMD algorithms - Denis Yaroshevskiy - https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg_jaSFHpmlSBD

Benchmarking

Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! - Chandler Carruth - https://www.youtube.com/watch?v=nXaxk27zwlk

Counting Nanoseconds Microbenchmarking C++ Code - David Gross - https://www.youtube.com/watch?v=Czr5dBfs72U

Benchmarking C++ Code - Bryce Adelstein-Lelbach - https://www.youtube.com/watch?v=zWxSZcpeS8Q

Benchmarking C++, From video games to algorithmic trading - Alexander Radchenko - https://www.youtube.com/watch?v=7YVMC5v4qCA

Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My! - Chandler Carruth - https://www.youtube.com/watch?v=nXaxk27zwlk

Going Nowhere Faster - Chandler Carruth - https://www.youtube.com/watch?v=2EWejmkKlxs

Measurement and Timing - Performance suiteering of Software Systems - https://www.youtube.com/watch?v=LvX3g45ynu8

How NOT to Measure Latency - Gil Tene - https://www.youtube.com/watch?v=lJ8ydIuPFeU

Analyzing

Understanding the Performance of code using LLVM-MCA - A. Biagio & M. Davis - https://www.youtube.com/watch?v=Ku2D8bjEGXk

LLVM Optimization Remarks - Ofek Shilon - https://www.youtube.com/watch?v=qmEsx4MbKoc

Profiling

From Top-down Microarchitecture Analysis to Structured Performance Optimizationsa - https://cassyni.com/events/YKbqoE4axHCgvQ9vuQq7Cy

Coz: finding code that counts with causal profiling - ACM - https://www.youtube.com/watch?v=jE0V-p1odPg

Take Advantage for Intel Instrumentation and Tracing Technology for Performance Analysis - https://www.youtube.com/watch?v=1zdVFLajewM&list=PLg-UKERBljNw3_6Q598CS3DE7KqDXjP-d

LIKWID Performance Tools - https://www.youtube.com/playlist?list=PLxVedhmuwLq2CqJpAABDMbZG8Whi7pKsk

Introduction to the Tracy Profiler - Bartosz Taudul - https://youtu.be/fB5B46lbapc

Performance Matters - Emery Berger - https://www.youtube.com/watch?v=r-TLSBdHe1A

Optimizing

Understanding Compiler Optimization - Chandler Carruth - https://www.youtube.com/watch?v=haQ2cijhvhE

Efficiency with Algorithms, Performance with Data Structures - Chandler Carruth - https://www.youtube.com/watch?v=fHNmRkzxHWs

Design for Performance - Fedor Pikus - https://www.youtube.com/watch?v=m25p3EtBua4

Unlocking Modern CPU Power - Next-Gen C++ Optimization Techniques - https://www.youtube.com/watch?v=wGSSUSeaLgA

Branchless Programming in C++ - Fedor Pikus - https://www.youtube.com/watch?v=g-WPhYREFjk

CPU design effects - Jakub Beranek - youtube.com/watch?v=ICKIMHCw--Y

Fastware - Andrei Alexandrescu - https://www.youtube.com/watch?v=o4-CwDo2zpg

Performance Tuning - Matt Godbolt - https://www.youtube.com/watch?v=fV6qYho-XVs

Memory & Caches - Matt Godbolt - https://www.youtube.com/watch?v=4_smHyqgDTU

What Every Programmer Should Know about How CPUs Work - Matt Godbolt - https://www.youtube.com/watch?v=-HNpim5x-IE

There Are No Zero-cost Abstractions - Chandler Carruth - https://www.youtube.com/watch?v=rHIkrotSwcc&

Understanding Optimizers: Helping the Compiler Help You - Nir Friedman - https://www.youtube.com/watch?v=8nyq8SNUTSc

C++ Algorithmic Complexity, Data Locality, Parallelism, Compiler Optimizations, & Some Concurrency - Avi Lachmish - https://www.youtube.com/watch?v=0iXRRCnurvo

Software Optimizations Become Simple with Top-Down Analysis on Intel Skylake - Ahmad Yasin - https://www.youtube.com/watch?v=kjufVhyuV_A

Being Friendly to Your Computer Hardware in Software Development - Ignas Bagdonas - https://www.youtube.com/watch?v=eceFgsiPPmk

Want fast C++? Know your hardware - Timur Doumler - https://www.youtube.com/watch?v=BP6NxVxDQIs

What is Low Latency C++ - Timur Doumler - https://www.youtube.com/watch?v=EzmNeAhWqVs, https://www.youtube.com/watch?v=5uIsadq-nyk

Where Have All the Cycles Gone? - Sean Parent - https://www.youtube.com/watch?v=B-aDBB34o6Y

Understanding CPU Microarchitecture to Increase Performance - https://www.youtube.com/watch?v=rglmJ6Xyj1c

Performance Analysis & Tuning on Modern CPU - Denis Bakhvalov - https://www.youtube.com/watch?v=Ho3bCIJcMcc

Comparison of C++ Performance Optimization Techniques for C++ Programmers - Eduardo Madrid - https://www.youtube.com/watch?v=4DQqcRwFXOI

Simple Code, High Performance - Molly Rocket - https://www.youtube.com/watch?v=Ge3aKEmZcqY

Assembly, System Calls, and Hardware in C++ - David Sankel - https://www.youtube.com/watch?v=7xwjjolDnwg

Optimizing Binary Search - Sergey Slotin - https://www.youtube.com/watch?v=1RIPMQQRBWk

A Deep Dive Into Dispatching Techniques in C++ - Jonathan Muller - https://www.youtube.com/watch?v=vUwsfmVkKtY

C++ Memory Model: from C++11 to C++23 - Alex Dathskovsky - https://www.youtube.com/watch?v=SVEYNEWZLo4

Abusing Your Memory Model for Fun and Profit - Samy Al Bahra, Paul Khuong - https://www.youtube.com/watch?v=N07tM7xWF1U&t=1s

The speed of concurrency (is lock-free faster?) - Fedor Pikus - https://www.youtube.com/watch?v=9hJkWwHDDxs

Read, Copy, Update, then what? RCU for non-kernel programmers - Fedor Pikus - https://www.youtube.com/watch?v=rxQ5K9lo034

Single Producer Single Consumer Lock-free FIFO From the Ground Up - Charles Frasch - https://www.youtube.com/watch?v=K3P_Lmq6pw0

Introduction to Hardware Efficiency in Cpp - Ivica Bogosavljevic - https://www.youtube.com/watch?v=Fs_T070H9C8

The Performance Price of Dynamic Memory in C++ - Ivica Bogosavljevic - https://www.youtube.com/watch?v=LC4jOs6z-ZI

The Hidden Performance Price of C++ Virtual Functions - Ivica Bogosavljevic - https://www.youtube.com/watch?v=n6PvvE_tEPk

Why do Programs Get Slower with Time? - Ivica Bogosavljevic - https://www.youtube.com/watch?v=nS5vjnPKX0I

CPU Cache Effects - Sergey Slotin - https://www.youtube.com/watch?v=mQWuX_KgH00

Cpu Caches and Why You Care - Scott Meyers - https://www.youtube.com/watch?v=WDIkqP4JbkE

CPU vs FPGA - https://www.youtube.com/watch?v=BML1YHZpx2o

Designing for Efficient Cache Usage - Scott McMillan - https://www.youtube.com/watch?v=3-ityWN-FdE

Cache consistency and the C++ memory model - Yossi Moale - https://www.youtube.com/watch?v=Sa08x_NMZIg

std::simd: How to Express Inherent Parallelism Efficiently Via Data-parallel Types - Matthias Kretz - https://www.youtube.com/watch?v=LAJ_hywLtMA

The Art of SIMD Programming - Sergey Slotin - https://www.youtube.com/watch?v=vIRjSdTCIEU

Advanced SIMD Algorithms in Pictures - Denis Yaroshevskiy - https://www.youtube.com/watch?v=vGcH40rkLdA

Performance Optimization, SIMD and Cache - Sergiy Migdalskiy - https://www.youtube.com/watch?v=Nsf2_Au6KxU

Data-Oriented Design and C++ - Mike Acton - https://www.youtube.com/watch?v=rX0ItVEVjHc

Practical Data Oriented Design (DoD) - Andrew Kelley - https://www.youtube.com/watch?v=IroPQ150F6c

Data Orientation For The Win - Eduardo Madrid - https://www.youtube.com/watch?v=QbffGSgsCcQ

You Can Do Better than std::unordered_map - Malte Skarupke - https://www.youtube.com/watch?v=M2fKMP47slQ

Faster than Rust and C++: the PERFECT hash table - https://www.youtube.com/watch?v=DMQ_HcNSOAI

Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step - Matt Kulukundis - https://www.youtube.com/watch?v=ncHmEUmJZf4

C++ Run-Time Optimizations for Compile-Time Reflection - Kris Jusiak - https://www.youtube.com/watch?v=ncHmEUmJZf4 - https://www.youtube.com/watch?v=kCATOctR0BA

High Frequency Trading

When Nanoseconds Matter: Ultrafast Trading Systems in C++ - David Gross - https://www.youtube.com/watch?v=sX2nF1fW7kI

When a Microsecond Is an Eternity: High Performance Trading Systems in C++ - Carl Cook - https://www.youtube.com/watch?v=NH1Tta7purM

The Speed Game: Automated Trading Systems in C++ - Carl Cook - https://www.youtube.com/watch?v=ulOLGX3HNCI

Low-Latency Trading Systems in C++ - Jason McGuiness - https://www.youtube.com/watch?v=FnMfhWiSweo

High Frequency Trading and Ultra Low Latency development techniques - Nimrod Sapir - https://www.youtube.com/watch?v=_0aU8S-hFQI

Trading at light speed: designing low latency systems in C++ - David Gross - https://www.youtube.com/watch?v=8uAW5FQtcvE&list=PLSkBiuVO9yj1MvDkYJ5WOnPeKsoRi3eiW&index=2

Optimizing Trading Strategies for FPGAs in C/C++ - https://www.youtube.com/watch?v=4Wklh0XS5i0

C++ Electronic Trading for Cpp Programmers - Mathias Gaunard - https://www.youtube.com/watch?v=ltT2fDqBCEo

Achieving performance in financial data processing through compile time introspection - Eduardo Madrid - https://www.youtube.com/watch?v=z6fo90R8q5U

How to Simulate a Low Latency Exchange in C++ - Benjamin Catterall - https://www.youtube.com/watch?v=QQrTE4YLkSE

Building Low Latency Trading Systems - https://www.youtube.com/watch?v=yBNpSqOOoRk

Cache Warming: Warm Up The Code - Jonathan Keinan - https://www.youtube.com/watch?v=XzRxikGgaHI

How Linux Took Over the World of Finance - Christoph H Lameter - https://www.youtube.com/watch?v=UUOM4KdaHkY

Tools

Online

Compiler Explorer - https://compiler-explorer.com

Latency, Throughput, and Port Usage Information - https://uops.info / https://uica.uops.info

Latency, Memory Latency and CPUID dumps - http://instlatx64.atw.hu

Instruction Reference - https://www.felixcloutier.com/x86

x86 Processor Information - https://sandpile.org

Memory Latency Data - https://chipsandcheese.com/memory-latency-data

Instruction Discovery And Analysis on x86-64 - https://explore.liblisa.nl

Quick C++ Benchmark - https://quick-bench.com

Benchmarking

google-benchmark - https://github.com/google/benchmark

nanobench - https://github.com/martinus/nanobench

celero - https://github.com/DigitalInBlue/Celero

nanobench - https://github.com/andreas-abel/nanoBench

uarch-bench - https://github.com/travisdowns/uarch-bench

llvm-exegesis - https://llvm.org/docs/CommandGuide/llvm-exegesis.html

Profiling

linux-perf - https://perf.wiki.kernel.org

intel-vtune - https://www.intel.com/content/www/us/en/docs/vtune-profiler

intel-advisor - https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html

intel-sde - https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html

intel-pin - https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html

amd-uprof - https://www.amd.com/en/developer/uprof.html

callgrind - https://valgrind.org/docs/manual/cl-manual.html

pmu-tools - https://github.com/andikleen/pmu-tools

perf-tools - https://github.com/brendangregg/perf-tools

yperf - https://github.com/aayasin/perf-tools

ebpf - https://ebpf.io

dtrace - https://www.oracle.com/linux/downloads/linux-dtrace.html

ftrace - https://www.kernel.org/doc/html/latest/trace/ftrace.html

utrace - https://github.com/Gui774ume/utrace

strace - https://strace.io

magictrace - https://github.com/janestreet/magic-trace

omnitrace - https://github.com/ROCm/omnitrace

tracy - https://github.com/wolfpld/tracy

optick - https://github.com/bombomby/optick

rad_telemetry - https://www.radgametools.com/telemetry.html

wachy - https://rubrikinc.github.io/wachy

easy_profiler - https://github.com/yse/easy_profiler

gprof - https://ftp.gnu.org/old-gnu/Manuals/gprof-2.9.1/html_mono/gprof.html

gperftools - https://github.com/gperftools/gperftools

oprofile - https://oprofile.sourceforge.io

optview2 - https://github.com/OfekShilon/optview2

llvm-xray - https://llvm.org/docs/XRay.html

likwid - https://github.com/RRZE-HPC/likwid

lttng - https://lttng.org

sysprof - https://www.sysprof.com

coz - https://github.com/plasma-umass/coz

bcc - https://github.com/iovisor/bcc

Analysis

llvm-mca - https://llvm.org/docs/CommandGuide/llvm-mca.html

osaca - https://github.com/RRZE-HPC/OSACA

uica - https://uica.uops.info

Profile-Guided Optimization (PGO)

clang-pgo - https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization

gcc-pgo - https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

llvm-bolt - https://github.com/llvm/llvm-project/blob/main/bolt/README.md

llvm-propelleer - https://github.com/google/llvm-propeller

autofdo - https://github.com/google/autofdo

Utilities

pyperf - https://github.com/psf/pyperf

kcachegrind - https://kcachegrind.sourceforge.net/html/Home.html

hyperfine - https://github.com/sharkdp/hyperfine

hotspot - https://github.com/KDAB/hotspot

numatop - https://github.com/intel/numatop

bpftop - https://github.com/Netflix/bpftop

pahole - https://github.com/acmel/dwarves

perfetto - https://perfetto.dev

speedscope - https://github.com/jlfwong/speedscope

retsnoop - https://github.com/anakryiko/retsnoop

core-to-core-latency - https://github.com/nviennot/core-to-core-latency

llvm-opt-report - https://llvm.org/docs/CommandGuide/llvm-opt-report.html

jupyter notebook - https://jupyter.org

Libraries

perf_event_open - https://man7.org/linux/man-pages/man2/perf_event_open.2.html

perfmon2 - https://perfmon2.sourceforge.net

papi - https://github.com/icl-utk-edu/papi

libpfc - https://github.com/obilaniu/libpfc

intel_pt - https://github.com/intel/libipt

intel_pcm - https://github.com/intel/pcm

License

MIT / Apache2:LLVM*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`perf`: C++23 Performance library

Overview

Use cases

Requirements

Features

Setup

Tutorial

Studies

API

FAQ

Resources

License

About

Releases 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
perf		perf
perf.cppm		perf.cppm

qlibs/perf

Folders and files

Latest commit

History

Repository files navigation

perf: C++23 Performance library

Overview

Use cases

Requirements

Features

Setup

Tutorial

Studies

API

FAQ

Resources

License

About

Topics

Resources

Stars

Watchers

Forks

Releases 1

Languages

`perf`: C++23 Performance library