Skip to content

Commit

Permalink
content: Information on kexec and bpftrace
Browse files Browse the repository at this point in the history
Overviews on how to use kexec and bpftrace for kernel development. These
are very lightweight posts with room for more detail.

Signed-off-by: Rahul Rameshbabu <[email protected]>
  • Loading branch information
Binary-Eater committed Dec 11, 2023
1 parent d1c09a0 commit 41ab1cb
Show file tree
Hide file tree
Showing 2 changed files with 286 additions and 0 deletions.
170 changes: 170 additions & 0 deletions content/posts/bpftrace.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
title: "bpftrace"
date: 2023-12-10T21:32:59-08:00
draft: false
---

* What is bpftrace?

~bpftrace~ is both a CLI tool and a tracing language that compiles down to Linux
enhanced Berkeley Packet Filter (eBPF) instructions. The BPF VM subsystem in the
Linux kernel is immensely powerful. Covering it would require a separate
article. For now, think of eBPF as some kind of mechanism that magically enables
sandboxed loading of injected logic into priveleged contexts from userspace.
~bpftrace~ makes use of eBPF to inject dynamic tracing instruments. The
~bpftrace~ language can work with tranditional static tracing probes as well.

* Different types of probes available

The different types of probes are demonstrated in the bpftrace [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#probes][reference guide]]
in the source tree. As I continue to learn more about ~bpftrace~ and think more
details should be expanded about each probe type, I will add that information to
this post. For now, I will cover what kernel build configuration options are
required for supporting these probes.

** kprobe / kretprobe

Minimal configuration required.

#+BEGIN_SRC
CONFIG_BPF_EVENTS=y
CONFIG_KPROBES=y
#+END_SRC

Additional helpful configuration options.

#+BEGIN_SRC
CONFIG_BPF_KPROBE_OVERRIDE=y # Enables overriding functions that would be executed after kprobe point
CONFIG_KPROBES_ON_FTRACE=y # Optimizes kprobes with ftrace tracers already generated
CONFIG_KPROBE_EVENTS=y # Support dynamically inserting tracing events using kprobes
#+END_SRC

** kfunc / kretfunc

Minimal configuration required.

#+BEGIN_SRC
CONFIG_DEBUG_INFO_BTF=y
#+END_SRC

Additional helpful configuration options.

#+BEGIN_SRC
CONFIG_MODULE_ALLOW_BTF_MISMATCH=y # Allows modules with mismatching BTF information against running kernel to be loaded
#+END_SRC

** uprobe / uretprobe

#+BEGIN_SRC
CONFIG_UPROBE=y
#+END_SRC

Additional helpful configuration options.

#+BEGIN_SRC
CONFIG_UPROBE_EVENTS=y # Support dynamically setting uprobes/uretprobes using memory offsets of userspace programs
# Link: https://docs.kernel.org/trace/uprobetracer.html
#+END_SRC

** tracepoint

Minimal configuration required.

#+BEGIN_SRC
CONFIG_TRACEPOINTS=y
#+END_SRC

Additional helpful configuration options.

#+BEGIN_SRC
CONFIG_TRACEPOINT_BENCHMARK=y # For benchmarking the tracepoint feature in the kernel using a kernel tracepoint
#+END_SRC

Here is [[https://docs.kernel.org/trace/tracepoints.html][documentation for how to implement tracepoints in kernel code]]. Even
without ~bpftrace~, there are mechanisms such as [[https://docs.kernel.org/trace/events.html][event tracing]] that can be used
to handle tracepoint activity.

* Knowing what's supported on your kernel

Running ~bpftrace --info~ provides information on what is and is not supported.

#+BEGIN_EXAMPLE
bpftrace --info
System
OS: Linux 6.1.31 #1-NixOS SMP PREEMPT_DYNAMIC Tue May 30 13:03:33 UTC 2023
Arch: x86_64

Build
version: v0.18.0
LLVM: 14.0.6
unsafe probe: no
bfd: yes
libdw (DWARF support): yes

Kernel helpers
probe_read: yes
probe_read_str: yes
probe_read_user: yes
probe_read_user_str: yes
probe_read_kernel: yes
probe_read_kernel_str: yes
get_current_cgroup_id: yes
send_signal: yes
override_return: no
get_boot_ns: yes
dpath: yes
skboutput: yes

Kernel features
Instruction limit: 1000000
Loop support: yes
btf: yes
module btf: yes
map batch: yes
uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes

Map types
hash: yes
percpu hash: yes
array: yes
percpu array: yes
stack_trace: yes
perf_event_array: yes

Probe types
kprobe: yes
tracepoint: yes
perf_event: yes
kfunc: yes
iter:task: yes
iter:task_file: yes
iter:task_vma: yes
kprobe_multi: no
raw_tp_special: yes
#+END_EXAMPLE

A lot of the kernel dependent features will require certain configuration
options to be selected. The output shared is the default for the build
configuration used in NixOS for the ~linuxPackages_latest~ kernel. Convenient
for me in general for demonstrations. However, I need to compile the kernel
myself for development purposes. I present the needed configuration options for
each type of probe.

* Useful resources for learning more about bpftrace

Honestly, I am pretty new to both ~bpftrace~ and eBPF myself. I plan on updating
this page as I continue to learn more. One of my goals is learning how to use
the [[https://github.com/brendangregg/FlameGraph/blob/master/stackcollapse-bpftrace.pl][stackcollapse-bpftrace.pl]] script for generating flamegraphs. Right now, I
use ~perf~ for generating flamegraphs. I am also collecting useful bpftace
snippets that I build along my journey as a kernel developer and systems
enthusiast. These snippets can be found on my GitHub repository,
[[https://github.com/Binary-Eater/bpftrace-scripts][Binary-Eater/bpftrace-scripts]].

In general, the [[https://github.com/iovisor/bpftrace][iovisor/bpftrace]] GitHub repository has a nice [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md][reference]] and
[[https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md][one-liner tutorial]] for new users to follow along. The [[https://github.com/iovisor/bpftrace/blob/master/man/adoc/bpftrace.adoc][manpage]] is an even more
thorough resource.

The ~tools/~ directory of the ~bpftrace~ repository also serves as a great
reference.

[[https://www.brendangregg.com/index.html][Brendan Gregg's blog]] has a number of additional examples as well.
116 changes: 116 additions & 0 deletions content/posts/kexec.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: "kexec"
date: 2023-12-10T21:32:56-08:00
draft: false
---

* What is kexec?

~kexec~ is short for kernel execute. At a high level, it is analogous with the
syscall ~exec~. ~kexec~ replaces the memory of the currently loaded kernel image
and begins executing the newly loaded kernel image. tl;dr it enables users to
run new kernels without needing to do a full power cycle of a system.

* What is the value of kexec?

+ Enables a "boot once" flow for testing a potentially problematic kernel.
+ No bootloader configuration required.
+ Does not require a complete system reboot (BIOS boot) to swap between kernel
images for testing.
+ Lowers the difficulty for setting up ad-hoc kernel development environments
using the latest kernel source tree.

* What is required to use kexec?

** Kconfig

The kernel that will have its image rewritten (the currently booted kernel) will
need the following kernel configuration options.

#+BEGIN_SRC
CONFIG_KEXEC=y # Enables support for general kexec syscall functionality.
CONFIG_KEXEC_FILE=y # Enables kexec_file_load syscall. See the manual for kexec_load(2) for more details.

# Optional security options
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=n # Verify the signing signature of the bzImage used in kexec
CONFIG_KEXEC_SIG=n # If the kernel image has a signature, make sure the signature is valid when using the kexec_file_load syscall
CONFIG_KEXEC_SIG_FORCE=n # Enforce that the kernel image used in the kexec_file_load syscall has a valid signature
#+END_SRC

** Userspace

You will need ~kexec-tools~ in userspace to be able to configure a kernel image
for ~kexec~.

* Using kexec

Here is an example of setting up a new kernel image to execute in place of the
currently running image while reusing the kernel commandline of the previous
image.

#+BEGIN_SRC sh
kexec -l /path/to/vmlinuz --initrd=/path/to/initramfs.img --reuse-cmdline
kexec -e
#+END_SRC

The ~--initrd~ flag is optional. It is used when an initramfs image is needed to
properly bring up a system. ~kexec -e~ will then execute the configured kernel
without gracefully taking down any system services momentarily that might be
impacted.

~systemd~ can assist with a more graceful ~kexec~ flow for taking down services
and even supporting ad-hoc clean-up services using ~WantedBy=kexec.target~ in
the ~Install~ section of a systemd service definition. ~kexec -e~ is replaced
with ~systemctl kexec~.

If you want to unload the target kernel that was previously loaded by ~kexec~,
~kexec -u~ will unload the currently running ~kexec~ target kernel.

There are a number of options for the ~kexec~ commandline tool that are
documented in the manual for ~kexec(8)~.

Usage details for ~kexec~ are also well documented on the Arch Linux wiki and
Gentoo wiki.

OpenSUSE also has a more thorough [[https://documentation.suse.com/de-de/sles/15-GA/html/SLES-all/cha-tuning-kexec.html][write-up]] on ~kexec~ and ~kdump~.

NOTE: It seems Gentoo uses a patched version of ~reboot~ to offer a graceful
~-k~ flag. Gentoo probably does this since it offers an alternative to
~systemd~, OpenRC.

* Why is kexec not enabled everywhere?

Lightly mentioned in this article already, ~kexec~ is not an infallible process.
This has to do with the fact that, unlike a userspace program that might have
its application instructions overwritten with an ~exec~ syscall, the kernel
image being overwritten is in charge of managing devices at a very low-level.
Device teardown and initialization may not occur in a way that leaves an
"already-running" system in a stable state.

The idea of being able to safely move to a different kernel without compromising
the system is called kernel livepatching. Kernel livepatching is a hot area of
research with multiple entities taking their own approaches on the matter.

+ [[https://ubuntu.com/blog/an-overview-of-live-kernel-patching][Canonical's solution without kexec by using ftrace hooking]]
+ [[https://documentation.suse.com/sles/12-SP4/html/SLES-kgraft/index.html][kGraft by OpenSUSE that similarly uses ftrace hooking]]
+ [[https://www.redhat.com/en/topics/linux/what-is-linux-kernel-live-patching#the-two-spaces-of-linux-system-operations][RedHat's ftrace hooking kpatch livepatch solution]]
+ [[https://en.wikipedia.org/wiki/Ksplice#Design][Ksplice using its own injector and hooking mechanism]]

*NOTE:* Ksplice was initially developed by MIT students during its initial
development, as a reference illustrating that kernel livepatching is a topic
worth academic exploration.

*DISCLAIMER:* I have not read any of the above approaches in detail. I just felt
I should draw attention to them for curious readers interested in ways to
potentially make ~kexec~ "more-robust".

* Alternatives to kexec

For "boot once" testing, using bootloader configurations to do a "boot once" is
an option. However, many bootloaders have an involved process for achieving
this.

I have not found decent documentation on how to do this for GRUB 2 (not GRUB
Legacy). In reality, I might work with a number of systems using different
bootloaders such as ~systemd-boot~ and learning how to do this with every
bootloader implementation out there seems like an adventure.

0 comments on commit 41ab1cb

Please sign in to comment.