-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
content: Information on kexec and bpftrace
Overviews on how to use kexec and bpftrace for kernel development. These are very lightweight posts with room for more detail. Signed-off-by: Rahul Rameshbabu <[email protected]>
- Loading branch information
1 parent
d1c09a0
commit 41ab1cb
Showing
2 changed files
with
286 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
--- | ||
title: "bpftrace" | ||
date: 2023-12-10T21:32:59-08:00 | ||
draft: false | ||
--- | ||
|
||
* What is bpftrace? | ||
|
||
~bpftrace~ is both a CLI tool and a tracing language that compiles down to Linux | ||
enhanced Berkeley Packet Filter (eBPF) instructions. The BPF VM subsystem in the | ||
Linux kernel is immensely powerful. Covering it would require a separate | ||
article. For now, think of eBPF as some kind of mechanism that magically enables | ||
sandboxed loading of injected logic into priveleged contexts from userspace. | ||
~bpftrace~ makes use of eBPF to inject dynamic tracing instruments. The | ||
~bpftrace~ language can work with tranditional static tracing probes as well. | ||
|
||
* Different types of probes available | ||
|
||
The different types of probes are demonstrated in the bpftrace [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#probes][reference guide]] | ||
in the source tree. As I continue to learn more about ~bpftrace~ and think more | ||
details should be expanded about each probe type, I will add that information to | ||
this post. For now, I will cover what kernel build configuration options are | ||
required for supporting these probes. | ||
|
||
** kprobe / kretprobe | ||
|
||
Minimal configuration required. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_BPF_EVENTS=y | ||
CONFIG_KPROBES=y | ||
#+END_SRC | ||
|
||
Additional helpful configuration options. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_BPF_KPROBE_OVERRIDE=y # Enables overriding functions that would be executed after kprobe point | ||
CONFIG_KPROBES_ON_FTRACE=y # Optimizes kprobes with ftrace tracers already generated | ||
CONFIG_KPROBE_EVENTS=y # Support dynamically inserting tracing events using kprobes | ||
#+END_SRC | ||
|
||
** kfunc / kretfunc | ||
|
||
Minimal configuration required. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_DEBUG_INFO_BTF=y | ||
#+END_SRC | ||
|
||
Additional helpful configuration options. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_MODULE_ALLOW_BTF_MISMATCH=y # Allows modules with mismatching BTF information against running kernel to be loaded | ||
#+END_SRC | ||
|
||
** uprobe / uretprobe | ||
|
||
#+BEGIN_SRC | ||
CONFIG_UPROBE=y | ||
#+END_SRC | ||
|
||
Additional helpful configuration options. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_UPROBE_EVENTS=y # Support dynamically setting uprobes/uretprobes using memory offsets of userspace programs | ||
# Link: https://docs.kernel.org/trace/uprobetracer.html | ||
#+END_SRC | ||
|
||
** tracepoint | ||
|
||
Minimal configuration required. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_TRACEPOINTS=y | ||
#+END_SRC | ||
|
||
Additional helpful configuration options. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_TRACEPOINT_BENCHMARK=y # For benchmarking the tracepoint feature in the kernel using a kernel tracepoint | ||
#+END_SRC | ||
|
||
Here is [[https://docs.kernel.org/trace/tracepoints.html][documentation for how to implement tracepoints in kernel code]]. Even | ||
without ~bpftrace~, there are mechanisms such as [[https://docs.kernel.org/trace/events.html][event tracing]] that can be used | ||
to handle tracepoint activity. | ||
|
||
* Knowing what's supported on your kernel | ||
|
||
Running ~bpftrace --info~ provides information on what is and is not supported. | ||
|
||
#+BEGIN_EXAMPLE | ||
bpftrace --info | ||
System | ||
OS: Linux 6.1.31 #1-NixOS SMP PREEMPT_DYNAMIC Tue May 30 13:03:33 UTC 2023 | ||
Arch: x86_64 | ||
|
||
Build | ||
version: v0.18.0 | ||
LLVM: 14.0.6 | ||
unsafe probe: no | ||
bfd: yes | ||
libdw (DWARF support): yes | ||
|
||
Kernel helpers | ||
probe_read: yes | ||
probe_read_str: yes | ||
probe_read_user: yes | ||
probe_read_user_str: yes | ||
probe_read_kernel: yes | ||
probe_read_kernel_str: yes | ||
get_current_cgroup_id: yes | ||
send_signal: yes | ||
override_return: no | ||
get_boot_ns: yes | ||
dpath: yes | ||
skboutput: yes | ||
|
||
Kernel features | ||
Instruction limit: 1000000 | ||
Loop support: yes | ||
btf: yes | ||
module btf: yes | ||
map batch: yes | ||
uprobe refcount (depends on Build:bcc bpf_attach_uprobe refcount): yes | ||
|
||
Map types | ||
hash: yes | ||
percpu hash: yes | ||
array: yes | ||
percpu array: yes | ||
stack_trace: yes | ||
perf_event_array: yes | ||
|
||
Probe types | ||
kprobe: yes | ||
tracepoint: yes | ||
perf_event: yes | ||
kfunc: yes | ||
iter:task: yes | ||
iter:task_file: yes | ||
iter:task_vma: yes | ||
kprobe_multi: no | ||
raw_tp_special: yes | ||
#+END_EXAMPLE | ||
|
||
A lot of the kernel dependent features will require certain configuration | ||
options to be selected. The output shared is the default for the build | ||
configuration used in NixOS for the ~linuxPackages_latest~ kernel. Convenient | ||
for me in general for demonstrations. However, I need to compile the kernel | ||
myself for development purposes. I present the needed configuration options for | ||
each type of probe. | ||
|
||
* Useful resources for learning more about bpftrace | ||
|
||
Honestly, I am pretty new to both ~bpftrace~ and eBPF myself. I plan on updating | ||
this page as I continue to learn more. One of my goals is learning how to use | ||
the [[https://github.com/brendangregg/FlameGraph/blob/master/stackcollapse-bpftrace.pl][stackcollapse-bpftrace.pl]] script for generating flamegraphs. Right now, I | ||
use ~perf~ for generating flamegraphs. I am also collecting useful bpftace | ||
snippets that I build along my journey as a kernel developer and systems | ||
enthusiast. These snippets can be found on my GitHub repository, | ||
[[https://github.com/Binary-Eater/bpftrace-scripts][Binary-Eater/bpftrace-scripts]]. | ||
|
||
In general, the [[https://github.com/iovisor/bpftrace][iovisor/bpftrace]] GitHub repository has a nice [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md][reference]] and | ||
[[https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md][one-liner tutorial]] for new users to follow along. The [[https://github.com/iovisor/bpftrace/blob/master/man/adoc/bpftrace.adoc][manpage]] is an even more | ||
thorough resource. | ||
|
||
The ~tools/~ directory of the ~bpftrace~ repository also serves as a great | ||
reference. | ||
|
||
[[https://www.brendangregg.com/index.html][Brendan Gregg's blog]] has a number of additional examples as well. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
--- | ||
title: "kexec" | ||
date: 2023-12-10T21:32:56-08:00 | ||
draft: false | ||
--- | ||
|
||
* What is kexec? | ||
|
||
~kexec~ is short for kernel execute. At a high level, it is analogous with the | ||
syscall ~exec~. ~kexec~ replaces the memory of the currently loaded kernel image | ||
and begins executing the newly loaded kernel image. tl;dr it enables users to | ||
run new kernels without needing to do a full power cycle of a system. | ||
|
||
* What is the value of kexec? | ||
|
||
+ Enables a "boot once" flow for testing a potentially problematic kernel. | ||
+ No bootloader configuration required. | ||
+ Does not require a complete system reboot (BIOS boot) to swap between kernel | ||
images for testing. | ||
+ Lowers the difficulty for setting up ad-hoc kernel development environments | ||
using the latest kernel source tree. | ||
|
||
* What is required to use kexec? | ||
|
||
** Kconfig | ||
|
||
The kernel that will have its image rewritten (the currently booted kernel) will | ||
need the following kernel configuration options. | ||
|
||
#+BEGIN_SRC | ||
CONFIG_KEXEC=y # Enables support for general kexec syscall functionality. | ||
CONFIG_KEXEC_FILE=y # Enables kexec_file_load syscall. See the manual for kexec_load(2) for more details. | ||
|
||
# Optional security options | ||
CONFIG_KEXEC_BZIMAGE_VERIFY_SIG=n # Verify the signing signature of the bzImage used in kexec | ||
CONFIG_KEXEC_SIG=n # If the kernel image has a signature, make sure the signature is valid when using the kexec_file_load syscall | ||
CONFIG_KEXEC_SIG_FORCE=n # Enforce that the kernel image used in the kexec_file_load syscall has a valid signature | ||
#+END_SRC | ||
|
||
** Userspace | ||
|
||
You will need ~kexec-tools~ in userspace to be able to configure a kernel image | ||
for ~kexec~. | ||
|
||
* Using kexec | ||
|
||
Here is an example of setting up a new kernel image to execute in place of the | ||
currently running image while reusing the kernel commandline of the previous | ||
image. | ||
|
||
#+BEGIN_SRC sh | ||
kexec -l /path/to/vmlinuz --initrd=/path/to/initramfs.img --reuse-cmdline | ||
kexec -e | ||
#+END_SRC | ||
|
||
The ~--initrd~ flag is optional. It is used when an initramfs image is needed to | ||
properly bring up a system. ~kexec -e~ will then execute the configured kernel | ||
without gracefully taking down any system services momentarily that might be | ||
impacted. | ||
|
||
~systemd~ can assist with a more graceful ~kexec~ flow for taking down services | ||
and even supporting ad-hoc clean-up services using ~WantedBy=kexec.target~ in | ||
the ~Install~ section of a systemd service definition. ~kexec -e~ is replaced | ||
with ~systemctl kexec~. | ||
|
||
If you want to unload the target kernel that was previously loaded by ~kexec~, | ||
~kexec -u~ will unload the currently running ~kexec~ target kernel. | ||
|
||
There are a number of options for the ~kexec~ commandline tool that are | ||
documented in the manual for ~kexec(8)~. | ||
|
||
Usage details for ~kexec~ are also well documented on the Arch Linux wiki and | ||
Gentoo wiki. | ||
|
||
OpenSUSE also has a more thorough [[https://documentation.suse.com/de-de/sles/15-GA/html/SLES-all/cha-tuning-kexec.html][write-up]] on ~kexec~ and ~kdump~. | ||
|
||
NOTE: It seems Gentoo uses a patched version of ~reboot~ to offer a graceful | ||
~-k~ flag. Gentoo probably does this since it offers an alternative to | ||
~systemd~, OpenRC. | ||
|
||
* Why is kexec not enabled everywhere? | ||
|
||
Lightly mentioned in this article already, ~kexec~ is not an infallible process. | ||
This has to do with the fact that, unlike a userspace program that might have | ||
its application instructions overwritten with an ~exec~ syscall, the kernel | ||
image being overwritten is in charge of managing devices at a very low-level. | ||
Device teardown and initialization may not occur in a way that leaves an | ||
"already-running" system in a stable state. | ||
|
||
The idea of being able to safely move to a different kernel without compromising | ||
the system is called kernel livepatching. Kernel livepatching is a hot area of | ||
research with multiple entities taking their own approaches on the matter. | ||
|
||
+ [[https://ubuntu.com/blog/an-overview-of-live-kernel-patching][Canonical's solution without kexec by using ftrace hooking]] | ||
+ [[https://documentation.suse.com/sles/12-SP4/html/SLES-kgraft/index.html][kGraft by OpenSUSE that similarly uses ftrace hooking]] | ||
+ [[https://www.redhat.com/en/topics/linux/what-is-linux-kernel-live-patching#the-two-spaces-of-linux-system-operations][RedHat's ftrace hooking kpatch livepatch solution]] | ||
+ [[https://en.wikipedia.org/wiki/Ksplice#Design][Ksplice using its own injector and hooking mechanism]] | ||
|
||
*NOTE:* Ksplice was initially developed by MIT students during its initial | ||
development, as a reference illustrating that kernel livepatching is a topic | ||
worth academic exploration. | ||
|
||
*DISCLAIMER:* I have not read any of the above approaches in detail. I just felt | ||
I should draw attention to them for curious readers interested in ways to | ||
potentially make ~kexec~ "more-robust". | ||
|
||
* Alternatives to kexec | ||
|
||
For "boot once" testing, using bootloader configurations to do a "boot once" is | ||
an option. However, many bootloaders have an involved process for achieving | ||
this. | ||
|
||
I have not found decent documentation on how to do this for GRUB 2 (not GRUB | ||
Legacy). In reality, I might work with a number of systems using different | ||
bootloaders such as ~systemd-boot~ and learning how to do this with every | ||
bootloader implementation out there seems like an adventure. |