使用BPF ring buffer

Overview

本文介绍如何使用BPF ring buffer,原文来自这里

在学习ring buffer的使用时,我翻阅了很多文档,筛选过后发现这篇文章是很不错的学习材料,即便它最后一次更新停留在2021年3月,内容也依旧可靠。

以下为正文。


Update March 30th, 2021: This article is still relevant if you are looking for a practical example on how to use the BPF Ring Buffer. If you want a deep explaination on how it works I suggest to visit the blog of the main author of this feature Andrii here. Enjoy the learning! :)

2021 年 3 月 30 日更新:如果您正在寻找有关如何使用 BPF 环形缓冲区的实际示例,本文仍然具有相关性。 如果您想深入了解其工作原理,我建议访问此功能主要作者 Andrii 的博客。 享受学习的乐趣吧! :)

介绍

The 5.8 release of the Linux Kernel came out with lots of interesting elements. Yes, as always.

A couple of weeks ago, while still processing all the news in there, I came accross a patch proposing a new bpf map type called BPF_MAP_TYPE_RINGBUF. By using this new map type we finally have an MPSC (multi producer single consumer) data structure optimized for data buffering and streaming.

Linux 内核 5.8 版本包含许多有趣的元素。 是的,一如既往。

几周前,当我仍在处理其中的所有新闻时,我发现了一个补丁,提出了一种名为 BPF_MAP_TYPE_RINGBUF 的新 bpf map类型。 通过使用这种新的map类型,我们最终拥有了针对数据缓冲和流式传输进行优化的 MPSC(多生产者单消费者)数据结构。

Some exciting things about it:

  • This type of map is not tied to the same CPU when dealing with the output as it is with BPF_MAP_TYPE_PERF_EVENT_ARRAY. This is very important for me and I’m already experimenting with this in the Falco BPF driver.
  • It’s very flexible in letting the user to decide what kind of memory allocation model they want to use by reserving beforehand or not.
  • It is observable using bpf_ringbuf_query by replying to various queries about its state. This could be useful to feed a prometheus exporter to monitor the health of the buffer.
  • Producers do not block each other, even on different CPUs
  • Spinlock is used internally to do the locking on reservations that are also ordered, while commits are completely lock free. This is very cool, because locking comes for free, no need to use bpf_spin_lock around or having to manage it.

一些令人兴奋的事情:

  • 这种类型的map在处理输出时与 BPF_MAP_TYPE_PERF_EVENT_ARRAY 不同,它不绑定同一个CPU。 这对我来说非常重要,我已经在 Falco BPF 驱动程序中进行了试验。
  • 它非常灵活,可以让用户通过“预留”或“不预留”来决定使用哪种内存分配模型。
  • 可以使用 bpf_ringbuf_query 通过回复的有关其状态的各种查询来观察它。 这对于为普罗米修斯导出器提供数据以监控缓冲区的运行健康状况可能很有用。
  • 生产者不会互相阻塞,即使在不同的CPU上
  • 自旋锁在内部用于对预留进行锁定,预留同样是有序的,而提交是完全无锁的。 这非常酷,因为锁定是免费的,无需使用 bpf_spin_lock或不得不管理它。

The patch author did a very good job at explaining all the reasons why the change was needed, so I will not go that way with this post. Instead, I want to write about to actually make use of this new feature.

补丁作者很好地解释了需要进行更改的所有原因,因此我不会在这篇文章中继续阐述这些。 相反,我想写一篇关于实际使用这个新功能的文章。

写这篇文章的动机

Finding good resources on new BPF features is very hard. The subsystem maintainers team is doing a ginormous work at it and documenting every single bit is very difficult.

Moreover, this new feature is just another map interface so essentially can be used as the others do. However, I felt like others could benefit from my researching about this new features so i did put together this writeup while I was experimenting on it.

寻找有关 BPF 新功能的优质文档非常困难。 子系统维护团队正在做大量的工作,记录每一个细节是非常困难的。

此外,这个新功能只是另一个map接口,因此本质上可以像其他功能一样使用。 然而,我觉得其他人可以从我对这个新功能的研究中受益,所以我在试验时整理了这篇文章。

关于helpers的注意事项

For every functionality it exposes, the BPF subsystem exposes an helper.

The helper is used to let you interact with that specific part of the subsystem that does the feature you are invoking.

对于它暴露的每个功能,BPF 子系统都会公开一个helper程序。

helper用于让您与正在调用的功能的子系统的特定部分进行交互。

The purpose of the Linux Kernel is not to give you the helper definitions or a library so your system will normally not ship with an header that you can import to get your hands into the functions definitions for the helper.

The idea is that you will write the definitions yourself when you want to use a specific helper, e.g:

Linux 内核的目的不是为您提供helper程序定义或库,因此您的系统通常不会附带可导入的header文件,以便了解helper程序的函数定义。

所以当您想使用特定的helper时,您将不得不自己编写定义,例如:

1static void *(*bpf_ringbuf_reserve)(void *ringbuf, __u64 size, __u64 flags) =
2  (void *)BPF_FUNC_ringbuf_reserve;

The patch adds 5 new BPF helpers

这个补丁增加了5个新的BPF helper程序

1void *bpf_ringbuf_output(void *ringbuf, void *data, u64 size, u64 flags);
2void *bpf_ringbuf_reserve(void *ringbuf, u64 size, u64 flags);
3void bpf_ringbuf_submit(void *data, u64 flags);
4void bpf_ringbuf_discard(void *data, u64 flags);
5u64 bpf_ringbuf_query(void *ringbuf, u64 flags);

You can look at a complete list of all the BPF helpers at bpf-helpers(7).

With these premises, and to keep things simple I decided to show two different usage examples of the new features using libbpf and BCC.

It would be impractical for me to show you how to use the functionalities in a raw way by defining ourselves all the needed helpers definitions for the BPF functionalities we use.

A very good explaination about BPF helpers can be found at ebpf.io.

您可以在 bpf-helpers(7) 查看所有 BPF 助手的完整列表。

有了这些前提,并且为了简单起见,我决定使用 libbpf 和 BCC 来展示新功能的两个不同使用示例。

对我来说,通过为我们使用的 BPF 功能定义所有所需的helper定义,来向您展示如何以原始方式使用这些功能是不切实际的。

关于 BPF 助手的完整的解释可以在 ebpf.io 找到。

使用 libbpf

Fortunately, the kernel provides a complete API that does all the work of exporting the helpers for us.

If you look around for libbpf, it has two homes:

  • The original copy, resides in the linux kernel under tools/lib/bpf.
  • The out-of-tree mirror at github.com/libbpf/libbpf.

To follow the example here, first go to the libbpf repository and follow the instructions to install it. The ring buffer support was added in v0.0.9. Also, make sure to have a >= 5.8 Kernel.

幸运的是,内核提供了一个完整的API,可以为我们完成导出helper的所有工作。

如果你四处寻找 libbpf,它有两个主要地址:

要遵循此处的示例,请首先转到 libbpf 代码仓库并按照说明进行安装。 v0.0.9 中添加了环形缓冲区支持。 另外,请确保内核 >= 5.8。

Here is how the BPF program:

The program itself is very simple, we attach to the tracepoint that gets hit every time an execve syscall is done.

The interesting part here for BPF_MAP_TYPE_RINGBUF is the initialization of the map with bpf_map_def. This type of map does not want the .key and .value sections and for the .max_entries value the patch says it wants a power of two. That is not entirely right, the value also needs to be page aligned with the current page shift size. In the current asm_generic/page.h here it’s defined as 1 << 12 so any value multiple of 4096 will be ok.

BPF 程序如下:

程序本身非常简单,我们attach到tracepoint上,每次 execve 系统调用完成时会触发。

BPF_MAP_TYPE_RINGBUF 有趣的部分是使用 bpf_map_def 初始化map。 这种类型的map不需要 .key.value 部分,而对于 .max_entries 值,补丁表明它需要设置为 2 的幂(译者注:2的n次方)。 这并不完全正确,该值还需要与当前页移位大小进行页面对齐。 在当前的 asm_generic/page.h 中,它被定义为 1 << 12,因此 4096 的任何倍数都可以。

Once the map is initialized, look at what we do in our tracepoint, there are two ringbuf specific calls:

  • bpf_ringbuf_reserve does the memory reservation for the buffer, this is the only time locking is done
  • bpf_ringbuf_submit does the actual write to the map, this is lock free

一旦map被初始化,看看我们在tracepoint中做了什么,有两个ringbuf特定的调用:

  • bpf_ringbuf_reserve 为缓冲区预留内存,这是唯一一次完成锁定
  • bpf_ringbuf_submit 实际写入map,这是无锁的

(译者注:下面的代码会被折叠,点击左下角的...可以展开,或者右上角的图标)

 1##include <linux/types.h>
 2
 3##include <bpf/bpf_helpers.h>
 4##include <linux/bpf.h>
 5
 6struct event {
 7  __u32 pid;
 8  char filename[16];
 9};
10
11struct bpf_map_def SEC("maps") buffer = {
12    .type = BPF_MAP_TYPE_RINGBUF,
13    .max_entries = 4096 * 64,
14};
15
16struct trace_entry {
17  short unsigned int type;
18  unsigned char flags;
19  unsigned char preempt_count;
20  int pid;
21};
22
23struct trace_event_raw_sys_enter {
24  struct trace_entry ent;
25  long int id;
26  long unsigned int args[6];
27  char __data[0];
28};
29
30
31SEC("tracepoint/syscalls/sys_enter_execve")
32int sys_enter_execve(struct trace_event_raw_sys_enter *ctx) {
33  __u32 pid = bpf_get_current_pid_tgid() >> 32;
34  struct event *event = bpf_ringbuf_reserve(&buffer, sizeof(struct event), 0);
35  if (!event) {
36    return 1;
37  }
38  event->pid = pid;
39  bpf_probe_read_user_str(event->filename, sizeof(event->filename),
40                          (const char *)ctx->args[0]);
41
42  bpf_ringbuf_submit(event, 0);
43
44  return 0;
45}
46
47char _license[] SEC("license") = "GPL";

Now save this source in a file called program.c if you want to try it later.

Loading the program would be impossible without a loader.

Besides all the boilerplate it does to load the program and the tracepoint, there are some interesting things for the ringbuf usecase here too:

  • The buf_process_sample callback gets called every time a new element is read from the ring buffer
  • The ringbuffer is read using ring_buffer_consume

现在,如果您想稍后尝试,请将此源代码保存在名为 program.c 的文件中。

如果没有loader程序,就不可能加载程序。

除了加载程序和tracepoint所需的所有样板之外,ringbuf 用例还有一些有趣的事情:

  • 每次从环形缓冲区读取新元素时都会调用 buf_process_sample 回调函数
  • 使用ring_buffer_consume读取ringbuffer
 1##include <bpf/libbpf.h>
 2##include <stdio.h>
 3##include <unistd.h>
 4
 5struct event {
 6  __u32 pid;
 7  char filename[16];
 8};
 9
10static int buf_process_sample(void *ctx, void *data, size_t len) {
11  struct event *evt = (struct event *)data;
12  printf("%d %s\n", evt->pid, evt->filename);
13
14  return 0;
15}
16
17int main(int argc, char *argv[]) {
18  const char *file = "program.o";
19  struct bpf_object *obj;
20  int prog_fd = -1;
21  int buffer_map_fd = -1;
22  struct  bpf_program *prog;
23
24  bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj, &prog_fd);
25
26  buffer_map_fd = bpf_object__find_map_fd_by_name(obj, "buffer");
27
28  struct ring_buffer *ring_buffer;
29 
30  ring_buffer = ring_buffer__new(buffer_map_fd, buf_process_sample, NULL, NULL);
31
32  if(!ring_buffer) {
33    fprintf(stderr, "failed to create ring buffer\n");
34    return 1;
35  }
36
37  prog = bpf_object__find_program_by_title(obj, "tracepoint/syscalls/sys_enter_execve");
38  if (!prog) {
39    fprintf(stderr, "failed to find tracepoint\n");
40    return 1;
41  }
42
43  bpf_program__attach_tracepoint(prog, "syscalls", "sys_enter_execve");
44
45  while(1) {
46    ring_buffer__consume(ring_buffer);
47    sleep(1);
48  }
49
50  return 0;
51}

Now save this source in a file called loader.c if you want to try it later.

It required quite some code to just showcase the ringbuf related functions. Sorry for the big wall of code!

Now we can proceed, compile and run it.

In the folder where you saved program.c and loader.c:

Compile the program:

现在,如果您想稍后尝试,请将此源代码保存在名为 loader.c 的文件中。

它需要相当多的代码来展示与ringbuf相关的功能。 抱歉,代码确实很多!

现在我们可以继续,编译并运行它。

在保存program.c和loader.c的文件夹中:

编译程序:

1clang -O2 -target bpf -g -c program.c ## -g is to generate btf code

Compile the loader

编译loader

1gcc -g -lbpf loader.c

You can now run it via:

你可以按如下方式运行:

 1sudo ./a.out
 2It wil produce something similar to this:
 3
 4393811 /bin/zsh
 5393812 /usr/bin/env
 6393812 /usr/local/bin/
 7393812 /usr/local/sbin
 8393812 /usr/bin/zsh
 9393816 /usr/bin/ls
10393818 /usr/bin/git
11393819 /usr/bin/awk
12393824 /usr/bin/git
13393825 /usr/bin/git
14393826 /usr/bin/git

If you followed my suggestion and left the -g flag to the clang command while compiling the program, congrats, you just produced a BPF CO-RE (Compile Once, Run Everywhere) program.

Yes, you can move it to another machine with Kernel 5.8 and it will work. Next step is to compile the loader statically to move it together with the program. This is left to the reader :)

如果您遵循我的建议,在编译程序时将 -g 标志留给 clang 命令,那么恭喜您,您刚刚生成了一个 BPF CO-RE(一次编译,到处运行)程序。

是的,您可以将其移动到另一台具有 5.8 内核的计算机上,它也会工作。 下一步是静态编译加载器以将其与程序一起移动过去。 这个就留给读者了:)

使用 BCC

This paragraph is about doing the same thing we did with libbpf but with BCC.

BCC added the support for the BPF ring buffer almost immediately by adding the helper definitions and by implementing the Python API support.

To make this work you will need to be on a kernel >= 5.8 and have at least BCC 0.16.0. If you need to learn how to install BCC they have a very good resource here.

Here’s the python code, comments below:

本段的内容与我们使用 libbpf 所做的相同,但使用的是 BCC

BCC 通过添加helper定义和实现 Python API 支持,几乎立即添加了对 BPF 环形缓冲区的支持。

要实现此功能,您需要使用 >= 5.8 的内核,并且至少有 BCC 0.16.0。 如果您需要学习如何安装 BCC,他们这里有非常好的资源。

这是python代码,注释如下:

 1##!/usr/bin/python3
 2
 3import sys
 4import time
 5
 6from bcc import BPF
 7
 8src = r"""
 9BPF_RINGBUF_OUTPUT(buffer, 1 << 4);
10
11struct event {
12    u32 pid;
13    char filename[16];
14};
15
16TRACEPOINT_PROBE(syscalls, sys_enter_execve) {
17    u32 pid = bpf_get_current_pid_tgid() >> 32;
18    struct event *event = buffer.ringbuf_reserve(sizeof(struct event));
19    if (!event) {
20        return 1;
21    }
22    event->pid = pid;
23    bpf_probe_read_user_str(event->filename, sizeof(event->filename), args->filename);
24
25    buffer.ringbuf_submit(event, 0);
26
27    return 0;
28}
29"""
30
31b = BPF(text=src)
32
33def callback(ctx, data, size):
34    event = b['buffer'].event(data)
35    print("%-8s %-16s" % (event.pid, event.filename.decode('utf-8')))
36
37
38my_rb = b['buffer']
39my_rb.open_ring_buffer(callback)
40
41print("%-8s %-16s" % ("PID", "FILENAME"))
42
43try:
44    while 1:
45        b.ring_buffer_poll()
46        time.sleep(0.5)
47except KeyboardInterrupt:
48    sys.exit()

As you can see, we are making use of the BCC helper BPF_RINGBUF_OUTPUT to create a ring buffer named events, then on that one we call ringbuf_submit and ringbug_poll to do our read and write operations.

If you want to try, copy the program to a program.py file. You will need to execute it with root permissions:

正如您所看到的,我们正在利用 BCC helper程序 BPF_RINGBUF_OUTPUT 创建一个名为 events 的环形缓冲区,然后在该环形缓冲区上我们调用ringbuf_submit 和ringbug_poll 来执行读取和写入操作。

如果您想尝试,请将程序复制到program.py 文件中。 您需要使用 root 权限执行它:

1sudo python program.py

The output should be something like:

输出应该会像这样:

 1PID      FILENAME
 243674    /bin/zsh
 343675    /usr/bin/env
 443675    /usr/local/bin/
 543675    /usr/local/sbin
 643675    /usr/bin/zsh
 743678    /usr/bin/dircol
 843679    /usr/bin/ls
 943681    /usr/bin/git
1043682    /usr/bin/awk
1143687    /usr/bin/git
1243688    /usr/bin/git
1343689    /usr/bin/git
1443701    /usr/bin/sh
1543701    /usr/bin/git

结论

Once again, as with every release, the BPF subsystem is becoming more and more feature complete. This specific feature is addressing a very felt use case for those (like me) who move a lot of data around using maps.

Thanks to the maintainers and the many contributors for their hard work!

与每个版本一样,BPF 子系统的功能变得越来越完整。 对于那些使用maps移动大量数据的人(比如我)来说,这个特定功能正在解决一个非常明显的用例。

感谢维护者和众多贡献者的辛勤工作!