2025-01-27
BPF_MAP_TYPE_RINGBUF
s
are pretty cool! They are a performant way of sending and receiving some data
from an eBPF (kernel space) to user space.
Using as little amount of data to communicate to and from user space keeps an
eBPF performant and happy.
I reviewed a PR recently where 16
bytes were removed from a struct we enqueue
in a BPF_MAP_TYPE_RINGBUF
(ringbuf). The struct went from 128
bytes to 112
bytes. I wondered, are 16 bytes actually saved here? Since I am eBPF-naive, I
thought this would be a good opportunity to dive into the Linux Kernel source
code and unearth any horrors answers. This is my short journey into the
depths.
There are two overall questions I want to answer:
BPF_MAP_TYPE_RINGBUF
. I’m aware of two
general classes of ring buffers or queues:
Yes, RTFM would have answered my questions nearly instantly. But I wanted to go to the source to find the answers.
The eBPF code I reviewed calls bpf_ringbuf_output
, so lets look for that.
Googling bpf_ringbuf_output elixir
brings us to ringbuf.c
.
elixir.bootlin.com
is a wonderful viewer of the kernel source.
bpf_ringbuf_output
work?It lets you write some data into the ringbuf. You call the function with a pointer to the ringbuf, a pointer to your data, the size of your data and some flags. E.g. writing an 8 byte unsigned int:
bpf_ringbuf_output(my_map, &42ull, 8, 0);
It’s a short 17 lines:
BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64, size,
u64, flags)
{
struct bpf_ringbuf_map *rb_map;
void *rec;
if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
return -EINVAL;
rb_map = container_of(map, struct bpf_ringbuf_map, map);
rec = __bpf_ringbuf_reserve(rb_map->rb, size);
if (!rec)
return -EAGAIN;
memcpy(rec, data, size);
bpf_ringbuf_commit(rec, flags, false /* discard */);
return 0;
}
Looks like we extract the map (container_of
), get a pointer to some memory
(__bpg_ringbuf_reserve
), and memcpy
the input data into that pointer. Our
answer lies in __bpf_rinbuf_reserve
. What kind of pointer does it give us? An
allocation? An offset into the ringbuf’s data?
rec = __bpf_ringbuf_reserve
Looking at __bpf_ringbuf_reserve
:
static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
{
unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
struct bpf_ringbuf_hdr *hdr;
u32 len, pg_off, tmp_size, hdr_len;
// </snip>
len = round_up(size + BPF_RINGBUF_HDR_SZ, 8);
// </snip>
cons_pos = smp_load_acquire(&rb->consumer_pos);
// </snip> lock the map
pend_pos = rb->pending_pos;
prod_pos = rb->producer_pos;
new_prod_pos = prod_pos + len;
// </snip> length checks / seeing if the consumer is slow
hdr = (void *)rb->data + (prod_pos & rb->mask);
pg_off = bpf_ringbuf_rec_pg_off(rb, hdr); // get the page offset for the rb
hdr->len = size | BPF_RINGBUF_BUSY_BIT;
hdr->pg_off = pg_off;
// </snip> store consumer and unlock
return (void *)hdr + BPF_RINGBUF_HDR_SZ;
}
A few things are going on in this function. Some </snip>
ed error checks and
what appears to be a loop determining if the consumer is slow (?), and something
strange on line #8
! We’re rounding the given len
to BPF_RINGBUF_HDR_SZ + 8
. It appears an element gets some kind of associated metadata, a
bpf_ringbuf_hdr
!
The hdr
created on line #22
points inside the ringbuf’s data,
rb->data
. We then write the size
of our reservation in it and the page
offset of the ringbuf. The page offset “allows [us] to restore struct
bpf_ringbuf * from record pointer”1.
struct bpf_ringbuf_hdr
is pretty simple:
/* 8-byte ring buffer record header structure */
struct bpf_ringbuf_hdr {
u32 len;
u32 pg_off;
};
Finally, on line #28
, __bpf_ringbuf_reserve
returns (void *)hdr + BPF_RINGBUF_HDR_SZ
. So the hdr
pointer + 8 bytes.
Elements must be directly written into the ringbuf. We don’t see any kernal
allocation in the code of bpf_ringbuf_output
. Rather, we see that an offset
into the rb->data
array is calculated and the given element directly
memcpy
ed into it. So that means a ringbuf
s are not implemented as an array
of pointers. Interestingly, we learned that elements take up 8 more bytes than
expected. Not just 8 more bytes, the next highest 8 byte boundary after
adding 8 bytes!
So the size of our 112
byte struct is:
ceil((112 + 8) / 8) * 8 =
ceil(120 / 8) * 8 =
ceil(15) * 8 =
= 120
Whew! Since our new struct + an 8-byte header takes up 120 bytes and 120 bytes is evenly divisible by 8, we did actually save memory. For example, if we went from 39 bytes to 38 bytes (saving 1 byte), we still use 48 bytes of memory since 39 and 38 share the next larger multiple of 8 after adding 8:
ceil((39 + 8) / 8) * 8 = ceil((38 + 8) / 8) * 8
Reading the docs would have given me this answer pretty much instantly, but looking at the source was a fun exercise.