A short dive into BPF_MAP_TYPE_RINGBUF

2025-01-27

BPF_MAP_TYPE_RINGBUFs are pretty cool! They are a performant way of sending and receiving some data from an eBPF (kernel space) to user space. Using as little amount of data to communicate to and from user space keeps an eBPF performant and happy.

I reviewed a PR recently where 16 bytes were removed from a struct we enqueue in a BPF_MAP_TYPE_RINGBUF (ringbuf). The struct went from 128 bytes to 112 bytes. I wondered, are 16 bytes actually saved here? Since I am eBPF-naive, I thought this would be a good opportunity to dive into the Linux Kernel source code and unearth any ~~horrors~~ answers. This is my short journey into the depths.

There are two overall questions I want to answer:

What kind of ring buffer is a BPF_MAP_TYPE_RINGBUF. I’m aware of two general classes of ring buffers or queues:
- A queue of pointers: When an item is added, it’s written to a new allocation and the consumer gets its pointer.
- A queue of “data”: elements are directly written into the queue. The consumer is given a pointer to the start of the element.
What sort of bookkeeping is required to maintain the queue?

Yes, RTFM would have answered my questions nearly instantly. But I wanted to go to the source to find the answers.

The eBPF code I reviewed calls bpf_ringbuf_output, so lets look for that. Googling bpf_ringbuf_output elixir brings us to ringbuf.c. elixir.bootlin.com is a wonderful viewer of the kernel source.

How does `bpf_ringbuf_output` work?

It lets you write some data into the ringbuf. You call the function with a pointer to the ringbuf, a pointer to your data, the size of your data and some flags. E.g. writing an 8 byte unsigned int:

bpf_ringbuf_output(my_map, &42ull, 8, 0);

It’s a short 17 lines:

BPF_CALL_4(bpf_ringbuf_output, struct bpf_map *, map, void *, data, u64, size,
       u64, flags)
{
    struct bpf_ringbuf_map *rb_map;
    void *rec;

    if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
        return -EINVAL;

    rb_map = container_of(map, struct bpf_ringbuf_map, map);
    rec = __bpf_ringbuf_reserve(rb_map->rb, size);
    if (!rec)
        return -EAGAIN;

    memcpy(rec, data, size);
    bpf_ringbuf_commit(rec, flags, false /* discard */);
    return 0;
}

Looks like we extract the map (container_of), get a pointer to some memory (__bpg_ringbuf_reserve), and memcpy the input data into that pointer. Our answer lies in __bpf_rinbuf_reserve. What kind of pointer does it give us? An allocation? An offset into the ringbuf’s data?

`rec = __bpf_ringbuf_reserve`

Looking at __bpf_ringbuf_reserve:

static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
{
    unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
    struct bpf_ringbuf_hdr *hdr;
    u32 len, pg_off, tmp_size, hdr_len;
    // </snip>

    len = round_up(size + BPF_RINGBUF_HDR_SZ, 8);

    // </snip>

    cons_pos = smp_load_acquire(&rb->consumer_pos);

    // </snip> lock the map

    pend_pos = rb->pending_pos;
    prod_pos = rb->producer_pos;
    new_prod_pos = prod_pos + len;

    // </snip> length checks / seeing if the consumer is slow

    hdr = (void *)rb->data + (prod_pos & rb->mask);
    pg_off = bpf_ringbuf_rec_pg_off(rb, hdr); // get the page offset for the rb
    hdr->len = size | BPF_RINGBUF_BUSY_BIT;
    hdr->pg_off = pg_off;

    // </snip> store consumer and unlock
    return (void *)hdr + BPF_RINGBUF_HDR_SZ;
}

A few things are going on in this function. Some </snip>ed error checks and what appears to be a loop determining if the consumer is slow (?), and something strange on line #8! We’re rounding the given len to BPF_RINGBUF_HDR_SZ + 8. It appears an element gets some kind of associated metadata, a bpf_ringbuf_hdr!

The hdr created on line #22 points inside the ringbuf’s data, rb->data. We then write the size of our reservation in it and the page offset of the ringbuf. The page offset “allows [us] to restore struct bpf_ringbuf * from record pointer”¹.

struct bpf_ringbuf_hdr is pretty simple:

/* 8-byte ring buffer record header structure */
struct bpf_ringbuf_hdr {
	u32 len;
	u32 pg_off;
};

Finally, on line #28, __bpf_ringbuf_reserve returns (void *)hdr + BPF_RINGBUF_HDR_SZ. So the hdr pointer + 8 bytes.

Our Answer!

Elements must be directly written into the ringbuf. We don’t see any kernal allocation in the code of bpf_ringbuf_output. Rather, we see that an offset into the rb->data array is calculated and the given element directly memcpyed into it. So that means a ringbufs are not implemented as an array of pointers. Interestingly, we learned that elements take up 8 more bytes than expected. Not just 8 more bytes, the next highest 8 byte boundary after adding 8 bytes!

So the size of our 112 byte struct is:

ceil((112 + 8) / 8) * 8 =
      ceil(120 / 8) * 8 =
           ceil(15) * 8 =
                        = 120

Whew! Since our new struct + an 8-byte header takes up 120 bytes and 120 bytes is evenly divisible by 8, we did actually save memory. For example, if we went from 39 bytes to 38 bytes (saving 1 byte), we still use 48 bytes of memory since 39 and 38 share the next larger multiple of 8 after adding 8:

ceil((39 + 8) / 8) * 8 = ceil((38 + 8) / 8) * 8

Reading the docs would have given me this answer pretty much instantly, but looking at the source was a fun exercise.

https://elixir.bootlin.com/linux/v6.12.6/source/kernel/bpf/ringbuf.c#L390 ↩

A short dive into BPF_MAP_TYPE_RINGBUF

How does bpf_ringbuf_output work?

rec = __bpf_ringbuf_reserve

Our Answer!

Footnotes

How does `bpf_ringbuf_output` work?

`rec = __bpf_ringbuf_reserve`