-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ringbuf: fix corrupt samples on arm64 #1375
Conversation
20f2f28
to
5e944c6
Compare
5e944c6
to
f8ed272
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch. Looking at the code on the kernel side, it looks like we indeed need to be careful to at least read things in the right order.
In the interest of catching any future regressions, could you share a bit more about where you've seen this occur? Has your fix been rolled out through the datadog agent, or was this seen on internal systems only? Any chance of finding a reproducer?
Basically the pattern we have is that we send events that look like
Recently we added some metrics on the event type and saw that in some rare cases the event type was completely unexpected, which led us to start the investigation leading to this PR. I have not been able to create a reproducer yet, the only way I have currently to reproduce this is to deploy to a lot of machines and wait multiple hours. |
f8ed272
to
7496426
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very interesting! Do you understand how user space can end up with the busy bit missing? From my perspective:
- BPF reserves space in the ringbuf. This sets the BPF_RINGBUF_BUSY_BIT. This set is not atomic!
- The kernel does an an atomic store to prod_pos.
- User space does an atomic load of prod_pos.
- User space copies len (non atomically). BPF_RINGBUF_BUSY_BIT is not present.
The only way I can see this happening is if:
- The assignment of
len = size | BUSY_BIT
is decomposed intolen = size; len |= BUSY
in assembly. - The store to prod_pos is reordered to happen after
len = size
and beforelen |= BUSY
.
The problem with this theory is that I don't understand why reading len atomically would fix this problem, since there is no atomic write to len on the kernel side? (At least until we hit the xchg
in bpf_ringbuf_commit
)
Follow up question: does happen on arm64 and amd64? |
Here is |
7496426
to
b013649
Compare
I just checked and I only see this corruption on arm64 hosts, should have checked this before this is good intel |
Signed-off-by: Paul Cacheux <[email protected]>
b013649
to
82ab9a9
Compare
Reading the len field of the event header needs to be atomic to match the release memory barrier from https://github.com/torvalds/linux/blob/eb26cbb1a754ccde5d4d74527dad5ba051808fad/kernel/bpf/ringbuf.c#L487 We also don't care about the PgOff field which is only used by ringbuf internals. [ Additional context by Lorenz ] Based on a suggestion by Daniel Borkmann I wrote a litmus test against the Linux Kernel Memory Model: C ringbuf {} P0(int *hdr_len, int *prod_pos, int *data) // Producer { // __bpf_ringbuf_reserve *hdr_len = 1; // BUSY smp_store_release(prod_pos, 1); // store some data. *data = 1; // bpf_ringbuf_commit int r0 = xchg(hdr_len, 2); // COMMIT } P1(int *hdr_len, int *prod_pos, int *data) // Consumer { int p = smp_load_acquire(prod_pos); if (p > 0) { int l = *hdr_len; if (l == 2) { int d = *data; } } } exists (prod_pos=1 /\ 1:l=2 /\ 1:d=0) Note that hdr_len is read via a plain read. Executing the litmus test shows that we can indeed observe a COMMIT (l=2) before data is visible: $ herd7 -conf linux-kernel.cfg litmus-tests/ringbuf.litmus Test ringbuf Allowed States 4 1:d=0; 1:l=0; [prod_pos]=1; 1:d=0; 1:l=1; [prod_pos]=1; 1:d=0; 1:l=2; [prod_pos]=1; 1:d=1; 1:l=2; [prod_pos]=1; Ok Witnesses Positive: 1 Negative: 3 Flag data-race Condition exists ([prod_pos]=1 /\ 1:l=2 /\ 1:d=0) Observation ringbuf Sometimes 1 3 Time ringbuf 0.01 Hash=7aed947102aa203443a5b9b5884f956f This is fixed by using an smp_load_acquire or equivalent to read hdr_len. Signed-off-by: Paul Cacheux <[email protected]> Co-developed-by: Lorenz Bauer <[email protected]>
82ab9a9
to
02732d9
Compare
Fixed up the second commit to handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this happens isn't because something weird happens in __bpf_ringbuf_reserve
. The problem is that user space (aka we) don't enfore an ordering between hdr->len being visible and data being written into the ring, at least as far as I can understand it. This is a problem on arm64 since it has weaker coherency (?) guarantees by default.
I have been trying to build a small reproducer all day without much success.. thanks a lot ! |
I have been investigating some very rare event corruption in the output sample data coming from the eBPF ringbuffer. Looking at libbpf, and aya (rust equivalent) it seems we should be reading the len field in the header atomically. Matching this kernel operation
I have been able to confirm that this fixes my issue.
My current understanding is that for some reason, probably a mix of optimizer optimization and CPU trickery, we end up reading the correct len in the header but the busy bit is missing, meaning that we end up returning a sample with not yet commited data. I will expand on that once I improve my understanding of the issue.