Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf marshalling error when processing traffic over gRPC stream #3260

Open
inliquid opened this issue Dec 26, 2024 · 3 comments
Open

Protobuf marshalling error when processing traffic over gRPC stream #3260

inliquid opened this issue Dec 26, 2024 · 3 comments
Labels
kind/bug Something isn't working

Comments

@inliquid
Copy link
Contributor

inliquid commented Dec 26, 2024

What happened?

Issue affects our prod systems and constantly appears during load tests.

This was initially discovered when using own gRPC agent which consumes events from tetragon directly, but this could be easily reproduced using tetra.

In a container which is being monitored run:

while true; do cat /etc/pam.conf > /dev/null  && awk 'BEGIN {system("whoami")}' > /dev/null && sleep 0.25 || break; done

In tetragon container run:

tetra getevents --pods test-pod -o compact

This will fail after some time (~5-60 min) with following error:

<...>
🚀 process default/test-pod-debian /usr/bin/whoami
💥 exit    default/test-pod-debian /usr/bin/whoami  0
💥 exit    default/test-pod-debian /bin/sh -c whoami 0
💥 exit    default/test-pod-debian /usr/bin/awk  "BEGIN {system("whoami")}" 0
🚀 process default/test-pod-debian /usr/bin/sleep 0.25
time="2024-12-26T14:17:58Z" level=fatal msg="Failed to receive events" error="rpc error: code = Internal desc = grpc: error while marshaling: marshaling tetragon.GetEventsResponse: size mismatch (see https://github.com/golang/protobuf/issues/1609): calculated=0, measured=134"

This reproduces even without any Tracing Policy.

Tetragon Version

v1.1.2

Kernel Version

5.14.0-284.30.1.el9_2.x86_64

Kubernetes Version

v1.27.6

@inliquid inliquid added the kind/bug Something isn't working label Dec 26, 2024
@inliquid
Copy link
Contributor Author

I've seen this issue #2875 but not sure is it same issue, or something specific to a particular test.

@mtardy
Copy link
Member

mtardy commented Jan 2, 2025

Thanks for the report and the reproducing steps, indeed it's an issue we bumped into regularly and I think @will-isovalent investigated it a while ago and fixed it in some parts, he might have more context over it.

@kkourt
Copy link
Contributor

kkourt commented Jan 6, 2025

Thanks @inliquid! Can you reproduce it without awk or is awk needed for the issue to happen?

My guess is that there is some race happening when filling the .Process section when we get out of order exec events. so having two very close exec events (via awk) to reproduce this might indicate that my suspicion above is correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants