-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[filter_multiline] [engine] Segmentation fault (SIGSEGV) and/or deadlock in threaded mode #9835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
@drbugfinder-work would you mind helping me validate this issue? I tried to reproduce it and fixed something that may or may not be the root of this issue with PR #10237 but I'd like to be able to follow your exact repro steps. As for @nokute78s PR I think it might be outdated (at some point I think there was a bug where certain coroutines where scheduled in the wrong thread) because unless I'm forgetting about something critical there should be no more than one thread operating on a multiline context ever. |
@leonardo-albertovich I am replying for @drbugfinder-work here: we are still observing intermittent crashes on Fluent Bit 4.0.1, this time with slightly different stack traces:
This does not easily reproduce on demand in an isolated test, only very intermittently in a complex environment. We will try to work on a more simple reproducer. |
Thank you @shenningz, from your log I can see that the multiline code is running on the main thread.
|
Actually, going back to the original report I can see that the multiline filter is running as a processor directly attached to a tail input that also seems to have something like a I'm a bit confused because your traces look different @shenningz. I've found two defects in multiline so far :
I'll modify my tests to :
Even then it'd be really helpful if you could get in touch with me and share more context (configuration, logs, etc) so feel free to do it here or contact me through the fluent slack if you feel more comfortable that way. |
It's a nasty threading issue caused by this line. The reason is regardless of if threading is enabled or not processor stacks are initiaklized from the main thread which causes those processors (or filters) that create timers to erroneously schedule those timers on the main pipeline thread instead of the plugins thread. That is why @nokute78 s patch worked and as I previously mentioned, the fact that there was code running in the wrong thread was the root cause and I'm happy that we didn't merge the PR because it would hide it and make it much more difficult to fix in the future. There is another issue with how the event loop is picked when a collector is created which causes the emitter to fail if the previously mentioned issue is fixed. This one is tricky because it decides which event loop to use based on the As a proof of concept I patched it to pick the event loop from the TLS instead of the taking it from the config structure but I'm not happy with it so I'll take a few more hours to think about it and make a proper PR that fixes all of these issues. |
Hi @leonardo-albertovich config.yaml
parsers_multiline.conf
dummy.sh
|
Great, I was able to reproduce the crash immediately with this and so far it seems that it validates my approach because the patched version doesn't crash. |
This comment has been minimized.
This comment has been minimized.
I have a new version of the log generator (this time in python for better multithreaded log creation). The old version was not fast enough in bash:
|
I've just proofed our initial observations regarding mixed-up multiline messages. The output record looks like this:
You can use this python script to check your own output files:
This happens not very often, but it does happen. I guess it is caused by the same issue. |
Thank you for the update, the original bash script worked reliably enough for me but I'll check this new script to see if I can reproduce the mixup issue so I ensure that my PR covers both issues. I'm a bit behind the schedule for this PR but I'll try to wrap it up today. If I share a branch with my changes would you be able to run some tests on your side to help me validate the patch? |
@leonardo-albertovich Absolutely! Just let me know |
Sorry about the delay @drbugfinder-work, these are the PRs for 4.0 and 3.2 Please let me know how it goes. These PRs fix the incorrect scheduling issue, they do not fix the leftover collector timer issue but that shouldn't be the culprit here and I wanted to keep things properly compartmentalized. |
@leonardo-albertovich Great! I'll test it! |
Bug Report
Describe the bug
When using threaded mode in filter_multiline, segmentation faults or deadlocks are occurring randomly (especially in high load situations).
I assume this is caused by missing thread-safe implementation within the
flb_log_event_encoder
functions.There is also an auto-closed issue #6728, together with an open and outdated PR from @nokute78 #6765 which are describing a similar issue, which is obviously still not fixed.
Example deadlock stacktraces:
flb_log_event_encoder_commit_record
flb_log_event_encoder_dynamic_field_reset
and similar stacktraces for other flb_log_event_encoder functions.
Example stacktrace for segmentation fault crash:
@nokute78 (cc @edsiper) Was there a reason for #6765 not to be merged (and updated to current code base)?
To Reproduce
gdb -p <pid> --batch -ex "thread apply all bt" -ex "detach" -ex "quit"
)Your Environment
Maybe related:
As I read in the announcement of v2.0.2, the memory ring buffer
mem_buf_limit
should be no less than 20M in size. As far as I understand the code, thein_emitter
is used withmemrb
in case of threaded multiline filter.However, as I've already mentioned in #8473, there is this strange (and most probably wrong) assignment:
fluent-bit/plugins/in_emitter/emitter.c
Line 245 in 9652b0d
The default value for the flush frequency is 2000, so I assume this would set the ring buffer size to only 2k. Can you please verify this @nokute78 @edsiper @leonardo-albertovich @pwhelan
The text was updated successfully, but these errors were encountered: