feat(crashtracking): capture non signal based crashes#5321
feat(crashtracking): capture non signal based crashes#5321
Conversation
|
Thank you for updating Change log entry section 👏 Visited at: 2026-02-06 01:14:59 UTC |
c5d3fce to
e4b1623
Compare
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: 808b3f6 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
BenchmarksBenchmark execution time: 2026-02-06 22:46:19 Comparing candidate commit 808b3f6 in PR branch Found 2 performance improvements and 0 performance regressions! Performance is the same for 42 metrics, 2 unstable metrics. scenario:tracing - Propagation - Datadog
scenario:tracing - Tracing.log_correlation
|
Fmt fmt
Remove noisy log Update symbol name Check result, build message in ruby unit test and test cleanup Inline + no order dependency + cleanup Number of frames logic on ruby side frame processing in helper Restore accidentally deleted comment Update tags on fork Fmt Fix potential mem leak move to core clean Extract into helper Fix more potential leaks Fmt
6f5fc9b to
25077d0
Compare
Removed comment about Ruby exception crash reporting tests.
p-datadog
left a comment
There was a problem hiding this comment.
I read the C code and while nothing jumped out at me I also don't know if everything there is correct.
I left comments for the Ruby code.
In general, since we do have a crash tracker for crashes, I would like to see "unhandled exceptions" (and more precisely, "unhandled exceptions on main thread") NOT be referred to as "crashes" in Ruby code or documentation. I understand that eventually the libdatadog data structures will be created that have "crash" in their name, but I would prefer to see everything upstream of that use correct terminology and refer to "unhandled exceptions".
| rescue => e | ||
| # Don't let crash reporting itself crash the exit process |
There was a problem hiding this comment.
This only rescues StandardError-derived exceptions. If you want to be fully thorough in not permitting crash tracking issues affecting the application, you should rescue everything (with provisions for NoMemoryError, Interrupt and SystemExit again) as is for example done in https://github.com/DataDog/dd-trace-rb/blob/master/lib/datadog/di/instrumenter.rb#L191-L196.
There was a problem hiding this comment.
Yeah I understand but wouldn't we want to reraise if we get NoMemoryError, Interrupt, or SystemExit.
At that point the program is actually just done?
There was a problem hiding this comment.
Like we don't want to swallow SystemExit, ignore SIGTERM, or block process shutdown? Plz correct me if I am wrong
There was a problem hiding this comment.
You are correct on those three exception classes.
The question is whether you intend to rescue StandardError-derived exceptions only or all (except for those 3).
I asked claude which exceptions do not derive from StandardError and here is what it said:
Exceptions that don't inherit from StandardError:
- NoMemoryError - Out of memory condition
- ScriptError and its subclasses:
- LoadError - Failed to load a file
- NotImplementedError - Method not implemented on this platform
- SyntaxError - Syntax error in code - SecurityError - Security violation
- SignalException and its subclass:
- Interrupt - Raised when Ctrl-C is pressed (SIGINT) - SystemExit - Raised by exit or abort
- SystemStackError - Stack overflow
- fatal - Unrecoverable error (cannot be rescued)
Of these, LoadError (and NotImplementedError) are quite common.
Our existing rescues are mostly for StandardError (this is what gets rescued if no class is explicitly specified) therefore, if you just want to do what the library is already doing elsewhere, the current code in the diff is fine, but I think maybe this should be revisited library-wide (and making this change library-wide would be out of scope for this PR).
Flip negation
87fad47 to
808b3f6
Compare
| end | ||
| sleep 0.1 | ||
|
|
||
| raise StandardError, 'Test Ruby crash' |
There was a problem hiding this comment.
| raise StandardError, 'Test Ruby crash' | |
| raise StandardError, 'Test Ruby unhandled exception on main thread' |
| end | ||
| end | ||
|
|
||
| it 'reports Ruby exceptions via http when app crashes' do |
There was a problem hiding this comment.
| it 'reports Ruby exceptions via http when app crashes' do | |
| it 'reports Ruby unhandled exceptions via http' do |
| end | ||
| end | ||
|
|
||
| context 'Ruby exception crash reporting' do |
There was a problem hiding this comment.
| context 'Ruby exception crash reporting' do | |
| context 'Ruby unhandled exception reporting' do |
| logger.debug('Crashtracker failed to report unhandled exception to crash tracker') unless success | ||
| rescue => e | ||
| # don't let crash reporting itself raise an error | ||
| logger.debug("Crashtracker failed to report Ruby exception crash: #{e.message}") |
There was a problem hiding this comment.
| logger.debug("Crashtracker failed to report Ruby exception crash: #{e.message}") | |
| logger.debug("Crashtracker failed to report Ruby unhandled exception: #{e.class}: #{e.message}") |
What does this PR do?
This PR adds support for crash report collection and emission for non-signal based crashes. We do this by hooking into
at_exitand accessing the exception stack. We send the exception stack over from the Ruby side to the native code side, and use it to build a crash report. We also send a crash ping, mainly for parity.Native stack collection planned to be implemented but is out of scope for this stage.
Motivation:
Nice to see non-signal based crashes (not captured by regular errortracking) and was a feature request from SSI team.
Ticket: PROF-13673
Change log entry
Non-signal based crashes are caught and reported
Additional Notes:
How to test the change?
Unit tests
Run a test ruby program instrumented with the crashtracker and look at the report being sent.