Skip to content

feat(crashtracking): capture non signal based crashes#5321

Open
gyuheon0h wants to merge 7 commits intomasterfrom
gyuheon0h/capture-non-signal-crash
Open

feat(crashtracking): capture non signal based crashes#5321
gyuheon0h wants to merge 7 commits intomasterfrom
gyuheon0h/capture-non-signal-crash

Conversation

@gyuheon0h
Copy link
Contributor

@gyuheon0h gyuheon0h commented Feb 5, 2026

What does this PR do?
This PR adds support for crash report collection and emission for non-signal based crashes. We do this by hooking into at_exit and accessing the exception stack. We send the exception stack over from the Ruby side to the native code side, and use it to build a crash report. We also send a crash ping, mainly for parity.

Native stack collection planned to be implemented but is out of scope for this stage.

Motivation:
Nice to see non-signal based crashes (not captured by regular errortracking) and was a feature request from SSI team.

Ticket: PROF-13673
Change log entry
Non-signal based crashes are caught and reported

Additional Notes:

How to test the change?
Unit tests

Run a test ruby program instrumented with the crashtracker and look at the report being sent.

{
  "data_schema_version": "1.4",
  "error": {
    "is_crash": true,
    "kind": "UnhandledException",
    "message": "Unhandled ArgumentError: Test argument crash",
    "source_type": "Crashtracking",
    "stack": {
      "format": "Datadog Crashtracker 1.0",
      "frames": [
        {
          "file": "/home/bits/go/src/github.com/DataDog/dd-trace-rb/spec/datadog/core/crashtracking/component_spec.rb",
          "function": "block (4 levels) in <top (required)>",
          "line": 161
        },
        {
          "file": "/home/bits/go/src/github.com/DataDog/dd-trace-rb/spec/datadog/core/crashtracking/component_spec.rb",
          "function": "block (6 levels) in <top (required)>",
          "line": 168
        },
        ...
        {
          "file": "/var/lib/gems/3.0.0/gems/rspec-core-3.13.6/lib/rspec/core/runner.rb",
          "function": "invoke",
          "line": 45
        },
        {
          "file": "/var/lib/gems/3.0.0/gems/rspec-core-3.13.6/exe/rspec",
          "function": "<top (required)>",
          "line": 4
        },
        {
          "file": "/usr/local/bin/rspec",
          "function": "load",
          "line": 25
        },
        {
          "file": "/usr/local/bin/rspec",
          "function": "<main>",
          "line": 25
        }
      ],
      "incomplete": false
    }
  },
  "incomplete": false,
  "metadata": {
    "library_name": "dd-trace-rb",
    "library_version": "2.29.0",
    "family": "ruby",
    "tags": [
      "tag1:value1",
      "tag2:value2",
      "language:ruby-testing-123",
      "service:ruby-testing-123"
    ]
  },
  "os_info": {
    "architecture": "x86_64",
    "bitness": "64-bit",
    "os_type": "Ubuntu",
    "version": "22.4.0"
  },
  "proc_info": {
    "pid": 220117
  },
  "timestamp": "2026-02-06 00:25:31.590807434 UTC",
  "uuid": "9082567b-686a-4897-95cb-e596c929ba78"
}

@gyuheon0h gyuheon0h marked this pull request as ready for review February 5, 2026 20:52
@gyuheon0h gyuheon0h requested review from a team as code owners February 5, 2026 20:52
@gyuheon0h gyuheon0h marked this pull request as draft February 5, 2026 20:52
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Thank you for updating Change log entry section 👏

Visited at: 2026-02-06 01:14:59 UTC

@github-actions github-actions bot added the core Involves Datadog core libraries label Feb 5, 2026
@gyuheon0h gyuheon0h force-pushed the gyuheon0h/capture-non-signal-crash branch 2 times, most recently from c5d3fce to e4b1623 Compare February 5, 2026 21:35
@datadog-official
Copy link

datadog-official bot commented Feb 6, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage
Patch Coverage: 86.59%
Overall Coverage: 95.17%

View detailed report

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 808b3f6 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@pr-commenter
Copy link

pr-commenter bot commented Feb 6, 2026

Benchmarks

Benchmark execution time: 2026-02-06 22:46:19

Comparing candidate commit 808b3f6 in PR branch gyuheon0h/capture-non-signal-crash with baseline commit 7631952 in branch master.

Found 2 performance improvements and 0 performance regressions! Performance is the same for 42 metrics, 2 unstable metrics.

scenario:tracing - Propagation - Datadog

  • 🟩 throughput [+3204.958op/s; +3279.084op/s] or [+11.190%; +11.449%]

scenario:tracing - Tracing.log_correlation

  • 🟩 throughput [+6094.284op/s; +6354.678op/s] or [+6.068%; +6.328%]

@gyuheon0h gyuheon0h marked this pull request as ready for review February 6, 2026 01:05
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given it a pass!

@gyuheon0h gyuheon0h requested a review from ivoanjo February 6, 2026 19:16
Revert "Gitignore weird files that keep popping up (will pop this commit later)"

This reverts commit aeb3017.

Revert "Remove VS Code config files from tracking"

This reverts commit 2b30b86.

Use locations array

Clean

Lazy logging

Fix memory leak
Fmt

fmt
Remove noisy log

Update symbol name

Check result, build message in ruby

unit test and test cleanup

Inline + no order dependency + cleanup

Number of frames logic on ruby side

frame processing in helper

Restore accidentally deleted comment

Update tags on fork

Fmt

Fix potential mem leak

move to core

clean

Extract into helper

Fix more potential leaks

Fmt
@gyuheon0h gyuheon0h force-pushed the gyuheon0h/capture-non-signal-crash branch from 6f5fc9b to 25077d0 Compare February 6, 2026 19:50
Removed comment about Ruby exception crash reporting tests.
Copy link
Member

@p-datadog p-datadog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the C code and while nothing jumped out at me I also don't know if everything there is correct.

I left comments for the Ruby code.

In general, since we do have a crash tracker for crashes, I would like to see "unhandled exceptions" (and more precisely, "unhandled exceptions on main thread") NOT be referred to as "crashes" in Ruby code or documentation. I understand that eventually the libdatadog data structures will be created that have "crash" in their name, but I would prefer to see everything upstream of that use correct terminology and refer to "unhandled exceptions".

Comment on lines +41 to +42
rescue => e
# Don't let crash reporting itself crash the exit process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only rescues StandardError-derived exceptions. If you want to be fully thorough in not permitting crash tracking issues affecting the application, you should rescue everything (with provisions for NoMemoryError, Interrupt and SystemExit again) as is for example done in https://github.com/DataDog/dd-trace-rb/blob/master/lib/datadog/di/instrumenter.rb#L191-L196.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand but wouldn't we want to reraise if we get NoMemoryError, Interrupt, or SystemExit.

At that point the program is actually just done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we don't want to swallow SystemExit, ignore SIGTERM, or block process shutdown? Plz correct me if I am wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct on those three exception classes.

The question is whether you intend to rescue StandardError-derived exceptions only or all (except for those 3).

I asked claude which exceptions do not derive from StandardError and here is what it said:

Exceptions that don't inherit from StandardError:

  1. NoMemoryError - Out of memory condition
  2. ScriptError and its subclasses:
    - LoadError - Failed to load a file
    - NotImplementedError - Method not implemented on this platform
    - SyntaxError - Syntax error in code
  3. SecurityError - Security violation
  4. SignalException and its subclass:
    - Interrupt - Raised when Ctrl-C is pressed (SIGINT)
  5. SystemExit - Raised by exit or abort
  6. SystemStackError - Stack overflow
  7. fatal - Unrecoverable error (cannot be rescued)

Of these, LoadError (and NotImplementedError) are quite common.

Our existing rescues are mostly for StandardError (this is what gets rescued if no class is explicitly specified) therefore, if you just want to do what the library is already doing elsewhere, the current code in the diff is fine, but I think maybe this should be revisited library-wide (and making this change library-wide would be out of scope for this PR).

@gyuheon0h gyuheon0h force-pushed the gyuheon0h/capture-non-signal-crash branch from 87fad47 to 808b3f6 Compare February 6, 2026 22:16
end
sleep 0.1

raise StandardError, 'Test Ruby crash'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise StandardError, 'Test Ruby crash'
raise StandardError, 'Test Ruby unhandled exception on main thread'

end
end

it 'reports Ruby exceptions via http when app crashes' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it 'reports Ruby exceptions via http when app crashes' do
it 'reports Ruby unhandled exceptions via http' do

end
end

context 'Ruby exception crash reporting' do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
context 'Ruby exception crash reporting' do
context 'Ruby unhandled exception reporting' do

logger.debug('Crashtracker failed to report unhandled exception to crash tracker') unless success
rescue => e
# don't let crash reporting itself raise an error
logger.debug("Crashtracker failed to report Ruby exception crash: #{e.message}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.debug("Crashtracker failed to report Ruby exception crash: #{e.message}")
logger.debug("Crashtracker failed to report Ruby unhandled exception: #{e.class}: #{e.message}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Involves Datadog core libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants