Skip to content

DEBUG-3573 send AppClientConfigurationChange telemetry event when subsequent telemetry instances start #4678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

p-datadog
Copy link
Member

@p-datadog p-datadog commented May 21, 2025

What does this PR do?

When tracer is reconfigured (via Datadog.configure), and telemetry is enabled in both previous and new configurations, telemetry will now emit AppClientConfigurationChange from the second instance when previously no event was emitted.

Motivation:

Previously, only AppStarted event was emitted for programmatic and environment configuration, and only the first time the tracer was configured. This caused configuration state to not be sent to the backend when Datadog.configure was used to enable a product (for example, dynamic instrumentation).

Change log entry
Yes: improve dynamic instrumentation UI reporting of application and host status

Additional Notes:

AppClientConfigurationChange currently sends the complete configuration (same as what AppStarted would have sent). Ideally it should only send changed parameters.

This PR does not change behavior of tracer when telemetry is turned on or off. For example, turning telemetry off would NOT presently produce AppClosing event (assuming tracer is reconfigured rather than shut down).

Besides AppClientConfigurationChange, there is also AppIntegrationsChange. This event is not changed by this PR, it is still not being sent on second and subsequent telemetry initializations. It probably should be sent in theory but there is no immediate demand for it. (And it should also send changed state, not all state.)

This PR also makes two fixes to telemetry:

  1. Previously, submitting events to telemetry would discard them if the initial event (used to be AppStarted, now it's either AppStarted or AppClientConfigurationChange) hasn't been sent yet. Now the submitted events are added to the queue and will be sent out after the initial event. The queue is (and was) already bounded. Since the initial event can fail and will be retried, there is a window of what looks to be 50 seconds where events could have been getting discarded).

  2. The initial event is sent when the worker is performing sending of some other event. Previously, If initial event fails to send, the other event would be dropped. Now the dequeued event is put back into the queue. (Currently the dequeued event is added to the end of the queue which is not the correct place for it, to avoid shifting all other events down one.)

How to test the change?
Integration tests have been added that verify the AppClientConfigurationChange is being sent when dynamic instrumentation is enabled programmatically along with several other smaller scope tests.

The change is also tested manually.

Copy link

github-actions bot commented May 21, 2025

Thank you for updating Change log entry section 👏

Visited at: 2025-05-27 21:11:12 UTC

@github-actions github-actions bot added the core Involves Datadog core libraries label May 21, 2025
@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented May 21, 2025

Datadog Report

Branch report: telemetry-lifecycle
Commit report: 0023b94
Test service: dd-trace-rb

✅ 0 Failed, 21706 Passed, 1299 Skipped, 4m 28.29s Total Time

@p-datadog p-datadog force-pushed the telemetry-lifecycle branch from 8757579 to a616bbd Compare May 23, 2025 20:52
@pr-commenter
Copy link

pr-commenter bot commented May 23, 2025

Benchmarks

Benchmark execution time: 2025-05-27 22:03:46

Comparing candidate commit 0023b94 in PR branch telemetry-lifecycle with baseline commit ff8ae5b in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 42 metrics, 3 unstable metrics.

@p-datadog p-datadog force-pushed the telemetry-lifecycle branch from a616bbd to 0af0a62 Compare May 26, 2025 21:13
@codecov-commenter
Copy link

codecov-commenter commented May 27, 2025

Codecov Report

Attention: Patch coverage is 99.63899% with 1 line in your changes missing coverage. Please review.

Project coverage is 97.64%. Comparing base (ff8ae5b) to head (0023b94).

Files with missing lines Patch % Lines
spec/datadog/core/telemetry/worker_spec.rb 94.44% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4678      +/-   ##
==========================================
- Coverage   97.64%   97.64%   -0.01%     
==========================================
  Files        1470     1474       +4     
  Lines       87685    87829     +144     
  Branches     4544     4552       +8     
==========================================
+ Hits        85624    85760     +136     
- Misses       2061     2069       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@p-datadog p-datadog force-pushed the telemetry-lifecycle branch from 721be22 to 8196384 Compare May 27, 2025 20:15
@p-datadog p-datadog changed the title telemetry lifecycle DEBUG-3573 send AppClientConfigurationChange telemetry event when subsequent telemetry instances start May 27, 2025
@p-datadog p-datadog added this to the 2.17.0 milestone May 27, 2025
@p-datadog p-datadog marked this pull request as ready for review May 27, 2025 23:27
@p-datadog p-datadog requested a review from a team as a code owner May 27, 2025 23:27
Copy link
Member

@Strech Strech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is a very good one, but I think we should consider moving it into the components tree instead of introducing new abstraction somewhere else.

class ComponentsState
def initialize(components)
@telemetry_enabled = components.telemetry.enabled
@remote_started = components.remote&.started?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to convert it into the boolean, otherwise we may return nil

irb(main):001> nil&.started?
=> nil
Suggested change
@remote_started = components.remote&.started?
@remote_started = !!components.remote&.started?
``

module Core
module Configuration
# Stores the state of component tree when replacing the tree.
class ComponentsState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about design decision here.

I see the ease of use of ComponentsState as a separate abstraction which is able to analyze some details about the components tree, but why it's not a part of the components tree itself?

A good example of the current design flaw is this part

          telemetry.emit_closing! unless replacement&.telemetry&.enabled

when inside the components itself we have to use directly some internals of the other components tree, instead of asking the components tree itself (violation of encapsulation), like this:

          telemetry.emit_closing! unless replacement.telemetry_enabled?

IMHO we should incorporate the state into components tree and ask questions, instead of introducing the missing API somewhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants