Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPORTER] Support handling retry-able errors for OTLP/gRPC #3219

Merged
merged 20 commits into from
Jan 21, 2025

Conversation

chusitoo
Copy link
Contributor

@chusitoo chusitoo commented Dec 22, 2024

Fixes #2049

Changes

This change introduces a retry policy in OTLP/gRPC exporter for select failures via the gRPC service config mechanism

  • Add support to set retry values via environment variables.
  • Enabled by default, using the same configuration values as in OTel java and dotnet.
  • Users can opt-out of the retry capabilities by zeroing out any (or all) of the retry settings.
  • Set service config JSON when creating gRPC channel if all parameters are non-zero.

The changes to support retries for OTLP/HTTP exporter are addressed in #3223

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

Copy link

netlify bot commented Dec 22, 2024

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name Link
🔨 Latest commit b537a49
🔍 Latest deploy log https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/678fcb67991c1f00082baf5c

Copy link

codecov bot commented Dec 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.77%. Comparing base (d2ff95a) to head (b537a49).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3219      +/-   ##
==========================================
- Coverage   87.78%   87.77%   -0.01%     
==========================================
  Files         198      198              
  Lines        6308     6308              
==========================================
- Hits         5537     5536       -1     
- Misses        771      772       +1     

see 1 file with indirect coverage changes

@chusitoo chusitoo changed the title Support handling Retryable error for OTLP/gRPC exporter Support handling retry-able errors for OTLP/gRPC exporter Dec 22, 2024
{
TestTraceService(std::vector<grpc::StatusCode> status_codes) : status_codes_(status_codes) {}

inline grpc::Status Export(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seem possible, at least not in a truthful manner, to test this by mocking Export() on the client side.
Instead, the real exporter is used for this test and the client retry behavior is directly observed on the server side.

@@ -357,6 +363,205 @@ TEST_F(OtlpGrpcExporterTestPeer, ConfigUnknownInsecureFromEnv)
}
# endif

# ifndef NO_GETENV
TEST_F(OtlpGrpcExporterTestPeer, ConfigRetryDefaultValues)
Copy link
Contributor Author

@chusitoo chusitoo Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults matching what is currently implemented in other flavors of OTel (verified dotnet, java, and js). I don't believe rust has a retry policy in place, though.

@chusitoo chusitoo changed the title Support handling retry-able errors for OTLP/gRPC exporter [EXPORTER] Support handling retry-able errors for OTLP/gRPC Dec 30, 2024
@chusitoo chusitoo marked this pull request as ready for review January 1, 2025 23:40
@chusitoo chusitoo requested a review from a team as a code owner January 1, 2025 23:40
std::uint32_t retry_policy_max_attempts{};

/** The initial backoff delay between retry attempts, random between (0, initial_backoff). */
float retry_policy_initial_backoff{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use std::chrono::duration<> here?

Copy link
Contributor Author

@chusitoo chusitoo Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that as a chrono duration initially, but it was not really of any use for otlp/grpc since it just gets passed down to the service config, so it was moved to otlp/http, where it is being required to perform some computations for the backoff.

FYI, implementation in previous commit was like this: cb14857

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, the exporting of both otlp/http and otlp/grpc will cost much more CPU than type conversion here. I think it's more important to make it clear what this parameters means(We don't know the meaning and the unit of this variable by just the name and comments here), and also float number has EPS and is more imprecise.
What do you think about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, precision is probably very subjective since examples of normal use cases are limited to a single decimal place (and this is how it is formatted here before passing the config settings to grpc library), which seems logical given that measuring backoff in tens of milliseconds or lower is probably a very niche requirement.

I think there is some truth in that chrono duration makes the type more descriptive. Part of the reasoning I went back to float was because I could not find a common place where I could alias this to a more descriptive name without having to repeat it in at least one more header file (for instance, otlp_environment.h and http_client.h).

For now, I will revert/update this in #3223 until it is approved/merged to avoid duplicating all these work in progress changes for common code bits...

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks.

More comments to follow.

exporters/otlp/test/otlp_grpc_exporter_test.cc Outdated Show resolved Hide resolved
@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch 2 times, most recently from 64b875c to 0a8bc82 Compare January 7, 2025 15:08
@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch 2 times, most recently from 6a92aa0 to 7f3d420 Compare January 7, 2025 15:31
@chusitoo chusitoo force-pushed the RetryableErrorGrpc branch from 7f3d420 to 8677f89 Compare January 7, 2025 16:11
@marcalff
Copy link
Member

Please merge from main to resolve conflicts.
Now that the OTLP HTTP PR is merged, resuming review for OTLP GRPC.

Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the feature.

@marcalff
Copy link
Member

Looking to merge first #3248, so that maintainers test do cover GRPC in functional tests.

This is to ensure that changes from this PR do work properly in functional tests.

@marcalff marcalff added the pr:please-review This PR is ready for review label Jan 17, 2025
@marcalff marcalff merged commit 031307b into open-telemetry:main Jan 21, 2025
57 checks passed
malkia added a commit to malkia/opentelemetry-cpp that referenced this pull request Jan 21, 2025
[EXPORTER] Support handling retry-able errors for OTLP/gRPC (open-telemetry#3219)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:please-review This PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support handling Retryable error for OTLP exporter (OTLP/gRPC and OTLP/HTTP)
3 participants