Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPORTER] Support handling retry-able errors for OTLP/HTTP #3223

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

chusitoo
Copy link
Contributor

@chusitoo chusitoo commented Dec 30, 2024

Fixes #2049

Changes

This change introduces a retry mechanism for OTLP/HTTP for select failures, mimicking the same exponential backoff approach used in OTLP/gRPC.

  • Add support to set retry values via environment variables.
  • Enabled by default, using the same configuration values as in OTel dotnet, java and js.
  • Users can opt-out of the retry capabilities by zeroing out any (or all) of the retry settings.
  • Similar to OTLP/gRPC, retries are transparent to the user so only the last attempt is "bubbled up" as the actual response.

The changes to support retries for OTLP/gRPC exporter are addressed in #3219

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

Copy link

netlify bot commented Dec 30, 2024

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name Link
🔨 Latest commit 082597a
🔍 Latest deploy log https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/6772c02f4c1c9800081accd4

Copy link

netlify bot commented Dec 30, 2024

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name Link
🔨 Latest commit 20b347f
🔍 Latest deploy log https://app.netlify.com/sites/opentelemetry-cpp-api-docs/deploys/677dfaa5f4227e0008acaa97

@chusitoo chusitoo changed the title Support handling retry-able errors for OTLP/HTTP exporter [EXPORTER] Support handling retry-able errors for OTLP/HTTP Dec 30, 2024
Copy link

codecov bot commented Dec 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.21%. Comparing base (bb68f49) to head (20b347f).
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3223      +/-   ##
==========================================
+ Coverage   88.16%   88.21%   +0.06%     
==========================================
  Files         198      198              
  Lines        6224     6259      +35     
==========================================
+ Hits         5487     5521      +34     
- Misses        737      738       +1     
Files with missing lines Coverage Δ
sdk/src/common/env_variables.cc 99.03% <100.00%> (+0.50%) ⬆️

... and 1 file with indirect coverage changes

(retry_attempts_ < retry_policy_.max_attempts);
}

std::chrono::system_clock::time_point HttpOperation::NextRetryTime()
Copy link
Contributor Author

@chusitoo chusitoo Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this function is modeled after the exponential backoff from the gRPC client retry policy so that both, OTLP gRPC and HTTP, behave more or less consistently.


if (operation->IsRetryable())
{
self->pending_to_retry_sessions_.push_front(session);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is safe being lock-free because it is only managed by the background thread and not public. In case the session is removed by doAbortSessions() or doRemoveSessions(), the pointer would be ignored and removed when processed by doRetrySessions()

@chusitoo chusitoo marked this pull request as ready for review January 1, 2025 23:53
@chusitoo chusitoo requested a review from a team as a code owner January 1, 2025 23:53
Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work.

This is a first pass of comments, more to follow,
see some changes on environment variables


if (GetStringDualEnvVar(signal_env.data(), generic_env.data(), value))
{
return static_cast<std::uint32_t>(std::strtoul(value.c_str(), nullptr, 10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, any garbage in the environment variable that does not fit into a ul string will be silently ignored.

Please implement GetUIntEnvironmentVariable in sdk/common instead, and log warnings when invalid strings are found.

See existing code for Bool and Duration.


if (GetStringDualEnvVar(signal_env.data(), generic_env.data(), value))
{
return std::strtof(value.c_str(), nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, implement GetFloatEnvironmentVariable in sdk/common.

return false;
}

if (!ParseNumber(raw_value.c_str(), value))
Copy link
Member

@marcalff marcalff Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing ParseNumber(), how about:

const char* start = value.c_str();
const char* end = start + value.length();
const char *actual_end = nullptr;

value = std::strtoul(start, &actual_end, 10)

if (actual_end != end)
{
... complain about garbage and fail ...
}

Whether std::strtoul() strips whitespace or not is not the issue, the original concern was to make sure that:

ENV_VAR="not even a number"
ENV_VAR="42 and some change"

is correctly rejected, because the whole raw string is not consumed.

In my understanding (not tested), this should take care of the negative sign as well.

To put it differently, as long as std::strtoul() accepts the whole raw string, the string is deemed valid, we don't want to reimplement strtoul here.

Copy link
Contributor Author

@chusitoo chusitoo Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I misunderstood the previous comment in that strtoul and strtof were not good enough to report any errors. I can revert to something less elaborate.

@chusitoo chusitoo force-pushed the RetryableErrorHttp branch from 266e6e5 to 25b332c Compare January 7, 2025 16:52
Copy link
Member

@marcalff marcalff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing review,

environment variables not in the spec should use the OTEL_CPP_xxx namespace.

Comment on lines 1176 to 1177
constexpr char kSignalEnv[] = "OTEL_EXPORTER_OTLP_METRICS_RETRY_MAX_ATTEMPTS";
constexpr char kGenericEnv[] = "OTEL_EXPORTER_OTLP_RETRY_MAX_ATTEMPTS";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new environment variables are not in the spec (yet).

Please use the OTEL_CPP_xxx namespace, as in OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_MAX_ATTEMPTS, and likewise for all friends.

Once the spec is extended, this will be revisited to use the official names.

}
else
{
++retry_it;
Copy link
Member

@owent owent Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break and returns true here?Or the background thread may exit if all retry sessions are pending.
It's a FIFO list, so I think when the first session is pending, the rest are all pending.

Copy link
Contributor Author

@chusitoo chusitoo Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression would be that we want to "reschedule" as many operation that are ready as possible.
This is because curl_multi_info_read on L465 runs in a tight loop and should be able to process several handles "simultaneously".
Otherwise, we'd need to break out of that loop and look here again to see who is next.

A tradeoff to short-circuiting this list and exit at the first handle that is still scheduled sometime "in the future" is that it assumes that everything else until the head of the list would be newer and therefore not ready. I don't believe that it is possible in practice to have handles that are out of order, but I also did not spend enough time trying to exclude that possibility.

@chusitoo chusitoo force-pushed the RetryableErrorHttp branch from 3c6e758 to 20b347f Compare January 8, 2025 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support handling Retryable error for OTLP exporter (OTLP/gRPC and OTLP/HTTP)
3 participants