-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPORTER] Support handling retry-able errors for OTLP/HTTP #3223
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.
|
✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3223 +/- ##
==========================================
+ Coverage 88.16% 88.21% +0.06%
==========================================
Files 198 198
Lines 6224 6259 +35
==========================================
+ Hits 5487 5521 +34
- Misses 737 738 +1
|
c4d037c
to
2e5d7d8
Compare
2e5d7d8
to
48402d9
Compare
(retry_attempts_ < retry_policy_.max_attempts); | ||
} | ||
|
||
std::chrono::system_clock::time_point HttpOperation::NextRetryTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic in this function is modeled after the exponential backoff from the gRPC client retry policy so that both, OTLP gRPC and HTTP, behave more or less consistently.
|
||
if (operation->IsRetryable()) | ||
{ | ||
self->pending_to_retry_sessions_.push_front(session); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is safe being lock-free because it is only managed by the background thread and not public. In case the session is removed by doAbortSessions()
or doRemoveSessions()
, the pointer would be ignored and removed when processed by doRetrySessions()
d7be7c0
to
9d0aa22
Compare
9d0aa22
to
4de2f81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work.
This is a first pass of comments, more to follow,
see some changes on environment variables
|
||
if (GetStringDualEnvVar(signal_env.data(), generic_env.data(), value)) | ||
{ | ||
return static_cast<std::uint32_t>(std::strtoul(value.c_str(), nullptr, 10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, any garbage in the environment variable that does not fit into a ul string will be silently ignored.
Please implement GetUIntEnvironmentVariable
in sdk/common instead, and log warnings when invalid strings are found.
See existing code for Bool and Duration.
|
||
if (GetStringDualEnvVar(signal_env.data(), generic_env.data(), value)) | ||
{ | ||
return std::strtof(value.c_str(), nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, implement GetFloatEnvironmentVariable
in sdk/common.
sdk/src/common/env_variables.cc
Outdated
return false; | ||
} | ||
|
||
if (!ParseNumber(raw_value.c_str(), value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing ParseNumber(), how about:
const char* start = value.c_str();
const char* end = start + value.length();
const char *actual_end = nullptr;
value = std::strtoul(start, &actual_end, 10)
if (actual_end != end)
{
... complain about garbage and fail ...
}
Whether std::strtoul()
strips whitespace or not is not the issue, the original concern was to make sure that:
ENV_VAR="not even a number"
ENV_VAR="42 and some change"
is correctly rejected, because the whole raw string is not consumed.
In my understanding (not tested), this should take care of the negative sign as well.
To put it differently, as long as std::strtoul()
accepts the whole raw string, the string is deemed valid, we don't want to reimplement strtoul here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I misunderstood the previous comment in that strtoul
and strtof
were not good enough to report any errors. I can revert to something less elaborate.
266e6e5
to
25b332c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing review,
environment variables not in the spec should use the OTEL_CPP_xxx namespace.
constexpr char kSignalEnv[] = "OTEL_EXPORTER_OTLP_METRICS_RETRY_MAX_ATTEMPTS"; | ||
constexpr char kGenericEnv[] = "OTEL_EXPORTER_OTLP_RETRY_MAX_ATTEMPTS"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new environment variables are not in the spec (yet).
Please use the OTEL_CPP_xxx namespace, as in OTEL_CPP_EXPORTER_OTLP_METRICS_RETRY_MAX_ATTEMPTS
, and likewise for all friends.
Once the spec is extended, this will be revisited to use the official names.
} | ||
else | ||
{ | ||
++retry_it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we break and returns true
here?Or the background thread may exit if all retry sessions are pending.
It's a FIFO list, so I think when the first session is pending, the rest are all pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression would be that we want to "reschedule" as many operation that are ready as possible.
This is because curl_multi_info_read
on L465 runs in a tight loop and should be able to process several handles "simultaneously".
Otherwise, we'd need to break out of that loop and look here again to see who is next.
A tradeoff to short-circuiting this list and exit at the first handle that is still scheduled sometime "in the future" is that it assumes that everything else until the head of the list would be newer and therefore not ready. I don't believe that it is possible in practice to have handles that are out of order, but I also did not spend enough time trying to exclude that possibility.
3c6e758
to
20b347f
Compare
Fixes #2049
Changes
This change introduces a retry mechanism for OTLP/HTTP for select failures, mimicking the same exponential backoff approach used in OTLP/gRPC.
The changes to support retries for OTLP/gRPC exporter are addressed in #3219
For significant contributions please make sure you have completed the following items:
CHANGELOG.md
updated for non-trivial changes