-
Notifications
You must be signed in to change notification settings - Fork 1.7k
engine: expose internal logging call counts as internal metrics #10326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
engine: expose internal logging call counts as internal metrics #10326
Conversation
return NULL; | ||
} | ||
|
||
ret_ctx->u = flb_upstream_create(ret_ctx->config, "127.0.0.1", 2020, 0, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any pattern or prior art for picking random free ports in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not but we can talk about it if you're interested in making that improvement, I'd appreciate it.
d83dbeb
to
68ede1c
Compare
68ede1c
to
e61c2c0
Compare
include/fluent-bit/flb_config.h
Outdated
@@ -164,6 +164,7 @@ struct flb_config { | |||
/* Logging */ | |||
char *log_file; | |||
struct flb_log *log; | |||
struct flb_log_metrics *log_metrics_ctx; /* Global metrics for logging calls */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this metrics context supposed to be limited to the logger instance? If that's the case then it should be declared inside if flb_log
instead of flb_config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the metrics to flb_log
and the lifecycle of them is now managed by flb_log_create
and flb_log_destroy
.
include/fluent-bit/flb_log.h
Outdated
@@ -232,6 +242,8 @@ static inline int flb_log_suppress_check(int log_suppress_interval, const char * | |||
int flb_log_worker_init(struct flb_worker *worker); | |||
int flb_log_worker_destroy(struct flb_worker *worker); | |||
int flb_errno_print(int errnum, const char *file, int line); | |||
struct flb_log_metrics *flb_log_metrics_create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two shouldn't be public as they should be only invoked by flb_log_create
and flb_log_destroy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed from the header file.
src/flb_config.c
Outdated
@@ -391,6 +391,14 @@ struct flb_config *flb_config_init() | |||
flb_regex_init(); | |||
#endif | |||
|
|||
/* Create internal logger metrics */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be moved to flb_log_create
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/flb_log.c
Outdated
@@ -573,39 +597,32 @@ int flb_log_construct(struct log_message *msg, int *ret_len, | |||
int total; | |||
time_t now; | |||
const char *header_color = NULL; | |||
const char *header_title = NULL; | |||
const char *header_title = flb_log_message_type_str(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't make function calls in the middle of the declarations, initialize it to NULL and move the function call to line 606.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/flb_log.c
Outdated
@@ -564,6 +566,28 @@ struct flb_log *flb_log_create(struct flb_config *config, int type, | |||
return log; | |||
} | |||
|
|||
static inline char *flb_log_message_type_str(int type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the return type to const char *
to match the current usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/flb_log.c
Outdated
return lm; | ||
|
||
error: | ||
if (lm && lm->logs_total_counter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code was to stay this would be a much cleaner structure which I'd ask you to follow:
if (lm != NULL) {
if (lm->logs_total_counter != NULL) {
cmt_counter_destroy(lm->logs_total_counter);
}
if (lm->cmt != NULL) {
cmt_destroy(lm->cmt);
}
flb_free(lm);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted to not using goto. The pattern of using goto to jump to a cleanup section that effectively unwinds the allocations was something Phil and I had talked a bit about ahead of the PR, but I can bring that up as a discussion separately.
src/flb_log.c
Outdated
break; | ||
} | ||
|
||
if (cmt_counter_set(lm->logs_total_counter, ts, 0, 1, (char *[]) {message_type_str}) == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't put function calls inside of the conditions, it's ok to use simpler functions such as strlen
sometimes there but this is not the case, declare a variable, store the result value there and then compare it.
The compiler will optimize it anyway and the result code will remain in a register in 99% of the cases, especially in the modern fastcall-ish conventions (amd64 & arm).
src/flb_log.c
Outdated
* Initialize counters for all log message types to 0. | ||
* This assumes types are contiguous starting at 1 (FLB_LOG_ERROR). | ||
*/ | ||
i = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
is already initialized by the for
, remove this.
return NULL; | ||
} | ||
|
||
ret_ctx->u = flb_upstream_create(ret_ctx->config, "127.0.0.1", 2020, 0, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not but we can talk about it if you're interested in making that improvement, I'd appreciate it.
8023366
to
98b4e24
Compare
src/flb_config.c
Outdated
@@ -345,6 +345,11 @@ struct flb_config *flb_config_init() | |||
mk_list_init(&config->cmetrics); | |||
mk_list_init(&config->cf_parsers_list); | |||
|
|||
/* Initialize multiline-parser list. We need this here, because from now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's isolate this in another PR, I've already approved yours as it was and don't want to mix things up so please revert this change and open a new one with the appropriate context so your reasoning is easy to follow and validate.
98b4e24
to
48b63d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks generally ok -- I have some nitpicks.
tests/runtime/core_internal_logger.c
Outdated
|
||
/* If the assertion fails, retry in a sleep loop since the fluent-bit's HTTP server | ||
* may not be ready yet */ | ||
for (i = 1; i <= attempt_count; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (i = 0; i < attempt_count; i++) {
if (assert_internal_log_metrics(http_ctx, i == attempt_count - 1)) {
break;
}
flb_time_msleep(100);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
src/flb_log.c
Outdated
if (config != NULL && config->log != NULL) { | ||
msg_type_str = flb_log_message_type_str(type); | ||
ts = cfl_time_now(); | ||
cmt_counter_inc(config->log->metrics->logs_total_counter, ts, 1, (char *[]) {msg_type_str}); // Ignoring inc error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should silently ignore this error. At least we should put a debug statement to track in case we are chasing a bug.
int ret;
ret = cmt_counter_inc(config->log->metrics->logs_total_counter, ts, 1,
(char *[]) {msg_type_str});
if (ret != 0) {
flb_debug("[log] failed to increment log total counter for message type '%s' (error=%d)",
msg_type_str ? msg_type_str : "unknown", ret);
}```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to log the failure, though I didn't call flb_debug
since that recurses in this same function and would likely hit the same log increment error again. Instead, opted to log the increment error using fprintf(stderr, ...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use inline conditionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
tests/runtime/core_internal_logger.c
Outdated
/* The process startup should have logged at least 1 info log */ | ||
regex = flb_regex_create( | ||
"fluentbit_logger_logs_total\\{message_type=\"info\"\\} [1-9]+[0-9]*"); | ||
TEST_CHECK(regex != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can early return here.
if (!TEST_CHECK(regex != NULL)) {
TEST_MSG("Failed to create regex for info log count check");
flb_http_client_destroy(http_client);
return -1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
48b63d6
to
0e0e134
Compare
@niedbalski thanks for taking a look. Pushed a temporary commit addressing your feedback. |
Internal logger calls increment a new v2 metric exposed by the HTTP server Prometheus scrape endpoint. There is one time series per log message type. Signed-off-by: Alec Holmes <[email protected]>
Signed-off-by: Alec Holmes <[email protected]>
Signed-off-by: Alec Holmes <[email protected]>
Signed-off-by: Alec Holmes <[email protected]>
0e0e134
to
a48e7ab
Compare
if (ctx->log != NULL) { | ||
ret = cmt_cat(cmt, ctx->log->metrics->cmt); | ||
if (ret == -1) { | ||
flb_error("[metrics exporter] could not append global log_metrics_ctx"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this so the message reflect what it is rather than the context name.
} | ||
|
||
metrics->cmt = cmt_create(); | ||
if (!metrics->cmt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please compare explicitly against NULL
here and in any other places where it's not done in that way.
* This assumes types are contiguous starting at 1 (FLB_LOG_ERROR). | ||
*/ | ||
ts = cfl_time_now(); | ||
for (i = 1; ; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change this loop so it explicitly iterates from FLB_LOG_ERROR
up to FLB_LOG_TRACE
.
Your mechanism is smart but not obvious at first glance and in order to minimize the chance of someone misunderstanding the code and making a mistake we need to make things painfully obvious when possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, rename i
to a proper name, we don't use single letter variables because it's a slippery slope that can lead to sub par and hard to follow code which can (and has) cause(d) issues.
0, | ||
1, (char *[]) {message_type_str}); | ||
if (ret == -1) { | ||
cmt_counter_destroy(metrics->logs_total_counter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace this cleanup block (and the one before) with a call to flb_log_metrics_destroy
/* Frees the metrics instance and its associated resources. */ | ||
void flb_log_metrics_destroy(struct flb_log_metrics *metrics) | ||
{ | ||
cmt_counter_destroy(metrics->logs_total_counter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this, cmt_destroy
iterates the metrics in the context and destroys them so you don't need to. Also, if this function is called by flb_log_metrics_create
when cmt_counter_create
fails we'd need to add a conditional to avoid calling cmt_counter_destry
which makes no sense since it's automatic.
fprintf(stderr, "[log] could not create log metrics\n"); | ||
mk_event_loop_destroy(log->evl); | ||
flb_free(log); | ||
config->log = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking at this I noticed that config->log
is set immediately after allocating it.
I've checked the client code and verified that there are no threading issues, however, I wanted to use this as an example of bad logic for the new contributors.
The problem is that the uninitialized log context is set to config
which means it could be prematurely accessed by other threads. The right thing would be to only set config->log
at the end of this function when we know for sure that the context is fully initialized and usable.
You don't need to change anything here, just look at it and if you have any questions ask them.
ts, | ||
1, (char *[]) {msg_type_str}); | ||
if (ret == -1) { | ||
// Not using flb_log_debug to avoid recursing into this same function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace this with a C style comment /**/
@@ -299,6 +299,15 @@ struct cmt *flb_me_get_cmetrics(struct flb_config *ctx) | |||
} | |||
} | |||
|
|||
if (ctx->log != NULL) { | |||
ret = cmt_cat(cmt, ctx->log->metrics->cmt); | |||
if (ret == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invert this conditional son instead of matching -1
it matches anything different than 0
.
The idea is that while at the moment this function returns -1
in every error return branch this is not great if we care about diagnosing issues and we'd like to leave the door open to improve it in the future.
This PR adds a new v2 runtime metric that exposes the number of logger calls by message type. A fluent-bit process consistently logging errors can be indicative of significant configuration or infrastructure problems. A common pattern for observing failures across many instances of software is to expose failures as metric counters that can then be observed and alerted on.
The implementation piggybacks on the src/flb_log.c logging library already extracting a worker context from the current thread.
Here is the example output of curling a fluent-bit with a service http_server enabled:
Enter
[N/A]
in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-test
label to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.