Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka output causes high CPU usage when ACL is missconfigured #42343

Closed
belimawr opened this issue Jan 17, 2025 · 1 comment · Fixed by #42401
Closed

Kafka output causes high CPU usage when ACL is missconfigured #42343

belimawr opened this issue Jan 17, 2025 · 1 comment · Fixed by #42401
Assignees
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@belimawr
Copy link
Contributor

When Filebeat uses the Kafka output, it can happen that a miss configured ACL causes writes to fail and the output keeps re-trying to publish the event over and over again without any sort of back off, which makes Filebeat to start consuming high amounts of CPU until the ACL is fixed.

The short term solution is to stop Filebeat, fix the ACL and ensure the credentials are correctly set and then re-start Filebeat.

The problem sees to came from the error handling code:

case errors.Is(err, sarama.ErrInvalidMessage):
r.client.log.Errorf("Kafka (topic=%v): dropping invalid message", msg.topic)
r.client.observer.PermanentErrors(1)
case errors.Is(err, sarama.ErrMessageSizeTooLarge) || errors.Is(err, sarama.ErrInvalidMessageSize):
r.client.log.Errorf("Kafka (topic=%v): dropping too large message of size %v.",
msg.topic,
len(msg.key)+len(msg.value))
r.client.observer.PermanentErrors(1)
case errors.Is(err, breaker.ErrBreakerOpen):
// Add this message to the failed list, but don't overwrite r.err since
// all the breaker error means is "there were a lot of other errors".
r.failed = append(r.failed, msg.data)
default:
r.failed = append(r.failed, msg.data)
if r.err == nil {
// Don't overwrite an existing error. This way at tne end of the batch
// we report the first error that we saw, rather than the last one.
r.err = err
}
}

That does not handle specific errors like sarama.ErrTopicAuthorizationFailed

Unfortunately I have not managed to reproduce this specific situation, however looking at the code and talking with @faec and @mauri870 that seems to be the most likely cause.

The report I saw of this issue, the user is handling authentication with TLS certificates and setting up ACLs. Once the ACL is correct, Filebeat works as expected, however if the ACL is miss configured/missing, then Filebeat goes into this high CPU usage state.

Because the issue seems to be coming from the output code, other Beats could be affected by the same problem

@belimawr belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Jan 17, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants