Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thanos s3 tier 2 request usage is very high #7916

Open
doctorpangloss opened this issue Nov 17, 2024 · 7 comments
Open

thanos s3 tier 2 request usage is very high #7916

doctorpangloss opened this issue Nov 17, 2024 · 7 comments

Comments

@doctorpangloss
Copy link

There is a mind-numbing amount of arcana related to using thanos with s3 without suffering from insane request costs

#2643

#2389

What is the correct configuration for thanos so that when used as a storage backend for prometheus, which is 99% of people, with s3, which is probably 95% of people, such that there isn't a huge request load?

Copy link

dosubot bot commented Nov 17, 2024

Hello @doctorpangloss! I'm here to help you with your Thanos S3 tier 2 request usage issue. Let's figure this out together.

There are a few discussions and issues related to high S3 request usage with Thanos that might be helpful:

  1. S3 and high usage of GetObject: This discussion suggests that increasing or adjusting in-memory caching can help reduce the issue. It might be beneficial to set in-memory cache by default to a percentage value of the assigned memory [1].

  2. Thanos Store generates many requests to S3 storage: The solution here involved reducing requests by setting --sync-block-duration=30m, which was suitable for their case [2].

  3. Large amount of S3 calls of compactor in small Thanos instance: This open issue suggests adjusting configuration flags to increase the intervals at which certain operations are performed, such as setting --wait-interval=24h and --compact.cleanup-interval=0s to reduce the number of S3 requests [3].

These solutions involve tuning the configuration to optimize the request load on S3. You might want to consider these adjustments based on your specific use case and requirements.

To continue talking to Dosu, mention @dosu.

@yeya24
Copy link
Contributor

yeya24 commented Nov 17, 2024

Can you please share more details about the number of requests to S3 you have?
What types of request those are, GET, EXISTS or something else?
Thanos objstore metrics should tell you what kind of requests to object store you have the most. Do you have any data about which component it comes from? Compactor or Store Gateway?

Have you tried the metadata caches?

@doctorpangloss
Copy link
Author

thanos is indeed the cause

@doctorpangloss
Copy link
Author

Can you please share more details about the number of requests to S3 you have? What types of request those are, GET, EXISTS or something else? Thanos objstore metrics should tell you what kind of requests to object store you have the most. Do you have any data about which component it comes from? Compactor or Store Gateway?

Have you tried the metadata caches?

Honestly I am overwhelmed by how difficult it is to answer any of these questions. Do you have any advice for investigating this? For now I am simply setting the number of replicas of thanos to zero, because the usage has absolutely exploded.

@TheReal1604
Copy link

Hi @doctorpangloss,

just to get a quick overview, you can just scrape the metrics of thanos-store of its http-endpoint and import the official thanos-store example dashboard to your grafana instance.

https://github.com/thanos-io/thanos/blob/main/examples/dashboards/store.json

There is a row called "Bucket-Operations". Or you just scrape the metrics endpoint yourself.

image

@julienlau
Copy link

I see Thanos rolled out on a bare metal data lake based on a shared Minio instance... Thanos immediately was the number 1 requester in terms of queries per day.
Thanos is the only workload that is rate limited by our 1000 qps cap on S3.
That is definitely something to improve before 1.0.0.

In addition, some sizing guides regarding S3 resources consumption would be nice.

@julienlau
Copy link

julienlau commented Jan 8, 2025

please find attached metrics from a thanos instance on an on premise test setup.
Note that this is after applying rate limiting to in front of our S3 cluster.

Thanos makes way too many getHeadObject and getObject requests.
thanos-1 - anon
thanos-2 - anon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants