Rate Limit and Retry for Models #1734

grll · 2025-05-15T15:46:34Z

Add a quite popular feature request: Rate Limit and Retry support for models.

It is implemented as a wrapper like the InstrumentedModel. It leverages aiolimiter for a simple implementation of the leaking bucket algorithm for rate limit while retry leverage the tenacity library.

Usage:

```python
model = RateLimitedModel(
    "anthropic:claude-3-7-sonnet-latest",
    limiter=AsyncLimiter(1, 1),
    retryer=AsyncRetrying(stop=stop_after_attempt(3)),
)
agent = Agent(model=model)
```

…bucket algorithm

grll · 2025-05-15T15:51:40Z

Fixes #1732
Fixes #782

DouweM · 2025-05-19T23:33:54Z

@grll Nice work! Looks like we still have some failing tests, let me know if you'd like some guidance there.

As for the code, can we please reduce the duplication a bit by moving the if self.limiter:/else stuff to an inline function called from both sides of the if self.retry:/else branch?

tekumara

+1 this is nicely done.

@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?

It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting

grll · 2025-05-20T17:12:56Z

@grll Nice work! Looks like we still have some failing tests, let me know if you'd like some guidance there.

As for the code, can we please reduce the duplication a bit by moving the if self.limiter:/else stuff to an inline function called from both sides of the if self.retry:/else branch?

Hey thanks for review, I will have a look was indeed struggling a bit with the test in the CI while it was working ok locally need to investigate a bit more. I can also have a look at the refactor you suggested!

grll · 2025-05-20T17:31:10Z

+1 this is nicely done.

@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?

It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting

Hey thanks for the suggestion, the initial goal with using aiolimiter was to keep things as simple as possible but we could instead integrated with throttled-py in order to let user choose from various algorithm / redis backend as well.

ZhuoZhuoCrayon · 2025-05-21T07:28:51Z

+1 this is nicely done.
@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?
It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting

Hey thanks for the suggestion, the initial goal with using aiolimiter was to keep things as simple as possible but we could instead integrated with throttled-py in order to let user choose from various algorithm / redis backend as well.

@grll I am the developer of throttled-py. throttled-py provides flexible current limiting strategies and storage backend configurations. I am very willing to participate in the discussion and implementation of this PR.

grll · 2025-05-21T08:11:03Z

+1 this is nicely done.
@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?
It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting

Hey thanks for the suggestion, the initial goal with using aiolimiter was to keep things as simple as possible but we could instead integrated with throttled-py in order to let user choose from various algorithm / redis backend as well.

@grll I am the developer of throttled-py. throttled-py provides flexible current limiting strategies and storage backend configurations. I am very willing to participate in the discussion and implementation of this PR.

Hey @ZhuoZhuoCrayon thanks for jumping in and great work in throttled-py. I am concerned about one thing here is that asyncio support is limited in the current implementatino of throttled-py. What is the current status around asyncio?

ZhuoZhuoCrayon · 2025-05-21T08:32:43Z

+1 this is nicely done.
@grll @DouweM what do you think about having a GCRA implementation as implemented by something like https://github.com/ZhuoZhuoCrayon/throttled-py ?
It's a more efficient variant of the leaky bucket algorithm without its downsides (ie: doesn't need a background "drip" process). See https://brandur.org/rate-limiting

Hey thanks for the suggestion, the initial goal with using aiolimiter was to keep things as simple as possible but we could instead integrated with throttled-py in order to let user choose from various algorithm / redis backend as well.

@grll I am the developer of throttled-py. throttled-py provides flexible current limiting strategies and storage backend configurations. I am very willing to participate in the discussion and implementation of this PR.

Hey @ZhuoZhuoCrayon thanks for jumping in and great work in throttled-py. I am concerned about one thing here is that asyncio support is limited in the current implementatino of throttled-py. What is the current status around asyncio?

@grll Features are in the development branch and a stable version will be released this weekend

grll · 2025-05-24T15:07:25Z

@DouweM or @dmontagu any thoughts on the above?

ZhuoZhuoCrayon · 2025-05-25T19:23:27Z

Hi, @grll and @tekumara !

throttled-py asynchronous support has been officially released in v2.1.0.

The core API is the same for synchronous and asynchronous code, just replace from throttled import ... with from throttled.asyncio import ... in your code:

import asyncio
from throttled.asyncio import RateLimiterType, Throttled, rate_limiter, store, utils

throttle = Throttled(
    using=RateLimiterType.GCRA.value,
    quota=rate_limiter.per_sec(1_000, burst=1_000),
    store=store.MemoryStore(),
)


async def call_api() -> bool:
    result = await throttle.limit("/ping", cost=1)
    return result.limited


async def main():
    benchmark: utils.Benchmark = utils.Benchmark()
    denied_num: int = sum(await benchmark.async_serial(call_api, 100_000))
    print(f"❌ Denied: {denied_num} requests")

if __name__ == "__main__":
    asyncio.run(main())

I think GCRA has lower performance overhead and smoother throttling. At the same time, making the storage backend optional can increase subsequent scalability. Anyway, adding throttling and retries to the model is very cool, thank you!

DouweM · 2025-05-26T18:12:37Z

@ZhuoZhuoCrayon Thanks a ton, throttled looks great!

@grll Are you up for changing this PR to use throttled instead?

grll · 2025-05-26T19:49:44Z

@ZhuoZhuoCrayon Thanks a ton, throttled looks great!

@grll Are you up for changing this PR to use throttled instead?

Sure happy to!

grll added 5 commits May 15, 2025 16:16

add rate_limited model wrapper

436864e

add test for rate_limited

3a71187

add ailimiter to also optionally limit the rate on the api with toke …

ca29d2e

…bucket algorithm

add usage in documentation

5b2a407

add working test for both rate limit and retry

49ae9f2

grll mentioned this pull request May 15, 2025

running into request rate limiting error frequently for openAI models #782

Open

grll added 4 commits May 15, 2025 17:53

fix lint

25929c6

fix usage example

03dece0

fix usage lint

42d5113

fix import error in example

ab12b79

DouweM assigned dmontagu May 19, 2025

DouweM assigned DouweM and unassigned dmontagu May 19, 2025

DouweM added the awaiting author revision label May 19, 2025

tekumara reviewed May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rate Limit and Retry for Models #1734

Rate Limit and Retry for Models #1734

grll commented May 15, 2025

Uh oh!

grll commented May 15, 2025

Uh oh!

DouweM commented May 19, 2025

Uh oh!

tekumara left a comment

Uh oh!

grll commented May 20, 2025

Uh oh!

grll commented May 20, 2025

Uh oh!

ZhuoZhuoCrayon commented May 21, 2025 •

edited

Loading

Uh oh!

grll commented May 21, 2025

Uh oh!

ZhuoZhuoCrayon commented May 21, 2025

Uh oh!

grll commented May 24, 2025 •

edited

Loading

Uh oh!

ZhuoZhuoCrayon commented May 25, 2025 •

edited

Loading

Uh oh!

DouweM commented May 26, 2025

Uh oh!

grll commented May 26, 2025

Uh oh!

Uh oh!

Rate Limit and Retry for Models #1734

Are you sure you want to change the base?

Rate Limit and Retry for Models #1734

Conversation

grll commented May 15, 2025

Uh oh!

grll commented May 15, 2025

Uh oh!

DouweM commented May 19, 2025

Uh oh!

tekumara left a comment

Choose a reason for hiding this comment

Uh oh!

grll commented May 20, 2025

Uh oh!

grll commented May 20, 2025

Uh oh!

ZhuoZhuoCrayon commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grll commented May 21, 2025

Uh oh!

ZhuoZhuoCrayon commented May 21, 2025

Uh oh!

grll commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhuoZhuoCrayon commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DouweM commented May 26, 2025

Uh oh!

grll commented May 26, 2025

Uh oh!

Uh oh!

ZhuoZhuoCrayon commented May 21, 2025 •

edited

Loading

grll commented May 24, 2025 •

edited

Loading

ZhuoZhuoCrayon commented May 25, 2025 •

edited

Loading