Store per model params about bulk inference and batching in a JSON #388

RyanMarten · 2025-01-21T15:22:13Z

I'm thinking similarly to litellm, we can store this in a json (easily readable, and editable by the community). They store fields relevant to sending single request:

{
  "max_tokens": 32768,
  "max_input_tokens": 32768,
  "max_output_tokens": 0,
  "input_cost_per_token": 0,
  "output_cost_per_token": 0
}

We should store fields relevant to sending many requests:

{
"batch_available": True
"batch_discount_factor": 0.5
"max_requests_per_batch": Z
"max_batch_file_size": W
"rate_limit_strategy": "separate"     # also can be concurrent tokens / etc. 
"max_input_tokens_per_minute": Y
"max_output_tokens_per_minute": X
"max_requests_per_minute": V
}

Sometimes this information is contained in the headers, but most of the time it is not. So the user has to manually find these values in the documentation and then manually set them in backend_params. We can do this automatically instead for the user by storing these values in this file and loading them in.

For example, gemini doesn't provide rate limit information in the headers. This way when the user specifies gemini, they get a reasonable default (instead of our current very conservative super low one of 10 RPM). See how we currently have to specify the values in examples/litellm-recipe-generation/litellm_recipe_prompting.py

curator/examples/litellm-recipe-generation/litellm_recipe_prompting.py

Lines 49 to 59 in b083f69

    
           # 4. If you are a free user, update rate limits: 
        
           #       max_requests_per_minute=15 
        
           #       max_tokens_per_minute=1_000_000 
        
           #       (Up to 1,000 requests per day) 
        
           ############################################# 
        
           recipe_generator = RecipeGenerator( 
        
               model_name="gemini/gemini-1.5-flash", 
        
               backend="litellm", 
        
               backend_params={"max_requests_per_minute": 2_000, "max_tokens_per_minute": 4_000_000}, 
        
           )

There is an obviously complicating factor of different tiers / custom rate limit values for users. But those can be manually defined by the user (as we require the user to do now, even if they are on a standard tier). The point of this is to have reasonable defaults.

The litellm file is here:
https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json

They also have a nice website where you can search and compare providers (which we can do too for bulk inference and batching):
https://models.litellm.ai/

The text was updated successfully, but these errors were encountered:

RyanMarten · 2025-01-21T15:23:55Z

Inspired by discussion in #378

feat: support max parallel request processor #378 (comment)

RyanMarten mentioned this issue Jan 21, 2025

feat: support max parallel request processor #378

Merged

adamoptimizer self-assigned this Jan 27, 2025

adamoptimizer added the task This is a task label Jan 28, 2025

kartik4949 assigned kartik4949 and unassigned adamoptimizer Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store per model params about bulk inference and batching in a JSON #388

Store per model params about bulk inference and batching in a JSON #388

RyanMarten commented Jan 21, 2025 •

edited

Loading

RyanMarten commented Jan 21, 2025 •

edited

Loading

Store per model params about bulk inference and batching in a JSON #388

Store per model params about bulk inference and batching in a JSON #388

Comments

RyanMarten commented Jan 21, 2025 • edited Loading

RyanMarten commented Jan 21, 2025 • edited Loading

RyanMarten commented Jan 21, 2025 •

edited

Loading

RyanMarten commented Jan 21, 2025 •

edited

Loading