Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store per model params about bulk inference and batching in a JSON #388

Open
RyanMarten opened this issue Jan 21, 2025 · 1 comment
Open
Assignees
Labels
task This is a task

Comments

@RyanMarten
Copy link
Contributor

RyanMarten commented Jan 21, 2025

I'm thinking similarly to litellm, we can store this in a json (easily readable, and editable by the community). They store fields relevant to sending single request:

{
  "max_tokens": 32768,
  "max_input_tokens": 32768,
  "max_output_tokens": 0,
  "input_cost_per_token": 0,
  "output_cost_per_token": 0
}

We should store fields relevant to sending many requests:

{
"batch_available": True
"batch_discount_factor": 0.5
"max_requests_per_batch": Z
"max_batch_file_size": W
"rate_limit_strategy": "separate"     # also can be concurrent tokens / etc. 
"max_input_tokens_per_minute": Y
"max_output_tokens_per_minute": X
"max_requests_per_minute": V
}

Sometimes this information is contained in the headers, but most of the time it is not. So the user has to manually find these values in the documentation and then manually set them in backend_params. We can do this automatically instead for the user by storing these values in this file and loading them in.

For example, gemini doesn't provide rate limit information in the headers. This way when the user specifies gemini, they get a reasonable default (instead of our current very conservative super low one of 10 RPM). See how we currently have to specify the values in examples/litellm-recipe-generation/litellm_recipe_prompting.py

# 4. If you are a free user, update rate limits:
# max_requests_per_minute=15
# max_tokens_per_minute=1_000_000
# (Up to 1,000 requests per day)
#############################################
recipe_generator = RecipeGenerator(
model_name="gemini/gemini-1.5-flash",
backend="litellm",
backend_params={"max_requests_per_minute": 2_000, "max_tokens_per_minute": 4_000_000},
)

There is an obviously complicating factor of different tiers / custom rate limit values for users. But those can be manually defined by the user (as we require the user to do now, even if they are on a standard tier). The point of this is to have reasonable defaults.

The litellm file is here:
https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json

They also have a nice website where you can search and compare providers (which we can do too for bulk inference and batching):
https://models.litellm.ai/

@RyanMarten
Copy link
Contributor Author

RyanMarten commented Jan 21, 2025

Inspired by discussion in #378

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task This is a task
Projects
None yet
Development

No branches or pull requests

3 participants