Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/docs/providers/inference/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@ description: "Inference

Llama Stack Inference API for generating completions, chat completions, and embeddings.

This API provides the raw interface to the underlying models. Two kinds of models are supported:
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search."
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query."
sidebar_label: Inference
title: Inference
---
Expand All @@ -18,8 +19,9 @@ Inference

Llama Stack Inference API for generating completions, chat completions, and embeddings.

This API provides the raw interface to the underlying models. Two kinds of models are supported:
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query.

This section contains documentation for all available providers for the **inference** API.
2 changes: 1 addition & 1 deletion docs/static/deprecated-llama-stack-spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -13459,7 +13459,7 @@
},
{
"name": "Inference",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Two kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Three kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.\n- Rerank models: these models reorder the documents based on their relevance to a query.",
"x-displayName": "Inference"
},
{
Expand Down
7 changes: 5 additions & 2 deletions docs/static/deprecated-llama-stack-spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10210,13 +10210,16 @@ tags:
embeddings.


This API provides the raw interface to the underlying models. Two kinds of models
are supported:
This API provides the raw interface to the underlying models. Three kinds of
models are supported:

- LLM models: these models generate "raw" and "chat" (conversational) completions.

- Embedding models: these models generate embeddings to be used for semantic
search.

- Rerank models: these models reorder the documents based on their relevance
to a query.
x-displayName: Inference
- name: Models
description: ''
Expand Down
5 changes: 3 additions & 2 deletions docs/static/llama-stack-spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -6859,7 +6859,8 @@
"type": "string",
"enum": [
"llm",
"embedding"
"embedding",
"rerank"
],
"title": "ModelType",
"description": "Enumeration of supported model types in Llama Stack."
Expand Down Expand Up @@ -13261,7 +13262,7 @@
},
{
"name": "Inference",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Two kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Three kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.\n- Rerank models: these models reorder the documents based on their relevance to a query.",
"x-displayName": "Inference"
},
{
Expand Down
8 changes: 6 additions & 2 deletions docs/static/llama-stack-spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5269,6 +5269,7 @@ components:
enum:
- llm
- embedding
- rerank
title: ModelType
description: >-
Enumeration of supported model types in Llama Stack.
Expand Down Expand Up @@ -10182,13 +10183,16 @@ tags:
embeddings.


This API provides the raw interface to the underlying models. Two kinds of models
are supported:
This API provides the raw interface to the underlying models. Three kinds of
models are supported:

- LLM models: these models generate "raw" and "chat" (conversational) completions.

- Embedding models: these models generate embeddings to be used for semantic
search.

- Rerank models: these models reorder the documents based on their relevance
to a query.
x-displayName: Inference
- name: Inspect
description: >-
Expand Down
5 changes: 3 additions & 2 deletions docs/static/stainless-llama-stack-spec.html
Original file line number Diff line number Diff line change
Expand Up @@ -8531,7 +8531,8 @@
"type": "string",
"enum": [
"llm",
"embedding"
"embedding",
"rerank"
],
"title": "ModelType",
"description": "Enumeration of supported model types in Llama Stack."
Expand Down Expand Up @@ -17951,7 +17952,7 @@
},
{
"name": "Inference",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Two kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.",
"description": "Llama Stack Inference API for generating completions, chat completions, and embeddings.\n\nThis API provides the raw interface to the underlying models. Three kinds of models are supported:\n- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.\n- Embedding models: these models generate embeddings to be used for semantic search.\n- Rerank models: these models reorder the documents based on their relevance to a query.",
"x-displayName": "Inference"
},
{
Expand Down
8 changes: 6 additions & 2 deletions docs/static/stainless-llama-stack-spec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6482,6 +6482,7 @@ components:
enum:
- llm
- embedding
- rerank
title: ModelType
description: >-
Enumeration of supported model types in Llama Stack.
Expand Down Expand Up @@ -13577,13 +13578,16 @@ tags:
embeddings.


This API provides the raw interface to the underlying models. Two kinds of models
are supported:
This API provides the raw interface to the underlying models. Three kinds of
models are supported:

- LLM models: these models generate "raw" and "chat" (conversational) completions.

- Embedding models: these models generate embeddings to be used for semantic
search.

- Rerank models: these models reorder the documents based on their relevance
to a query.
x-displayName: Inference
- name: Inspect
description: >-
Expand Down
3 changes: 2 additions & 1 deletion llama_stack/apis/inference/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -1234,9 +1234,10 @@ class Inference(InferenceProvider):

Llama Stack Inference API for generating completions, chat completions, and embeddings.

This API provides the raw interface to the underlying models. Two kinds of models are supported:
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query.
"""

@webmethod(route="/openai/v1/chat/completions", method="GET", level=LLAMA_STACK_API_V1, deprecated=True)
Expand Down
2 changes: 2 additions & 0 deletions llama_stack/apis/models/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,12 @@ class ModelType(StrEnum):
"""Enumeration of supported model types in Llama Stack.
:cvar llm: Large language model for text generation and completion
:cvar embedding: Embedding model for converting text to vector representations
:cvar rerank: Reranking model for reordering documents based on their relevance to a query
"""

llm = "llm"
embedding = "embedding"
rerank = "rerank"


@json_schema_type
Expand Down
22 changes: 22 additions & 0 deletions llama_stack/core/routers/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,14 @@
OpenAIEmbeddingsResponse,
OpenAIMessageParam,
Order,
RerankResponse,
StopReason,
ToolPromptFormat,
)
from llama_stack.apis.inference.inference import (
OpenAIChatCompletionContentPartImageParam,
OpenAIChatCompletionContentPartTextParam,
)
from llama_stack.apis.models import Model, ModelType
from llama_stack.apis.telemetry import MetricEvent, MetricInResponse, Telemetry
from llama_stack.log import get_logger
Expand Down Expand Up @@ -182,6 +187,23 @@ async def _get_model(self, model_id: str, expected_model_type: str) -> Model:
raise ModelTypeError(model_id, model.model_type, expected_model_type)
return model

async def rerank(
self,
model: str,
query: str | OpenAIChatCompletionContentPartTextParam | OpenAIChatCompletionContentPartImageParam,
items: list[str | OpenAIChatCompletionContentPartTextParam | OpenAIChatCompletionContentPartImageParam],
max_num_results: int | None = None,
) -> RerankResponse:
logger.debug(f"InferenceRouter.rerank: {model}")
model_obj = await self._get_model(model, ModelType.rerank)
provider = await self.routing_table.get_provider_impl(model_obj.identifier)
return await provider.rerank(
model=model_obj.identifier,
query=query,
items=items,
max_num_results=max_num_results,
)

async def openai_completion(
self,
params: Annotated[OpenAICompletionRequestWithExtraBody, Body(...)],
Expand Down
41 changes: 26 additions & 15 deletions llama_stack/providers/utils/inference/openai_mixin.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ class OpenAIMixin(NeedsRequestProviderData, ABC, BaseModel):
- overwrite_completion_id: If True, overwrites the 'id' field in OpenAI responses
- download_images: If True, downloads images and converts to base64 for providers that require it
- embedding_model_metadata: A dictionary mapping model IDs to their embedding metadata
- construct_model_from_identifier: Method to construct a Model instance corresponding to the given identifier
- provider_data_api_key_field: Optional field name in provider data to look for API key
- list_provider_model_ids: Method to list available models from the provider
- get_extra_client_params: Method to provide extra parameters to the AsyncOpenAI client
Expand Down Expand Up @@ -121,6 +122,30 @@ def get_extra_client_params(self) -> dict[str, Any]:
"""
return {}

def construct_model_from_identifier(self, identifier: str) -> Model:
"""
Construct a Model instance corresponding to the given identifier

Child classes can override this to customize model typing/metadata.

:param identifier: The provider's model identifier
:return: A Model instance
"""
if metadata := self.embedding_model_metadata.get(identifier):
return Model(
provider_id=self.__provider_id__, # type: ignore[attr-defined]
provider_resource_id=identifier,
identifier=identifier,
model_type=ModelType.embedding,
metadata=metadata,
)
return Model(
provider_id=self.__provider_id__, # type: ignore[attr-defined]
provider_resource_id=identifier,
identifier=identifier,
model_type=ModelType.llm,
)

async def list_provider_model_ids(self) -> Iterable[str]:
"""
List available models from the provider.
Expand Down Expand Up @@ -416,21 +441,7 @@ async def list_models(self) -> list[Model] | None:
if self.allowed_models and provider_model_id not in self.allowed_models:
logger.info(f"Skipping model {provider_model_id} as it is not in the allowed models list")
continue
if metadata := self.embedding_model_metadata.get(provider_model_id):
model = Model(
provider_id=self.__provider_id__, # type: ignore[attr-defined]
provider_resource_id=provider_model_id,
identifier=provider_model_id,
model_type=ModelType.embedding,
metadata=metadata,
)
else:
model = Model(
provider_id=self.__provider_id__, # type: ignore[attr-defined]
provider_resource_id=provider_model_id,
identifier=provider_model_id,
model_type=ModelType.llm,
)
model = self.construct_model_from_identifier(provider_model_id)
self._model_cache[provider_model_id] = model

return list(self._model_cache.values())
Expand Down
Loading
Loading