fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375

ChristianZaccaria · 2025-09-08T14:58:43Z

What does this PR do?

Faiss:
- Changed from IndexFlatL2 (euclidean), to IndexFlatIP (dot product) metric type as the similarity metric. If we normalize the vector embeddings beforehand, the inner product effectively becomes cosine similarity.
- Score calculation: is rescaled from cosine similarity [-1,1] to [0,1].
SQLite-vec:
- Setting distance_metric to cosine at table creation. This determines how similarity search is computed internally for all queries on the table.
- Score calculation: Cosine Distance [0,2] -> normalized to [0,1]
Chroma:
- Added "hnsw:space": "cosine" to metadata on registering the DB. This tells Chroma how to measure similarity when building and querying the HNSW index for that collection.
- Score calculation: Cosine distance [0,2] -> normalized to [0,1]
Milvus:
- Scores are sorted in descending order.
- Score calculation: score is rescaled from cosine similarity [-1,1] to [0,1].
PGvector:
- Score calculation: Cosine distance [0,2] -> normalized to [0,1]
Qdrant:
- Score calculation: Cosine similarity range [-1,1] -> normalized to [0,1]
Weaviate:
- Score calculation: Cosine distance range [0,2] -> normalized to [0,1]
Added useful logging info to each vector-io provider.

Closes #3213

Test Plan

Added test cases:
- Test that vector similarity scores are properly normalized to [0,1] range for all vector providers.
- Runs multiple queries with varying similarity levels (high, medium, low, nonsense).
- Verifies all similarity scores are numeric and normalized to [0, 1].
- Confirms scores are sorted in descending order (most similar first).

…o [0,1]

r3v5

Hey @ChristianZaccaria ! Thanks for your PR, nice work. I proposed some improvements.

r3v5 · 2025-09-10T12:09:48Z

llama_stack/providers/inline/vector_io/sqlite_vec/sqlite_vec.py

        for row in rows:
            _id, chunk_json, distance = row
-            score = 1.0 / distance if distance != 0 else float("inf")
+            distance = float(distance)


My recommendation would be to compute normalized scores from cosine distance directly in SQL query. It will keep code concise and allow to have less Python code - it's a privilege thanks to SQL type of database :)

Formula is used: score = 1 / (1 + cosine_distance)
See my comments here under the issue #3213

Additional advantages:

Less data transferred: Only rows meeting the threshold are returned.

Correct ordering: You get results already sorted by similarity, not distance.

Modify query_sql like that inside _execute_query()

Suggested change

distance = float(distance)

query_sql = f"""

WITH results AS (

SELECT

m.id AS id,

m.chunk AS chunk,

v.distance AS distance,

1.0 / (1.0 + v.distance) AS score

FROM [{self.vector_table}] AS v

JOIN [{self.metadata_table}] AS m ON m.id = v.id

WHERE v.embedding MATCH ? AND k = ?

)

SELECT id, chunk, score

FROM results

WHERE score >= ?

ORDER BY score DESC;

"""

cur.execute(query_sql, (emb_blob, k, score_threshold))

r3v5 · 2025-09-10T12:15:36Z

llama_stack/providers/inline/vector_io/sqlite_vec/sqlite_vec.py

+            # Cosine distance range [0,2] -> normalized to [0,1]
+            score = 1.0 - (distance / 2.0)
+            logger.info(f"Computed score {score} from distance {distance} for chunk id {_id}")
            if score < score_threshold:


Inside for loop before try except statement all code can be eliminated and this tiny piece can be added

Suggested change

if score < score_threshold:

_id, chunk_json, score = row

score = float(score)

logger.info(f"Received score {score} for chunk id {_id}")

Just to make more sense, this is how the whole updated function looks in my code:

async def query_vector( self, embedding: NDArray, k: int, score_threshold: float, ) -> QueryChunksResponse: """ Performs vector-based search using a virtual table for vector similarity. """ logger.info( f"SQLITE-VEC VECTOR SEARCH CALLED: embedding_shape={embedding.shape}, k={k}, threshold={score_threshold}" ) def _execute_query(): connection = _create_sqlite_connection(self.db_path) cur = connection.cursor() try: emb_list = embedding.tolist() if isinstance(embedding, np.ndarray) else list(embedding) emb_blob = serialize_vector(emb_list) query_sql = f""" WITH results AS ( SELECT m.id AS id, m.chunk AS chunk, v.distance AS distance, 1.0 / (1.0 + v.distance) AS score FROM [{self.vector_table}] AS v JOIN [{self.metadata_table}] AS m ON m.id = v.id WHERE v.embedding MATCH ? AND k = ? ) SELECT id, chunk, score FROM results WHERE score >= ? ORDER BY score DESC; """ cur.execute(query_sql, (emb_blob, k, score_threshold)) return cur.fetchall() finally: cur.close() connection.close() rows = await asyncio.to_thread(_execute_query) chunks, scores = [], [] for row in rows: _id, chunk_json, score = row score = float(score) logger.info(f"Received score {score} for chunk id {_id}") try: chunk = Chunk.model_validate_json(chunk_json) except Exception as e: logger.error(f"Error parsing chunk JSON for id {_id}: {e}") continue chunks.append(chunk) scores.append(score) logger.info(f"SQLITE-VEC VECTOR SEARCH RESULTS: Found {len(chunks)} chunks with scores {scores}") return QueryChunksResponse(chunks=chunks, scores=scores)

r3v5 · 2025-09-10T13:34:52Z

llama_stack/providers/remote/vector_io/chroma/chroma.py


-            score = 1.0 / float(dist) if dist != 0 else float("inf")
+            # Cosine distance range [0,2] -> normalized to [0,1]
+            score = 1.0 - (float(dist) / 2.0)


Suggested change

score = 1.0 - (float(dist) / 2.0)

score = 1.0 / (1.0 + float(dist))

r3v5 · 2025-09-10T13:36:09Z

llama_stack/providers/remote/vector_io/milvus/milvus.py

+
+        chunks, scores = [], []
+        for res in search_res[0]:
+            score = float(res["distance"] + 1.0) / 2.0  # rescale to [0,1]


Suggested change

score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1]

score = 1.0 / (1.0 + float(res["distance"]) # rescale to [0,1]

r3v5 · 2025-09-10T13:39:23Z

llama_stack/providers/remote/vector_io/pgvector/pgvector.py

            for doc, dist in results:
-                score = 1.0 / float(dist) if dist != 0 else float("inf")
+                # Cosine distance range [0,2] -> normalized to [0,1]
+                score = 1.0 - (float(dist) / 2.0)


Suggested change

score = 1.0 - (float(dist) / 2.0)

score = 1.0 / (1.0 + float(dist))

ehhuang · 2025-09-11T20:41:25Z

I made a comment on the issue. Would like to discuss this more.

ehhuang

see linked issue

ChristianZaccaria requested review from ashwinb, bbrowning, ehhuang, hardikjshah, leseb, mattf, raghotham, reluctantfuturist, slekkala1, terrytangyuan and yanxi0830 as code owners September 8, 2025 14:58

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 8, 2025

fix(vector-io): unify score calculation to use cosine and normalize t…

a0e0c70

…o [0,1]

ChristianZaccaria force-pushed the pr-fix-distance-score branch from b967086 to a0e0c70 Compare September 8, 2025 15:21

ChristianZaccaria mentioned this pull request Sep 10, 2025

vector_io providers do not calculate scores correctly when cosine distance being used #3213

Open

2 tasks

r3v5 suggested changes Sep 10, 2025

View reviewed changes

ehhuang requested changes Oct 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375

fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375

Uh oh!

ChristianZaccaria commented Sep 8, 2025

Uh oh!

r3v5 left a comment

Uh oh!

r3v5 Sep 10, 2025 •

edited

Loading

Uh oh!

r3v5 Sep 10, 2025

Uh oh!

r3v5 Sep 10, 2025

Uh oh!

r3v5 Sep 10, 2025

Uh oh!

r3v5 Sep 10, 2025 •

edited

Loading

Uh oh!

r3v5 Sep 10, 2025

Uh oh!

ehhuang commented Sep 11, 2025

Uh oh!

ehhuang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            distance = float(distance)
+           query_sql = f"""
+                    WITH results AS (
+                        SELECT
+                            m.id AS id,
+                            m.chunk AS chunk,
+                            v.distance AS distance,
+.0 / (1.0 + v.distance) AS score
+                        FROM [{self.vector_table}] AS v
+                        JOIN [{self.metadata_table}] AS m ON m.id = v.id
+                        WHERE v.embedding MATCH ? AND k = ?
+                    )
+                    SELECT id, chunk, score
+                    FROM results
+                    WHERE score >= ?
+                    ORDER BY score DESC;
+                """
+                cur.execute(query_sql, (emb_blob, k, score_threshold))

-            if score < score_threshold:
+            _id, chunk_json, score = row
+            score = float(score)
+            logger.info(f"Received score {score} for chunk id {_id}")

	score = 1.0 - (float(dist) / 2.0)
	score = 1.0 / (1.0 + float(dist))

	score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1]
	score = 1.0 / (1.0 + float(res["distance"]) # rescale to [0,1]

fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375

Are you sure you want to change the base?

fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375

Uh oh!

Conversation

ChristianZaccaria commented Sep 8, 2025

What does this PR do?

Test Plan

Uh oh!

r3v5 left a comment

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r3v5 Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ehhuang commented Sep 11, 2025

Uh oh!

ehhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r3v5 Sep 10, 2025 •

edited

Loading

r3v5 Sep 10, 2025 •

edited

Loading