Skip to content

Conversation

ChristianZaccaria
Copy link
Contributor

What does this PR do?

  • Faiss:

    • Changed from IndexFlatL2 (euclidean), to IndexFlatIP (dot product) metric type as the similarity metric. If we normalize the vector embeddings beforehand, the inner product effectively becomes cosine similarity.
    • Score calculation: is rescaled from cosine similarity [-1,1] to [0,1].
  • SQLite-vec:

    • Setting distance_metric to cosine at table creation. This determines how similarity search is computed internally for all queries on the table.
    • Score calculation: Cosine Distance [0,2] -> normalized to [0,1]
  • Chroma:

    • Added "hnsw:space": "cosine" to metadata on registering the DB. This tells Chroma how to measure similarity when building and querying the HNSW index for that collection.
    • Score calculation: Cosine distance [0,2] -> normalized to [0,1]
  • Milvus:

    • Scores are sorted in descending order.
    • Score calculation: score is rescaled from cosine similarity [-1,1] to [0,1].
  • PGvector:

    • Score calculation: Cosine distance [0,2] -> normalized to [0,1]
  • Qdrant:

    • Score calculation: Cosine similarity range [-1,1] -> normalized to [0,1]
  • Weaviate:

    • Score calculation: Cosine distance range [0,2] -> normalized to [0,1]
  • Added useful logging info to each vector-io provider.

Closes #3213

Test Plan

  • Added test cases:
    • Test that vector similarity scores are properly normalized to [0,1] range for all vector providers.
    • Runs multiple queries with varying similarity levels (high, medium, low, nonsense).
    • Verifies all similarity scores are numeric and normalized to [0, 1].
    • Confirms scores are sorted in descending order (most similar first).

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 8, 2025
Copy link
Contributor

@r3v5 r3v5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ChristianZaccaria ! Thanks for your PR, nice work. I proposed some improvements.

for row in rows:
_id, chunk_json, distance = row
score = 1.0 / distance if distance != 0 else float("inf")
distance = float(distance)
Copy link
Contributor

@r3v5 r3v5 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation would be to compute normalized scores from cosine distance directly in SQL query. It will keep code concise and allow to have less Python code - it's a privilege thanks to SQL type of database :)

Formula is used: score = 1 / (1 + cosine_distance)
See my comments here under the issue #3213

Additional advantages:

  • Less data transferred: Only rows meeting the threshold are returned.
  • Correct ordering: You get results already sorted by similarity, not distance.

Modify query_sql like that inside _execute_query()

Suggested change
distance = float(distance)
query_sql = f"""
WITH results AS (
SELECT
m.id AS id,
m.chunk AS chunk,
v.distance AS distance,
1.0 / (1.0 + v.distance) AS score
FROM [{self.vector_table}] AS v
JOIN [{self.metadata_table}] AS m ON m.id = v.id
WHERE v.embedding MATCH ? AND k = ?
)
SELECT id, chunk, score
FROM results
WHERE score >= ?
ORDER BY score DESC;
"""
cur.execute(query_sql, (emb_blob, k, score_threshold))

# Cosine distance range [0,2] -> normalized to [0,1]
score = 1.0 - (distance / 2.0)
logger.info(f"Computed score {score} from distance {distance} for chunk id {_id}")
if score < score_threshold:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside for loop before try except statement all code can be eliminated and this tiny piece can be added

Suggested change
if score < score_threshold:
_id, chunk_json, score = row
score = float(score)
logger.info(f"Received score {score} for chunk id {_id}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make more sense, this is how the whole updated function looks in my code:

async def query_vector(
        self,
        embedding: NDArray,
        k: int,
        score_threshold: float,
    ) -> QueryChunksResponse:
        """
        Performs vector-based search using a virtual table for vector similarity.
        """
        logger.info(
            f"SQLITE-VEC VECTOR SEARCH CALLED: embedding_shape={embedding.shape}, k={k}, threshold={score_threshold}"
        )

        def _execute_query():
            connection = _create_sqlite_connection(self.db_path)
            cur = connection.cursor()
            try:
                emb_list = embedding.tolist() if isinstance(embedding, np.ndarray) else list(embedding)
                emb_blob = serialize_vector(emb_list)
                query_sql = f"""
                    WITH results AS (
                        SELECT 
                            m.id AS id,
                            m.chunk AS chunk,
                            v.distance AS distance,
                            1.0 / (1.0 + v.distance) AS score
                        FROM [{self.vector_table}] AS v
                        JOIN [{self.metadata_table}] AS m ON m.id = v.id
                        WHERE v.embedding MATCH ? AND k = ?
                    )
                    SELECT id, chunk, score
                    FROM results
                    WHERE score >= ?
                    ORDER BY score DESC;
                """
                cur.execute(query_sql, (emb_blob, k, score_threshold))
                return cur.fetchall()
            finally:
                cur.close()
                connection.close()

        rows = await asyncio.to_thread(_execute_query)
        chunks, scores = [], []
        for row in rows:
            _id, chunk_json, score = row
            score = float(score)
            logger.info(f"Received score {score} for chunk id {_id}")
            try:
                chunk = Chunk.model_validate_json(chunk_json)
            except Exception as e:
                logger.error(f"Error parsing chunk JSON for id {_id}: {e}")
                continue
            chunks.append(chunk)
            scores.append(score)

        logger.info(f"SQLITE-VEC VECTOR SEARCH RESULTS: Found {len(chunks)} chunks with scores {scores}")
        return QueryChunksResponse(chunks=chunks, scores=scores)


score = 1.0 / float(dist) if dist != 0 else float("inf")
# Cosine distance range [0,2] -> normalized to [0,1]
score = 1.0 - (float(dist) / 2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
score = 1.0 - (float(dist) / 2.0)
score = 1.0 / (1.0 + float(dist))


chunks, scores = [], []
for res in search_res[0]:
score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1]
Copy link
Contributor

@r3v5 r3v5 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1]
score = 1.0 / (1.0 + float(res["distance"]) # rescale to [0,1]

for doc, dist in results:
score = 1.0 / float(dist) if dist != 0 else float("inf")
# Cosine distance range [0,2] -> normalized to [0,1]
score = 1.0 - (float(dist) / 2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
score = 1.0 - (float(dist) / 2.0)
score = 1.0 / (1.0 + float(dist))

@ehhuang
Copy link
Contributor

ehhuang commented Sep 11, 2025

I made a comment on the issue. Would like to discuss this more.

Copy link
Contributor

@ehhuang ehhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see linked issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vector_io providers do not calculate scores correctly when cosine distance being used

3 participants