-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(vector-io): unify score calculation to use cosine and normalize to [0,1] #3375
Conversation
b967086
to
a0e0c70
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ChristianZaccaria ! Thanks for your PR, nice work. I proposed some improvements.
for row in rows: | ||
_id, chunk_json, distance = row | ||
score = 1.0 / distance if distance != 0 else float("inf") | ||
distance = float(distance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation would be to compute normalized scores from cosine distance directly in SQL query. It will keep code concise and allow to have less Python code - it's a privilege thanks to SQL type of database :)
Formula is used: score = 1 / (1 + cosine_distance)
See my comments here under the issue #3213
Additional advantages:
- Less data transferred: Only rows meeting the threshold are returned.
- Correct ordering: You get results already sorted by similarity, not distance.
Modify query_sql
like that inside _execute_query()
distance = float(distance) | |
query_sql = f""" | |
WITH results AS ( | |
SELECT | |
m.id AS id, | |
m.chunk AS chunk, | |
v.distance AS distance, | |
1.0 / (1.0 + v.distance) AS score | |
FROM [{self.vector_table}] AS v | |
JOIN [{self.metadata_table}] AS m ON m.id = v.id | |
WHERE v.embedding MATCH ? AND k = ? | |
) | |
SELECT id, chunk, score | |
FROM results | |
WHERE score >= ? | |
ORDER BY score DESC; | |
""" | |
cur.execute(query_sql, (emb_blob, k, score_threshold)) |
# Cosine distance range [0,2] -> normalized to [0,1] | ||
score = 1.0 - (distance / 2.0) | ||
logger.info(f"Computed score {score} from distance {distance} for chunk id {_id}") | ||
if score < score_threshold: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside for loop before try except statement all code can be eliminated and this tiny piece can be added
if score < score_threshold: | |
_id, chunk_json, score = row | |
score = float(score) | |
logger.info(f"Received score {score} for chunk id {_id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make more sense, this is how the whole updated function looks in my code:
async def query_vector(
self,
embedding: NDArray,
k: int,
score_threshold: float,
) -> QueryChunksResponse:
"""
Performs vector-based search using a virtual table for vector similarity.
"""
logger.info(
f"SQLITE-VEC VECTOR SEARCH CALLED: embedding_shape={embedding.shape}, k={k}, threshold={score_threshold}"
)
def _execute_query():
connection = _create_sqlite_connection(self.db_path)
cur = connection.cursor()
try:
emb_list = embedding.tolist() if isinstance(embedding, np.ndarray) else list(embedding)
emb_blob = serialize_vector(emb_list)
query_sql = f"""
WITH results AS (
SELECT
m.id AS id,
m.chunk AS chunk,
v.distance AS distance,
1.0 / (1.0 + v.distance) AS score
FROM [{self.vector_table}] AS v
JOIN [{self.metadata_table}] AS m ON m.id = v.id
WHERE v.embedding MATCH ? AND k = ?
)
SELECT id, chunk, score
FROM results
WHERE score >= ?
ORDER BY score DESC;
"""
cur.execute(query_sql, (emb_blob, k, score_threshold))
return cur.fetchall()
finally:
cur.close()
connection.close()
rows = await asyncio.to_thread(_execute_query)
chunks, scores = [], []
for row in rows:
_id, chunk_json, score = row
score = float(score)
logger.info(f"Received score {score} for chunk id {_id}")
try:
chunk = Chunk.model_validate_json(chunk_json)
except Exception as e:
logger.error(f"Error parsing chunk JSON for id {_id}: {e}")
continue
chunks.append(chunk)
scores.append(score)
logger.info(f"SQLITE-VEC VECTOR SEARCH RESULTS: Found {len(chunks)} chunks with scores {scores}")
return QueryChunksResponse(chunks=chunks, scores=scores)
|
||
score = 1.0 / float(dist) if dist != 0 else float("inf") | ||
# Cosine distance range [0,2] -> normalized to [0,1] | ||
score = 1.0 - (float(dist) / 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
score = 1.0 - (float(dist) / 2.0) | |
score = 1.0 / (1.0 + float(dist)) |
|
||
chunks, scores = [], [] | ||
for res in search_res[0]: | ||
score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
score = float(res["distance"] + 1.0) / 2.0 # rescale to [0,1] | |
score = 1.0 / (1.0 + float(res["distance"]) # rescale to [0,1] |
for doc, dist in results: | ||
score = 1.0 / float(dist) if dist != 0 else float("inf") | ||
# Cosine distance range [0,2] -> normalized to [0,1] | ||
score = 1.0 - (float(dist) / 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
score = 1.0 - (float(dist) / 2.0) | |
score = 1.0 / (1.0 + float(dist)) |
I made a comment on the issue. Would like to discuss this more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see linked issue
What does this PR do?
Faiss:
IndexFlatL2
(euclidean), toIndexFlatIP
(dot product) metric type as the similarity metric. If we normalize the vector embeddings beforehand, the inner product effectively becomes cosine similarity.SQLite-vec:
distance_metric
tocosine
at table creation. This determines how similarity search is computed internally for all queries on the table.Chroma:
"hnsw:space": "cosine"
to metadata on registering the DB. This tells Chroma how to measure similarity when building and querying the HNSW index for that collection.Milvus:
PGvector:
Qdrant:
Weaviate:
Added useful logging info to each vector-io provider.
Closes #3213
Test Plan