Skip to content

Reject non-finite persisted vector floats#37

Closed
QuakeWang wants to merge 2 commits into
apache:mainfrom
QuakeWang:reject-nonfinite-index
Closed

Reject non-finite persisted vector floats#37
QuakeWang wants to merge 2 commits into
apache:mainfrom
QuakeWang:reject-nonfinite-index

Conversation

@QuakeWang

Copy link
Copy Markdown
Member

Summary

User input validation already rejects NaN and Infinity, but corrupted or older index files could still persist non-finite f32 values in reader-side float sections. Those values could enter search ranking and hit partial_cmp().unwrap() panic paths, crossing JNI as a panic-boundary error instead of a normal malformed-index error.

This PR rejects non-finite persisted f32 values while decoding v1 reader sections for IVF-FLAT, IVF-HNSW-FLAT, IVF-PQ, and IVF-HNSW-SQ. It also hardens reader-side top-k ordering with total_cmp and keeps intentional JNI panic-boundary tests separate from malformed-index error coverage.

Testing

  • cargo fmt --all -- --check
  • cargo build --all-targets
  • cargo test -p paimon-vindex-core
  • cargo clippy --all-targets --workspace -- -D warnings
  • cargo clippy -p paimon-vindex-jni --all-targets --features panic-boundary-test-hook -- -D warnings
  • mvn -f java/pom.xml test
  • cargo build -p paimon-vindex-jni --release --features panic-boundary-test-hook
  • java -cp java/target/classes:java/target/test-classes org.apache.paimon.index.vector.VectorIndexNativePanicBoundaryTest target/release/libpaimon_vindex_jni.dylib

Signed-off-by: QuakeWang <wangfuzheng0814@foxmail.com>
@JingsongLi

Copy link
Copy Markdown
Contributor

Potential performance concern: this PR adds finite-f32 validation while decoding persisted list payloads. For IVF-FLAT in particular, read_inverted_list() is on the search path and runs once per probed list (search) / unique probed list (search_batch). The new bytes_to_f32_vec_checked() first decodes the whole vector payload into Vec<f32> and then scans all values again with is_finite(), so each query adds an extra O(sum(count * d) over probed lists) pass over candidate vectors.

Could we either merge the finite check into the bytes-to-f32 decode loop to avoid the second pass, or validate/cache a list only once if the reader assumes the underlying index bytes are immutable? The one-time checks in ensure_loaded() look fine; the main concern is the repeated per-search list-vector validation.

@QuakeWang

Copy link
Copy Markdown
Member Author

Potential performance concern: this PR adds finite-f32 validation while decoding persisted list payloads. For IVF-FLAT in particular, read_inverted_list() is on the search path and runs once per probed list (search) / unique probed list (search_batch). The new bytes_to_f32_vec_checked() first decodes the whole vector payload into Vec<f32> and then scans all values again with is_finite(), so each query adds an extra O(sum(count * d) over probed lists) pass over candidate vectors.

Could we either merge the finite check into the bytes-to-f32 decode loop to avoid the second pass, or validate/cache a list only once if the reader assumes the underlying index bytes are immutable? The one-time checks in ensure_loaded() look fine; the main concern is the repeated per-search list-vector validation.

@JingsongLi Thanks for pointing this out. I agree this validation was adding an avoidable second pass on the search path.

I updated the checked f32 decoding helpers used by IVF-FLAT / IVF-HNSW-FLAT list payload decoding to validate each value while decoding bytes into f32, so we no longer decode the full vector payload and scan it again.

I intentionally left the IVF-PQ ensure_loaded() checks unchanged because those are one-time metadata/codebook validations, not repeated per-search list-vector validation.

@JingsongLi

JingsongLi commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I don't think we should pay regular costs for corrupted index files. Even though the current cost is not low, it is still a linear cost.

@JingsongLi

Copy link
Copy Markdown
Contributor

I know we should have better validation, but I don't think we should overly pursue validation. The core should still focus on functional phenomena and problems, or performance improvement.

@JingsongLi JingsongLi closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants