Rewrite cuVS format implementation #2

ChrisHegarty · 2025-02-14T10:23:35Z

This chanage rewrites the cuVS format implementation.

After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time.

Summary of the most significant changes:

Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other.
The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly.
Merges are physical, all raw vectors are retrieved and used to create new cuVS indices.
a standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters.

A number of workarounds have been put in place, which will eventually be lifted.

pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter.
Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.

After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time. Summary of the most significant changes: 1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other. 2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly. 3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices. 4. A standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters. A number of workarounds have been put in place, which will eventually be lifted. 1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter. 2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.

github-actions bot added the module:sandbox label Feb 14, 2025

ChrisHegarty added 2 commits February 14, 2025 11:43

add bug URLs

30206d6

ChrisHegarty force-pushed the cuvs-format-rewrite branch from fcbdb28 to 30206d6 Compare February 14, 2025 11:43

ChrisHegarty merged commit 30206d6 into SearchScale:cuvs-integration-main Feb 14, 2025
1 of 5 checks passed

ChrisHegarty self-assigned this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite cuVS format implementation #2

Rewrite cuVS format implementation #2

ChrisHegarty commented Feb 14, 2025 •

edited

Loading

Rewrite cuVS format implementation #2

Rewrite cuVS format implementation #2

Conversation

ChrisHegarty commented Feb 14, 2025 • edited Loading

ChrisHegarty commented Feb 14, 2025 •

edited

Loading