Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite cuVS format implementation #2

Conversation

ChrisHegarty
Copy link
Collaborator

@ChrisHegarty ChrisHegarty commented Feb 14, 2025

This chanage rewrites the cuVS format implementation.

After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time.

Summary of the most significant changes:

  1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other.
  2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly.
  3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices.
  4. a standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters.

A number of workarounds have been put in place, which will eventually be lifted.

  1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter.
  2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.

After the rewrite all the BaseKnnVectorsFormatTestCase tests pass. There are still some lurking intermittent failures, but the tests pass successfully the majority of the time.

Summary of the most significant changes:

1. Use the flat vectors reader/writer to support the raw float32 vectors and ordinal to docId mapping. This is similar to how HNSW is supported in Lucene. And keeps the code aligned with how other formats are layered atop each other.
2. The cuVS indices (Cagra, brute force, and HNSW) are stored directly in the format, so can be mmap'ed directly.
3. Merges are physical, all raw vectors are retrieved and used to create new cuVS indices.
4. A standard KnnCollector is used, no need for a special one for cuVS, unless one wants to customise some very specific parameters.

A number of workarounds have been put in place, which will eventually be lifted.

1. pre-filter and deleted docs over sample the topK, since the cuvs-java do not yet support a pre-filter.
2. Ignore Cagra failures indexing with small numbers of docs, fail over to just brute force.
@ChrisHegarty ChrisHegarty merged commit 30206d6 into SearchScale:cuvs-integration-main Feb 14, 2025
1 of 5 checks passed
@ChrisHegarty ChrisHegarty self-assigned this Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant