Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cuvs v24.12 #523

Merged
merged 66 commits into from
Dec 11, 2024
Merged

[RELEASE] cuvs v24.12 #523

merged 66 commits into from
Dec 11, 2024

Conversation

GPUtester
Copy link
Contributor

❄️ Code freeze for branch-24.12 and v24.12 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-24.12 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-24.12 into main for the release

raydouglass and others added 30 commits September 19, 2024 12:06
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
Forward-merge branch-24.10 into branch-24.12
1) Replace the domain name from `raft` to `cuvs` to avoid confusion when using tools such as NSYS to analyze the program timeline.
2) Use C++17 feature `__has_include` instead of a CMake script to find out if NVTX available in the benchmark executable. It turns out our CMake check has been not reliable due to not finding include directories correctly.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Micka (https://github.com/lowener)

URL: #400
Contributes to rapidsai/build-planning#106

Proposes specifying the RAPIDS version in `conda install` calls that install CI artifacts, to reduce the risk of CI jobs picking up artifacts from other releases.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Mike Sarahan (https://github.com/msarahan)

URL: #406
Add a command-line option to disable the CUDA event/stream synchronization on each iteration. Only one sync is done per benchmark loop in this case instead. As a result, the measured QPS is observed due to:
1) A small `cudaEventSynchronize` is removed from the benchmark loop;
2) If a GPU algorithm has little to no sync between the GPU and CPU, the kernel launch latency and other CPU overheads are completely hidden.

The new option is experimental and disabled by default.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #405
This package is available in `dependencies.yaml`, but due to an oversight was not added to conda metas.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - James Lamb (https://github.com/jameslamb)

URL: #408
Remove a collection of unneccesarily complex CMake logic.

Major change is that we explicitly opt-in to building the C API bindings by default since it is a hard requirement for our python bindings, and the project has numerous conditions to disable it.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Ben Frederickson (https://github.com/benfred)

URL: #416
This attempts to rename `cuvs_bench` to `cuvs-bench` and similarly for the CPU package. This follows from this thread: rapidsai/docker#715 (comment)

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #417
Fixes an issue in CI computations of sccache hit rates. See
#414 for details.
lowener and others added 13 commits November 25, 2024 20:09
I noticed it was missing while switching Milvus to cuVS

Authors:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #461
`.view()` is required

Authors:
  - tsuki (https://github.com/enp1s0)

Approvers:
  - Micka (https://github.com/lowener)

URL: #484
This PR adds an option to build the full HNSW hierarchy on the CPU when converting a CAGRA index to an hnswlib index. This lets us enable an `extend()` API. 

For hnswlib:
1. Update to `v0.7.0`
2. Remove dependency as symbols are compiled within DSO

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #465
Wheel-building CI jobs are failing like this:

> Checking
final_dist/cuvs_cu12-25.2.0a26-cp310-cp310-manylinux_2_28_aarch64.whl:
FAILED due to warnings
> WARNING `long_description` missing.
> Error: Process completed with exit code 1.

([build
link](https://github.com/rapidsai/cuvs/actions/runs/12133882036))

Looks like the root cause is a combination of the following:

* there was a new `twine` release (6.x) 3 days ago:
https://pypi.org/project/twine/#history
* it contains pypa/twine#1168, which makes
`twine check --strict` fail if the wheel's `long_description` is empty
* the `cuvs` wheel README (used as the wheel `long_description`) is
empty

This proposes adding a small README, with just 2 sentences copied from
the project's root-level README, to get past that check.

## Notes for Reviewers

The `long_description` becomes the project homepage when a project is
hosted on PyPI. The wheels produced from this repo aren't currently
being published to pypi.org so this change won't be seen there, but a
more user-friendly README should be added if/when we decide to publish
`cuvs-cu{11,12}` to pypi.org.

ref: rapidsai/build-planning#70
This notebook is adapting the Question Retrieval nb to use Milvus.
It can serve as a good example on how to do Bulk ingest, how to use cuVS, and especially CAGRA+HNSW on Milvus

Authors:
  - Micka (https://github.com/lowener)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #451
Ref : #472

## The cause of the bug
The bitonic sort was used on an array that was not a power of 2 long. In the current search implementation, the bitonic sort is used to move the invalid elements to the end of the buffer as:
https://github.com/rapidsai/cuvs/blob/5062594138a40231475299c7bac61083b0669fd1/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh#L758-L763
https://github.com/rapidsai/cuvs/blob/5062594138a40231475299c7bac61083b0669fd1/cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh#L644-L649

The problem is that the (max) array length (=`MAX_ITOPK + MAX_CANDIDATES`) is not always the power of two.
These bitonic sorts are called even if no elements are filtered out unless `cuvs::neighbors::filtering::none_sample_filter` is specified as the filter, so #472 occurs.

## Fix
This PR changes the filtering process so that the bitonic sort is not used to move the invalid elements to the end of the buffer.

Authors:
  - tsuki (https://github.com/enp1s0)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)

URL: #489
Let me know if this is out of scope for cuVS!

Authors:
  - Ajit Mistry (https://github.com/ajit283)
  - Ben Frederickson (https://github.com/benfred)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)

URL: #311
Non-blocking / stream-ordered dynamic batching as a new index type.

## API

This PR implements dynamic batching as a new index type, mirroring the API of other indices.

  * [_building is wrapping_] Building the index means creating a lightweight wrapper on top of an existing index and initializing necessary components, such as IO batch buffers and synchronization primitives.
  * [_type erasure_] The underlying/upstream index type is erased once the dynamic_batching wrapper is created, i.e. there's no way to recover the original search index type or parameters.
  * [_explicit control over batching_] To allow multiple user requests group into a dynamic batch request, the users must use copies of the same dynamic batching index (the user-facing index type is a thin wrapper on top of a shared pointer, hence the copy is shallow and cheap). The search function is thread-safe.

## Feature:  stream-ordered dynamic batching

Non-blocking / stream-ordered dynamic batching means the batching does not involve synchronizing with a GPU stream. The control is returned to the user as soon as the necessary work is submitted to the GPU. This entails a few good-to-know features:

1. The dynamic batching index has the same blocking properties as the upstream index: if the upstream index does not involve stream sync during search, that the dynamic batching index does not involve it as well (otherwise, the dynamic batching search obviously waits till the upstream search synchronizes under the hood).
2. It's responsibility of the user to synchronize the stream before getting the results back - even if the upstream index search does not need it (the batch results are scattered back to the request threads in a post-processing kernel).
3. If the upstream index does not synchronize during search, the dynamic batching index can group the queries even in a single-threaded application (_try it with --no-lap-sync option in the ann-bench benchmarks_).

Overall, stream-ordered dynamic batching makes it easy to modify existing cuVS indexes, because the wrapped index has the same execution behavior as the upstream index.

## Work-in-progress TODO

- [x] Add dynamic batching option to more indices in ann-bench
- [x] Add tests
- [x] **(postponed to 25.02)** Do proper benchmarking and possibly fine-tune the inter-thread communication
- [x] Review the API side (`cpp/include/cuvs/neighbors/dynamic_batching.hpp`) [ready for review CC @cjnolet]
- [x] Review the algorithm side (`cpp/src/neighbors/detail/dynamic_batching.cuh`) [ready for preliminary review: requests for algorithm docsting/clarifications are especially welcome]

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #261
First draft for scalar quantization.

WIP status:
* only int8_t target type
* quantile computation inefficient (via sampling & sorting)

Authors:
  - Malte Förster (https://github.com/mfoerste4)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #494
Skip some checks involving hard-coded offsets into the data when the number of records in the checked PQ list is smaller than needed.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: #512
Due to the tight integration between cuvs and raft, we need to ensure that cuvs is updated for rapidsai/raft#2503 or builds of cuvs that rely on cloning raft will get an incompatible version of cutlass due to raft's update.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Ben Frederickson (https://github.com/benfred)

URL: #516
Partially addresses #455

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Ben Frederickson (https://github.com/benfred)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)

URL: #463
…466)

PR allows calculating ground truth for cuvs-bench on CPU systems. Current version uses a simple NumPy brute force, perhaps we should consider using faiss? cc @cjnolet @divyegala

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #466
@GPUtester GPUtester requested review from a team as code owners December 6, 2024 19:13
@GPUtester GPUtester requested review from bdice and removed request for a team December 6, 2024 19:13
Copy link

copy-pr-bot bot commented Dec 6, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@raydouglass raydouglass merged commit b5eed5e into main Dec 11, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.