[pull] master from ray-project:master #2314

pull · 2023-08-17T18:24:48Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: angelinalg <[email protected]>

…ent times out (#38516) We currently leak `asyncio.Event`s when: - Clients time out waiting for updates. - `notify_changed` is never called for that key. This change removes the events when the timeout happens on the `LongPollHost`.

See more timeout for `test_numpy`, such as ``` (15:56:39) [269 / 347] 24 / 78 tests; Testing //python/ray/data:test_numpy; 63s local -- | | TIMEOUT: //python/ray/data:test_numpy (Summary) ``` So bump up timeout here for test_numpy. Signed-off-by: Cheng Su <[email protected]>

Signed-off-by: Lonnie Liu <[email protected]>

…checkpoint (#38507) Signed-off-by: Justin Yu <[email protected]>

Co-authored-by: Jiajun Yao <[email protected]>

This PR remove the GCS Ray Syncer given the new ray syncer has been turn on by default for a while.

Address FasterXML/jackson-core#827

…nguage (#37969) ## Why are these changes needed? the paramter and return type of C++ Worker support `std::any` for cross language. C++ Worker `std::any` is map to Java object. TODO: The current pull request encounters errors when compiling on Windows, so Windows support is temporarily not available. I will work on adapting it to Windows and merge the Windows support once it is ready.

* Update Guava to address https://nvd.nist.gov/vuln/detail/CVE-2023-2976

This reverts commit fb31492. This is causing unit test failure: #38460 (comment)

…Gallery HF examples doesn't include it (#38532) Signed-off-by: angelinalg <[email protected]>

…38497) When users `pip install ray` but not `ray[train]`/`ray[tune]`, they can run into confusing error messages, such as: ``` ModuleNotFoundError: No module named 'pyarrow' ``` With this PR, we import the core library requirements in the respective `__init__.py` and raise an actionable error message if they are not found: ``` ImportError: Can't import ray.train as some dependencies are missing. Run `pip install ray[train]` to fix. ``` Signed-off-by: Kai Fricke <[email protected]>

Adjust checkpoint interface to accept new-style checkpoint objects instead. In progress reporter output: ``` Training saved a checkpoint for iteration 4 at: (s3)bucket/foo/bar ``` Signed-off-by: Kai Fricke <[email protected]>

Summary This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from #33976. Details The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes. Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in #34125 / #26118. The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.

Fix the comment of 'ObjectBufferPool::ChunkInfo::chunk_index' which was an obvious typo. Signed-off-by: z4y1b2 <[email protected]>

The new `test_proxy.py` seems to be timed out consistently on windows. Use medium test size

When memory profiling is enabled and Serve actors restart, they attempt to use the same memray file that they used before crashing. This causes memray to raise an error, and the Serve actors crash again, entering a crash loop. This change makes the memray file change whenever an actor restarts.

This change lets users profile CPU usage on all Serve actors by setting the env var RAY_SERVE_ENABLE_CPU_PROFILING=1. Users can then take a snapshot of the CPU profile by getting a handle to a Serve actor and calling the _save_cpu_profile_data actor method.

Handles, routers, http proxy will have full information `DeploymentID(deployment_name, app_name)`, while controller will have incomplete information `"{app name}_{deployment name}"`. The following PR(s) will wrap up everything by having controller deployment state manager use `DeploymentID` as well. * Deployment graph build (and all DeploymentNodes) will use `DeploymentID` * Handles and routers will use `DeploymentID` * HTTP Proxy will use `DeploymentID` * Application state manager will use the original deployment name (e.g. target state will have `{"f": DeploymentInfo(...)}`), then prepend the app name when calling methods on the shared deployment state manager (e.g. send `{"default_f": DeploymentInfo(...)}` to deployment state manager). * Some controller methods like `get_deployment_info` will use `DeploymentID` because it needs to call into endpoint state. Other important notes: * Using `app_name=""` for V1 deployments. * For now, all metrics still use the same `"{app name}_{deployment name}"` deployment tags. * Java routers and http proxy don't change. Correspondingly, the controller will use old update format for java (string instead of `DeploymentID`).

Signed-off-by: xwjiang2010 <[email protected]

Signed-off-by: Matthew Deng <[email protected]>

This PR adds CI runners for the Ray Train and Tune tests with the new storage context path enabled. Many tests are excluded at first. We will iteratively work on enabling them to avoid having to fix a bunch of issues in one giant PR. Signed-off-by: Kai Fricke <[email protected]>

Fixed missing space and made more concise. Signed-off-by: angelinalg <[email protected]>

) This PR enables the new persistence mode feature flag for the following release tests: workspace_template_many_model_training air_benchmark_data_bulk_ingest air_benchmark_tensorflow_mnist_cpu_1x4 air_benchmark_tensorflow_mnist_cpu_4x1 air_benchmark_tensorflow_mnist_cpu_4x4 air_benchmark_tensorflow_mnist_gpu_4x4 air_example_dreambooth_finetuning alpa_opt_2_7b_sanity_check alpa_opt_30b_inference cluster_tune_scale_up_down long_running_many_ppo ml_user_horovod_user_test_latest ml_user_horovod_user_test_master ml_user_train_tensorflow_mnist_test ml_user_train_torch_linear_test ml_user_tune_rllib_connect_test train_horovod_multi_node_test Signed-off-by: Justin Yu <[email protected]>

…eckpoint reporting (#38523) This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune. Before (ray <= 2.6): ```python def train_fn_per_worker(config): ... tmpdir = tempfile.mkdtemp() if session.get_world_rank() == 0: # write to tmpdir # global rank 0 MUST report. otherwise it's as if you didn't checkpoint checkpoint = Checkpoint.from_directory(...) else: # create an "empty" checkpoint... # otherwise, if you just reported None, we throw an error # even worse, if you report a dict checkpoint here... unknown territory checkpoint = Checkpoint.from_directory(...) session.report(..., checkpoint) ``` After: ```python def train_fn_per_worker(config): ... # ANY combination of workers can report a checkpoint if train.get_context().get_world_rank() in [2, 4, 6]: with tempfile.TemporaryDirectory() as tempdir: # write to tmpdir train.report(metrics, Checkpoint.from_directory(tempdir)) else: train.report(metrics) ``` Note: the reported *metrics* are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction.

Small optimization that makes handles immediately send reports to the controller if the current state is "idle": - handle has no queued requests - there are 0 replicas for the deployment This makes it so that the first request sent to a scaled-to-zero deployment doesn't have to wait for the every-10-second metric push. Ran some tests below for comparison. The load test is ramp 0 -> 10 users -> 50 users -> 100 users. You can see this greatly helps the first ramp up from 0 -> 10 users, and doesn't really affect anything else.

angelinalg and others added 28 commits August 16, 2023 17:24

[train] update names of guides in side nav (#38519)

89a2fac

Signed-off-by: angelinalg <[email protected]>

use wanda to build the forge container (#38538)

68ca111

Signed-off-by: Lonnie Liu <[email protected]>

[air] pyarrow.fs persistence: Don't automatically delete the local …

b3b900c

…checkpoint (#38507) Signed-off-by: Justin Yu <[email protected]>

Remove from codeowners (#38447)

c6ad28a

Co-authored-by: Jiajun Yao <[email protected]>

[core] Remove gcs ray syncer. (#38489)

4d86646

This PR remove the GCS Ray Syncer given the new ray syncer has been turn on by default for a while.

[Java] Update Jackson Databind XML Parsing (#38525)

55cf65d

Address FasterXML/jackson-core#827

[Java] Update Guava package (#38424)

a29e528

* Update Guava to address https://nvd.nist.gov/vuln/detail/CVE-2023-2976

Revert "[Data] Find partitions refactor (#38460)" (#38548)

3575cf9

This reverts commit fb31492. This is causing unit test failure: #38460 (comment)

add huggingface tag to gptj-fine-tuning example; otherwise, Examples …

cc6a3cd

…Gallery HF examples doesn't include it (#38532) Signed-off-by: angelinalg <[email protected]>

[train] Storage: Call callbacks with new checkpoint (#38456)

b73a496

Adjust checkpoint interface to accept new-style checkpoint objects instead. In progress reporter output: ``` Training saved a checkpoint for iteration 4 at: (s3)bucket/foo/bar ``` Signed-off-by: Kai Fricke <[email protected]>

[Core] Fix a comment in object_buffer_pool.h (#38549)

f4ef133

Fix the comment of 'ObjectBufferPool::ChunkInfo::chunk_index' which was an obvious typo. Signed-off-by: z4y1b2 <[email protected]>

[Serve] use medium for proxy test (#38543)

2db8e8a

The new `test_proxy.py` seems to be timed out consistently on windows. Use medium test size

[template] Augment dreambooth ft example with lora (#38176)

3f693aa

Signed-off-by: xwjiang2010 <[email protected]

[train] minor updates to XGBoost/LightGBM Trainer API docs' (#38542)

f27bb15

Signed-off-by: Matthew Deng <[email protected]>

Update vicuna_13b_lightning_deepspeed_finetune.ipynb (#38537)

5edd846

Fixed missing space and made more concise. Signed-off-by: angelinalg <[email protected]>

[docs] change Ray AIR to Ray AI libraries; remove from top nav (#38527)

6eb2f26

pull bot added the ⤵️ pull label Aug 17, 2023

pull bot merged commit e79c914 into miqdigital:master Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master #2314

[pull] master from ray-project:master #2314

pull bot commented Aug 17, 2023 •

edited

Loading

[pull] master from ray-project:master #2314

[pull] master from ray-project:master #2314

Conversation

pull bot commented Aug 17, 2023 • edited Loading

pull bot commented Aug 17, 2023 •

edited

Loading