forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from ray-project:master #2314
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: angelinalg <[email protected]>
…ent times out (#38516) We currently leak `asyncio.Event`s when: - Clients time out waiting for updates. - `notify_changed` is never called for that key. This change removes the events when the timeout happens on the `LongPollHost`.
See more timeout for `test_numpy`, such as ``` (15:56:39) [269 / 347] 24 / 78 tests; Testing //python/ray/data:test_numpy; 63s local -- | | TIMEOUT: //python/ray/data:test_numpy (Summary) ``` So bump up timeout here for test_numpy. Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
…checkpoint (#38507) Signed-off-by: Justin Yu <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
This PR remove the GCS Ray Syncer given the new ray syncer has been turn on by default for a while.
…nguage (#37969) ## Why are these changes needed? the paramter and return type of C++ Worker support `std::any` for cross language. C++ Worker `std::any` is map to Java object. TODO: The current pull request encounters errors when compiling on Windows, so Windows support is temporarily not available. I will work on adapting it to Windows and merge the Windows support once it is ready.
* Update Guava to address https://nvd.nist.gov/vuln/detail/CVE-2023-2976
This reverts commit fb31492. This is causing unit test failure: #38460 (comment)
…Gallery HF examples doesn't include it (#38532) Signed-off-by: angelinalg <[email protected]>
…38497) When users `pip install ray` but not `ray[train]`/`ray[tune]`, they can run into confusing error messages, such as: ``` ModuleNotFoundError: No module named 'pyarrow' ``` With this PR, we import the core library requirements in the respective `__init__.py` and raise an actionable error message if they are not found: ``` ImportError: Can't import ray.train as some dependencies are missing. Run `pip install ray[train]` to fix. ``` Signed-off-by: Kai Fricke <[email protected]>
Adjust checkpoint interface to accept new-style checkpoint objects instead. In progress reporter output: ``` Training saved a checkpoint for iteration 4 at: (s3)bucket/foo/bar ``` Signed-off-by: Kai Fricke <[email protected]>
Summary This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from #33976. Details The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes. Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in #34125 / #26118. The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.
Fix the comment of 'ObjectBufferPool::ChunkInfo::chunk_index' which was an obvious typo. Signed-off-by: z4y1b2 <[email protected]>
The new `test_proxy.py` seems to be timed out consistently on windows. Use medium test size
When memory profiling is enabled and Serve actors restart, they attempt to use the same memray file that they used before crashing. This causes memray to raise an error, and the Serve actors crash again, entering a crash loop. This change makes the memray file change whenever an actor restarts.
This change lets users profile CPU usage on all Serve actors by setting the env var RAY_SERVE_ENABLE_CPU_PROFILING=1. Users can then take a snapshot of the CPU profile by getting a handle to a Serve actor and calling the _save_cpu_profile_data actor method.
Handles, routers, http proxy will have full information `DeploymentID(deployment_name, app_name)`, while controller will have incomplete information `"{app name}_{deployment name}"`. The following PR(s) will wrap up everything by having controller deployment state manager use `DeploymentID` as well. * Deployment graph build (and all DeploymentNodes) will use `DeploymentID` * Handles and routers will use `DeploymentID` * HTTP Proxy will use `DeploymentID` * Application state manager will use the original deployment name (e.g. target state will have `{"f": DeploymentInfo(...)}`), then prepend the app name when calling methods on the shared deployment state manager (e.g. send `{"default_f": DeploymentInfo(...)}` to deployment state manager). * Some controller methods like `get_deployment_info` will use `DeploymentID` because it needs to call into endpoint state. Other important notes: * Using `app_name=""` for V1 deployments. * For now, all metrics still use the same `"{app name}_{deployment name}"` deployment tags. * Java routers and http proxy don't change. Correspondingly, the controller will use old update format for java (string instead of `DeploymentID`).
Signed-off-by: xwjiang2010 <[email protected]
Signed-off-by: Matthew Deng <[email protected]>
This PR adds CI runners for the Ray Train and Tune tests with the new storage context path enabled. Many tests are excluded at first. We will iteratively work on enabling them to avoid having to fix a bunch of issues in one giant PR. Signed-off-by: Kai Fricke <[email protected]>
Fixed missing space and made more concise. Signed-off-by: angelinalg <[email protected]>
) This PR enables the new persistence mode feature flag for the following release tests: workspace_template_many_model_training air_benchmark_data_bulk_ingest air_benchmark_tensorflow_mnist_cpu_1x4 air_benchmark_tensorflow_mnist_cpu_4x1 air_benchmark_tensorflow_mnist_cpu_4x4 air_benchmark_tensorflow_mnist_gpu_4x4 air_example_dreambooth_finetuning alpa_opt_2_7b_sanity_check alpa_opt_30b_inference cluster_tune_scale_up_down long_running_many_ppo ml_user_horovod_user_test_latest ml_user_horovod_user_test_master ml_user_train_tensorflow_mnist_test ml_user_train_torch_linear_test ml_user_tune_rllib_connect_test train_horovod_multi_node_test Signed-off-by: Justin Yu <[email protected]>
…eckpoint reporting (#38523) This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune. Before (ray <= 2.6): ```python def train_fn_per_worker(config): ... tmpdir = tempfile.mkdtemp() if session.get_world_rank() == 0: # write to tmpdir # global rank 0 MUST report. otherwise it's as if you didn't checkpoint checkpoint = Checkpoint.from_directory(...) else: # create an "empty" checkpoint... # otherwise, if you just reported None, we throw an error # even worse, if you report a dict checkpoint here... unknown territory checkpoint = Checkpoint.from_directory(...) session.report(..., checkpoint) ``` After: ```python def train_fn_per_worker(config): ... # ANY combination of workers can report a checkpoint if train.get_context().get_world_rank() in [2, 4, 6]: with tempfile.TemporaryDirectory() as tempdir: # write to tmpdir train.report(metrics, Checkpoint.from_directory(tempdir)) else: train.report(metrics) ``` Note: the reported *metrics* are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction.
Small optimization that makes handles immediately send reports to the controller if the current state is "idle": - handle has no queued requests - there are 0 replicas for the deployment This makes it so that the first request sent to a scaled-to-zero deployment doesn't have to wait for the every-10-second metric push. Ran some tests below for comparison. The load test is ramp 0 -> 10 users -> 50 users -> 100 users. You can see this greatly helps the first ramp up from 0 -> 10 users, and doesn't really affect anything else.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )