Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #2314

Merged
merged 28 commits into from
Aug 17, 2023
Merged

Conversation

pull[bot]
Copy link

@pull pull bot commented Aug 17, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

angelinalg and others added 28 commits August 16, 2023 17:24
…ent times out (#38516)

We currently leak `asyncio.Event`s when:

- Clients time out waiting for updates.
- `notify_changed` is never called for that key.

This change removes the events when the timeout happens on the `LongPollHost`.
See more timeout for `test_numpy`, such as

```
(15:56:39) [269 / 347] 24 / 78 tests; Testing //python/ray/data:test_numpy; 63s local
--
  |  
  | TIMEOUT: //python/ray/data:test_numpy (Summary)
```

So bump up timeout here for test_numpy.

Signed-off-by: Cheng Su <[email protected]>
This PR remove the GCS Ray Syncer given the new ray syncer has been turn on by default for a while.
…nguage (#37969)

## Why are these changes needed?
the paramter and return type of C++ Worker support `std::any`  for cross language.
C++ Worker `std::any` is map to Java object.

TODO: The current pull request encounters errors when compiling on Windows, so Windows support is temporarily not available. I will work on adapting it to Windows and merge the Windows support once it is ready.
This reverts commit fb31492.

This is causing unit test failure:

#38460 (comment)
…Gallery HF examples doesn't include it (#38532)

Signed-off-by: angelinalg <[email protected]>
…38497)

When users `pip install ray` but not `ray[train]`/`ray[tune]`, they can run into confusing error messages, such as:

```
ModuleNotFoundError: No module named 'pyarrow'
```

With this PR, we import the core library requirements in the respective `__init__.py` and raise an actionable error message if they are not found:

```
ImportError: Can't import ray.train as some dependencies are missing. Run `pip install ray[train]` to fix.
```

Signed-off-by: Kai Fricke <[email protected]>
Adjust checkpoint interface to accept new-style checkpoint objects instead.

In progress reporter output:

```
Training saved a checkpoint for iteration 4 at: (s3)bucket/foo/bar
```

Signed-off-by: Kai Fricke <[email protected]>
Summary
This PR has CoreWorkers kill their child processes when they notice the raylet has died. This fixes an edge case missing from #33976.

Details
The CoreWorker processes have a periodic check that validates the raylet process is alive. If the raylet is no longer alive, then the CoreWorker processes invoke QuickExit. This PR modifies this behavior so that before calling QuickExit, the CoreWorker processes kill all child processes.

Note that this does not address the case where CoreWorker processes crash or are killed with SIGKILL. That condition requires the subreaper work proposed in #34125 / #26118.

The ray stop --force case also requires the subreaper work. This is because --force kills CoreWorker processes with SIGKILL, leaving them no opportunity to clean up child processes. We can add the logic which catches leaked child processes, but the best design for this is the subreaper design.
Fix the comment of 'ObjectBufferPool::ChunkInfo::chunk_index' which was an obvious typo.

Signed-off-by: z4y1b2 <[email protected]>
The new `test_proxy.py` seems to be timed out consistently on windows. Use medium test size
When memory profiling is enabled and Serve actors restart, they attempt to use the same memray file that they used before crashing. This causes memray to raise an error, and the Serve actors crash again, entering a crash loop.

This change makes the memray file change whenever an actor restarts.
This change lets users profile CPU usage on all Serve actors by setting the env var RAY_SERVE_ENABLE_CPU_PROFILING=1. Users can then take a snapshot of the CPU profile by getting a handle to a Serve actor and calling the _save_cpu_profile_data actor method.
Handles, routers, http proxy will have full information `DeploymentID(deployment_name, app_name)`, while controller will have incomplete information `"{app name}_{deployment name}"`. The following PR(s) will wrap up everything by having controller deployment state manager use `DeploymentID` as well.

* Deployment graph build (and all DeploymentNodes) will use `DeploymentID`
* Handles and routers will use `DeploymentID`
* HTTP Proxy will use `DeploymentID`
* Application state manager will use the original deployment name (e.g. target state will have `{"f": DeploymentInfo(...)}`), then prepend the app name when calling methods on the shared deployment state manager (e.g. send `{"default_f": DeploymentInfo(...)}` to deployment state manager).
* Some controller methods like `get_deployment_info` will use `DeploymentID` because it needs to call into endpoint state.

Other important notes:
* Using `app_name=""` for V1 deployments.
* For now, all metrics still use the same `"{app name}_{deployment name}"` deployment tags.
* Java routers and http proxy don't change. Correspondingly, the controller will use old update format for java (string instead of `DeploymentID`).
This PR adds CI runners for the Ray Train and Tune tests with the new storage context path enabled.

Many tests are excluded at first. We will iteratively work on enabling them to avoid having to fix a bunch of issues in one giant PR.

Signed-off-by: Kai Fricke <[email protected]>
Fixed missing space and made more concise.

Signed-off-by: angelinalg <[email protected]>
)

This PR enables the new persistence mode feature flag for the following release tests:

workspace_template_many_model_training
air_benchmark_data_bulk_ingest
air_benchmark_tensorflow_mnist_cpu_1x4
air_benchmark_tensorflow_mnist_cpu_4x1
air_benchmark_tensorflow_mnist_cpu_4x4
air_benchmark_tensorflow_mnist_gpu_4x4
air_example_dreambooth_finetuning
alpa_opt_2_7b_sanity_check
alpa_opt_30b_inference
cluster_tune_scale_up_down
long_running_many_ppo
ml_user_horovod_user_test_latest
ml_user_horovod_user_test_master
ml_user_train_tensorflow_mnist_test
ml_user_train_torch_linear_test
ml_user_tune_rllib_connect_test
train_horovod_multi_node_test

Signed-off-by: Justin Yu <[email protected]>
…eckpoint reporting (#38523)

This PR removes the need for the rank 0 worker to report a checkpoint in order for a checkpoint to be tracked by Train/Tune.

Before (ray <= 2.6):
```python
def train_fn_per_worker(config):
    ...
    tmpdir = tempfile.mkdtemp()
    if session.get_world_rank() == 0:
        # write to tmpdir
        # global rank 0 MUST report. otherwise it's as if you didn't checkpoint
        checkpoint = Checkpoint.from_directory(...)
    else:
        # create an "empty" checkpoint...
        # otherwise, if you just reported None, we throw an error
        # even worse, if you report a dict checkpoint here... unknown territory
        checkpoint = Checkpoint.from_directory(...)
    session.report(..., checkpoint)
```

After:
```python
def train_fn_per_worker(config):
    ...
    # ANY combination of workers can report a checkpoint
    if train.get_context().get_world_rank() in [2, 4, 6]:
        with tempfile.TemporaryDirectory() as tempdir:
            # write to tmpdir
            train.report(metrics, Checkpoint.from_directory(tempdir))
    else:
        train.report(metrics)
```

Note: the reported *metrics* are still pulled from the global rank 0 worker (same behavior as before). This PR does not remove that restriction.
Small optimization that makes handles immediately send reports to the controller if the current state is "idle":
- handle has no queued requests
- there are 0 replicas for the deployment

This makes it so that the first request sent to a scaled-to-zero deployment doesn't have to wait for the every-10-second metric push.

Ran some tests below for comparison. The load test is ramp 0 -> 10 users -> 50 users -> 100 users. You can see this greatly helps the first ramp up from 0 -> 10 users, and doesn't really affect anything else.
@pull pull bot added the ⤵️ pull label Aug 17, 2023
@pull pull bot merged commit e79c914 into miqdigital:master Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.