-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] recursive cancellation #47873
Conversation
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
@@ -365,5 +365,104 @@ async def get_out_of_band_response(self): | |||
assert h.get_out_of_band_response.remote().result() == "ok" | |||
|
|||
|
|||
def test_recursive_cancellation_during_execution(serve_instance): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order for these tests to be meaningful, don't we need to turn off the core cancellation support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at first I took the approach of implementing full recursive cancellation at the Serve layer, but that leads to more of a performance drop. I think it's unnecessary to cover the part that core already covers (recursive task cancellation). We just need to take care of canceling the asyncio request assignment task before a request has been assigned to a replica.
So in this test file test_recursive_cancellation_during_execution
already passes, and test_recursive_cancellation_during_assignment
is the one that doesn't pass without this PR.
@@ -594,6 +599,17 @@ async def assign_request( | |||
) -> ReplicaResult: | |||
"""Assign a request to a replica and return the resulting object_ref.""" | |||
|
|||
response_id = uuid.uuid4() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocker: There is this generate_request_id() maybe we can rename and reuse here?
python/ray/serve/_private/router.py
Outdated
response_id = uuid.uuid4() | ||
assign_request_task = asyncio.current_task() | ||
ray.serve.context._add_request_pending_assignment( | ||
request_meta.request_id, response_id, assign_request_task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use internal_request_id
for this. request_id
can be set by the client from the header and can be repeated/ other unexpected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice thanks for the suggestion, changed to use internal request id!
Signed-off-by: Cindy Zhang <[email protected]>
Signed-off-by: Cindy Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Cindy Zhang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Cindy!
Signed-off-by: Cindy Zhang <[email protected]>
## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <[email protected]>
## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <[email protected]>
## Why are these changes needed? Implement full recursive cancellation at Serve layer. Currently, at the time that a parent request is cancelled, if: - The child request has not been assigned to a replica, _meaning_ the replica scheduler in the router is still processing that request - AND the parent request is not directly blocking and waiting on the result of the child request Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request _was_ waiting on the result at the time of cancellation, then an `asyncio.CancelledError` will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true. This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled. Signed-off-by: Cindy Zhang <[email protected]> Signed-off-by: mohitjain2504 <[email protected]>
Why are these changes needed?
Implement full recursive cancellation at Serve layer.
Currently, at the time that a parent request is cancelled, if:
Then the child request will not be correctly cancelled. Note that if the second bullet point were actually flipped, i.e. the parent request was waiting on the result at the time of cancellation, then an
asyncio.CancelledError
will be sent directly to the replica scheduler and the child request will get successfully cancelled. However obviously this cannot be guaranteed to be true.This PR adds a cache for asyncio tasks that are launched in the router to assign a request to a replica. When a parent request is cancelled, if there are any "pending assignment requests" in the cache, they are cancelled.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.