-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the specializing interpreter thread-safe in --disable-gil
builds
#115999
Comments
(subscribing myself) |
…aded builds (#116013) For now, disable all specialization when the GIL might be disabled.
This is now a performance (rather than correctness) issue for free-threaded builds, so I'm going to focus on more time-sensitive issues for a while. |
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
…e-threaded builds (python#116013) For now, disable all specialization when the GIL might be disabled.
@swtaarrs Out of curiosity, is there any progress or plan for this issue? |
@corona10 I'm planning to work on this after I get the deferred reference stack in. However, there are no concrete plans as of now. I'm really happy for you or anyone else to propose a design for the specializing interpreter with free-threaded safety! |
@Fidget-Spinner cc @swtaarrs By the way, in the short term, can we enable the specializer to be used only for the main thread if we can not solve the issue before 3.13 is released? |
@corona10 for 3.13, I think generally we're focusing on scalability across multicore rather than single-threaded perf for 3.13. It's a bit too near to feature freeze for me to feel safe re-enabling specialization at this point. There are a lot of unsolved problems still even with specialization only on the main thread. Consider the following:
I'm reading a few papers to get some inspiration and also looking at how CRuby and other runtimes deal with this. Will post back when I have an actual plan. |
Stop the world when invalidating function versions The tier1 interpreter specializes `CALL` instructions based on the values of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1 interpreter uses function versions to verify that the attributes of a function during execution of a specialization match those seen during specialization. A function's version is initialized in `MAKE_FUNCTION` and is invalidated when any of the critical function attributes are changed. The tier1 interpreter stores the function version in the inline cache during specialization. A guard is used by the specialized instruction to verify that the version of the function on the operand stack matches the cached version (and therefore has all of the expected attributes). It is assumed that once the guard passes, all attributes will remain unchanged while executing the rest of the specialized instruction. Stopping the world when invalidating function versions ensures that all critical function attributes will remain unchanged after the function version guard passes in free-threaded builds. It's important to note that this is only true if the remainder of the specialized instruction does not enter and exit a stop-the-world point. We will stop the world the first time any of the following function attributes are mutated: - defaults - vectorcall - kwdefaults - closure - code This should happen rarely and only happens once per function, so the performance impact on majority of code should be minimal. Additionally, refactor the API for manipulating function versions to more clearly match the stated semantics.
…ython#124997) Stop the world when invalidating function versions The tier1 interpreter specializes `CALL` instructions based on the values of certain function attributes (e.g. `__code__`, `__defaults__`). The tier1 interpreter uses function versions to verify that the attributes of a function during execution of a specialization match those seen during specialization. A function's version is initialized in `MAKE_FUNCTION` and is invalidated when any of the critical function attributes are changed. The tier1 interpreter stores the function version in the inline cache during specialization. A guard is used by the specialized instruction to verify that the version of the function on the operand stack matches the cached version (and therefore has all of the expected attributes). It is assumed that once the guard passes, all attributes will remain unchanged while executing the rest of the specialized instruction. Stopping the world when invalidating function versions ensures that all critical function attributes will remain unchanged after the function version guard passes in free-threaded builds. It's important to note that this is only true if the remainder of the specialized instruction does not enter and exit a stop-the-world point. We will stop the world the first time any of the following function attributes are mutated: - defaults - vectorcall - kwdefaults - closure - code This should happen rarely and only happens once per function, so the performance impact on majority of code should be minimal. Additionally, refactor the API for manipulating function versions to more clearly match the stated semantics.
|
I am working on |
* None / bool / int / str are immutable types, so they is thread-safe. * list is mutable, but by using ``PyList_GET_SIZE`` we can make it as thread-safe.
Don't take a reason in unspecialize We only want to compute the reason if stats are enabled. Optimizing compilers should optimize this away for us (gcc and clang do), but it's better to be safe than sorry.
Enable specialization of LOAD_GLOBAL in free-threaded builds. Thread-safety of specialization in free-threaded builds is provided by the following: A critical section is held on both the globals and builtins objects during specialization. This ensures we get an atomic view of both builtins and globals during specialization. Generation of new keys versions is made atomic in free-threaded builds. Existing helpers are used to atomically modify the opcode. Thread-safety of specialized instructions in free-threaded builds is provided by the following: Relaxed atomics are used when loading and storing dict keys versions. This avoids potential data races as the dict keys versions are read without holding the dictionary's per-object lock in version guards. Dicts keys objects are passed from keys version guards to the downstream uops. This ensures that we are loading from the correct offset in the keys object. Once a unicode key has been stored in a keys object for a combined dictionary in free-threaded builds, the offset that it is stored in will never be reused for a different key. Once the version guard passes, we know that we are reading from the correct offset. The dictionary read fast-path is used to read values from the dictionary once we know the correct offset.
Use existing helpers to atomically modify the bytecode. Add unit tests to ensure specializing is happening as expected. Add test_specialize.py that can be used with ThreadSanitizer to detect data races.
…26600) Add free-threaded specialization for `UNPACK_SEQUENCE` opcode. `UNPACK_SEQUENCE_TUPLE/UNPACK_SEQUENCE_TWO_TUPLE` are already thread safe since tuples are immutable. `UNPACK_SEQUENCE_LIST` is not thread safe because of nature of lists (there is nothing preventing another thread from adding items to or removing them the list while the instruction is executing). To achieve thread safety we add a critical section to the implementation of `UNPACK_SEQUENCE_LIST`, especially around the parts where we check the size of the list and push items onto the stack. --------- Co-authored-by: Matt Page <[email protected]> Co-authored-by: mpage <[email protected]>
The specialization only depends on the type, so no special thread-safety considerations there. STORE_SUBSCR_LIST_INT needs to lock the list before modifying it. `_PyDict_SetItem_Take2` already internally locks the dictionary using a critical section.
Record success in `specialize` This matches the existing behavior where we increment the success stat for the generic opcode each time we successfully specialize an instruction.
The specialization only depends on the type, so no special thread-safety considerations there. STORE_SUBSCR_LIST_INT needs to lock the list before modifying it. `_PyDict_SetItem_Take2` already internally locks the dictionary using a critical section.
Feature or enhancement
Proposal:
In free-threaded builds, the specializing adaptive interpreter needs to be made thread-safe. We should start with a small PR to simply disable it in free-threaded builds, which will be correct but will incur a performance penalty. Then we can work out how to properly support specialization in a free-threaded build.
These two commits from Sam's nogil-3.12 branch can serve as inspiration:
There are two primary concerns to balance while implementing this functionality on
main
:Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
Specialization Families
Tasks
Linked PRs
BINARY_OP
#123926LOAD_GLOBAL
specializations to avoid reloading {globals, builtins} keys #124953UNPACK_SEQUENCE
#126600LOAD_GLOBAL
in free-threaded builds #126607TO_BOOL
#126616CALL
instructions in free-threaded builds #127123LOAD_SUPER_ATTR
in free-threaded builds #127128specialize
#127167STORE_SUBSCR
#127169The text was updated successfully, but these errors were encountered: