-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore useful lists of types of opcodes in opcode.py #40
Conversation
The line number calculation in libpython.py did not properly handle negative (signed) line table deltas.
I think there is a race condition in runtest_mp where if a worker has already pushed its final output to the queue, but is still "alive", then the main thread waits forever on the the queue. This might fix that issue, but there's still a delay of up to ~30 secs. See https://bugs.python.org/issue46205
This adds _PY_LIKELY, _PY_UNLIKELY, _Py_ALWAYS_INLINE, and macros to test for SSE2 and NEON support.
Some of the subinterpreter tests run code using a PyThreadState from the "wrong" native thread. This is going to be difficult to support and we may want to instead create a "correct" PyThreadState for the active native thread. For now, just disable the tests.
This adds a "status" field to each PyThreadState. The GC status will be useful for implementing stop-the-world garbage collection.
This adds a recursive lock that will be used for locking I/O streams. The lock performs a direct handoff if the waiting thread has been paused for at least 1 ms. Otherwise, the lock supports barging. The design is inspired by WTF locks (WebKit) and locking in Go. See https://webkit.org/blog/6161/locking-in-webkit/ The "parking lot" implementation will likely need to be improved before release. Currently, it uses a fixed-size hashtable (251 taken from Go) and the same linked-list for waiters and collisions. Note that Go uses a treap (random binary serach tree) for collisions, while WebKit resizes the hashtable to ensure 3x as many buckets as threads.
Critical sections are helpers to replace the global interpreter lock with finer grained locking. They provide similar guarantees to the GIL and avoid the deadlock risk that plain locking involves. Critical sections are implicitly ended whenever the GIL would be released. They are resumed when the GIL would be acquired. Nested critical sections behave as-if they're interleaved.
This changes SOABI and sys.implementation.name to "nogil" to avoid installing incompatible binary wheels. It also defines the macros Py_NOGIL and PY_NOGIL in patchlevel.h to identify nogil builds.
The GIL is controlled by the environment variable PYTHONGIL or the flag "-X nogil". The GIL is enabled by default. The interpreter will currently crash when running multi-threaded programs without the GIL.
The eval_breaker variable is used as a signal to break out of the interpreter loop to handle signals, asynchronous exceptions, stop for GC, or release the GIL to let another thread run. We will be able to have multiple active threads running the interpreter loop so it's useful to move eval_breaker to per-thread state so that notifications can target a specific thread. The specific signals are combined as bits in eval_breaker to simplify atomic updates.
There is a deadlock hazard when a thread acquires a lock in a finalizer that may also be held during a stop-the-world GC. Typical GC implementations avoid this by running finalizers in separate thread while other threads are resumed. This is awkward in Python because some code and tests expect finalizers to have finished by the time gc.collect() finishes. In CPython with the GIL the issue can be mitigated with a RLock because the GC doesn't actually stop other threads -- other threads can acquire the GIL and run during many parts of a GC cycle. For ease of implementation, we add a private API _thread.CriticalLock which disables the current thread from parking while the lock is held. This prevents deadlock by ensuring that the lock is released before the GC is run.
This uses _thread.CriticalLock to prevent garbage collection while modifying the finalizer registry and in the resource tracker. There is a potential deadlock when checking if the resource tracker is running. This exists in upstream CPython, but occurs much more frequently with biased reference counting. The call to _check_alive can trigger a garbage collection that collects a multiprocessing lock. The finalizer for that lock calls back into resource tracker leading to a deadlock. This is rare in upstream CPython because most function calls do not trigger a garbage collection.
When Py_REF_DEBUG is defined, Python aggregates the number of refcounting changes. This adds a per-thread counter and only aggregates across threads when _Py_GetRefTotal() is called. Threads also add their local counters to the global counter when they exit. counter on exit.
pyperformance: 9.96% regression About 3% points of the slow-down appears to be in the _Py_MergeZeroRefcount function. If this is rewritten to always avoid atomic operations than the regression is only 6.7%. We should be able to use the fast-path merge & dealloc for more types once we start using mimalloc and remove object caches.
The scheme will be used to allow safe reads from dicts and lists that don't acquire the collection's lock.
pyperformance: 13.9% regression
WrapperTest.test_create_at_shutdown_with_encoding fails with something like: ``` Exception ignored in: <function C.__del__ at 0x101895ae0> Traceback (most recent call last): File "x.py", line 10, in __del__ File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 2030, in __init__ File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 1021, in seekable ValueError: I/O operation on closed file. ``` The problem is that the GC calls the finalizer on `self.buf` before the `__del__` on `class C`. I don't think Python guarantees any ordering on destructors.
Modified src/alloc.c to remove include of alloc-override.c Did not include the following files: - include/mimalloc-new-delete.h - include/mimalloc-override.h - src/alloc-override-osx.c - src/alloc-override.c - src/static.c
- remove debug spam for freeing large allocations - use same bytes (0xDD) for freed allocations in CPython and mimalloc This is important for the test_capi debug memory tests
These are changes to support separate heaps for Python objects, Python objects with GC header, and non Python objects.
Python threads have complicated lifecycles. After fork(), only one thread remains alive. In that case, we want to abandon the heaps of the threads that didn't survive fork() so that the we can still find all GC objects after we delete those dead PyThreadState. There's a similar issue with daemon threads. Python deletes the PyThreadState for daemon threads potentially before (or concurrently) with thread exit.
The optimistic lock-free reads will require preserving some data across malloc/free calls. For refcounted Python objects, the refcount field must remain valid. For dict keys the entire object must remain valid, other than the first word (dk_usable). This adds a field debug_offset that is the offset from the start of the block where the allocator is allowed to fill dead objects with 0xDD or 0xD0. The value -1 signifies that the entire block must remain valid, other than the first word, which is used to store the free list.
This changes mimalloc to tag pages for PyObjects with the current qsbr "goal". These pages can't be freed or used in a different heap until after the qsbr shared read counter reaches the goal.
This adds a recursive mutex to each LRU cache. The mutex is held while the cache is updated, but released when calling the annotated function. Note that this still potentially can lead to deadlocks that did not exist with the GIL, but I think only in rather esoteric cases.
This adds support for protecting functions wrapped by argument clinic with recursive mutexes.
The co_zombieframe is expected to be NULL when running without the GIL. The test_threads function checks that at least one of the running threads is blocked on the GIL and that the behavior is displayed in the GDB backtrace. Of course, this is no longer the case when running without the GIL.
This adds a mutex to every deque object. It's acquired around most operations. Some remaining functions, which are not yet thread-safe are: - deque.copy - deque.__init__
PyImport_ImportModuleLevelObject previously used the variable module.__spec__._initializing to check if a module is currently being initialized. This access could race with modifications to the module's dict. Instead, use the new md_initialized field on PyModule to check if the module is already initialized. This can mean that a duck-typed module doesn't get the benefit of the fast-path, but I think other than that the change should still preserve the important behavior.
This fixes a hang in test_multiprocessing when running without the GIL due to lost modifications to count in the seamphore.
The module name is copied to "__module__" attribute of functions, including closures. Make the name interned (and immortalized) to avoid bottlenecks on reference counting it.
The tracing references feature is not thread-safe without the GIL. I think it's possible to get the same functionality in a thread-safe way by traversing the mimalloc heaps (like we do in the garbage collector), but that's not currently implemented. For now the "--with-trace-refs" argument is ignored by configure with a warning about unrecognized options. Fixes colesbury#3
Heap type objects are a common source of reference count contention because every created instance increments the reference count of its type. The existing deferred reference count scheme isn't sufficient because instances are not necessarily visible to the garbage collector. This stores most reference count modifications in a per-thread array with an entry for each heap type object. Each heap type object has a new field 'tp_typeid' which is used to index the thread-local array. Thread-local reference counts are merged into the type's own reference count field during garbage collection and before the thread exits.
Changes "test_cancel_futures_wait_false" to pause a tiny bit before calling "shutdown" with cancel_futures=True. Otherwise, the work item may be cancelled before it is started. In practice, this only happens when the GIL is disabled. With the GIL on, the thread grabs the work item before t.shutdown() is called due to the 5 ms GIL switch interval. The importlib frozen module is regenerated because of the switch of some strings to interned.
The modified pip package looks in a custom package index URL for compatible nogil Python packages. See https://github.com/colesbury/pip/tree/21.3.1-nogil for modifications
Hi @lgray -- thanks for the bug report and PR. I'll put my modifications to numba up on GitHub in case anyone wants to try their hand at getting it to work. I'll also take a look at |
@colesbury thanks for the reply! Awkward-array compiles from the tarball out of the box and seems to operate correctly FWIW. If you're willing to host a wheel that'd be great but it doesn't appear to need any additional work! I'll leave numba alone since it's much deeper than what I can meaningfully contribute. Kind of annoying since numba is a bit buried in our code (numpy-like array manipulation and recursion occasionally don't play nice...), would be useful from our end to demonstrate a fully realistic example of physics analysis running multi-threaded as opposed to multi-processing. We'll have to wait (or convince someone from the numba team to take a look, we could see about that). There are some reduced examples we can do that'll make some of the same points though. If we can show that we can change our python analysis to have shared threaded histogram filling and such, then I'd be willing to try to make a push for support from the data-science community as to how useful the ideas in this POC fork are. |
Awkward array includes pybind11 (as a git submodule instead of a pip dependency). pybind11 isn't thread-safe without the GIL without patches, so I expect awkward array to have the same issues. For example, this program crashes when run without the GIL. If you only use awkward array from a single thread, I don't expect there to be a problem, but if you use it from multiple threads I'd expect issues. I'll build an awkward array wheel with the patched pybind11. |
Yes - on further testing we ran into problems (at larger number of threads) as well. There's also boost-histogram which we'd be interested to look at that's also using pybind, and compiles fine from the tarball. |
@bendavid @jpivarski FYI |
@lgray I've uploaded wheels for Here are the scripts I used to minimally test the multithreaded behavior: Here are the patched projects (just replaced pybind11 submodules): |
@colesbury on another point - we found that gdb in the image and Dockerfile you supplied is broken (it compiles against vanilla python and segfaults on execution). I've put together a Dockerfile that builds its down gdb. Happy to toss it your way. |
@lgray Thanks, I'd appreciate that! |
Following up on #39.
Unless really required I am not going to follow the python pull request format, given that this is a fork.
Goal: I am trying to get numba to work with nogil because of a software project, a high-energy physics data analysis framework, I would like to benchmark using nogil threading instead of multi-processing. We use numba for a variety of bottom line optimizations since our data is weird compared to most out there. Notably, we do not use numpy as our base data representation but awkward-array.. It would be cool if wheels could be hosted for that too, but not necessary!
The problem: numba makes use of convenience functions in the
dis
module in python to deal with various kinds of ops. I tried to label the new set of op codes in nogil as closely as possible to the native cpython. This fixed the first problem I ran into where numba would complain thathasjrel
was missing. The problem I run into now is that in nogil there is noNOP
operation defined inopcode.py
which numba requires defined. I am not sure how to proceed since this is not even close to my area of expertise and not having no-op defined seems weird, but it sounds surmountable (perhaps even easily so) with the appropriate knowledge.The edits I've put in are fairly minor but appear necessary since they supply module functionality that went missing due to the nogil changes, and that functionality is relied upon in numba.
If this is a fool's errand and numba requires too many details of vanilla cpython - just let me know and I'll drop it. :)