Restore useful lists of types of opcodes in opcode.py #40

lgray · 2022-03-09T15:23:10Z

Following up on #39.

Unless really required I am not going to follow the python pull request format, given that this is a fork.

Goal: I am trying to get numba to work with nogil because of a software project, a high-energy physics data analysis framework, I would like to benchmark using nogil threading instead of multi-processing. We use numba for a variety of bottom line optimizations since our data is weird compared to most out there. Notably, we do not use numpy as our base data representation but awkward-array.. It would be cool if wheels could be hosted for that too, but not necessary!

The problem: numba makes use of convenience functions in the dis module in python to deal with various kinds of ops. I tried to label the new set of op codes in nogil as closely as possible to the native cpython. This fixed the first problem I ran into where numba would complain that hasjrel was missing. The problem I run into now is that in nogil there is no NOP operation defined in opcode.py which numba requires defined. I am not sure how to proceed since this is not even close to my area of expertise and not having no-op defined seems weird, but it sounds surmountable (perhaps even easily so) with the appropriate knowledge.

The edits I've put in are fairly minor but appear necessary since they supply module functionality that went missing due to the nogil changes, and that functionality is relied upon in numba.

If this is a fool's errand and numba requires too many details of vanilla cpython - just let me know and I'll drop it. :)

The line number calculation in libpython.py did not properly handle negative (signed) line table deltas.

I think there is a race condition in runtest_mp where if a worker has already pushed its final output to the queue, but is still "alive", then the main thread waits forever on the the queue. This might fix that issue, but there's still a delay of up to ~30 secs. See https://bugs.python.org/issue46205

This adds _PY_LIKELY, _PY_UNLIKELY, _Py_ALWAYS_INLINE, and macros to test for SSE2 and NEON support.

Some of the subinterpreter tests run code using a PyThreadState from the "wrong" native thread. This is going to be difficult to support and we may want to instead create a "correct" PyThreadState for the active native thread. For now, just disable the tests.

This adds a "status" field to each PyThreadState. The GC status will be useful for implementing stop-the-world garbage collection.

This adds a recursive lock that will be used for locking I/O streams. The lock performs a direct handoff if the waiting thread has been paused for at least 1 ms. Otherwise, the lock supports barging. The design is inspired by WTF locks (WebKit) and locking in Go. See https://webkit.org/blog/6161/locking-in-webkit/ The "parking lot" implementation will likely need to be improved before release. Currently, it uses a fixed-size hashtable (251 taken from Go) and the same linked-list for waiters and collisions. Note that Go uses a treap (random binary serach tree) for collisions, while WebKit resizes the hashtable to ensure 3x as many buckets as threads.

Critical sections are helpers to replace the global interpreter lock with finer grained locking. They provide similar guarantees to the GIL and avoid the deadlock risk that plain locking involves. Critical sections are implicitly ended whenever the GIL would be released. They are resumed when the GIL would be acquired. Nested critical sections behave as-if they're interleaved.

This changes SOABI and sys.implementation.name to "nogil" to avoid installing incompatible binary wheels. It also defines the macros Py_NOGIL and PY_NOGIL in patchlevel.h to identify nogil builds.

The GIL is controlled by the environment variable PYTHONGIL or the flag "-X nogil". The GIL is enabled by default. The interpreter will currently crash when running multi-threaded programs without the GIL.

The eval_breaker variable is used as a signal to break out of the interpreter loop to handle signals, asynchronous exceptions, stop for GC, or release the GIL to let another thread run. We will be able to have multiple active threads running the interpreter loop so it's useful to move eval_breaker to per-thread state so that notifications can target a specific thread. The specific signals are combined as bits in eval_breaker to simplify atomic updates.

There is a deadlock hazard when a thread acquires a lock in a finalizer that may also be held during a stop-the-world GC. Typical GC implementations avoid this by running finalizers in separate thread while other threads are resumed. This is awkward in Python because some code and tests expect finalizers to have finished by the time gc.collect() finishes. In CPython with the GIL the issue can be mitigated with a RLock because the GC doesn't actually stop other threads -- other threads can acquire the GIL and run during many parts of a GC cycle. For ease of implementation, we add a private API _thread.CriticalLock which disables the current thread from parking while the lock is held. This prevents deadlock by ensuring that the lock is released before the GC is run.

This uses _thread.CriticalLock to prevent garbage collection while modifying the finalizer registry and in the resource tracker. There is a potential deadlock when checking if the resource tracker is running. This exists in upstream CPython, but occurs much more frequently with biased reference counting. The call to _check_alive can trigger a garbage collection that collects a multiprocessing lock. The finalizer for that lock calls back into resource tracker leading to a deadlock. This is rare in upstream CPython because most function calls do not trigger a garbage collection.

When Py_REF_DEBUG is defined, Python aggregates the number of refcounting changes. This adds a per-thread counter and only aggregates across threads when _Py_GetRefTotal() is called. Threads also add their local counters to the global counter when they exit. counter on exit.

pyperformance: 9.96% regression About 3% points of the slow-down appears to be in the _Py_MergeZeroRefcount function. If this is rewritten to always avoid atomic operations than the regression is only 6.7%. We should be able to use the fast-path merge & dealloc for more types once we start using mimalloc and remove object caches.

The scheme will be used to allow safe reads from dicts and lists that don't acquire the collection's lock.

pyperformance: 13.9% regression

WrapperTest.test_create_at_shutdown_with_encoding fails with something like: ``` Exception ignored in: <function C.__del__ at 0x101895ae0> Traceback (most recent call last): File "x.py", line 10, in __del__ File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 2030, in __init__ File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 1021, in seekable ValueError: I/O operation on closed file. ``` The problem is that the GC calls the finalizer on `self.buf` before the `__del__` on `class C`. I don't think Python guarantees any ordering on destructors.

Modified src/alloc.c to remove include of alloc-override.c Did not include the following files: - include/mimalloc-new-delete.h - include/mimalloc-override.h - src/alloc-override-osx.c - src/alloc-override.c - src/static.c

…ypes

- remove debug spam for freeing large allocations - use same bytes (0xDD) for freed allocations in CPython and mimalloc This is important for the test_capi debug memory tests

These are changes to support separate heaps for Python objects, Python objects with GC header, and non Python objects.

Python threads have complicated lifecycles. After fork(), only one thread remains alive. In that case, we want to abandon the heaps of the threads that didn't survive fork() so that the we can still find all GC objects after we delete those dead PyThreadState. There's a similar issue with daemon threads. Python deletes the PyThreadState for daemon threads potentially before (or concurrently) with thread exit.

The optimistic lock-free reads will require preserving some data across malloc/free calls. For refcounted Python objects, the refcount field must remain valid. For dict keys the entire object must remain valid, other than the first word (dk_usable). This adds a field debug_offset that is the offset from the start of the block where the allocator is allowed to fill dead objects with 0xDD or 0xD0. The value -1 signifies that the entire block must remain valid, other than the first word, which is used to store the free list.

This changes mimalloc to tag pages for PyObjects with the current qsbr "goal". These pages can't be freed or used in a different heap until after the qsbr shared read counter reaches the goal.

This adds a recursive mutex to each LRU cache. The mutex is held while the cache is updated, but released when calling the annotated function. Note that this still potentially can lead to deadlocks that did not exist with the GIL, but I think only in rather esoteric cases.

This adds support for protecting functions wrapped by argument clinic with recursive mutexes.

The co_zombieframe is expected to be NULL when running without the GIL. The test_threads function checks that at least one of the running threads is blocked on the GIL and that the behavior is displayed in the GDB backtrace. Of course, this is no longer the case when running without the GIL.

This adds a mutex to every deque object. It's acquired around most operations. Some remaining functions, which are not yet thread-safe are: - deque.copy - deque.__init__

PyImport_ImportModuleLevelObject previously used the variable module.__spec__._initializing to check if a module is currently being initialized. This access could race with modifications to the module's dict. Instead, use the new md_initialized field on PyModule to check if the module is already initialized. This can mean that a duck-typed module doesn't get the benefit of the fast-path, but I think other than that the change should still preserve the important behavior.

This fixes a hang in test_multiprocessing when running without the GIL due to lost modifications to count in the seamphore.

The module name is copied to "__module__" attribute of functions, including closures. Make the name interned (and immortalized) to avoid bottlenecks on reference counting it.

The tracing references feature is not thread-safe without the GIL. I think it's possible to get the same functionality in a thread-safe way by traversing the mimalloc heaps (like we do in the garbage collector), but that's not currently implemented. For now the "--with-trace-refs" argument is ignored by configure with a warning about unrecognized options. Fixes colesbury#3

Heap type objects are a common source of reference count contention because every created instance increments the reference count of its type. The existing deferred reference count scheme isn't sufficient because instances are not necessarily visible to the garbage collector. This stores most reference count modifications in a per-thread array with an entry for each heap type object. Each heap type object has a new field 'tp_typeid' which is used to index the thread-local array. Thread-local reference counts are merged into the type's own reference count field during garbage collection and before the thread exits.

Changes "test_cancel_futures_wait_false" to pause a tiny bit before calling "shutdown" with cancel_futures=True. Otherwise, the work item may be cancelled before it is started. In practice, this only happens when the GIL is disabled. With the GIL on, the thread grabs the work item before t.shutdown() is called due to the 5 ms GIL switch interval. The importlib frozen module is regenerated because of the switch of some strings to interned.

The modified pip package looks in a custom package index URL for compatible nogil Python packages. See https://github.com/colesbury/pip/tree/21.3.1-nogil for modifications

colesbury · 2022-03-09T19:13:09Z

Hi @lgray -- thanks for the bug report and PR. numba requires a lot more changes than the missing hasjrel. I spent a week in January trying to get numba to work with some progress, but ultimately unsuccessfully. I'm pretty sure it's solvable, but I'm not sure how long it would take.

I'll put my modifications to numba up on GitHub in case anyone wants to try their hand at getting it to work. I'll also take a look at awkward-array.

lgray · 2022-03-09T20:09:31Z

@colesbury thanks for the reply!

Awkward-array compiles from the tarball out of the box and seems to operate correctly FWIW. If you're willing to host a wheel that'd be great but it doesn't appear to need any additional work!

I'll leave numba alone since it's much deeper than what I can meaningfully contribute. Kind of annoying since numba is a bit buried in our code (numpy-like array manipulation and recursion occasionally don't play nice...), would be useful from our end to demonstrate a fully realistic example of physics analysis running multi-threaded as opposed to multi-processing. We'll have to wait (or convince someone from the numba team to take a look, we could see about that).

There are some reduced examples we can do that'll make some of the same points though. If we can show that we can change our python analysis to have shared threaded histogram filling and such, then I'd be willing to try to make a push for support from the data-science community as to how useful the ideas in this POC fork are.

colesbury · 2022-03-10T19:55:02Z

Awkward-array compiles from the tarball out of the box and seems to operate correctly FWIW. If you're willing to host a wheel that'd be great but it doesn't appear to need any additional work!

Awkward array includes pybind11 (as a git submodule instead of a pip dependency). pybind11 isn't thread-safe without the GIL without patches, so I expect awkward array to have the same issues. For example, this program crashes when run without the GIL. If you only use awkward array from a single thread, I don't expect there to be a problem, but if you use it from multiple threads I'd expect issues.

I'll build an awkward array wheel with the patched pybind11.

lgray · 2022-03-10T19:57:59Z

Yes - on further testing we ran into problems (at larger number of threads) as well. There's also boost-histogram which we'd be interested to look at that's also using pybind, and compiles fine from the tarball.

lgray · 2022-03-10T20:15:00Z

@bendavid @jpivarski FYI

colesbury · 2022-03-10T20:47:43Z

@lgray I've uploaded wheels for awkward (1.7.0) and boost-histogram (1.3.1) with the patched pybind11. You should be able to install them as typical through pip after you've uninstalled any locally built versions of those packages.

Here are the scripts I used to minimally test the multithreaded behavior:

Here are the patched projects (just replaced pybind11 submodules):

lgray · 2022-03-10T21:14:25Z

@colesbury on another point - we found that gdb in the image and Dockerfile you supplied is broken (it compiles against vanilla python and segfaults on execution). I've put together a Dockerfile that builds its down gdb. Happy to toss it your way.

colesbury · 2022-03-10T21:20:04Z

@lgray Thanks, I'd appreciate that!

colesbury added 30 commits November 18, 2021 12:49

_ExecutorManagerThread: fix for bpo-43498

2ce972a

bpo-45838: Fix incorrect line numbers in Tools/gdb/libpython.py

6aa525a

The line number calculation in libpython.py did not properly handle negative (signed) line table deltas.

Add pyatomic.h

f9453e5

pyport: add new macros

601f03b

This adds _PY_LIKELY, _PY_UNLIKELY, _Py_ALWAYS_INLINE, and macros to test for SSE2 and NEON support.

Make PyThreadState_GET thread-local

f9f8e3e

pystate: keep track of attached vs. detached state

4f83e24

This adds a "status" field to each PyThreadState. The GC status will be useful for implementing stop-the-world garbage collection.

Change "SOABI" and sys.implementation.name and define Py_NOGIL

9eef7dd

This changes SOABI and sys.implementation.name to "nogil" to avoid installing incompatible binary wheels. It also defines the macros Py_NOGIL and PY_NOGIL in patchlevel.h to identify nogil builds.

Enable/disable the GIL at runtime

df69615

The GIL is controlled by the environment variable PYTHONGIL or the flag "-X nogil". The GIL is enabled by default. The interpreter will currently crash when running multi-threaded programs without the GIL.

Make global objects immortal (not reference counted)

81b21c2

Add ob_tid to every object

8886ee8

Call function when incref/decref

2150605

Add safe memory reclamation scheme based on FreeBSD's GUS

cc16ae8

The scheme will be used to allow safe reads from dicts and lists that don't acquire the collection's lock.

Disable most free lists

fbe8612

pyperformance: 13.9% regression

Add mimalloc v1.6.1

a768370

Modified src/alloc.c to remove include of alloc-override.c Did not include the following files: - include/mimalloc-new-delete.h - include/mimalloc-override.h - src/alloc-override-osx.c - src/alloc-override.c - src/static.c

mimalloc: fix GCC warning on unknown pragma warning and strict protot…

a857ee0

…ypes

mimalloc: minimal changes for use in Python

2cbd6e0

- remove debug spam for freeing large allocations - use same bytes (0xDD) for freed allocations in CPython and mimalloc This is important for the test_capi debug memory tests

mimalloc: changes to support separate heaps

d3221cc

These are changes to support separate heaps for Python objects, Python objects with GC header, and non Python objects.

mimalloc: tag pages with qsbr counter

f69c4ff

This changes mimalloc to tag pages for PyObjects with the current qsbr "goal". These pages can't be freed or used in a different heap until after the qsbr shared read counter reaches the goal.

colesbury and others added 20 commits December 30, 2021 13:06

random: add a recursive mutex to guard random.Random

7cf1de2

clinic: support '@' syntax for recursive mutexes

5858664

This adds support for protecting functions wrapped by argument clinic with recursive mutexes.

bufferedio: add locks to make print() thread-safe

edf4a0d

textio: add locks to make textio thread-safe

57e258e

deque: make most functions thread-safe

9fc8425

This adds a mutex to every deque object. It's acquired around most operations. Some remaining functions, which are not yet thread-safe are: - deque.copy - deque.__init__

semaphore.c: decrease count before release sem_lock

5a68d67

This fixes a hang in test_multiprocessing when running without the GIL due to lost modifications to count in the seamphore.

module: intern module name

e59436d

The module name is copied to "__module__" attribute of functions, including closures. Make the name interned (and immortalized) to avoid bottlenecks on reference counting it.

_struct: fix race condition in cache_struct_converter

8155fe7

Implement new interpreter

fec60ba

Update README.rst for nogil

257a581

Add docker files for building nogil Python

7df4d4d

pip: use modified bundled pip

766afea

The modified pip package looks in a custom package index URL for compatible nogil Python packages. See https://github.com/colesbury/pip/tree/21.3.1-nogil for modifications

gc: make gc_get_refs assertion a runtime check in non-debug builds

2607880

restore useful lists of types of opcodes in opcode.py

df64edd

colesbury force-pushed the nogil branch from 7648c56 to c3894a5 Compare April 13, 2022 18:25

lgray closed this Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore useful lists of types of opcodes in opcode.py #40

Restore useful lists of types of opcodes in opcode.py #40

lgray commented Mar 9, 2022 •

edited

Loading

colesbury commented Mar 9, 2022 •

edited

Loading

lgray commented Mar 9, 2022

colesbury commented Mar 10, 2022

lgray commented Mar 10, 2022

lgray commented Mar 10, 2022

colesbury commented Mar 10, 2022

lgray commented Mar 10, 2022

colesbury commented Mar 10, 2022

Restore useful lists of types of opcodes in opcode.py #40

Restore useful lists of types of opcodes in opcode.py #40

Conversation

lgray commented Mar 9, 2022 • edited Loading

colesbury commented Mar 9, 2022 • edited Loading

lgray commented Mar 9, 2022

colesbury commented Mar 10, 2022

lgray commented Mar 10, 2022

lgray commented Mar 10, 2022

colesbury commented Mar 10, 2022

lgray commented Mar 10, 2022

colesbury commented Mar 10, 2022

lgray commented Mar 9, 2022 •

edited

Loading

colesbury commented Mar 9, 2022 •

edited

Loading