Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore useful lists of types of opcodes in opcode.py #40

Closed
wants to merge 79 commits into from

Conversation

lgray
Copy link

@lgray lgray commented Mar 9, 2022

Following up on #39.

Unless really required I am not going to follow the python pull request format, given that this is a fork.

Goal: I am trying to get numba to work with nogil because of a software project, a high-energy physics data analysis framework, I would like to benchmark using nogil threading instead of multi-processing. We use numba for a variety of bottom line optimizations since our data is weird compared to most out there. Notably, we do not use numpy as our base data representation but awkward-array.. It would be cool if wheels could be hosted for that too, but not necessary!

The problem: numba makes use of convenience functions in the dis module in python to deal with various kinds of ops. I tried to label the new set of op codes in nogil as closely as possible to the native cpython. This fixed the first problem I ran into where numba would complain that hasjrel was missing. The problem I run into now is that in nogil there is no NOP operation defined in opcode.py which numba requires defined. I am not sure how to proceed since this is not even close to my area of expertise and not having no-op defined seems weird, but it sounds surmountable (perhaps even easily so) with the appropriate knowledge.

The edits I've put in are fairly minor but appear necessary since they supply module functionality that went missing due to the nogil changes, and that functionality is relied upon in numba.

If this is a fool's errand and numba requires too many details of vanilla cpython - just let me know and I'll drop it. :)

The line number calculation in libpython.py did not properly handle
negative (signed) line table deltas.
I think there is a race condition in runtest_mp where if a worker
has already pushed its final output to the queue, but is still
"alive", then the main thread waits forever on the the queue.
This might fix that issue, but there's still a delay of up
to ~30 secs.

See https://bugs.python.org/issue46205
This adds _PY_LIKELY, _PY_UNLIKELY, _Py_ALWAYS_INLINE, and macros to
test for SSE2 and NEON support.
Some of the subinterpreter tests run code using a PyThreadState from the
"wrong" native thread. This is going to be difficult to support and we
may want to instead create a "correct" PyThreadState for the active
native thread. For now, just disable the tests.
This adds a "status" field to each PyThreadState. The GC status will be
useful for implementing stop-the-world garbage collection.
This adds a recursive lock that will be used for locking I/O streams.
The lock performs a direct handoff if the waiting thread has been paused
for at least 1 ms. Otherwise, the lock supports barging.

The design is inspired by WTF locks (WebKit) and locking in Go.
See https://webkit.org/blog/6161/locking-in-webkit/

The "parking lot" implementation will likely need to be improved before
release. Currently, it uses a fixed-size hashtable (251 taken from Go)
and the same linked-list for waiters and collisions. Note that Go uses
a treap (random binary serach tree) for collisions, while WebKit resizes
the hashtable to ensure 3x as many buckets as threads.
Critical sections are helpers to replace the global interpreter lock
with finer grained locking. They provide similar guarantees to the GIL
and avoid the deadlock risk that plain locking involves. Critical
sections are implicitly ended whenever the GIL would be released. They
are resumed when the GIL would be acquired. Nested critical sections
behave as-if they're interleaved.
This changes SOABI and sys.implementation.name to "nogil" to avoid
installing incompatible binary wheels. It also defines the macros
Py_NOGIL and PY_NOGIL in patchlevel.h to identify nogil builds.
The GIL is controlled by the environment variable PYTHONGIL or the flag
"-X nogil". The GIL is enabled by default. The interpreter will
currently crash when running multi-threaded programs without the GIL.
The eval_breaker variable is used as a signal to break out of the
interpreter loop to handle signals, asynchronous exceptions, stop for
GC, or release the GIL to let another thread run.

We will be able to have multiple active threads running the interpreter
loop so it's useful to move eval_breaker to per-thread state so that
notifications can target a specific thread.

The specific signals are combined as bits in eval_breaker to simplify
atomic updates.
There is a deadlock hazard when a thread acquires a lock in a finalizer
that may also be held during a stop-the-world GC. Typical GC
implementations avoid this by running finalizers in separate thread
while other threads are resumed. This is awkward in Python because
some code and tests expect finalizers to have finished by the time
gc.collect() finishes.

In CPython with the GIL the issue can be mitigated with a RLock because
the GC doesn't actually stop other threads -- other threads can acquire
the GIL and run during many parts of a GC cycle.

For ease of implementation, we add a private API _thread.CriticalLock
which disables the current thread from parking while the lock is held.
This prevents deadlock by ensuring that the lock is released before
the GC is run.
This uses _thread.CriticalLock to prevent garbage collection while
modifying the finalizer registry and in the resource tracker.

There is a potential deadlock when checking if the resource tracker is
running. This exists in upstream CPython, but occurs much more frequently
with biased reference counting. The call to _check_alive can trigger a
garbage collection that collects a multiprocessing lock. The finalizer
for that lock calls back into resource tracker leading to a deadlock.

This is rare in upstream CPython because most function calls do not
trigger a garbage collection.
When Py_REF_DEBUG is defined, Python aggregates the number of
refcounting changes. This adds a per-thread counter and only aggregates
across threads when _Py_GetRefTotal() is called. Threads also add their
local counters to the global counter when they exit.
counter on exit.
pyperformance: 9.96% regression

About 3% points of the slow-down appears to be in the
_Py_MergeZeroRefcount function. If this is rewritten to always avoid
atomic operations than the regression is only 6.7%.

We should be able to use the fast-path merge & dealloc for more types
once we start using mimalloc and remove object caches.
The scheme will be used to allow safe reads from dicts and lists that
don't acquire the collection's lock.
pyperformance: 13.9% regression
WrapperTest.test_create_at_shutdown_with_encoding fails with something
like:

```
Exception ignored in: <function C.__del__ at 0x101895ae0>
Traceback (most recent call last):
  File "x.py", line 10, in __del__
  File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 2030, in __init__
  File "/Users/sgross/Projects/nogil/Lib/_pyio.py", line 1021, in seekable
ValueError: I/O operation on closed file.
```

The problem is that the GC calls the finalizer on `self.buf` before the
`__del__` on `class C`. I don't think Python guarantees any ordering on
destructors.
Modified src/alloc.c to remove include of alloc-override.c
Did not include the following files:

 - include/mimalloc-new-delete.h
 - include/mimalloc-override.h
 - src/alloc-override-osx.c
 - src/alloc-override.c
 - src/static.c
 - remove debug spam for freeing large allocations
 - use same bytes (0xDD) for freed allocations in CPython and mimalloc
   This is important for the test_capi debug memory tests
These are changes to support separate heaps for Python objects,
Python objects with GC header, and non Python objects.
Python threads have complicated lifecycles. After fork(), only one
thread remains alive. In that case, we want to abandon the heaps of the
threads that didn't survive fork() so that the we can still find all GC
objects after we delete those dead PyThreadState.

There's a similar issue with daemon threads. Python deletes the
PyThreadState for daemon threads potentially before (or concurrently)
with thread exit.
The optimistic lock-free reads will require preserving some data across
malloc/free calls. For refcounted Python objects, the refcount field
must remain valid. For dict keys the entire object must remain valid,
other than the first word (dk_usable).

This adds a field debug_offset that is the offset from the start of
the block where the allocator is allowed to fill dead objects with
0xDD or 0xD0. The value -1 signifies that the entire block must remain
valid, other than the first word, which is used to store the free list.
This changes mimalloc to tag pages for PyObjects with the current qsbr
"goal". These pages can't be freed or used in a different heap until
after the qsbr shared read counter reaches the goal.
colesbury and others added 20 commits December 30, 2021 13:06
This adds a recursive mutex to each LRU cache. The mutex is held
while the cache is updated, but released when calling the annotated
function. Note that this still potentially can lead to deadlocks that
did not exist with the GIL, but I think only in rather esoteric cases.
This adds support for protecting functions wrapped by argument clinic
with recursive mutexes.
The co_zombieframe is expected to be NULL when running without the GIL.

The test_threads function checks that at least one of the running
threads is blocked on the GIL and that the behavior is displayed in the
GDB backtrace. Of course, this is no longer the case when running
without the GIL.
This adds a mutex to every deque object. It's acquired around
most operations. Some remaining functions, which are not yet
thread-safe are:

 - deque.copy
 - deque.__init__
PyImport_ImportModuleLevelObject previously used the variable
module.__spec__._initializing to check if a module is currently being
initialized. This access could race with modifications to the module's
dict.

Instead, use the new md_initialized field on PyModule to check if the
module is already initialized. This can mean that a duck-typed module
doesn't get the benefit of the fast-path, but I think other than that
the change should still preserve the important behavior.
This fixes a hang in test_multiprocessing when running without the GIL
due to lost modifications to count in the seamphore.
The module name is copied to "__module__" attribute of functions,
including closures. Make the name interned (and immortalized) to avoid
bottlenecks on reference counting it.
The tracing references feature is not thread-safe without the GIL. I
think it's possible to get the same functionality in a thread-safe way
by traversing the mimalloc heaps (like we do in the garbage collector),
but that's not currently implemented.

For now the "--with-trace-refs" argument is ignored by configure with a
warning about unrecognized options.

Fixes colesbury#3
Heap type objects are a common source of reference count contention
because every created instance increments the reference count of its
type. The existing deferred reference count scheme isn't sufficient
because instances are not necessarily visible to the garbage collector.

This stores most reference count modifications in a per-thread array
with an entry for each heap type object. Each heap type object has a
new field 'tp_typeid' which is used to index the thread-local array.
Thread-local reference counts are merged into the type's own reference
count field during garbage collection and before the thread exits.
Changes "test_cancel_futures_wait_false" to pause a tiny bit before
calling "shutdown" with cancel_futures=True. Otherwise, the work item
may be cancelled before it is started. In practice, this only happens
when the GIL is disabled. With the GIL on, the thread grabs the work
item before t.shutdown() is called due to the 5 ms GIL switch interval.

The importlib frozen module is regenerated because of the switch of some
strings to interned.
The modified pip package looks in a custom package index URL for
compatible nogil Python packages.

See https://github.com/colesbury/pip/tree/21.3.1-nogil for modifications
@colesbury
Copy link
Owner

colesbury commented Mar 9, 2022

Hi @lgray -- thanks for the bug report and PR. numba requires a lot more changes than the missing hasjrel. I spent a week in January trying to get numba to work with some progress, but ultimately unsuccessfully. I'm pretty sure it's solvable, but I'm not sure how long it would take.

I'll put my modifications to numba up on GitHub in case anyone wants to try their hand at getting it to work. I'll also take a look at awkward-array.

@lgray
Copy link
Author

lgray commented Mar 9, 2022

@colesbury thanks for the reply!

Awkward-array compiles from the tarball out of the box and seems to operate correctly FWIW. If you're willing to host a wheel that'd be great but it doesn't appear to need any additional work!

I'll leave numba alone since it's much deeper than what I can meaningfully contribute. Kind of annoying since numba is a bit buried in our code (numpy-like array manipulation and recursion occasionally don't play nice...), would be useful from our end to demonstrate a fully realistic example of physics analysis running multi-threaded as opposed to multi-processing. We'll have to wait (or convince someone from the numba team to take a look, we could see about that).

There are some reduced examples we can do that'll make some of the same points though. If we can show that we can change our python analysis to have shared threaded histogram filling and such, then I'd be willing to try to make a push for support from the data-science community as to how useful the ideas in this POC fork are.

@colesbury
Copy link
Owner

Awkward-array compiles from the tarball out of the box and seems to operate correctly FWIW. If you're willing to host a wheel that'd be great but it doesn't appear to need any additional work!

Awkward array includes pybind11 (as a git submodule instead of a pip dependency). pybind11 isn't thread-safe without the GIL without patches, so I expect awkward array to have the same issues. For example, this program crashes when run without the GIL. If you only use awkward array from a single thread, I don't expect there to be a problem, but if you use it from multiple threads I'd expect issues.

I'll build an awkward array wheel with the patched pybind11.

@lgray
Copy link
Author

lgray commented Mar 10, 2022

Yes - on further testing we ran into problems (at larger number of threads) as well. There's also boost-histogram which we'd be interested to look at that's also using pybind, and compiles fine from the tarball.

@lgray
Copy link
Author

lgray commented Mar 10, 2022

@bendavid @jpivarski FYI

@colesbury
Copy link
Owner

@lgray I've uploaded wheels for awkward (1.7.0) and boost-histogram (1.3.1) with the patched pybind11. You should be able to install them as typical through pip after you've uninstalled any locally built versions of those packages.

Here are the scripts I used to minimally test the multithreaded behavior:

Here are the patched projects (just replaced pybind11 submodules):

@lgray
Copy link
Author

lgray commented Mar 10, 2022

@colesbury on another point - we found that gdb in the image and Dockerfile you supplied is broken (it compiles against vanilla python and segfaults on execution). I've put together a Dockerfile that builds its down gdb. Happy to toss it your way.

@colesbury
Copy link
Owner

@lgray Thanks, I'd appreciate that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants