Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmark representing async-heavy, heterogeneous workload #239

Open
belm0 opened this issue Nov 3, 2022 · 7 comments
Open

benchmark representing async-heavy, heterogeneous workload #239

belm0 opened this issue Nov 3, 2022 · 7 comments

Comments

@belm0
Copy link

belm0 commented Nov 3, 2022

context: Recently I set about trying to find a faster Python interpreter for my async app, which has a high rate of context switching, many coroutines, and each coroutine running different code. I found that the app runs 2x slower on PyPy, and 20% faster on Pyston-full. One reason may be that a tracing JIT will naively trace across context switches, such that a given trace will never be repeated, due to arbitrary occurrence and ordering of coroutine resumes.

There really seems to be nothing in the benchmark world that represents this kind of async workload-- which, by the way, I expect to become more popular with time. The current async benchmarks in pyperformance have nothing like this-- the coroutines are trivial and homogenous.

It's concerning that the Pyston full fork is being retired, while PyPy blindly continues as if everything's OK, and faster-python proceeds at a furious pace-- all without evaluating a workload that is significant today, and may become more so in the future.

@stonebig
Copy link

stonebig commented Nov 3, 2022

maybe can you propose a simple yet representative code for such a benchmark ?

@belm0
Copy link
Author

belm0 commented Nov 3, 2022

I'm not sure how to go about it, as taxing a tracing JIT would require a lot of code (I think PyPy gives up after a 6000 opcode trace). Some kind of code generation may need to be employed. I'm open to ideas.

@ericsnowcurrently
Copy link
Member

Thanks for the suggestion! A benchmark like this would be valuable. It's mostly a matter of someone with enough knowledge taking the time to do it. It's relatively straight-forward to write a benchmark, (even outside the pyperformance repo), and pyperformance provides plenty of examples.

Also, it would be helpful to have both a workload-oriented benchmark and a feature-oriented one (macro vs. micro). There's more info about this on the faster-cpython wiki.

@belm0
Copy link
Author

belm0 commented Nov 3, 2022

It may be useful to write the benchmark against the anyio API, so that it could be run against both asyncio and trio.

@belm0
Copy link
Author

belm0 commented Nov 5, 2022

I'm thinking about what metrics of my app would most represent the workload. Then create a synthetic benchmark with similar metrics.

assuming a single threaded program, not hitting CPU limits, perhaps:

  • number of active tasks (coroutines)
  • task context switch rate
  • unique task context switch rate - given 1s sliding window of context switches, count of unique tasks (by underlying function) in that window
  • task step duration observations (for histogram) - wall time from task resume to yield
  • task step unique lines (i.e. discounting repeated execution)

@belm0
Copy link
Author

belm0 commented Nov 5, 2022

here's how my app looks by those metrics:

  max active tasks: 580
  context switch rate:  4,267 / s
  unique context switch rate (median, 1s window):  84 / s
  step duration: avg 119 µs +/- 347 µs
                 quartiles:  44.5 µs, 67.0 µs, 111 µs
  step unique lines: avg  164.9 +/-  115.1
                     quartiles:  87.5, 159.9, 191.4

"unique context switch rate" substantiates that the workload is heterogeneous. Each second, it touches about 80 different coroutine functions (coro.cr_code).

"step unique lines" substantiates that each coroutine step involves a significant amount of code to trace (avg and median are ~160 lines)

From the context switch rate and avg step duration, it's possible to derive approximate CPU load: 119e-6 * 4267 = 51%

(This is running on CPython 3.8. The durations are about 25% shorter on Pyston 3.8.)

@belm0
Copy link
Author

belm0 commented Nov 5, 2022

What's missing from the metrics, and especially relevant to JIT implementations, is some notion of the number of opcodes run per step. It should probably exclude repeated loop iterations-- so something like "step opcode line count".

--> I didn't go as far as counting opcodes, but I've revised previous posts to include unique line count per step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants