-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C] Research ConnectorX/pgeon for optimizing libpq driver #71
Comments
ConnectorX
These optimizations would probably be difficult to support, though we should preallocate where possible. TurbodbcConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference). Turbodbc also lists some optimizations: In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once. Turbodbc also implements some memory optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width. pgeon
|
@lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings |
Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great. |
If you're implementing an async interface, as a Perhaps the most prominent Python library to support AnyIO is |
Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.) Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice). |
This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html |
Is that before or after #636? FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations. |
I think that it's "after". |
FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/ |
Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol. |
Yes. |
FWIW, this is mentioned in the issue above. I think when I looked at it, it seemed like libpq would read the entire response before returning to you. |
Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do
...such that the only time the user pays for is download time. (Probably complicated to get right, though). |
Pgeon: https://github.com/0x0L/pgeon
ConnectorX: https://sfu-db.github.io/connector-x/intro.html
The text was updated successfully, but these errors were encountered: