Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

Open
lidavidm opened this issue Aug 19, 2022 · 13 comments
Open

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

lidavidm opened this issue Aug 19, 2022 · 13 comments

Comments

@lidavidm
Copy link
Member

Pgeon: https://github.com/0x0L/pgeon
ConnectorX: https://sfu-db.github.io/connector-x/intro.html

@lidavidm
Copy link
Member Author

lidavidm commented Aug 22, 2022

ConnectorX

  • Can partition the query along a given column, then fetch the partitions in parallel
  • "Copy-exactly-once" architecture
  • Uses preallocated buffers where possible (also, appears to do things like implement its own conversion to Python strings)

These optimizations would probably be difficult to support, though we should preallocate where possible.

Turbodbc

ConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference).

Turbodbc also lists some optimizations:
https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html

In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once.

Turbodbc also implements some memory optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width.

pgeon

  • Uses COPY (DuckDB appears to do this too, though note DuckDB's postgres extension is GPL) That honestly seems to be the main optimization
  • Queries some metadata tables up front to determine proper types
  • One snag about COPY is that it involves an allocation-per-row: https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-RECEIVE Not super great, but if this is actually a bottleneck I guess we can reimplement libpq…
  • The COPY binary format doesn't appear to give you a row count, making preallocation harder

@0x0L
Copy link

0x0L commented Aug 22, 2022

@lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings

@lidavidm
Copy link
Member Author

Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great.

@dhirschfeld
Copy link
Contributor

In particular, it can interleave I/O and conversion

If you're implementing an async interface, as a trio user, it would be great if you could use anyio rather than native acyncio features. This will enable the code to be used with any async library.

Perhaps the most prominent Python library to support AnyIO is fastapi, and that's where I'd (eventually) like to make use of adbc - asynchronously connecting to databases for displaying data in FastAPI dashboards.

@lidavidm
Copy link
Member Author

lidavidm commented Dec 6, 2022

Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.)

Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice).

@lidavidm lidavidm added this to the 0.2.0 milestone Dec 13, 2022
@lidavidm lidavidm removed this from the ADBC Libraries 0.2.0 milestone Feb 2, 2023
@lidavidm
Copy link
Member Author

lidavidm commented May 9, 2023

This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html

@paleolimbot
Copy link
Member

Is that before or after #636?

FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

@kou
Copy link
Member

kou commented May 9, 2023

I think that it's "after".
The benchmark is https://github.com/apache/arrow-flight-sql-postgresql/tree/main/benchmark/integer .

@kou
Copy link
Member

kou commented May 9, 2023

FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/

@lidavidm
Copy link
Member Author

lidavidm commented May 9, 2023

Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol.

@kou
Copy link
Member

kou commented May 9, 2023

@lidavidm
Copy link
Member Author

lidavidm commented May 9, 2023

Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

FWIW, this is mentioned in the issue above. I think when I looked at it, it seemed like libpq would read the entire response before returning to you.

@paleolimbot
Copy link
Member

Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do

download -> download -> download -> download -> download ->
                        sync -> decode -> wait              sync -> decode

...such that the only time the user pays for is download time. (Probably complicated to get right, though).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants