[C] Research ConnectorX/pgeon for optimizing libpq driver #71

lidavidm · 2022-08-19T18:26:42Z

Pgeon: https://github.com/0x0L/pgeon
ConnectorX: https://sfu-db.github.io/connector-x/intro.html

lidavidm · 2022-08-22T17:20:55Z

ConnectorX

Can partition the query along a given column, then fetch the partitions in parallel
"Copy-exactly-once" architecture
Uses preallocated buffers where possible (also, appears to do things like implement its own conversion to Python strings)

These optimizations would probably be difficult to support, though we should preallocate where possible.

Turbodbc

ConnectorX's docs compare it to Turbodbc which tends to trail it, though Turbodbc does not appear to implement parallelization (that might explain the difference).

Turbodbc also lists some optimizations:
https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html

In particular, it can interleave I/O and conversion. That may be interesting for us, though libpq seems to only either give you a choice between row-at-a-time or getting all query results at once.

Turbodbc also implements some memory optimizations: dictionary-encoding string fields, and dynamically determining the minimum integer width.

pgeon

Uses COPY (DuckDB appears to do this too, though note DuckDB's postgres extension is GPL) That honestly seems to be the main optimization
Queries some metadata tables up front to determine proper types
One snag about COPY is that it involves an allocation-per-row: https://www.postgresql.org/docs/current/libpq-copy.html#LIBPQ-COPY-RECEIVE Not super great, but if this is actually a bottleneck I guess we can reimplement libpq…
The COPY binary format doesn't appear to give you a row count, making preallocation harder

0x0L · 2022-08-22T18:07:36Z

@lidavidm for pgeon I have experimented with FETCH instead of COPY. COPY was the fastest method in my [limited] testings

lidavidm · 2022-08-22T18:34:22Z

Ah, thanks. I noticed that, and it seems like FETCH also requires you to manage a server-side cursor which isn't great.

dhirschfeld · 2022-12-06T23:04:39Z

In particular, it can interleave I/O and conversion

If you're implementing an async interface, as a trio user, it would be great if you could use anyio rather than native acyncio features. This will enable the code to be used with any async library.

Perhaps the most prominent Python library to support AnyIO is fastapi, and that's where I'd (eventually) like to make use of adbc - asynchronously connecting to databases for displaying data in FastAPI dashboards.

lidavidm · 2022-12-06T23:21:59Z

Async APIs are somewhere down on the list of things I would like to explore! But the 'base' API is all blocking. (I also haven't tried binding async C/C++ interfaces to Python's async APIs yet - I need to look at whether callbacks, polling, or something else is preferred/ergonomic.)

Thanks for the heads up - I'll make sure to support the broader ecosystem (I quite like trio's ideas, even if I haven't gotten a chance to use it in practice).

lidavidm · 2023-05-09T12:50:28Z

This benchmark (in slides) found that the libpq driver is very slow: https://www.clear-code.com/blog/2023/5/8/rubykaigi-2023-announce.html

paleolimbot · 2023-05-09T13:06:32Z

Is that before or after #636?

FWIW, after that PR you could write benchmarks for reading a raw COPY buffer (i.e., without reading over a connection). Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

kou · 2023-05-09T13:16:08Z

I think that it's "after".
The benchmark is https://github.com/apache/arrow-flight-sql-postgresql/tree/main/benchmark/integer .

kou · 2023-05-09T13:16:54Z

FYI: The slide URL in the blog post: https://slide.rabbit-shocker.org/authors/kou/rubykaigi-2023/

lidavidm · 2023-05-09T13:17:47Z

Ah, so the 'libpq' column is https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.c ? In that case I would expect it to be slower by definition since we're doing extra work to convert the result set to Arrow. And the Flight SQL server has an advantage since it can grab the data directly from PostgreSQL without going through the PostgreSQL wire protocol.

kou · 2023-05-09T13:20:15Z

Yes.
(https://github.com/apache/arrow-flight-sql-postgresql/blob/main/benchmark/integer/select.sql is the same performance as select.c.)

lidavidm · 2023-05-09T13:42:17Z

Another optimization would be to attempt parallelizing the "read from connection" and "convert to arrow" operations.

FWIW, this is mentioned in the issue above. I think when I looked at it, it seemed like libpq would read the entire response before returning to you.

paleolimbot · 2023-05-09T13:45:36Z

Yes, it won't return anything less than one row at a time. But right now we do download -> decode -> download ->decode and we in theory could do

download -> download -> download -> download -> download ->
                        sync -> decode -> wait              sync -> decode

...such that the only time the user pays for is download time. (Probably complicated to get right, though).

dhirschfeld mentioned this issue Dec 6, 2022

[Feature Request] Use AnyIO for Python Async #224

Open

lidavidm added this to the 0.2.0 milestone Dec 13, 2022

lidavidm removed this from the ADBC Libraries 0.2.0 milestone Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

lidavidm commented Aug 19, 2022

lidavidm commented Aug 22, 2022 •

edited

Loading

0x0L commented Aug 22, 2022

lidavidm commented Aug 22, 2022

dhirschfeld commented Dec 6, 2022

lidavidm commented Dec 6, 2022

lidavidm commented May 9, 2023

paleolimbot commented May 9, 2023

kou commented May 9, 2023

kou commented May 9, 2023

lidavidm commented May 9, 2023

kou commented May 9, 2023

lidavidm commented May 9, 2023

paleolimbot commented May 9, 2023

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

[C] Research ConnectorX/pgeon for optimizing libpq driver #71

Comments

lidavidm commented Aug 19, 2022

lidavidm commented Aug 22, 2022 • edited Loading

ConnectorX

Turbodbc

pgeon

0x0L commented Aug 22, 2022

lidavidm commented Aug 22, 2022

dhirschfeld commented Dec 6, 2022

lidavidm commented Dec 6, 2022

lidavidm commented May 9, 2023

paleolimbot commented May 9, 2023

kou commented May 9, 2023

kou commented May 9, 2023

lidavidm commented May 9, 2023

kou commented May 9, 2023

lidavidm commented May 9, 2023

paleolimbot commented May 9, 2023

lidavidm commented Aug 22, 2022 •

edited

Loading