Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTree index results in significantly slower execution #444

Open
Meysam- opened this issue Oct 28, 2024 · 1 comment
Open

RTree index results in significantly slower execution #444

Meysam- opened this issue Oct 28, 2024 · 1 comment

Comments

@Meysam-
Copy link

Meysam- commented Oct 28, 2024

I load a dataset of ~11.3 M points into DuckDB using the following sql:

CREATE TABLE my_points AS
    FROM 'input_data.csv';

INSTALL spatial;
LOAD spatial;

ALTER TABLE my_points ADD COLUMN geom GEOMETRY;
UPDATE my_points SET geom = ST_Point(geo_lon, geo_lat);

To reproduce, you can download the input_data.csv from here.

Then, I set the threads to one and run the following sql:

SET threads TO 1;

EXPLAIN ANALYSE SELECT count(*) FROM my_points WHERE ST_Intersects(geom, ST_GeomFromText('POLYGON ((38.364258 56.992883, 35.81543 55.739482, 37.705078 52.975108, 39.155273 54.622978, 44.494629 53.566414, 43.59375 56.255557, 38.364258 56.992883))'));

here is the query profile:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 1.06s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.96s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         my_points         │
│                           │
│     Projections: geom     │
│                           │
│       11358150 Rows       │
│          (0.10s)          │
└───────────────────────────┘

Next I make an RTree index using:

CREATE INDEX my_idx ON my_points USING RTREE (geom);

and I run the same query again. here is the query profile with the index:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 4.69s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.52s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│   my_points (RTREE INDEX  │
│       SCAN : my_idx)      │
│                           │
│     Projections: geom     │
│                           │
│        2394763 Rows       │
│          (4.17s)          │
└───────────────────────────┘

I am wondering if this a bug in RTree, or I am using the spatial extension wrong.
The other strange thing that I observed is that if I run the same sql when the DuckDB session that is persisted (i.e., DuckDB started with a database file), then the non-indexed query performs the same, but the query that uses the index is even slower. here is a query profile for the same query when the database is persisted:

┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 7.58s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
│                           │
│           1 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ ST_Intersects(geom, '\x02 │
│  \x04\x00\x00\x00\x00\x00 │
│  \x00\x00C\x0FB\x82\xE6SB │
│  \x81\xFA1B\xB7\xF8cB\x02 │
│  \x00\x00\x00\x01\x00\x00 │
│  \x00\x07\x00\x00\x00\x00 │
│ \x00\x00\x007\xA7\x92\x01 │
│  \xA0.C@\x8F\xE0F\xCA\x16 │
│  \x7FL@\xB1\x16\x9F\x02`  │
│  \xE8A@\x87\xFD\x9EX\xA7  │
│  \xDEK@\x86\x90\xF3\xFE?  │
│ \xDAB@\xB3\x08\xC5V\xD0|J@│
│ \xD5yT\xFC\xDF\x93C@R\x10<│
│ \xBE\xBDOK@\x9CS\xC9\x00P │
│ ?F@\x8F\xFF\x02A\x80\xC8J@│
│ \x00\x00\x00\x00\x00\xCCE@│
│ \xC7\xA1~\x17\xB6 L@7\xA7 │
│  \x92\x01\xA0.C@\x8F\xE0F │
│ \xCA\x16\x7FL@'::GEOMETRY)│
│                           │
│        2061759 Rows       │
│          (0.24s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│   my_points (RTREE INDEX  │
│       SCAN : my_idx)      │
│                           │
│     Projections: geom     │
│                           │
│        2394763 Rows       │
│          (7.34s)          │
└───────────────────────────┘

As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.
If this is a specific scenario where an RTree index is not preferred, I think it might be best to detect it in query planning and don't use an index for that query.

Version of DuckDB: v1.1.2 f680b7d08f

p.s.: I have 128GB of RAM in my computer and I set the threads specifically to 1, otherwise the non-indexed query is even faster cause it processes the data in parallel while the query plan that uses the RTree index is being executed sequentially.

@Maxxen
Copy link
Member

Maxxen commented Nov 22, 2024

Hello!

As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.

Alright, so the first thing to understand is that DuckDB indexes don't store any actual data themselves. in the case of the R-Tree the index itself only contains bounding boxes and row_id:s which are id's that uniquely identify specific rows in the DuckDB table storage. So whenever an R-Tree scan is performed, DuckDB first has to traverse and recurse down the index (which is currently a single-threaded operation), collect row ids whose bounding box interests the query bounds, and then separately look up the rows in the storage. This second step is effectively random-access, meaning that DuckDB will retrieve rows one-at-a-time using something similar to a binary search, because row-ids are not guaranteed to be ordered.

In comparison, when the index is not used and DuckDB performs a "full table scan", the lookup is just a single phase. There is no "jumping around" needed, because we know we're going to have to scan the whole table we just have to do a single pass across all table pages. It's also very easy to parallelize.

Alright, so why use the index at all? Well, to avoid having to scan as much. there's two scenarios where that's preferable:

  1. Disk/table access is very expensive. Perhaps you're working in a slow disk, or a remotely attached dataset which adds a ton of latency. Or perhaps the table is really long or really wide, such that you end up pulling a ton of data that will get discarded anyway. In the case of spatial, geospatial predicates can't get "pushed down" into the storage like other simple filters can (such as numeric in/equality predicates) and so DuckDB will materialize all the columns for all rows even if they get filtered out.
  2. The filtering operation is very expensive, and it's much faster to use the index to prune out candidate rows instead of checking every single row by brute-force.

Regarding 2. that's usually the case when dealing with geospatial predicates, however, in this case a simple "point in small polygon" is comparatively cheap, and DuckDB is just really good at brute-force computations. So a selectivity of about ~20% (as in your example) might not actually be enough.

For the adaptive-radix-trie we don't recommend creating one unless you have highly selective filters (where you expect to filter out everything except about 0.1% of rows), although Id probably say the R-Tree would still be beneficial for much higher percentages, particularly if you have more complex larger geometries.

As I have enough RAM that the index can totally fit in the RAM, I don't understand why it's slower when I run it when database is persisted.

Even if the whole R-Tree index fits in RAM, the indirection required to look up individual row-ids is still slower on disk than in memory because you may have to load a disk page just to conclude that the row-id is not present and continue on.

If this is a specific scenario where an RTree index is not preferred, I think it might be best to detect it in query planning and don't use an index for that query.

Unfortunately this is not really possible with the current extension capabilities in DuckDB. Extension types can't provide custom statistics (such as spatial extent or distribution) to influence query planning, and deciding when to use an index or not will more or less always come down to heuristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants