feat: integrate with pgvector #1153

jiashenC · 2023-09-17T18:17:40Z

I will need some feedback on the design here. @xzdandy @gaurav274

This PR reflects my initial design for in-database index features like pgvector. My initial idea is to offload as much as possible to the native database. When we create an index and do the index scan, we simply push an emulated query to the underlying database. While this is doable, this will introduce separate implementation paths in different components including the optimizer and the executor.

Other than the current design, another option is to reuse the third-party vector integration interface. There are also a few details that I am not clear about

For other vector libraries, we keep an index entry in our own catalog. Do we still maintain that even for pgvector? In the original implementation, the index catalog entry is linked to a table catalog entry. Because in this case, the table is inside Postgres, we also need to change the implementation here a little bit.
(Issue related to performance). Following the current create index implementation, data will be fetched from Postgres first but it is not really needed. The create index will anyway run inside Postgres.
~~The current vector index scan is implemented based on _row_id. When it is scanning data from Postgres, we need to figure out a way to populate _row_id for the native database engine.~~

xzdandy · 2023-09-17T19:32:28Z

If the native database supports vector indexing, I think we should push down it to the native database system. This can be an optimizer rule. If the native database does not support vector indexing, we will do it ourselves.

gaurav274 · 2023-09-17T21:05:09Z

Why are we not pushing the CREATE INDEX query to Postgres?

gaurav274 · 2023-09-19T01:09:37Z

Based on offline discussion closing this PR

jiashenC · 2023-09-19T01:44:22Z

I reopen this PR because the index scan pass is also implemented in this branch.

This PR adds a feature to allow users to create an index on a table using pgvector and also do a similarity search using the existing pgvector index.

Create index query.

CREATE INDEX test_index ON test_data_source.test_vector (embedding) 
    USING PGVECTOR

When users attempt to create an index in pgvector, it is internally translated to a native Postgres query to push down the create index query.

Index scan.

SELECT idx, embedding FROM test_data_source.test_vector 
    ORDER BY Similarity(DummyFeatureExtractor(Open(...)), embedding)
    LIMIT 1

I take some shortcuts in the optimizer that if the data source is from Postgres, translate the similarity query to the semantically equivalent similarity query that works for Postgres and push down the query to Postgres. That is implemented as part of the VectorIndexScanExecutor.

jarulraj · 2023-09-19T04:32:52Z

evadb/executor/create_index_executor.py

+            db_catalog_entry.engine, **db_catalog_entry.params
+        ) as handler:
+            columns = table.table_obj.columns
+            # As other libraries, we default to HNSW and L2 distance.


What does other libraries mean?

I meant other vector store types that we currently support (e.g., FAISS).

gaurav274 · 2023-09-19T07:48:55Z

evadb/binder/statement_binder.py

+        ), "Index can only be created on an existing table"
+
+        # Vector type specific check.
+        catalog = self._catalog()


Reminder to move it outside

@gaurav274 Added some changes. Can you give it a read? Not sure shall we still stick with the singledispatchmethod approach or do subclass inheritance and add some if statement like the Executor.

evadb/plan_nodes/abstract_plan.py

test/third_party_tests/test_native_similarity_index.py

evadb/optimizer/rules/rules.py

evadb/executor/create_index_executor.py

xzdandy · 2023-09-20T06:30:43Z

Overall design looks good to me. We may consider the gain/loss of moving some of the push down to optimizer. But I don't think that is the priority now. Left some comments for clarification.

jiashenC requested review from xzdandy and gaurav274 September 17, 2023 18:50

xzdandy assigned jiashenC Sep 18, 2023

xzdandy added Feature Request ✨ New feature or request High Effort 🏋 Difficult solution or problem to solve Work In Progress 🚧 labels Sep 18, 2023

gaurav274 closed this Sep 19, 2023

jiashenC reopened this Sep 19, 2023

jiashenC requested a review from jarulraj September 19, 2023 01:47

jiashenC added this to the v0.3.5 milestone Sep 19, 2023

jarulraj approved these changes Sep 19, 2023

View reviewed changes

gaurav274 reviewed Sep 19, 2023

View reviewed changes

jiashenC marked this pull request as ready for review September 19, 2023 15:12

jiashenC removed the Work In Progress 🚧 label Sep 19, 2023

xzdandy modified the milestones: v0.3.5, v0.3.6 Sep 20, 2023

xzdandy reviewed Sep 20, 2023

View reviewed changes

evadb/plan_nodes/abstract_plan.py Show resolved Hide resolved

xzdandy reviewed Sep 20, 2023

View reviewed changes

test/third_party_tests/test_native_similarity_index.py Outdated Show resolved Hide resolved

xzdandy reviewed Sep 20, 2023

View reviewed changes

evadb/optimizer/rules/rules.py Show resolved Hide resolved

xzdandy reviewed Sep 20, 2023

View reviewed changes

evadb/executor/create_index_executor.py Show resolved Hide resolved

jiashenC requested a review from gaurav274 September 20, 2023 19:16

jiashenC added 4 commits September 21, 2023 20:18

create pgvector index on table

fae27d0

add create index test case

9d8869f

fix lint error

8c41250

vector index scan

d5c54f7

jiashenC added 5 commits September 21, 2023 20:18

add index scan for native pgvector

ecc8939

fix lint issue

4131125

clean up test case

0951e22

separate binder for index

35640f8

fix lint

d7c2f14

jiashenC force-pushed the pg-vector branch from ed67884 to d7c2f14 Compare September 22, 2023 00:28

jiashenC merged commit 0844f48 into staging Sep 22, 2023

jiashenC deleted the pg-vector branch September 22, 2023 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate with pgvector #1153

feat: integrate with pgvector #1153

jiashenC commented Sep 17, 2023 •

edited

Loading

xzdandy commented Sep 17, 2023

gaurav274 commented Sep 17, 2023

gaurav274 commented Sep 19, 2023

jiashenC commented Sep 19, 2023 •

edited

Loading

jarulraj Sep 19, 2023

jiashenC Sep 19, 2023

gaurav274 Sep 19, 2023

jiashenC Sep 20, 2023

xzdandy commented Sep 20, 2023

feat: integrate with pgvector #1153

feat: integrate with pgvector #1153

Conversation

jiashenC commented Sep 17, 2023 • edited Loading

xzdandy commented Sep 17, 2023

gaurav274 commented Sep 17, 2023

gaurav274 commented Sep 19, 2023

jiashenC commented Sep 19, 2023 • edited Loading

Create index query.

Index scan.

jarulraj Sep 19, 2023

Choose a reason for hiding this comment

jiashenC Sep 19, 2023

Choose a reason for hiding this comment

gaurav274 Sep 19, 2023

Choose a reason for hiding this comment

jiashenC Sep 20, 2023

Choose a reason for hiding this comment

xzdandy commented Sep 20, 2023

jiashenC commented Sep 17, 2023 •

edited

Loading

jiashenC commented Sep 19, 2023 •

edited

Loading