Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: integrate with pgvector #1153

Merged
merged 9 commits into from
Sep 22, 2023
Merged

feat: integrate with pgvector #1153

merged 9 commits into from
Sep 22, 2023

Conversation

jiashenC
Copy link
Member

@jiashenC jiashenC commented Sep 17, 2023

I will need some feedback on the design here. @xzdandy @gaurav274

This PR reflects my initial design for in-database index features like pgvector. My initial idea is to offload as much as possible to the native database. When we create an index and do the index scan, we simply push an emulated query to the underlying database. While this is doable, this will introduce separate implementation paths in different components including the optimizer and the executor.

Other than the current design, another option is to reuse the third-party vector integration interface. There are also a few details that I am not clear about

  • For other vector libraries, we keep an index entry in our own catalog. Do we still maintain that even for pgvector? In the original implementation, the index catalog entry is linked to a table catalog entry. Because in this case, the table is inside Postgres, we also need to change the implementation here a little bit.
  • (Issue related to performance). Following the current create index implementation, data will be fetched from Postgres first but it is not really needed. The create index will anyway run inside Postgres.
  • The current vector index scan is implemented based on _row_id. When it is scanning data from Postgres, we need to figure out a way to populate _row_id for the native database engine.

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 17, 2023

If the native database supports vector indexing, I think we should push down it to the native database system. This can be an optimizer rule. If the native database does not support vector indexing, we will do it ourselves.

@gaurav274
Copy link
Member

Why are we not pushing the CREATE INDEX query to Postgres?

@xzdandy xzdandy added Feature Request ✨ New feature or request High Effort 🏋 Difficult solution or problem to solve Work In Progress 🚧 labels Sep 18, 2023
@gaurav274
Copy link
Member

Based on offline discussion closing this PR

@gaurav274 gaurav274 closed this Sep 19, 2023
@jiashenC jiashenC reopened this Sep 19, 2023
@jiashenC
Copy link
Member Author

jiashenC commented Sep 19, 2023

I reopen this PR because the index scan pass is also implemented in this branch.

This PR adds a feature to allow users to create an index on a table using pgvector and also do a similarity search using the existing pgvector index.

Create index query.

CREATE INDEX test_index ON test_data_source.test_vector (embedding) 
    USING PGVECTOR

When users attempt to create an index in pgvector, it is internally translated to a native Postgres query to push down the create index query.

Index scan.

SELECT idx, embedding FROM test_data_source.test_vector 
    ORDER BY Similarity(DummyFeatureExtractor(Open(...)), embedding)
    LIMIT 1

I take some shortcuts in the optimizer that if the data source is from Postgres, translate the similarity query to the semantically equivalent similarity query that works for Postgres and push down the query to Postgres. That is implemented as part of the VectorIndexScanExecutor.

@jiashenC jiashenC requested a review from jarulraj September 19, 2023 01:47
@jiashenC jiashenC added this to the v0.3.5 milestone Sep 19, 2023
db_catalog_entry.engine, **db_catalog_entry.params
) as handler:
columns = table.table_obj.columns
# As other libraries, we default to HNSW and L2 distance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does other libraries mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant other vector store types that we currently support (e.g., FAISS).

), "Index can only be created on an existing table"

# Vector type specific check.
catalog = self._catalog()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to move it outside

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaurav274 Added some changes. Can you give it a read? Not sure shall we still stick with the singledispatchmethod approach or do subclass inheritance and add some if statement like the Executor.

@jiashenC jiashenC marked this pull request as ready for review September 19, 2023 15:12
@xzdandy xzdandy modified the milestones: v0.3.5, v0.3.6 Sep 20, 2023
@xzdandy
Copy link
Collaborator

xzdandy commented Sep 20, 2023

Overall design looks good to me. We may consider the gain/loss of moving some of the push down to optimizer. But I don't think that is the priority now. Left some comments for clarification.

@jiashenC jiashenC requested a review from gaurav274 September 20, 2023 19:16
@jiashenC jiashenC merged commit 0844f48 into staging Sep 22, 2023
@jiashenC jiashenC deleted the pg-vector branch September 22, 2023 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request ✨ New feature or request High Effort 🏋 Difficult solution or problem to solve
Projects
Development

Successfully merging this pull request may close these issues.

4 participants