pgvector-remote is a fork of pgvector which combines the simplicity of pgvector with the power of remote vector databases, by introducing a new remote vector index type. Currently, pgvector-remote only supports pinecone , but we plan to support other vendors in the future.
- Short Version
- Use Cases
- Installation
- Configuration
- Index Creation
- Performance Considerations
- Docker
- Credits
Benchmarks for a 10M filtered search workload. https://big-ann-benchmarks.com/neurips23.html#tracks. Results for pgvector are shown for a tuned hnsw index on a t2.2xlarge (32GB RAM). Results for pinecone are for a p2.x8 pod.
CREATE TABLE products (name text, embedding vector(1536), price float);
CREATE INDEX my_remote_index ON products USING pinecone (embedding, price) with (host = 'my-pinecone-index.pinecone.io');
-- [insert, update, and delete billions of records in products]
SELECT * FROM products WHERE price < 40.0 ORDER BY embedding <-> '[...]' LIMIT 10; -- pinecone performs this query, including the price predicate
- Vector databases like pinecone aren't docstores (and they shouldn't try to be). That means your document and its embedding live in separate databases. pgvector-remote lets you keep your metadata in postgres and your embeddings in pinecone, while hiding this complexity from the user by presenting a unified sql interface to creating, querying, and updating pinecone indexes.
- Control your data. Using pgvector-remote means that all your vectors are in postgres. This makes it easy to test out a different index type (like hnsw) and drop pinecone in favor of a different vendor.
- Scalability: Pinecone is designed to scale to billions of vectors. pgvector does not easily accomodate such large datasets. Large vector indexes are incredibly highly memory intensive and therefore it makes sense to separate this from the main database. For example indexing 200M vectors of 1536 dimensions would require 1.2TB of memory.
- Seamless integration: You don't need to write a line of pinecone application logic. Use a unified sql interface to leverage pinecone as if it were any other postgres index type.
- Synchronization: pgvector-remote ensures that the data in pinecone and postgres are always in sync. For example, if your postgres transaction rolls back you don't need to worry about cleaning up the data in pinecone.
Why is this integration better than confluent's kafka-connect?
- Liveness and correctness: pgvector-remote sends inserted vectors to pinecone in batches and locally scans unflushed records, guaranteeing that all data is always visible to index queries.
- Query and integration logic: traditional ETL won't help you write queries like the one above. pgvector-remote translates select predicates to pinecone filters.
- Small datasets: If you have a small to medium dataset (10M vectors at 768 dimensions), you can use pgvector without a remote vector store. The local hnsw indexes will be sufficient.
- Minimal metadata: You aren't performing metadata filtering. Currently, pgvector does not handle metadata filtering, meaning that queries like the one above can sometimes be inefficient and inaccurate.
Install libcurl headers. For example,
sudo apt-get install libcurl4-openssl-dev
Then follow the installation instructions for pgvector, using the feature/remote_indexes
of this repository.
Set the pinecone API key in the postgres configuration. For example,
ALTER DATABASE mydb SET pinecone.api_key = 'xxxxxxxx-xxxx-xxxx-xxxx–xxxxxxxxxxxx';
There are two ways to specify the pinecone index:
- By providing the host of an existing pinecone index. For example,
CREATE INDEX my_remote_index ON products USING pinecone (embedding) with (host = 'example-23kshha.svc.us-east-1-aws.pinecone.io');
- By specifying the
spec
of the pinecone index. For example,
CREATE INDEX my_remote_index ON products USING pinecone (embedding) with (spec = '"spec": {
"serverless": {
"region": "us-west-2",
"cloud": "aws"
}
}');
All spec options can be found here
- Place your pinecone index in the same region as your postgres instance to minimize latency.
- Make use of connection pooling to run queries in postgres concurrently. For example, use
asyncpg
in python. - Records are sent to the remote index in batches. Therefore pgvector-remote performs a local scan of the unflushed records before every query. To disable this set
pinecone.max_buffer_scan
to 0. For example,
ALTER DATABASE mydb SET pinecone.max_buffer_scan = 0;
- You can adjust the number of vectors sent in each request and the number of concurrent requests per batch using
pinecone.vectors_per_request
andpinecone.requests_per_batch
respectively. For example,
ALTER DATABASE mydb SET pinecone.vectors_per_request = 100; --default
ALTER DATABASE mydb SET pinecone.requests_per_batch = 40; --default
- You can control the number of results returned by pinecone using
pinecone.top_k
. Lowering this parameter can decrease latencies, but keep in mind that setting this too low could cause fewer results to be returned than expected.
An example docker image can be obtained with,
docker pull kslohith17/pgvector-remote:latest
This contains postgres along with pgvector-remote configured to run on it.
We give special thanks to these projects, which enabled us to develop our extension:
- pgvector: Open-source vector similarity search for Postgres
- PASE: PostgreSQL Ultra-High-Dimensional Approximate Nearest Neighbor Search Extension
- Faiss: A Library for Efficient Similarity Search and Clustering of Dense Vectors
- Using the Triangle Inequality to Accelerate k-means
- k-means++: The Advantage of Careful Seeding
- Concept Decompositions for Large Sparse Text Data using Clustering
- Efficient and Robust Approximate Nearest Neighbor Search using Hierarchical Navigable Small World Graphs
- Pinecone: Vector database and search service designed for real-time applications