refactored colbert codebase #407

epinzur · 2024-05-07T21:44:20Z

dropped multiprocess stuff after seeing in tests that it runs slower than the single process code.
split the BaseVectorStore abstract class into BaseVectorStore and BaseDatabase abstract classes
- BaseDatabase contains the interface for interacting with the database to manage CRUD and do the proper queries for ColBERT retrieval.
- BaseVectorStore contains the interface for creating a LLamaIndex or LangChain VectorStore
Created a CassandraDatabase implementation for the BaseDatabase abstract class
Created a ColbertVectorStore implementation for the BaseVectorStore abstract class
Created a ragstack-langchain.colbert ColbertVectorStore class that follows the standard langchain vector-store patterns
Created a ragstack-llamaindex.colbert ColbertVectorStore class that follows the standard llamaindex vector-store patterns
Renamed ragstack-langchain.colbert ColbertLCRetriever to ColbertRetriever and updated methods to match the standard langchain retrieval patterns
Renamed ragstack-llamaindex.colbert ColbertLIRetriever to ColbertRetriever and updated methods to match the standard llamaindex retrieval patterns
Added async methods for a few of the classes. More to come in a future PR.
Updated the llamaindex and langchain integration tests to more closely follow standard ways of doing RAG ingest and retrieval for those packages.
Dropped internal use of nest_asyncio. If running in a Jupyter environment, the user should run nest_asyncio.apply() there. It should NOT be in our package.be
Dropped all my different helper objects (BaseChunk, DataChunk, RetrievedChunk, EmbeddedChunk, etc...) in favor of a single Chunk object.
Created a helper class for sharing test data

Probably more stuff that I'm forgetting...

There is still some more stuff to implement on the llamaindex side, but I'd love to get a release out that is working for langchain users.

Also I'll add more robust testing later.

mlr · 2024-05-09T00:26:55Z

Antora site build successful! ✅
Deploying draft.
Deployment successful! View draft

libs/langchain/tests/integration_tests/test_colbert.py

libs/colbert/ragstack_colbert/colbert_embedding_model.py

zzzming · 2024-05-10T19:54:29Z

LGTM

epinzur added 2 commits May 6, 2024 14:52

cleaned up embedding code

0febac9

more updates

83b1681

epinzur added the DO NOT MERGE label May 7, 2024

epinzur added 4 commits May 7, 2024 17:02

dropped multiprocessing code

67918f3

fixed unit tests

1896ace

major refactor for more langchain and llamaindex support

b9862c8

progress on llamaindex and langchain stuff

b8997ad

epinzur force-pushed the colbert-cpu branch 2 times, most recently from 2221393 to 4cde7a2 Compare May 9, 2024 17:06

progress on tests

1049435

epinzur force-pushed the colbert-cpu branch from 4cde7a2 to 1049435 Compare May 9, 2024 17:10

epinzur added 3 commits May 9, 2024 14:51

fixes found from testing

0e4d314

fixed a bug in embedding

fa361ed

more fixes and formatting

e9d78d4

epinzur force-pushed the colbert-cpu branch from d85a794 to e9d78d4 Compare May 9, 2024 22:06

epinzur changed the title ~~DRAFT: cleaned up colbert embedding execution~~ refactored colbert codebase May 9, 2024

epinzur added 2 commits May 9, 2024 17:28

revert baseline_tensors.py formatting

c8ee70e

removed test file

f766242

epinzur requested review from nicoloboschi and zzzming May 9, 2024 22:32

epinzur commented May 9, 2024

View reviewed changes

libs/langchain/tests/integration_tests/test_colbert.py Show resolved Hide resolved

some renaming

45f02e3

epinzur force-pushed the colbert-cpu branch from d894e1d to 45f02e3 Compare May 10, 2024 14:25

minor embed update

1fdeeca

zzzming reviewed May 10, 2024

View reviewed changes

libs/colbert/ragstack_colbert/colbert_embedding_model.py Show resolved Hide resolved

libs/colbert/ragstack_colbert/colbert_embedding_model.py Show resolved Hide resolved

libs/colbert/ragstack_colbert/colbert_embedding_model.py Show resolved Hide resolved

llamaindex cleanup

1cfd128

zzzming requested review from zzzming and removed request for zzzming May 10, 2024 19:53

epinzur merged commit d7bca7f into main May 10, 2024
11 of 13 checks passed

epinzur deleted the colbert-cpu branch May 10, 2024 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactored colbert codebase #407

refactored colbert codebase #407

epinzur commented May 7, 2024 •

edited

Loading

mlr commented May 9, 2024 •

edited

Loading

zzzming commented May 10, 2024

refactored colbert codebase #407

refactored colbert codebase #407

Conversation

epinzur commented May 7, 2024 • edited Loading

mlr commented May 9, 2024 • edited Loading

zzzming commented May 10, 2024

epinzur commented May 7, 2024 •

edited

Loading

mlr commented May 9, 2024 •

edited

Loading