Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactored colbert codebase #407

Merged
merged 15 commits into from
May 10, 2024
Merged

refactored colbert codebase #407

merged 15 commits into from
May 10, 2024

Conversation

epinzur
Copy link
Collaborator

@epinzur epinzur commented May 7, 2024

  • dropped multiprocess stuff after seeing in tests that it runs slower than the single process code.
  • split the BaseVectorStore abstract class into BaseVectorStore and BaseDatabase abstract classes
    • BaseDatabase contains the interface for interacting with the database to manage CRUD and do the proper queries for ColBERT retrieval.
    • BaseVectorStore contains the interface for creating a LLamaIndex or LangChain VectorStore
  • Created a CassandraDatabase implementation for the BaseDatabase abstract class
  • Created a ColbertVectorStore implementation for the BaseVectorStore abstract class
  • Created a ragstack-langchain.colbert ColbertVectorStore class that follows the standard langchain vector-store patterns
  • Created a ragstack-llamaindex.colbert ColbertVectorStore class that follows the standard llamaindex vector-store patterns
  • Renamed ragstack-langchain.colbert ColbertLCRetriever to ColbertRetriever and updated methods to match the standard langchain retrieval patterns
  • Renamed ragstack-llamaindex.colbert ColbertLIRetriever to ColbertRetriever and updated methods to match the standard llamaindex retrieval patterns
  • Added async methods for a few of the classes. More to come in a future PR.
  • Updated the llamaindex and langchain integration tests to more closely follow standard ways of doing RAG ingest and retrieval for those packages.
  • Dropped internal use of nest_asyncio. If running in a Jupyter environment, the user should run nest_asyncio.apply() there. It should NOT be in our package.be
  • Dropped all my different helper objects (BaseChunk, DataChunk, RetrievedChunk, EmbeddedChunk, etc...) in favor of a single Chunk object.
  • Created a helper class for sharing test data

Probably more stuff that I'm forgetting...

There is still some more stuff to implement on the llamaindex side, but I'd love to get a release out that is working for langchain users.

Also I'll add more robust testing later.

@mlr
Copy link
Collaborator

mlr commented May 9, 2024

Antora site build successful! ✅
Deploying draft.
Deployment successful! View draft

@epinzur epinzur force-pushed the colbert-cpu branch 2 times, most recently from 2221393 to 4cde7a2 Compare May 9, 2024 17:06
@epinzur epinzur changed the title DRAFT: cleaned up colbert embedding execution refactored colbert codebase May 9, 2024
@epinzur epinzur requested review from nicoloboschi and zzzming May 9, 2024 22:32
@zzzming zzzming requested review from zzzming and removed request for zzzming May 10, 2024 19:53
@zzzming
Copy link
Member

zzzming commented May 10, 2024

LGTM

@epinzur epinzur merged commit d7bca7f into main May 10, 2024
11 of 13 checks passed
@epinzur epinzur deleted the colbert-cpu branch May 10, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants