Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeBERT as base model #32

Closed
wpertsch opened this issue Jan 19, 2022 · 2 comments
Closed

CodeBERT as base model #32

wpertsch opened this issue Jan 19, 2022 · 2 comments

Comments

@wpertsch
Copy link

Hello, we are trying to use ColBERT for Code Retrieval. Therefore we would like to use a different base model than BERT, namely CodeBERT. By applying the changes contained in this commit hueck/ColBERT@1d268f5 we obtained this ColBERT checkpoint.

Is there a simple way to integrate a checkpoint based on a different architecture? I think this would be a useful feature to possibly improve the performance of the model.

I tried to customize pyterrier myself, but after fixing minor problems I ran into the following error, which I assume is not related to the custom checkpoint.

TypeError                                 Traceback (most recent call last)
<ipython-input-8-ae1901375f17> in <module>()
      7 gen = pt.index.treccollection2textgen(files)
      8 
----> 9 indexer.index(gen)


/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in index(self, iterator)
    326         create_directory(self.args.index_root)
    327         create_directory(self.args.index_path)
--> 328         ceg.encode()
    329         self.colbert = ceg.colbert
    330         self.checkpoint = ceg.checkpoint

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in encode(self)
    401             t1 = time.time()
    402             batch = self._preprocess_batch(offset, lines)
--> 403             embs, doclens, ids = self._encode_batch(batch_idx, batch)
    404             if DEBUG:
    405                 assert sum(doclens) == len(ids), (batch_idx, len(doclens), len(ids))

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in _encode_batch(self, batch_idx, batch)
    352     def _encode_batch(self, batch_idx, batch):
    353         with torch.no_grad():
--> 354             embs, ids = self.inference.docFromText(batch, bsize=self.args.bsize, keep_dims=False, with_ids=True)
    355             assert type(embs) is list
    356             assert len(embs) == len(batch)

TypeError: docFromText() got an unexpected keyword argument 'with_ids'

Is this related to stanford-futuredata/ColBERT#30? It seems that pyterrier assumes that this pull request has been merged.
@cmacdonald, could you maybe explain the reason for this pull request? As we don't want to mask the punctuation. Is there a way to just bypass it?

To reproduce the error you can use this colab. Note that it uses forked ColBERT and pyterrier_colbert versions.

Thank you for your help!

@cmacdonald
Copy link
Collaborator

cmacdonald commented Jan 20, 2022

Our ColBERT (https://github.com/cmacdonald/colbert/tree/v0.2) has diverged a bit from upstream. with_ids support is to get the ids, which we have found useful to record for things like query embedding pruning or PRF. Its not about masking punctuation per se. Given the multitude of changes upstream, I havent looked at merging back.

ColBERTIndexer has an ids=True kwarg - you can avoid the error by setting ids=False.

To load your model into ColBERTFactory you need a bit more fiddling.

This change may also be useful https://github.com/cmacdonald/ColBERT/blob/models/colbert/modeling/colbert.py

@wpertsch
Copy link
Author

wpertsch commented Feb 6, 2022

Thank you for the help @cmacdonald!
We managed to use CODEBert within pyterrier_colbert. Our installation looked like this:

pip install python-terrier
pip install wheel
pip install faiss-gpu==1.6.3
pip install --upgrade git+https://github.com/wpertsch/pyterrier_colbert.git
pip uninstall -y ColBERT
pip install -q git+https://github.com/hueck/ColBERT.git

The ColBERT is customized to use CODEBert. The pyterrier_colbert is customized to pop the "bert.embeddings.position_ids".

The code we used for Indexing looked like this:

import faiss
assert faiss.get_num_gpus() > 0

import pyterrier as pt
pt.init()

checkpoint="/path/to/CODEBert-checkpoint"

from pyterrier_colbert.indexing import ColBERTIndexer

indexer = ColBERTIndexer(checkpoint, "/home/exampleforgithub/indextest", "colbert_smallindex", chunksize=3, ids=False)
files = pt.io.find_files("/home/exampleforgithub/data/small")
gen = pt.index.treccollection2textgen(files)
indexer.index(gen)
print("well done index")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants