CodeBERT as base model #32

wpertsch · 2022-01-19T12:23:53Z

Hello, we are trying to use ColBERT for Code Retrieval. Therefore we would like to use a different base model than BERT, namely CodeBERT. By applying the changes contained in this commit hueck/ColBERT@1d268f5 we obtained this ColBERT checkpoint.

Is there a simple way to integrate a checkpoint based on a different architecture? I think this would be a useful feature to possibly improve the performance of the model.

I tried to customize pyterrier myself, but after fixing minor problems I ran into the following error, which I assume is not related to the custom checkpoint.

TypeError                                 Traceback (most recent call last)
<ipython-input-8-ae1901375f17> in <module>()
      7 gen = pt.index.treccollection2textgen(files)
      8 
----> 9 indexer.index(gen)


/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in index(self, iterator)
    326         create_directory(self.args.index_root)
    327         create_directory(self.args.index_path)
--> 328         ceg.encode()
    329         self.colbert = ceg.colbert
    330         self.checkpoint = ceg.checkpoint

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in encode(self)
    401             t1 = time.time()
    402             batch = self._preprocess_batch(offset, lines)
--> 403             embs, doclens, ids = self._encode_batch(batch_idx, batch)
    404             if DEBUG:
    405                 assert sum(doclens) == len(ids), (batch_idx, len(doclens), len(ids))

/usr/local/lib/python3.7/dist-packages/pyterrier_colbert/indexing.py in _encode_batch(self, batch_idx, batch)
    352     def _encode_batch(self, batch_idx, batch):
    353         with torch.no_grad():
--> 354             embs, ids = self.inference.docFromText(batch, bsize=self.args.bsize, keep_dims=False, with_ids=True)
    355             assert type(embs) is list
    356             assert len(embs) == len(batch)

TypeError: docFromText() got an unexpected keyword argument 'with_ids'

Is this related to stanford-futuredata/ColBERT#30? It seems that pyterrier assumes that this pull request has been merged.
@cmacdonald, could you maybe explain the reason for this pull request? As we don't want to mask the punctuation. Is there a way to just bypass it?

To reproduce the error you can use this colab. Note that it uses forked ColBERT and pyterrier_colbert versions.

Thank you for your help!

The text was updated successfully, but these errors were encountered:

cmacdonald · 2022-01-20T13:40:44Z

Our ColBERT (https://github.com/cmacdonald/colbert/tree/v0.2) has diverged a bit from upstream. with_ids support is to get the ids, which we have found useful to record for things like query embedding pruning or PRF. Its not about masking punctuation per se. Given the multitude of changes upstream, I havent looked at merging back.

ColBERTIndexer has an ids=True kwarg - you can avoid the error by setting ids=False.

To load your model into ColBERTFactory you need a bit more fiddling.

This change may also be useful https://github.com/cmacdonald/ColBERT/blob/models/colbert/modeling/colbert.py

wpertsch · 2022-02-06T16:20:36Z

Thank you for the help @cmacdonald!
We managed to use CODEBert within pyterrier_colbert. Our installation looked like this:

pip install python-terrier
pip install wheel
pip install faiss-gpu==1.6.3
pip install --upgrade git+https://github.com/wpertsch/pyterrier_colbert.git
pip uninstall -y ColBERT
pip install -q git+https://github.com/hueck/ColBERT.git

The ColBERT is customized to use CODEBert. The pyterrier_colbert is customized to pop the "bert.embeddings.position_ids".

The code we used for Indexing looked like this:

import faiss
assert faiss.get_num_gpus() > 0

import pyterrier as pt
pt.init()

checkpoint="/path/to/CODEBert-checkpoint"

from pyterrier_colbert.indexing import ColBERTIndexer

indexer = ColBERTIndexer(checkpoint, "/home/exampleforgithub/indextest", "colbert_smallindex", chunksize=3, ids=False)
files = pt.io.find_files("/home/exampleforgithub/data/small")
gen = pt.index.treccollection2textgen(files)
indexer.index(gen)
print("well done index")

wpertsch closed this as completed Feb 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeBERT as base model #32

CodeBERT as base model #32

wpertsch commented Jan 19, 2022

cmacdonald commented Jan 20, 2022 •

edited

Loading

wpertsch commented Feb 6, 2022 •

edited

Loading

CodeBERT as base model #32

CodeBERT as base model #32

Comments

wpertsch commented Jan 19, 2022

cmacdonald commented Jan 20, 2022 • edited Loading

wpertsch commented Feb 6, 2022 • edited Loading

cmacdonald commented Jan 20, 2022 •

edited

Loading

wpertsch commented Feb 6, 2022 •

edited

Loading