-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeBERT as base model #32
Comments
Our ColBERT (https://github.com/cmacdonald/colbert/tree/v0.2) has diverged a bit from upstream. ColBERTIndexer has an To load your model into ColBERTFactory you need a bit more fiddling. This change may also be useful https://github.com/cmacdonald/ColBERT/blob/models/colbert/modeling/colbert.py |
Thank you for the help @cmacdonald!
The ColBERT is customized to use CODEBert. The pyterrier_colbert is customized to pop the "bert.embeddings.position_ids". The code we used for Indexing looked like this: import faiss
assert faiss.get_num_gpus() > 0
import pyterrier as pt
pt.init()
checkpoint="/path/to/CODEBert-checkpoint"
from pyterrier_colbert.indexing import ColBERTIndexer
indexer = ColBERTIndexer(checkpoint, "/home/exampleforgithub/indextest", "colbert_smallindex", chunksize=3, ids=False)
files = pt.io.find_files("/home/exampleforgithub/data/small")
gen = pt.index.treccollection2textgen(files)
indexer.index(gen)
print("well done index") |
Hello, we are trying to use ColBERT for Code Retrieval. Therefore we would like to use a different base model than BERT, namely CodeBERT. By applying the changes contained in this commit hueck/ColBERT@1d268f5 we obtained this ColBERT checkpoint.
Is there a simple way to integrate a checkpoint based on a different architecture? I think this would be a useful feature to possibly improve the performance of the model.
I tried to customize pyterrier myself, but after fixing minor problems I ran into the following error, which I assume is not related to the custom checkpoint.
Is this related to stanford-futuredata/ColBERT#30? It seems that pyterrier assumes that this pull request has been merged.
@cmacdonald, could you maybe explain the reason for this pull request? As we don't want to mask the punctuation. Is there a way to just bypass it?
To reproduce the error you can use this colab. Note that it uses forked ColBERT and pyterrier_colbert versions.
Thank you for your help!
The text was updated successfully, but these errors were encountered: