This repository contains the code that achieved 31st place in U.S. Patent Phrase to Phrase Matching. You can see the detailed explanation of the solution here.
You can also view all the experiments logs on Weights & Biases dashboard here.
This competition was organized by USPTO and Kaggle. The main aim of this competition was to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. You can read more about the problem statement on the overview page of the competition.
You can simply install all the requirements and setup your machine completely to run the code by install_deps.sh
script in bash
folder. Before running this script make sure that you include your kaggle key into the script in the respective field.
$ bash bash/install_deps.sh
You can download the data by the following command. Make sure to download the data into a folder named as input
$ kaggle datasets download -d atharvaingle/uspppm-data
You can specify your configuration in config.yaml
(present in config/
folder) whether to log artifacts to GCP/W&B or not log them at all by boolean flags in the file.
You can run an experiment by writing a bash file as following:
#!/bin/bash
# sample run file
cd /home/US-Patent-Matching-Kaggle/src
for fold in 0 1 2 3 4
do
python3 train.py \
paths="jarvislabs" \
trainer="deberta_v3_base" \
run.debug=False \
run.fold=$fold \
run.exp_num="53" \
trainer.dataloader_num_workers=6 \
data.use_custom_seperator=True \
model.model_name="microsoft/deberta-v3-base" \
model.class_name="DebertaV2ForSequenceClassificationGeneral" \
model.loss_type="mse" \
model.multi_sample_dropout=True \
model.attention_pool=False \
run.name="mse-stable-drop-msd" \
run.comment="mse loss + multi sample dropout with deberta StableDropout"
done
# use following line only while training on jarvislabs.ai
# pause instance programatically after running a series of experiments
python3 -c "from jarviscloud import jarviscloud; jarviscloud.pause()"
- You can override any config field from bash file
- You can pick any model class imported in
modeling/__init__.py
. You just have the specify the name inmodel.class_name
field of Hydra configuration. - Loss choices:
mse
,bce
andpearson
- You can choose any trainer configuration you want from various
*.yaml
files inconfig/trainer
folder.
You can check out the final ensemble inference code submission for this competition in Kaggle notebook here