Text Classification Benchmark based-on Fine-tuned Transformer

1. Quick Start

# clone the project 
git clone [email protected]:celsofranssa/TeCBench.git

# change directory to project folder
cd TeCBench/

# Create a new virtual environment by choosing a Python interpreter 
# and making a ./venv directory to hold it:
virtualenv -p python3 ./venv

# activate the virtual environment using a shell-specific command:
source ./venv/bin/activate

# install dependecies
pip install -r requirements.txt

# setting python path
export PYTHONPATH=$PATHONPATH:<path-to-project-dir>/TeCBench/

# (if you need) to exit virtualenv later:
deactivate

2. Datasets

Downloading the datasets from Kaggle Datasets (get kaggle credentials on Kaggle API Docs):

kaggle datasets download \
  celsofranssa/tecbench-datasets \
  -p resource/dataset/ \
  --unzip

Make sure that after completing the download of the datasets the file structure is as follows:

TeCBench/
├── LICENSE
├── main.py
├── README.md
├── requirements.txt
├── resource
│   ├── dataset
│   │   ├── 20ng
│   │   │   ├── fold_0
│   │   │   │   ├── test.pkl
│   │   │   │   ├── train.pkl
│   │   │   │   └── val.pkl

        ...     

│   │   │   ├── fold_9
│   │   │   │   ├── test.pkl
│   │   │   │   ├── train.pkl
│   │   │   │   └── val.pkl
│   │   │   └── samples.pkl

        ..

│   │   └── yelp_2015
│   │       ├── fold_0
│   │       │   ├── test.pkl
│   │       │   ├── train.pkl
│   │       │   └── val.pkl

        ...

│   │       ├── fold_4
│   │       │   ├── test.pkl
│   │       │   ├── train.pkl
│   │       │   └── val.pkl
│   │       └── samples.pkl
│   ├── log
│   ├── model_checkpoint
│   ├── prediction
│   ├── representation
│   └── stat
├── settings
│   ├── data
│   │   ├── 20NG.yaml
│   │   └── YELP.yaml
│   ├── model
│   │   └── BERT.yaml
│   └── settings.yaml
└── source
    ├── callback
    │   └── PredictionWriter.py
    ├── datamodule
    │   └── TecDataModule.py
    ├── dataset
    │   └── TeCDataset.py
    ├── encoder
    │   └── BertEncoder.py
    ├── helper
    │   └── EvalHelper.py
    ├── metric
    │   └── F1.py
    ├── model
    │   └── TeCModel.py
    └── pooling
        ├── AttentivePooling.py
        ├── AveragePooling.py
        ├── MaxPooling.py
        └── NoPooling.py

3. Test Run

The following bash command fits the BERT model over 20NG dataset using batch_size=128 and a single epoch.

python main.py tasks=[fit] \
    model=BERT \
    data=20NG \ 
    data.batch_size=32 \ 
    trainer.max_epochs=1

If all goes well the following output should be produced:

GPU available: True, used: True
[2020-12-31 13:44:42,967][lightning][INFO] - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
[2020-12-31 13:44:42,967][lightning][INFO] - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2020-12-31 13:44:42,967][lightning][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type        | Params
-----------------------------------------
0 | encoder  | BertEncoder | 108 M 
1 | cls_head | Sequential  | 15.4 K
2 | loss     | NLLLoss     | 0     
3 | f1       | F1          | 0     
-----------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params


Epoch 0: 100%|███████████████████████████████████████████████████████| 5199/5199 [13:06<00:00,  6.61it/s, loss=5.57, v_num=1, val_mrr=0.041, val_loss=5.54]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification Benchmark based-on Fine-tuned Transformer

1. Quick Start

2. Datasets

3. Test Run

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
resource		resource
settings		settings
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

License

celsofranssa/TeCBench

Folders and files

Latest commit

History

Repository files navigation

Text Classification Benchmark based-on Fine-tuned Transformer

1. Quick Start

2. Datasets

3. Test Run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages