This is a guide to follow for contributing to UDA-4-TSC repository.
We are using pre-commit package.
After you install the requirements in your conda environment you will be able to install the git hook
necessary using the following command pre-commit install
.
From now on every commit you perform will be run through black
which is defined in .pre-commit-config.yaml
If you would like to test the black
format yourself you can use the command black --check --line-length 100 .
When adding a new dataset you will need:
- Create a new dataset file that inherits from hugginface datasets in
_datasets
following the huggingface tutorial on adding a custom dataset. - Update the base enumeration
DatasetNameEnum
insrc/_utils/enumerations.py
. - Add a new
_configs/conf/stages/preprocess/[dataset_name].yaml
that defines the preprocessing stage of the dataset you are adding. - Add the set of different
dataset_configs
to specify what target and source we need in sample configuration in_configs/custom_conf/dataset_name=[dataset_name]/sample.json
.
The config for a given dataset in _configs/conf/stages/preprocess/[dataset_name].yaml
needs to follow this schema:
dataset_name
: the name of the dataset.preprocessing_pipeline
: a list ofPreprocessorConfig
that contains:preprocessor
: the name of the preprocessorPreprocessorEnum
.config
: a dict that contains the custom configuration expected by the preprocessor defined insrc/_preprocessing
.
dataset_config
: a dict that contains two other dictssource
andtarget
where each one of these two will need a dict with:name
: the name of thesource/target
.- any other custom keys and values that are needed to configure the
source/target
.
When adding a new classifier you will need:
- Create a new classifier that either inherits from
SKClassifier
insrc/_classifiers/sk/base.py
or inherits fromHuggingFaceClassifier
insrc/_classifiers/hf/base.py
.- Note that the config of a
HuggingFaceClassifier
will need to follow theHFConfig
present insrc/_classifiers/hf/base.py
. - The latter
HFConfig
will need abackbone
attribute / dict configuration to create the backbone defined insrc/_backbone/nn
.
- Note that the config of a
- Update eh base enumeration
TLClassifierEnum
by inserting your new classifier's name insrc/_utils/enumerations.py
. - Update the function
get_tl_classifier_class
that fetches the classifier based on its name insrc/_utils/hub.py
. - For each classifier you will need to define its default configuration for tuning in
_configs/conf/stages/tune/[classifier_name].yaml
and for training in_configs/conf/stages/train/[classifier_name].yaml
. - If needed to override the default values for this classifier for some dataset, you need define them in
_configs/conf/stages/tune/tuner_config/[classifier_name]/[dataset_name].yaml
for tune stage and in_configs/conf/stages/train/config/[classifier_name]/[dataset_name].yaml
for train stage.
Skip tune stage
If your classifier does not need to define its hyperparameters tuning stage, then you are allowed to skip it by putting in _configs/conf/stages/tune/[classifier_name].yaml
the following:
random_seed: 1
search_method_names:
- None
no_tune:
no_tune: true
Otherwise you will need to define in _configs/conf/stages/tune/[classifier_name].yaml
the following:
classifier_name
: the name of the classifierTLClassifierEnum
tuner_config
: a configuration of the tuningTunerConfig
:hyperparam_tuning
the set of ray's hyperparameter to be tuned (alongside their search space).hyperparam_fixed
the set of ray's fixed hyperparameter that won't be tuned.
ray_config
: a configuration of rayRayConfig
defined insrc/_utils/stages.py
- usually you only need to specify how manycpu/gpu
your classifier will need.
Your training stage is necessary and cannot be skipped, therefore it will need to contain the following:
classifier_name
: the name of the classifierTLClassifierEnum
config
: the custom config dict of the classifier, if it is aHuggingFaceClassifier
will need to follow theHFConfig
present insrc/_classifiers/hf/base.py
.tune_config
: a dict whose schema followsTuneConfig
and should contain:search_method_name
: a string where you define which hyperparameter search method will be used to choose the best set of hparams, you have four options:None
: where the classifier's config should only be defined in thetrain
stage - in other wordstune
stage should be skipped in this case.- The options defined by the base enumeration
TLTunerEnum
insrc/_utils/enumerations.py
.
train_tune_option
: a string where you have five options defined inTrainTuneOptionsEnum
:tune_configs_only
: where the classifier's config should only be defined in thetune
stage.train_configs_only
: where the classifier's config should only be defined in thetrain
stage - in other wordstune
stage should be skipped in this case.tune_overrides_train
: where the hparams defined intrain
stage will be overriden by thetune
stage.train_overrides_tune
: where the hparams defined intune
stage will be overriden by thetrain
stage.train_union_tune
: where the union of both hparams defined intune
andtrain
stages will be taken as config for the classifier with no intersection allowed between the two sets defined intune
andtrain
.
Some pipeline configuration are usually not defined / overriden when adding a new classifier / dataset and are therefore defined in the src/_configs/conf/config.yaml
with comments to explain each field what it does.