Skip to content

Commit

Permalink
Rename dataset and output folder in Prior template (#67)
Browse files Browse the repository at this point in the history
* Adjust prior template terminology
  • Loading branch information
Rensvandeschoot authored Oct 24, 2024
1 parent 700888a commit c002482
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 15 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,9 +209,9 @@ asreview makita template multimodel --classifiers logistic nb --feature_extracto

command: `prior`

The prior template evaluates how large amounts of prior knowledge might affect simulation performance. It processes two types of data in the data folder: labeled dataset(s) to be simulated and labeled dataset(s) to be used as prior knowledge. The filename(s) of the dataset(s) containing the prior knowledge should use the naming prefix `prior_[dataset_name]`.
The prior template evaluates how a set of custom prior knowledge might affect simulation performance. It processes two types of data in the data folder: labeled dataset(s) to be simulated and labeled dataset(s) to be used as prior knowledge. The filename(s) of the dataset(s) containing the custom prior knowledge should use the naming prefix `prior_[dataset_name]`.

The template runs two simulations: the first simulation uses all records from the `prior_` dataset(s) as prior knowledge, and the second uses a 1+1 randomly chosen set of prior knowledge from the non-prior knowledge dataset. Both runs simulate performance on the combined non-prior dataset(s).
The template runs two simulations: the first simulation uses all records from the `prior_` dataset(s) as prior knowledge, and the second uses a 1+1 randomly chosen set of prior knowledge from the non-prior knowledge dataset as a minimal training set. Both runs simulate performance on the combined non-prior dataset(s).

Running this template creates a `generated_data` folder. This folder contains two datasets; `dataset_with_priors.csv` and `dataset_without_priors.csv`. The simulations specified in the generated jobs file will use these datasets for their simulations.

Expand Down
21 changes: 12 additions & 9 deletions asreviewcontrib/makita/template_prior.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,10 +95,10 @@ def get_template_specific_params(self, params):
)
n_runs = self.n_runs if self.n_runs is not None else 1

# Check if at least one dataset with prior knowledge is present
# Check if at least one dataset with custom prior knowledge is present
if self._prior_dataset_count == 0:
raise ValueError(
"At least one dataset with prior knowledge (prefix 'prior_' or \
"At least one dataset with custom prior knowledge (prefix 'prior_' or \
'priors_') is required."
)

Expand All @@ -108,18 +108,21 @@ def get_template_specific_params(self, params):
"At least one dataset without prior knowledge is required."
)

# Print the number of datasets with and without prior knowledge
print(f"\nTotal datasets with prior knowledge: {self._prior_dataset_count}")
# Print the number of datasets with custom and without prior knowledge
print(
f"Total datasets without prior knowledge: {self._non_prior_dataset_count}"
f"\nDatasets with custom prior knowledge: {self._prior_dataset_count}")
print(
f"Datasets without prior knowledge: {self._non_prior_dataset_count}"
)

# Create a directory for generated data if it doesn't already exist
generated_folder = Path("generated_data")
generated_folder.mkdir(parents=True, exist_ok=True)

# Set file paths for datasets with and without prior knowledge
filepath_with_priors = generated_folder / "dataset_with_priors.csv"
# Set file paths for datasets with custom records for prior knowledge
# and without pre-set prior knowledge from which a minimal training
# set of 2 will be selected
filepath_with_priors = generated_folder / "dataset_custom_priors.csv"
filepath_without_priors = generated_folder / "dataset_without_priors.csv"

# Combine all datasets into one DataFrame and remove rows where label is -1
Expand All @@ -136,7 +139,7 @@ def get_template_specific_params(self, params):
combined_dataset["makita_priors"] == 0
].shape[0]

# Print the number of rows with and without prior knowledge
# Print the number of rows with custom and without prior knowledge
print(f"Total rows of prior knowledge: {total_rows_with_priors}")
print(f"Total rows of non-prior knowledge: {total_rows_without_priors}")

Expand All @@ -150,7 +153,7 @@ def get_template_specific_params(self, params):
index_label='record_id'
)

# Create a string of indices for rows with prior knowledge
# Create a string of indices for rows with custom prior knowledge
prior_idx_list = combined_dataset[
combined_dataset["makita_priors"] == 1
].index.tolist()
Expand Down
8 changes: 4 additions & 4 deletions asreviewcontrib/makita/templates/template_prior.txt.template
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@ python -m asreview wordcloud {{ filepath_without_priors }} -o {{ output_folder }
{% endif %}

{% for run in range(n_runs) %}
python -m asreview simulate {{ filepath_with_priors }} -s {{ output_folder }}/simulation/state_files/sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }} --prior_idx {{ prior_idx }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_with_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json
python -m asreview simulate {{ filepath_with_priors }} -s {{ output_folder }}/simulation/state_files/sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }} --prior_idx {{ prior_idx }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_custom_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json

python -m asreview simulate {{ filepath_without_priors }} -s {{ output_folder }}/simulation/state_files/sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --init_seed {{ init_seed + run }} --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_without_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json
python -m asreview simulate {{ filepath_without_priors }} -s {{ output_folder }}/simulation/state_files/sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview --init_seed {{ init_seed + run }} --seed {{ model_seed + run }} -m {{ classifier }} -e {{ feature_extractor }} -q {{ query_strategy }} -b {{ balance_strategy }} --n_instances {{ instances_per_query }} --stop_if {{ stop_if }}
python -m asreview metrics {{ output_folder }}/simulation/state_files/sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.asreview -o {{ output_folder }}/simulation/metrics/metrics_sim_minimal_priors{{ "_{}".format(run) if n_runs > 1 else "" }}.json

{% endfor %}
# Generate plot and tables for dataset
Expand Down

0 comments on commit c002482

Please sign in to comment.