This repository is for the Recipe-MPR dataset and tools. It contains a custom recipe query dataset curated from FoodKG, a dataset search tool, code for baseline methods, and evaluation tools.
The Recipe-MPR data can be found in data/500QA.json.
All of the raw recipe data used by curators to create this dataset was accessed through a python-based search tool that was developed to easily search through recipe options. The search tool uses 3 publicly available JSON files: layer1.json, det_ingrs.json, and recipes_with_nutritional_info.json, originating from the Recipe1M dataset. These files are also what the FoodKG dataset is based off of as specified by the FoodKG dataset construction. Details on how to download the data and setup the search tool can be found in tools/search-tool/README.md.
The format of the dataset is in JSON, where each JSON object consists of the query, five options (recipe IDs and text descriptions), the recipe ID of the intended correct answer, labels of query reasoning strategies, and a mapping from preference aspects in the query to aspects in the correct option text. In this way, each group of query and five recipe options form a "multiple-choice" problem. The JSON schema used can be found under data/recipe-mpr.schema.json.
One example of the data:
{
"query": "I would like meat lasagna but I'm watching my weight",
"query_type": {
"Specific": 1,
"Commonsense": 1,
"Negated": 0,
"Analogical": 0,
"Temporal": 0
},
"options": {
"34572cc1ee": "Vegetarian lasagna with mushrooms, mixed vegetables, textured vegetable protein, and meat replacement",
"7042fffd85": "Forgot the Meat Lasagna with onions, mushroom and spinach",
"0b82f37488": "Beef lasagna with whole wheat noodles, low-fat cottage cheese, and part-skim mozzarella cheese",
"047ea4e60b": "Cheesy lasagna with Italian sausage, mushrooms, and 8 types of cheese",
"57139f1a42": "Meat loaf containing vegetables such as potatoes, onions, corn, carrots, and cabbage"
},
"answer": "0b82f37488",
"correctness_explanation": {
"meat lasagna": "Beef lasagna",
"watching my weight": ["whole-wheat","low-fat","part-skim"]
}
}
A total of 500 queries were generated by five different data curators(100 each). The queries aimed to be a) natural and b) multiaspect.
Each query was then labeled according to the query properties below, for a total of 5 binary labels where 1/0 indicates true/false. These query properties are multi-label and a count of the number of queries containing each label is below.
- Specific
- Specific queries mention a certain dish or recipe name, e.g., "spaghetti carbonara". By default, anything else is considered General.
- Commonsense
- A query is considered as Commonsense if it requires commonsense reasoning, e.g., inferring "I'm watching my weight" to mean "I want a low calorie meal". Otherwise, the query is considered Direct.
-
Negated
- Negated queries contain contradiction or denial of something, eg. using terms like "but", "but not", "without", "doesn't", etc.
-
Analogical
- Analogical queries use metaphors or similes to express preferences using a comparison, e.g., "like McDonald's"
-
Temporal
- Temporal queries contain explicit references to time, e.g., time of day, day of week, or terms concerning the passage of time like "slow", "fast", "lasting", etc.
Label | Query Count |
---|---|
Specific | 151 |
Commonsense | 268 |
Negated | 109 |
Analogical | 30 |
Temporal | 32 |
Examples of queries and their corresponding labels:
Query | Specific | Commonsense | Negated | Analogical | Temporal |
---|---|---|---|---|---|
I would like a beef recipe but not stew | 0 | 0 | 1 | 0 | 0 |
I want chicken that has a kick to it | 0 | 1 | 0 | 0 | 0 |
After the queries were generated, five recipe options were generated for each query by the same data curators, with one being the correct answer. These options were found with the FoodKG Search Tool and a text description was manually generated for each describing the recipe if the recipe name wasn't descriptive enough (otherwise the recipe name is used as the description). Information for generating these text descriptions comes mainly from the recipe name, ingredients and nutritional information. Sometimes additional recipe details are included such as cooking methodology or estimated time.
Requirements for options and descriptions:
-
The four incorrect options should be hard negatives (e.g. near misses). Hard negative options are defined as options close to the correct answer, but differing by at least one aspect.
-
There should only be one answer that can be considered a relevant recommendation among the five options.
-
The text descriptions that are manually written do not have to contain full recipe details, but need to include enough information to discern the correct option from the wrong ones.
-
The text descriptions should remain factual and avoid any additional human inference.
-
The text description for the correct answer should avoid direct word-matching with the query as much as possible.
For the queries generated, data curators labeled preference aspects and item aspects as spans of text in the query and correct option description. A mapping of query aspects to the corresponding terms in the correct option description that makes the option correct is annotated. For terms in the query that don't match to an explicit term in the option description, the "<INFERRED>" tag is used
An example of a query, correct option, and corresponding aspect mapping is below:
Query | I want to make a paella but I'm short on time |
Correct option | Brown rice paella containing mussels and Spanish chorizo that cooks fast |
Aspect mapping | "paella" --> "paella", "cooks fast" --> "short on time" |
Following the data generation and annotation, two full rounds of data validation among curators were conducted. In each round, subsets of queries and the associated labels, options, and descriptions were validated by someone other than the original curator. Thus, each data sample was validated by two other curators other than the original curator. The data validation was to ensure:
-
Queries, options, and text descriptions followed the guidelines established above
-
Query property labels are used consistently and correctly
-
A single correct answer can be identified without ambiguity and without looking at the ground truth label
-
The descriptions for any recipe used multiple times are consistent
-
There are no duplicate queries
The second validator helped resolve conflicts and oversee suggested changes if necessary. Changes were made only if both validators agreed on the proposed modifications.
A number of different off-the-shelf baseline evaluation methods ranging from Sparse (OWC, TF-IDF, BM25), Dense (BERT, TAS-B, GPT-3 embeddings), Zero-Shot (GPT-2, GPT-3, OPT), and Few-Shot (GPT-2, GPT-3, OPT) are provided. All baseline methods are provided for two evaluation settings: Monolithic and Aspect-based. Results from the baselines are automatically saved to .csv files with accuracy (hit@1) as the metric.
-
Monolithic setting: see baselines/monolithic/ for evaluation code and prompt formats
-
Aspect-based setting: see baselines/aspects/ for evaluation code and prompt formats
The package requirements needed for evaluation can be found in requirements.txt. To run evaluation experiments, the baselines/config.json file can be modified and the shell scripts found under baselines/scripts/ can be run directly. For example,
cd baselines
./scripts/single_runs.sh
To create random splits of the data into k-folds, the utils/make_fold_inds.py script can be run to output the fold indices into a JSON file. The fold indices used in our experiments are found under baselines/folds/.
python3 make_fold_inds.py -K [# of folds]
The utils/generate_results.py script is also provided, which contains functions to parse through the .csv results files from the evaluation code and generates a dataframe of all results for easier analysis and comparison of results.