FAERY

FAERY is a test collection for fine-grained dataset discovery, which is the task of answering an input with a ranked list of candidate datasets, along with the fields of each candidate dataset that are relevant to the input. We implement experiments on dataset discovery and explanation experiments. For details about this test collection, please refer to the following paper.

Datasets

We reused the 46,615 datasets collected from NTCIR. The "datasets.json" file (available at Zenodo provides the id, title, description, tags, author, and summary of each dataset in JSON format.

{ 
  "id": "0000de36-24e5-42c1-959d-2772a3c747e7", 
  "title": "Montezuma National Wildlife Refuge: January - April, 1943", 
  "description": "This narrative report for Montezuma National Wildlife Refuge outlines Refuge accomplishments from January through April of 1943. ...", 
  "tags": ["annual-narrative", "behavior", "populations"], 
  "author": "Fish and Wildlife Service", 
  "summary": "Almost continuous rains during April brought flood conditions to the Clyde River as well as to the refuge storage pool. Cayuga Lake is at its highest level in about ton years. ..."
}

Keyword Queries

The "./Data/queries.tsv" file provides 3,979 keyword queries. Each row represents a query with two "\t"-separated columns: query_id and query_text. The queries can be divided into generated queries created from the metadata of datasets and NTCIR queries imported from the English part of NTCIR. The IDs of generated queries start with "GEN_", which are used in LLM annotations, while IDs starting with "NTCIR_1" are NTCIR queries used in LLM annotations, and IDs starting with "NTCIR_2" are NTCIR queries used in human annotations.

Qrels

The "./Data/human_annotated_qrels.json" file contains 7,415 qrels, and the "./Data/llm_annotated_qrels.json" file contains 122,585 qrels. Each JSON object has eight keys: query_id, target_dataset_id, candidate_dataset_id, qdpair_id (the ID of the query-target dataset pair), qrel (relevance of a candidate dataset to a query, 0: irrelevant; 1: partially relevant; 2: highly relevant), query_explanation, drel (relevance of a candidate dataset to a target dataset, 0: irrelevant; 1: partially relevant; 2: highly relevant), and dataset_explanation. The query_explanation and dataset_explanation are both lists of length 5 consisting of 0 and 1, and the order of the corresponding fields is [title, description, tags, author, summary].

{
    "query_id": "NTCIR_200000", 
    "target_dataset_id": "002ece58-9603-43f1-8e2e-54e3d9649e84", 
    "candidate_dataset_id": "99e3b6a2-d097-463f-b6e1-3caceff300c9", 
    "qdpair_id": "1", 
    "qrel": 1, 
    "query_explanation": [1, 1, 1, 0, 0], 
    "drel": 2, 
    "dataset_explanation": [1, 1, 1, 1, 1]
}

Splits for Training, Validation, and Test Sets

To ensure that evaluation results are comparable, one should use the train-validation-test splits that we provide. There are two ways for splitting the data into training, validation, and test sets. The "./Data/Splits/5-Fold_split" folder contains five sub-folders. Each sub-folder provides three qrel files for training, validation, and test sets, respectively. The "./Data/Splits/Annotators_split" folder contains three qrel files for training, validation, and test sets, respectively.

Baselines for Discovery

We have evaluated two sparse retrieval models: (1) TF-IDF based cosine similarity, (2) BM25 and five dense retrieval models: (3) BGE, (4) GTE, (5) Contextualized late interaction over BERT (ColBERTv2), (6) coCondenser and (7) Dense Passage Retrieval (DPR). For reranking, we have evaluated three models: (1) Stella, (2) SFR-Embedding-Mistral, (3) GLM-4-Long, and (4) GLM-4-Air.

The details of the experiments are given in the corresponding section of our paper.

The "./Baselines" folder provides the results of each baseline method, where each JSON object is formatted as: {qdpair_id: {dataset_id: score, ...}, ...}.

Baselines for Explanation

We employed post-hoc explanation methods to identify which fields of the candidate dataset are relevant to the query or target dataset. We have evaluated four different explainers, (1) feature ablation explainer, (2) LIME, (3) SHAP, (4) LLM, using F1-score, and the first three methods need to be combined with the retrieval models.

The "./Baselines" folder provides the results of each explainers, where each JSON object is formatted as: {qdpair_id: {dataset_id: {explanaion_type: [0,1,1,0,0], ...}, ...}, ...}.

For specific experimental details and data, please refer to our paper.

Source Codes

All source codes of our implementation are provided in ./Code.

Dependencies

Python 3.9
rank-bm25
scikit-learn
sentence-transformers
faiss-gpu
ragatouille
tevatron
torch
shap
lime
zhipuai

Sparse Retrieval Models

See codes in ./Code/Retrieval/sparse.py for details.

Dense Retrieval Models

Unspervised Models

See codes in ./Code/Retrieval/unsupervised_dense.py for details.

Spervised Models

DPR: See ./Code/Retrieval/DPR for details.
coCondenser: See ./Code/Retrieval/coCondenser for details.
ColBERTv2: See ./Code/Retrieval/ColBERTv2 for details.
LLM: See ./Code/Retrieval/LLM for details.

Explanation Methods

Feature Ablation: See codes in ./Code/Explanation/feature_ablation.py for details.
SHAP: See ./Code/Explanation/SHAP for details.
LIME: See ./Code/Explanation/LIME for details.
LLM: See ./Code/Explanation/LLM for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAERY

Datasets

Keyword Queries

Qrels

Splits for Training, Validation, and Test Sets

Baselines for Discovery

Baselines for Explanation

Source Codes

Dependencies

Sparse Retrieval Models

Dense Retrieval Models

Unspervised Models

Spervised Models

Explanation Methods

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Baselines		Baselines
Code		Code
Data		Data
LICENSE		LICENSE
README.md		README.md

License

nju-websoft/FAERY

Folders and files

Latest commit

History

Repository files navigation

FAERY

Datasets

Keyword Queries

Qrels

Splits for Training, Validation, and Test Sets

Baselines for Discovery

Baselines for Explanation

Source Codes

Dependencies

Sparse Retrieval Models

Dense Retrieval Models

Unspervised Models

Spervised Models

Explanation Methods

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages