The purpose of this repository is to gather N-ary datasets that are used for the text mining relation extraction task. It will be mostly focused on biomedical datasets, although other relevant ones from other scientific domains might be present. Only datasets built for n-ary relations are mentioned and not others that can be adapted to the task.
Relation extraction (RE) is a task of text mining that aims to analyse the relations between the identified entities [1]. N-ary relation extraction aims to extract relations from n entities. Currently, only few datasets are available for this type of RE. N-ary relation extraction can help to answer more specific questions such as:
- given a mutation in a gene, which drug would it respond to, resulting in a gene-mutation-drug, ternary relation [2];
- given a gene variation how does it impact drug response phenotype, a ternary relation of gene variation-drug-phenotype [3];
- which type of drugs combinations will result in a positive effect [4];
- given a specific mutation in a gene, how does it affect the reaction to the drug [5].
# | Year | Entities | N-ary | Nº Relations | Type | Annotation Level | Relation Source | Reference & Dataset |
1.1 | 2017 | Drug-Gene-Mutation | Binary & Ternary | Ternary 3462 | Drug-Gene 137 464 | Drug-Mutation 3192 | Silver | Sent & Doc | Filtered from 1 Million Full text from PubMed Central | Cross-Sentence N-ary Relation Extraction with Graph LSTMs | [Dataset] |
1.2 | 2020 | Gene variations, Genes, Drugs & Phenotypes | Ternary | 2871 | Gold | Sent | 911 PubMed Abstracts | PGxCorpus, a manually annotated corpus for pharmacogenomics | [Dataset] |
1.3 | 2022 | Drug combinations | Variable length N-ary | 1248 | Gold | Sent,Par or Abs | 1634 PubMed Abstracts | A Dataset for N-ary Relation Extraction of Drug Combinations | [Dataset] |
- Language : English
- Format : TSV
- Standard : Silver
- Data origin : Filtered from 1M PubMed abstracts
- Number of instances : 144,150
- N-ary : 2-ary (drug-gene & drug-mutation) & 3-ary (drug-gene-mutation)
- Total relations :
- Positive examples :
- 3-ary : 3462 (59 unique)
- 2-ary : Drug-Gene 137,496 & Drug-Mutation 3192
- Negative examples were created by randomly sampling co-occurring entity triples without known interactions
- Positive examples :
- Language : English
- Format : Brat
- Standard : Gold
- Data origin : 911 PubMed abstracts
- Number of instances : 945 phrases
- N-ary : 2-ary & 3-ary (drug-genetic factors-phenotype)
- Total relations : 2871
- Language : English
- Format : JSON lines
- Standard : Gold
- Data origin : 1600 PubMed abstracts
- Number of instances :1634
- N-ary : variable lenght n-ary (drug-drug (...))
- Total relations : 1248
- Per label:
- POS_COMB : 838
- OTHER_COMB : 410
- NO_COMB : 591
- Per label:
- Train set size : 1362
- Test set size : 272
# | Year | Entities | N-ary | Nº Relations | Type | Annotation Level | Relation Source | Reference & Dataset |
2.1 | 2020 | Dataset, Metric, Task, Method | Binary & Quaternary | Gold | 16 2-ary & 5 4-ary (average per document) | Doc | 483 Fully annotated documents from Papers with Code | SciREX: A Challenge Dataset for Document-Level Information Extraction | [Dataset] |
SCIREX is a document-level dataset that includes several Information Extraction tasks, such as document-level N-ary relation identification from scientific publications and entity identification. It presents relations between the entities of type (Dataset, Method, Metric and Task) which focus on the main results of a scientific article. It is fully annotated with entities, their mentions, their coreferences, and their document level relation
- Language : English
- Format : JSON lines
- Standard : Gold
- Data origin : 483 Fully annotated documents from Papers with Code
- Number of instances : UNK
- N-ary : 2-ary & 4-ary
- Total relations : 16 2-ary & 5 4-ary (average per document)
References:
[1] J. Liu, H. Ren, M. Wu, J. Wang, and H. jin Kim, “Multiple relations extraction among multiple entities in unstructured text,” Soft Computing, vol. 22, pp. 4295–4305, 2018.
[2] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. tau Yih, “Cross-sentence n-ary relation extraction with graph lstms,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 101–115, 4 2017
[3] J. Legrand, R. Gogdemir, C. Bousquet, K. Dalleau, M.-D. Devignes, W. Digan, C.-J. Lee, N.-C. Ndiaye, N. Petitpain, and P. Ringot, “Pgxcorpus, a manually annotated corpus for pharmacogenomics,” Scientific data, vol. 7, pp. 1–13, 2020.
[4] A. Tiktinsky, V. Viswanathan, D. Niezni, D. Meron Azagury, Y. Shamay, H. Taub-Tabib, T. Hope, and Y. Goldberg, “A dataset for n-ary relation extraction of drug combinations,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Seattle, United States), pp. 3190–3203, Association for Computational Linguistics, July 2022.
[5] R. Jia, C. Wong, and H. Poon, “Document-level n-ary relation extraction with multiscale representation learning,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Minneapolis, Minnesota), pp. 3693–3704, Association for Computational Linguistics, June 2019.
This section provides the information about the search queries and platforms for this work:
[Date : 5-12-2022]
Search queries:
"n-ary";
"n-ary" AND "relation extraction";
"n-ary" AND "relation extraction" AND "biomedical";
"n-ary" AND "relation extraction" OR "biomedical"
Web Search Platforms: [Google Scholar]; [PubMed]; [Semantic Scholar]