This repository lists public conversational datasets in text formats.
Dataset | Description | Words | Turns | Conversations | License (and conditions) |
---|---|---|---|---|---|
Assemblée Nationale | Parliamentary proceedings from the French National Assembly | 133M | 1.6M | 4.5k | Open License 2.0 |
Theatre Classique | Classic stage plays | 12.8M | 441k | 25k | CC BY-NC-SA 4.0 (please cite) |
Theatre Gratuit | Stage plays | 2.7M | 155k | 4k | |
ESLO (1/5) | Guided conversations | 4.2M | 329k | 399 | CC BY-NC-SA 4.0 (please cite) |
TCOF (adults) | Guided conversations (between adults) | 765k | 49k | 237 | CC BY-NC-SA 2.0 (please cite) |
CFPP | Interviews of people in Paris in 2000 | 608k | 48k | 42 | CC BY-NC-SA 3.0 (please cite) |
ORFEO/Valibel (1/2) | Guided conversations of Belgian French speakers | 458k | 19k | 67 | CC BY-NC-SA 4.0 (please cite) |
PFC (1/2) | Guided interviews | 268k | 15k | 173 | CC BY-NC-SA 4.0 (please cite) |
ORFEO/CFPB | Interviews of people in Brussels | 138k | 11k | 12 | CC BY-NC-SA 4.0 |
ACSYNT | Guided interviews from southwestern France | 61k | 2.7k | 144 | CC BY-SA 4.0 (please cite) |
OFROM | Conversations in French-speaking Switzerland | 590k | 44k | 151 | CC BY-NC-SA 3.0 (please cite) |
ESLO (2/5) | Diverse conversation | 480k | 47k | 98 | CC BY-NC-SA 4.0 (please cite) |
ORFEO/CRFP | Diverse conversations | 405k | 9k | 124 | CC BY-NC-SA 4.0 (please cite) |
ORFEO/C-ORAL-ROM | Diverse conversation | 248k | 6k | 152 | CC BY-NC-SA 4.0 (please cite) |
PFC (2/2) | Diverse conversation | 230k | 14k | 146 | CC BY-NC-SA 4.0 (please cite) |
CLAPI | Diverse conversation | 122k | 15k | 14 | CC BY-NC-SA 4.0 |
CID | Dialogues between two friends | 118k | 9k | 8 | CC BY-NC-SA 4.0 (please cite) |
Rhapsodie | Diverse conversations | 28k | 1k | 41 | CC BY-NC-SA 3.0 (please cite) |
Paris Stories | Diverse conversations in Paris | 28k | 351 | 54 | CC BY-SA 4.0 |
LinTO (1/3) | Diverse conversation | 26k | 2k | 4 | CC BY-SA 4.0 (please cite) |
SUMM-RE | Meeting-style conversations (transcribed with Whisper large-v2 ASR) | 1.3M | 39k | 283 | CC BY-SA 4.0 (please cite) |
ORFEO/Reunions-de-Travail | Real meetings | 210k | 12k | 29 | CC BY-NC-SA 4.0 |
LinTO (2/3) | Meetings on speech recognition | 41k | 1.8k | 6 | CC BY-SA 4.0 (please cite) |
FREDSum | French political debates | 406k | 7k | 144 | CC BY-SA 4.0 (please cite) |
ESLO (3/5) | Conferences | 76k | 2k | 4 | CC BY-NC-SA 4.0 (please cite) |
ESLO (4/5) | In-person assistance and call-centers | 95k | 11k | 143 | CC BY-NC-SA 4.0 (please cite) |
ORFEO/Fleuron | Interactions created to teach foreign students about university life | 33k | 2k | 51 | CC BY-NC-SA 4.0 (please cite) |
OTG | Dialogues in a tourism office | 27k | 4k | 315 | CC BY-SA 3.0 (contact before usage) |
Accueil UBS | University telephone answering service | 7.2k | 1k | 41 | CC BY-SA 3.0 (contact before usage) |
ESLO (5/5) | Conference presentations | 43k | 120 | 9 | CC BY-NC-SA 4.0 (please cite) |
LinTO (3/3) | Technical presentations (AI topics) with Q/A | 38k | 1.5k | 4 | CC BY-SA 4.0 (please cite) |
ORFEO/Valibel (2/2) | Formal university addresses | 12k | 5 | 5 | CC BY-NC-SA 4.0 (please cite) |
Dataset | Description | Words | Turns | Conversations | License (and conditions) |
---|---|---|---|---|---|
Europarl | The Europarl parallel corpus | 56M | 214K | 11K | No copyright restrictions. If you use this data in your research, please contact [email protected] |
Charlotte Narratives | The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. | 200K | 2.7K | 93 | Available for download and use for research and development, including commercial development |
Switchboard | The corpus consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. | 3M | 290K | 2320 | LDC User Ageement for Non-Members |
MediaSum (GitHub) | MediaSum dataset for summarization. A collection of transcripts of CNN and NPR interviews with short summaries. | 720M | 13M | 458K | For research purposes only |
AMI (project page) | The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. | 712K | 75K | 139 | CC BY 4.0 |
ICSI (project page) | About 70 hours of meeting recordings. | 804K | 64K | <1K | CC BY 4.0 |
ReDial (GitHub) | ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. | 1.5M | 139K | 11K | CC BY 4.0 |
OpenDialKG (GitHub) | OpenDialKG is a dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. | 1M | 84K | 12K | CC-BY-NC-4.0 |
ABCD (GitHub) | Action-Based Conversations Dataset. | 1.5M | 142K | 10K | MIT |
AirDialogue (GitHub) | AirDialogue is a benchmark dataset for goal-oriented dialogue generation research. | 37M | 4.6M | 361K | Apache License 2.0 |
MULTIWOZ2_2 (pfb30) | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | 1.9M | 143K | 10.4K | Apache License 2.0 |
MulDoGO2 (GitHub) | Conversations from the airline, fastfood, finance, insurance, media, and software domains. | 10M | 892K | 63K | CDLA Permissive License |
Chit-Chat (GitHub) | Open-domain conversational dataset from the BYU Perception, Control & Cognition lab's Chit-Chat Challenge. | 2.3M | 7.1K | 258K | MIT License |
DailyDialog | High-quality multi-turn dialog dataset. | 1.2M | 102K | 13K | CC BY-NC-SA 4.0 |
British National Corpus (BNC) | Collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. | 110M | 663K | 0.9K | BCN License |
-
Claire French Dialogue Dataset (CFDD) 🇫🇷
- Dataset: 🤗 OpenLLM-France/Claire-Dialogue-French-0.1
- Paper: Julie Hunter, Jérôme Louradour, Virgile Rennard, Ismaïl Harrando, Guokan Shang, Jean-Pierre Lorré « Claire French Dialogue Dataset » (2023)
-
Claire English Dialogue Dataset (CFDD) 🇫🇷