Skip to content

OpenLLM-France/Claire-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Claire-datasets

This repository lists public conversational datasets in text formats.

Raw datasets

French

Dataset Description Words Turns Conversations License (and conditions)

Parliamentary Proceedings

Assemblée Nationale Parliamentary proceedings from the French National Assembly 133M 1.6M 4.5k Open License 2.0

Theatre

Theatre Classique Classic stage plays 12.8M 441k 25k CC BY-NC-SA 4.0 (please cite)
Theatre Gratuit Stage plays 2.7M 155k 4k

Interviews

ESLO (1/5) Guided conversations 4.2M 329k 399 CC BY-NC-SA 4.0 (please cite)
TCOF (adults) Guided conversations (between adults) 765k 49k 237 CC BY-NC-SA 2.0 (please cite)
CFPP Interviews of people in Paris in 2000 608k 48k 42 CC BY-NC-SA 3.0 (please cite)
ORFEO/Valibel (1/2) Guided conversations of Belgian French speakers 458k 19k 67 CC BY-NC-SA 4.0 (please cite)
PFC (1/2) Guided interviews 268k 15k 173 CC BY-NC-SA 4.0 (please cite)
ORFEO/CFPB Interviews of people in Brussels 138k 11k 12 CC BY-NC-SA 4.0
ACSYNT Guided interviews from southwestern France 61k 2.7k 144 CC BY-SA 4.0 (please cite)

Free Conversations

OFROM Conversations in French-speaking Switzerland 590k 44k 151 CC BY-NC-SA 3.0 (please cite)
ESLO (2/5) Diverse conversation 480k 47k 98 CC BY-NC-SA 4.0 (please cite)
ORFEO/CRFP Diverse conversations 405k 9k 124 CC BY-NC-SA 4.0 (please cite)
ORFEO/C-ORAL-ROM Diverse conversation 248k 6k 152 CC BY-NC-SA 4.0 (please cite)
PFC (2/2) Diverse conversation 230k 14k 146 CC BY-NC-SA 4.0 (please cite)
CLAPI Diverse conversation 122k 15k 14 CC BY-NC-SA 4.0
CID Dialogues between two friends 118k 9k 8 CC BY-NC-SA 4.0 (please cite)
Rhapsodie Diverse conversations 28k 1k 41 CC BY-NC-SA 3.0 (please cite)
Paris Stories Diverse conversations in Paris 28k 351 54 CC BY-SA 4.0
LinTO (1/3) Diverse conversation 26k 2k 4 CC BY-SA 4.0 (please cite)

Meetings

SUMM-RE Meeting-style conversations (transcribed with Whisper large-v2 ASR) 1.3M 39k 283 CC BY-SA 4.0 (please cite)
ORFEO/Reunions-de-Travail Real meetings 210k 12k 29 CC BY-NC-SA 4.0
LinTO (2/3) Meetings on speech recognition 41k 1.8k 6 CC BY-SA 4.0 (please cite)

Debates

FREDSum French political debates 406k 7k 144 CC BY-SA 4.0 (please cite)
ESLO (3/5) Conferences 76k 2k 4 CC BY-NC-SA 4.0 (please cite)

Assistance

ESLO (4/5) In-person assistance and call-centers 95k 11k 143 CC BY-NC-SA 4.0 (please cite)
ORFEO/Fleuron Interactions created to teach foreign students about university life 33k 2k 51 CC BY-NC-SA 4.0 (please cite)
OTG Dialogues in a tourism office 27k 4k 315 CC BY-SA 3.0 (contact before usage)
Accueil UBS University telephone answering service 7.2k 1k 41 CC BY-SA 3.0 (contact before usage)

Presentation, Formal Address

ESLO (5/5) Conference presentations 43k 120 9 CC BY-NC-SA 4.0 (please cite)
LinTO (3/3) Technical presentations (AI topics) with Q/A 38k 1.5k 4 CC BY-SA 4.0 (please cite)
ORFEO/Valibel (2/2) Formal university addresses 12k 5 5 CC BY-NC-SA 4.0 (please cite)

English

Dataset Description Words Turns Conversations License (and conditions)

Parliamentary Proceedings

Europarl The Europarl parallel corpus 56M 214K 11K No copyright restrictions. If you use this data in your research, please contact [email protected]

Spoken Dialogue

Charlotte Narratives The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities. 200K 2.7K 93 Available for download and use for research and development, including commercial development
Switchboard The corpus consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. 3M 290K 2320 LDC User Ageement for Non-Members

Broadcast

MediaSum (GitHub) MediaSum dataset for summarization. A collection of transcripts of CNN and NPR interviews with short summaries. 720M 13M 458K For research purposes only

Meetings

AMI (project page) The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings. 712K 75K 139 CC BY 4.0
ICSI (project page) About 70 hours of meeting recordings. 804K 64K <1K CC BY 4.0

Assistance

ReDial (GitHub) ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other. 1.5M 139K 11K CC BY 4.0
OpenDialKG (GitHub) OpenDialKG is a dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic. 1M 84K 12K CC-BY-NC-4.0
ABCD (GitHub) Action-Based Conversations Dataset. 1.5M 142K 10K MIT
AirDialogue (GitHub) AirDialogue is a benchmark dataset for goal-oriented dialogue generation research. 37M 4.6M 361K Apache License 2.0
MULTIWOZ2_2 (pfb30) Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. 1.9M 143K 10.4K Apache License 2.0
MulDoGO2 (GitHub) Conversations from the airline, fastfood, finance, insurance, media, and software domains. 10M 892K 63K CDLA Permissive License

Free Chat

Chit-Chat (GitHub) Open-domain conversational dataset from the BYU Perception, Control & Cognition lab's Chit-Chat Challenge. 2.3M 7.1K 258K MIT License
DailyDialog High-quality multi-turn dialog dataset. 1.2M 102K 13K CC BY-NC-SA 4.0

Misc

British National Corpus (BNC) Collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century. 110M 663K 0.9K BCN License

Normalized datasets

Contact

[email protected]

About

Lists of conversational datasets

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •