Claire-datasets

This repository lists public conversational datasets in text formats.

Raw datasets
- French
- English
Normalized datasets
Contact

Raw datasets

French

Dataset	Description	Words	Turns	Conversations	License (and conditions)
Parliamentary Proceedings
Assemblée Nationale	Parliamentary proceedings from the French National Assembly	133M	1.6M	4.5k	Open License 2.0
Theatre
Theatre Classique	Classic stage plays	12.8M	441k	25k	CC BY-NC-SA 4.0 (please cite)
Theatre Gratuit	Stage plays	2.7M	155k	4k
Interviews
ESLO (1/5)	Guided conversations	4.2M	329k	399	CC BY-NC-SA 4.0 (please cite)
TCOF (adults)	Guided conversations (between adults)	765k	49k	237	CC BY-NC-SA 2.0 (please cite)
CFPP	Interviews of people in Paris in 2000	608k	48k	42	CC BY-NC-SA 3.0 (please cite)
ORFEO/Valibel (1/2)	Guided conversations of Belgian French speakers	458k	19k	67	CC BY-NC-SA 4.0 (please cite)
PFC (1/2)	Guided interviews	268k	15k	173	CC BY-NC-SA 4.0 (please cite)
ORFEO/CFPB	Interviews of people in Brussels	138k	11k	12	CC BY-NC-SA 4.0
ACSYNT	Guided interviews from southwestern France	61k	2.7k	144	CC BY-SA 4.0 (please cite)
Free Conversations
OFROM	Conversations in French-speaking Switzerland	590k	44k	151	CC BY-NC-SA 3.0 (please cite)
ESLO (2/5)	Diverse conversation	480k	47k	98	CC BY-NC-SA 4.0 (please cite)
ORFEO/CRFP	Diverse conversations	405k	9k	124	CC BY-NC-SA 4.0 (please cite)
ORFEO/C-ORAL-ROM	Diverse conversation	248k	6k	152	CC BY-NC-SA 4.0 (please cite)
PFC (2/2)	Diverse conversation	230k	14k	146	CC BY-NC-SA 4.0 (please cite)
CLAPI	Diverse conversation	122k	15k	14	CC BY-NC-SA 4.0
CID	Dialogues between two friends	118k	9k	8	CC BY-NC-SA 4.0 (please cite)
Rhapsodie	Diverse conversations	28k	1k	41	CC BY-NC-SA 3.0 (please cite)
Paris Stories	Diverse conversations in Paris	28k	351	54	CC BY-SA 4.0
LinTO (1/3)	Diverse conversation	26k	2k	4	CC BY-SA 4.0 (please cite)
Meetings
SUMM-RE	Meeting-style conversations (transcribed with Whisper large-v2 ASR)	1.3M	39k	283	CC BY-SA 4.0 (please cite)
ORFEO/Reunions-de-Travail	Real meetings	210k	12k	29	CC BY-NC-SA 4.0
LinTO (2/3)	Meetings on speech recognition	41k	1.8k	6	CC BY-SA 4.0 (please cite)
Debates
FREDSum	French political debates	406k	7k	144	CC BY-SA 4.0 (please cite)
ESLO (3/5)	Conferences	76k	2k	4	CC BY-NC-SA 4.0 (please cite)
Assistance
ESLO (4/5)	In-person assistance and call-centers	95k	11k	143	CC BY-NC-SA 4.0 (please cite)
ORFEO/Fleuron	Interactions created to teach foreign students about university life	33k	2k	51	CC BY-NC-SA 4.0 (please cite)
OTG	Dialogues in a tourism office	27k	4k	315	CC BY-SA 3.0 (contact before usage)
Accueil UBS	University telephone answering service	7.2k	1k	41	CC BY-SA 3.0 (contact before usage)
Presentation, Formal Address
ESLO (5/5)	Conference presentations	43k	120	9	CC BY-NC-SA 4.0 (please cite)
LinTO (3/3)	Technical presentations (AI topics) with Q/A	38k	1.5k	4	CC BY-SA 4.0 (please cite)
ORFEO/Valibel (2/2)	Formal university addresses	12k	5	5	CC BY-NC-SA 4.0 (please cite)

English

Dataset	Description	Words	Turns	Conversations	License (and conditions)
Parliamentary Proceedings
Europarl	The Europarl parallel corpus	56M	214K	11K	No copyright restrictions. If you use this data in your research, please contact [email protected]
Spoken Dialogue
Charlotte Narratives	The Charlotte Narrative and Conversation Collection (CNCC) contains 95 narratives, conversations and interviews representative of the residents of Mecklenburg County, North Carolina and surrounding North Carolina communities.	200K	2.7K	93	Available for download and use for research and development, including commercial development
Switchboard	The corpus consists of approximately 260 hours of speech and was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship.	3M	290K	2320	LDC User Ageement for Non-Members
Broadcast
MediaSum (GitHub)	MediaSum dataset for summarization. A collection of transcripts of CNN and NPR interviews with short summaries.	720M	13M	458K	For research purposes only
Meetings
AMI (project page)	The AMI Meeting Corpus is a multi-modal data set consisting of 100 hours of meeting recordings.	712K	75K	139	CC BY 4.0
ICSI (project page)	About 70 hours of meeting recordings.	804K	64K	<1K	CC BY 4.0
Assistance
ReDial (GitHub)	ReDial (Recommendation Dialogues) is an annotated dataset of dialogues, where users recommend movies to each other.	1.5M	139K	11K	CC BY 4.0
OpenDialKG (GitHub)	OpenDialKG is a dataset of conversations between two crowdsourcing agents engaging in a dialog about a given topic.	1M	84K	12K	CC-BY-NC-4.0
ABCD (GitHub)	Action-Based Conversations Dataset.	1.5M	142K	10K	MIT
AirDialogue (GitHub)	AirDialogue is a benchmark dataset for goal-oriented dialogue generation research.	37M	4.6M	361K	Apache License 2.0
MULTIWOZ2_2 (pfb30)	Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics.	1.9M	143K	10.4K	Apache License 2.0
MulDoGO2 (GitHub)	Conversations from the airline, fastfood, finance, insurance, media, and software domains.	10M	892K	63K	CDLA Permissive License
Free Chat
Chit-Chat (GitHub)	Open-domain conversational dataset from the BYU Perception, Control & Cognition lab's Chit-Chat Challenge.	2.3M	7.1K	258K	MIT License
DailyDialog	High-quality multi-turn dialog dataset.	1.2M	102K	13K	CC BY-NC-SA 4.0
Misc
British National Corpus (BNC)	Collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English, both spoken and written, from the late twentieth century.	110M	663K	0.9K	BCN License

Normalized datasets

Claire French Dialogue Dataset (CFDD) 🇫🇷
- Dataset: 🤗 OpenLLM-France/Claire-Dialogue-French-0.1
- Paper: Julie Hunter, Jérôme Louradour, Virgile Rennard, Ismaïl Harrando, Guokan Shang, Jean-Pierre Lorré « Claire French Dialogue Dataset » (2023)
Claire English Dialogue Dataset (CFDD) 🇫🇷
- Dataset: 🤗 OpenLLM-France/Claire-Dialogue-English-0.1

Contact

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Claire-datasets

Raw datasets

French

Parliamentary Proceedings

Theatre

Interviews

Free Conversations

Meetings

Debates

Assistance

Presentation, Formal Address

English

Parliamentary Proceedings

Spoken Dialogue

Broadcast

Meetings

Assistance

Free Chat

Misc

Normalized datasets

Contact

About

Contributors 4

License

OpenLLM-France/Claire-datasets

Folders and files

Latest commit

History

Repository files navigation

Claire-datasets

Raw datasets

French

Parliamentary Proceedings

Theatre

Interviews

Free Conversations

Meetings

Debates

Assistance

Presentation, Formal Address

English

Parliamentary Proceedings

Spoken Dialogue

Broadcast

Meetings

Assistance

Free Chat

Misc

Normalized datasets

Contact

About

Resources

License

Stars

Watchers

Forks

Contributors 4