For details about this dataset please see telekom/wikipedia-22-12-de-dpr on GitHub.
This data set is compiled and open sourced by Philip May of Deutsche Telekom.
Copyright (c) 2023-2024 Philip May, Deutsche Telekom AG
Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.
The Wikipedia texts are licensed under CC BY-SA 4.0 Deed by the corresponding authors of the German Wikipedia. The questions and imperative questions are copyright (CC BY-SA 4.0 Deed) by Philip May, Deutsche Telekom AG. Indication of changes:
- data source is the Cohere/wikipedia-22-12-de-embeddings dataset on Hugging Face Hub
- we took
wiki_id
,title
andtext
- did some normalization and filtering
- and merged the texts to an appropriate token count
- details can be found in the respective notebooks