A curated list of awesome libraries, resources, services and datasets for Tatar language.
Table of Contents
- LLMs
- Parallel corpora
- Monocorpora
- Audio datasets
- Other datasets
- Cyrill-latin convertors
- Text-to-speech & speech-to-text
- Language corpus
- Language analyzers
- Volunteer localization projects
- Localization guides
- Browser's plugins
- tweety-tatar-base - LLM for the Tatar language, converted from the Mistral-7B-Instruct-v0.2 model trained by MistralAI
- mGPT-1.3B-tatar - The model derived from the base mGPT-XL (1.3B) model which was originally trained on the 61 languages from 25 language families using Wikipedia and C4 corpus by SberAI.
- IPSAN's parallel corpus dataset - Dataset collected by Institute of Applied Semiotics.
- Aygiz Kunafin's dataset - Dataset collected by language enthusiast Aygiz Kunafin.
- The Open Parallel Corpus (tt-en, tt-ru) - miscellaneous parallel corpora (⚠ it requires data cleaning and preparation).
- Apertium's language pair - an Apertium language pair for translating from Tatar to Russian (in incubator), Kazakh - Tatar pair (production).
- uonlp/CulturaX - Multilanguage dataset of The University of Oregon .
- Neurotatarlar's dataset - Our dataset includes processed books and documents crawled from the internet.
- Azatliq's crawled document - Dataset of documents crawled from the Azatliq website.
- Mozilla common voice - Open voice dataset powered by volunteer contributors.
- TatSC - ISSAI dataset
- SART - datasets of Similarity, Analogies, and Relatedness for Tatar language.
- MMS-1b-tatar - Fine-tuned ASR for tatar language.
- speech.tatar - Read aloud service powered by Institute of Applied Semiotics.
- Tatsoft ASR - API for automatic speech recognition system for Tatar language provided by Tatsoft.
- Tatsoft TTS - API for text-to-speech synthesis system for Tatar language provided by Tatsoft.
- TatarSCR - An open-source Tatar Speech Commands Dataset
- Silero Models - Pre-trained STT / TTS models with tatar language support. Minimal working example can be found here.
- Massively Multilingual Speech - Open-source STT / TTS initiative for thousands of languages.
- TurkicTTS - A multilingual text-to-speech synthesis system for 10 turkic languages.
- RHVoice - A free and open source speech synthesizer with tatar language support.
- Apertium's language analyzer
- Turkic Morpheme portal and API - Morphological analyzer for Turkic languages, including Tatar.
- LibreOffice - Localization of free and open-source LibreOffice.
- Mozilla Firefox - Localization of free and open-source Mozilla projects.
- Minecraft - Localization of legendary Minecraft.
- Mastodon - Localization of free and open-source social network Mastodon (source).
- Warzone 2100 - Localization of real-time strategy game Warzone 2100(source).
- Steam - Unofficial localization of Steam.
- Wikipedia - Community of Wikipedia volunteers.
- Ubuntu - Localization of Ubuntu OS to Tatar language.
- LineageOS AOSP - Localization of the biggest Android open source fork.
- Tatar Style Guide - Official localization style gide used by Microsoft.
- Microsoft terminology search - Official translations used in Microsoft's products.
- tatarspeech(beta) - real-time YouTube video translation to Tatar.