This repository contains a dataset with a list of villages in the eastern Caucasus, their coordinates and the language spoken there. It can be used to plot maps on East Caucasian languages or the languages of Dagestan using the Lingtypology package for R.
Feel free to use the data. If you find any mistakes, please create an issue here on Github.
Data format is TAB-separated CSV file or XLSX file.
Moroz, George, & Verhees, Samira. (2020). East Caucasian villages dataset (Version v1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3824151
#[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3824151.svg)](https://doi.org/10.5281/zenodo.3824151) Zenodo badge temporarily died
The first batch of data (all villages of Dagestan and the language spoken there), was mined by George Moroz. Daria Ignatenko (a student of the School of Linguistics at HSE University Moscow) also worked on the first version of the script.
The second batch (all villages of Chechnya and Ingushetia), was mined by George Moroz. Some inaccuracies in the data were corrected in the process of working with them.
The third batch was created manually and contains the Avar-speaking villages in the Zaqatala and Belokan regions of Azerbaijan, and the Bezhta and Avar villages in the Kakheti region of Georgia. Thanks to Matt Zaslansky for his help locating some villages in Zaqatala. Chechen speaking villages in Georgia were added following an issue created by Jesse Wichers Schreur.
The fourth (seperate) batch was created by collapsing Yury Koryakov's database with census and language information. All census data were merged by George Moroz, Cyrillic and Latin village name correspondences were made by Lev Kazakevich. All of these data were finally merged with the previous version of the villages dataset by George Moroz. See the files in the folder data > extra.
The fifth batch of data consists of an annotation of dialect affiliations for the villages, based on available literature. This task was carried out by Inga Kartozia and Kirill Koncha (both students of the School of Linguistics at HSE University Moscow). The data were subsequently edited by me (Samira Verhees), during which I got a lot of help from Yuri Koryakov.
If you have any questions about the dataset, write a letter to jh.verhees at gmail.
- Typological atlas of the languages of Daghestan created by a team from the Linguistic Convergence Laboratory (beta version)
- Database of Gender Systems in Nakh-Daghestanian languages created by Inga Kartozia
The latest version of the dataset (villages_cor) consists of a single table with the following parameters:
- id - unique id for each entry
- village - name of the village in Latin script
- rus_village - name of the village in Cyrillic script
- lat - latitudinal coordinates
- lon - longitudinal coordinates
- lang - language spoken in the village
- aff - branch to which the language belongs; mentions only uncontroversial group membership and takes an agnostic stance towards grouping among branches
- family - language family
- republic - republic where the language is spoken
- region - administrative district to which the village belongs (still mostly empty)
- elevation - altitude of the village (approximately)
- kutans - whether the village is a relatively new settlement in the north of Dagestan or not
The dialect levels are organized from macro-groups like Southern Avar (dialect_toplevel) to village varieties (village_dialect); some groups may show internal branching (all nt - non-toplevel columns). The name for each dialect / level is given in English/Latin and Russian/Cyrillic (column names ending in cyr).
Note that village_dialect simply duplicates the column with village names (village); their Russian equivalents can thus be found in rus_village.
-
dialect_toplevel
-
dialect_toplevel_cyr
-
dialect_nt1
-
dialect_nt1_cyr
-
dialect_nt2
-
dialect_nt2_cyr
-
dialect_nt3
-
dialect_nt3_cyr
-
village_dialect
-
source - source for information on dialect affiliation; the reference is a
Bibtexkey
; full bibliographical information can be found in the file bib (will be uploaded shortly) -
page - relevant page from the source
-
gltl_village_dialect - if the village dialect is mentioned in the Glottolog database, its name is displayed here
-
gltc_village_dialect - - if the village dialect is mentioned in the Glottolog database, its glottocode is displayed here
-
gltl_dialect - if one of our dialects corresponds to a "languoid" from Glottolog, its name is displayed here. In each case I took the lowest available level from Glottolog, e.g. for Tsez, Glottolog distinguishes Nuclear Tsez - Kidero, I entered Kidero as corresponding layer for all villages belonging to this dialect
-
gltc_dialect - corresponding glottocode for the dialects in gltl_dialect
-
gltc_lang - glottocode for the languages in our dataset
-
lang_col - standard color pallette for languages (see the example maps)
-
aff_col - standard color pallette for branches (see the example maps)
-
comment - any additional comments on a datapoint
This section describes the structure of the files villages and meta. They constitute an older version of the dataset.
The dataset is divided into two tables containing the following parameters:
A list of villages in the Republic of Dagestan, the Chechen and Ingush Republics, and adjacent regions where East Caucasian languages are spoken.
- id - unique id for each entry
- village - name of the village in Latin script
- lat - latitudinal coordinates
- lon - longitudinal coordinates
- lang - language spoken in the village
- idiom - dialect or local variety spoken in a village (still mostly empty)
- republic - republic where the language is spoken
- region - administrative district to which the village belongs (still mostly empty)
- elevation - altitude of the village (approximately)
- kutans - whether the village is a relatively new settlement in the north of Dagestan or not
The metadata file was based on a list of the traditionally recognized languages of the East Caucasian family and some additional idioms were added later. The addition of idioms and the annotation of villages for idiom is still at an early stage, and is not carried out in a very systematic way. (For example, I added Sanzhi simply because there is a grammar for Sanzhi, so we might want to display information from it on the map.)
For general maps, you can simply filter out the 29 East Caucasian languages that are usually distinguished (+ the four non-EC languages spoken in the area) using the core parameter.
- lang - name of the language used in the dataset
- idiom - dialect or local variety spoken in a village
- core - yes: 29 traditionally recognized East Caucasian languages + local Turkic languages and Armenian; no: additional idioms
- aff - genealogical group to which the language belongs. The division in groups is a mixture of higher and mid-level branches based on personal preferences. Other options and more detailed branching can be accessed using basic Lingtypology syntax
- family - language family to which the language belongs; used to distinguish the few non East Caucasian languages spoken in the area
- glottocode - glottocode of the idiom, which can be used to access background information and alternative names for the language via the Glottolog database
- gltc_lat - latitudinal coordinates for a generalized datapoint for the idiom in question; idioms limited to one village have the village coordinates
- gltc_lon - longitudinal coordinates for a generalized datapoint for the idiom in question; idioms limited to one village have the village coordinates
- general_location - the nature of the location for the coordinates in this file. In most cases a generalized datapoint from Glottolog is used
- lang_color - color scheme with a unique color for each language
- lang_color_pale - a light shades of the previous color scheme, see the sample maps
- aff_color - color scheme with a unique color for each genealogical group
- lang_color_dagtlas - an alternative color scheme with a unique color for each genealogical group
- villages_marked - specifies whether an (additional) idiom is marked in the dataframe with villages
- comment - any kind of comment on the datapoint
04.04.2019 - village: Sary-su language: Chechen Nogai;
village: Vinogradnoe (Chechnya) language: Chechen Kumyk;
village: Braguny language: Chechen Kumyk
05.05.2019 - George Moroz added parameter altitude (in meters above sea-level).
28.05.2019 - added datapoint for Kurush (Dokuzparinsky district).
03.10.2019 - added the fourth batch of data (see above);
village Chantliskure: changed name to Chantlisqure and language Hinukh Behzta;
added villages: Duisi, Dzibakhevi, Dzhokolo, Shua Khalatsani, Birkiani, Omalo (Pankisi) - language: Chechen. (Altitude will be added for these villages later.)
21.02.2020 - restructured the data; added the village Sanzhi, idiom: Sanzhi, language: Dargwa
13.05.2020 - added Tat language (location: Derbent); added Georgian language; added locations Tlyarata (Avar); Tsunta (Tsez); Qum (Tsakhur); Qax (Azerbaijani); Ilisu (Azerbaijani, historically Tsakhur); Alibeglo (Georgian); Meshabash (Georgian); added .xlsx files of the datasets because some people who use Windows have problems opening the .csv files; added parameter "kutans": this helps to filter out the northern part of Dagestan, which was inhabited relatively recently and consists of a mish-mash of ethnicities and languages. In addition, very little to nothing is known about the varieties spoken there; added an "id" column to prevent problems with villages that have the same name; created an official release for reference, following an issue by George Moroz.
24.06.2020 - updated acknowledgements.
15.01.2021 - added dialect annotation; added Cyrillic village names; removed Glavnyy-Kut (empty); updated coordinates for Siukh (1478); renewed page with example maps.