East Caucasian villages: coordinates and languages

This repository contains a dataset with a list of villages in the eastern Caucasus, their coordinates and the language spoken there. It can be used to plot maps on East Caucasian languages or the languages of Dagestan using the Lingtypology package for R.

Feel free to use the data. If you find any mistakes, please create an issue here on Github.

Data format is TAB-separated CSV file or XLSX file.

Cite

Moroz, George, & Verhees, Samira. (2020). East Caucasian villages dataset (Version v1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3824151


#[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3824151.svg)](https://doi.org/10.5281/zenodo.3824151) Zenodo badge temporarily died

Acknowledgements

The first batch of data (all villages of Dagestan and the language spoken there), was mined by George Moroz. Daria Ignatenko (a student of the School of Linguistics at HSE University Moscow) also worked on the first version of the script.

The second batch (all villages of Chechnya and Ingushetia), was mined by George Moroz. Some inaccuracies in the data were corrected in the process of working with them.

The third batch was created manually and contains the Avar-speaking villages in the Zaqatala and Belokan regions of Azerbaijan, and the Bezhta and Avar villages in the Kakheti region of Georgia. Thanks to Matt Zaslansky for his help locating some villages in Zaqatala. Chechen speaking villages in Georgia were added following an issue created by Jesse Wichers Schreur.

The fourth (seperate) batch was created by collapsing Yury Koryakov's database with census and language information. All census data were merged by George Moroz, Cyrillic and Latin village name correspondences were made by Lev Kazakevich. All of these data were finally merged with the previous version of the villages dataset by George Moroz. See the files in the folder data > extra.

The fifth batch of data consists of an annotation of dialect affiliations for the villages, based on available literature. This task was carried out by Inga Kartozia and Kirill Koncha (both students of the School of Linguistics at HSE University Moscow). The data were subsequently edited by me (Samira Verhees), during which I got a lot of help from Yuri Koryakov.

Contact

If you have any questions about the dataset, write a letter to jh.verhees at gmail.

Projects

Typological atlas of the languages of Daghestan created by a team from the Linguistic Convergence Laboratory (beta version)
Database of Gender Systems in Nakh-Daghestanian languages created by Inga Kartozia

Data

The latest version of the dataset (villages_cor) consists of a single table with the following parameters:

id - unique id for each entry
village - name of the village in Latin script
rus_village - name of the village in Cyrillic script
lat - latitudinal coordinates
lon - longitudinal coordinates
lang - language spoken in the village
aff - branch to which the language belongs; mentions only uncontroversial group membership and takes an agnostic stance towards grouping among branches
family - language family
republic - republic where the language is spoken
region - administrative district to which the village belongs (still mostly empty)
elevation - altitude of the village (approximately)
kutans - whether the village is a relatively new settlement in the north of Dagestan or not

The dialect levels are organized from macro-groups like Southern Avar (dialect_toplevel) to village varieties (village_dialect); some groups may show internal branching (all nt - non-toplevel columns). The name for each dialect / level is given in English/Latin and Russian/Cyrillic (column names ending in cyr).

Note that village_dialect simply duplicates the column with village names (village); their Russian equivalents can thus be found in rus_village.

dialect_toplevel
dialect_toplevel_cyr
dialect_nt1
dialect_nt1_cyr
dialect_nt2
dialect_nt2_cyr
dialect_nt3
dialect_nt3_cyr
village_dialect
source - source for information on dialect affiliation; the reference is a Bibtexkey; full bibliographical information can be found in the file bib (will be uploaded shortly)
page - relevant page from the source
gltl_village_dialect - if the village dialect is mentioned in the Glottolog database, its name is displayed here
gltc_village_dialect - - if the village dialect is mentioned in the Glottolog database, its glottocode is displayed here
gltl_dialect - if one of our dialects corresponds to a "languoid" from Glottolog, its name is displayed here. In each case I took the lowest available level from Glottolog, e.g. for Tsez, Glottolog distinguishes Nuclear Tsez - Kidero, I entered Kidero as corresponding layer for all villages belonging to this dialect
gltc_dialect - corresponding glottocode for the dialects in gltl_dialect
gltc_lang - glottocode for the languages in our dataset
lang_col - standard color pallette for languages (see the example maps)
aff_col - standard color pallette for branches (see the example maps)
comment - any additional comments on a datapoint

Legacy data

This section describes the structure of the files villages and meta. They constitute an older version of the dataset.

The dataset is divided into two tables containing the following parameters:

Villages

A list of villages in the Republic of Dagestan, the Chechen and Ingush Republics, and adjacent regions where East Caucasian languages are spoken.

id - unique id for each entry
village - name of the village in Latin script
lat - latitudinal coordinates
lon - longitudinal coordinates
lang - language spoken in the village
idiom - dialect or local variety spoken in a village (still mostly empty)
republic - republic where the language is spoken
region - administrative district to which the village belongs (still mostly empty)
elevation - altitude of the village (approximately)
kutans - whether the village is a relatively new settlement in the north of Dagestan or not

Metadata

The metadata file was based on a list of the traditionally recognized languages of the East Caucasian family and some additional idioms were added later. The addition of idioms and the annotation of villages for idiom is still at an early stage, and is not carried out in a very systematic way. (For example, I added Sanzhi simply because there is a grammar for Sanzhi, so we might want to display information from it on the map.)

For general maps, you can simply filter out the 29 East Caucasian languages that are usually distinguished (+ the four non-EC languages spoken in the area) using the core parameter.

lang - name of the language used in the dataset
idiom - dialect or local variety spoken in a village
core - yes: 29 traditionally recognized East Caucasian languages + local Turkic languages and Armenian; no: additional idioms
aff - genealogical group to which the language belongs. The division in groups is a mixture of higher and mid-level branches based on personal preferences. Other options and more detailed branching can be accessed using basic Lingtypology syntax
family - language family to which the language belongs; used to distinguish the few non East Caucasian languages spoken in the area
glottocode - glottocode of the idiom, which can be used to access background information and alternative names for the language via the Glottolog database
gltc_lat - latitudinal coordinates for a generalized datapoint for the idiom in question; idioms limited to one village have the village coordinates
gltc_lon - longitudinal coordinates for a generalized datapoint for the idiom in question; idioms limited to one village have the village coordinates
general_location - the nature of the location for the coordinates in this file. In most cases a generalized datapoint from Glottolog is used
lang_color - color scheme with a unique color for each language
lang_color_pale - a light shades of the previous color scheme, see the sample maps
aff_color - color scheme with a unique color for each genealogical group
lang_color_dagtlas - an alternative color scheme with a unique color for each genealogical group
villages_marked - specifies whether an (additional) idiom is marked in the dataframe with villages
comment - any kind of comment on the datapoint

Updates

04.04.2019 - village: Sary-su language: ~~Chechen~~ Nogai; village: Vinogradnoe (Chechnya) language: ~~Chechen~~ Kumyk; village: Braguny language: ~~Chechen~~ Kumyk

05.05.2019 - George Moroz added parameter altitude (in meters above sea-level).

28.05.2019 - added datapoint for Kurush (Dokuzparinsky district).

03.10.2019 - added the fourth batch of data (see above); village Chantliskure: changed name to Chantlisqure and language ~~Hinukh~~ Behzta; added villages: Duisi, Dzibakhevi, Dzhokolo, Shua Khalatsani, Birkiani, Omalo (Pankisi) - language: Chechen. (Altitude will be added for these villages later.)

21.02.2020 - restructured the data; added the village Sanzhi, idiom: Sanzhi, language: Dargwa

13.05.2020 - added Tat language (location: Derbent); added Georgian language; added locations Tlyarata (Avar); Tsunta (Tsez); Qum (Tsakhur); Qax (Azerbaijani); Ilisu (Azerbaijani, historically Tsakhur); Alibeglo (Georgian); Meshabash (Georgian); added .xlsx files of the datasets because some people who use Windows have problems opening the .csv files; added parameter "kutans": this helps to filter out the northern part of Dagestan, which was inhabited relatively recently and consists of a mish-mash of ethnicities and languages. In addition, very little to nothing is known about the varieties spoken there; added an "id" column to prevent problems with villages that have the same name; created an official release for reference, following an issue by George Moroz.

24.06.2020 - updated acknowledgements.

15.01.2021 - added dialect annotation; added Cyrillic village names; removed Glavnyy-Kut (empty); updated coordinates for Siukh (1478); renewed page with example maps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

East Caucasian villages: coordinates and languages

Cite

Acknowledgements

Contact

Projects

Data

Legacy data

Villages

Metadata

Updates

Files

README.md

Latest commit

History

README.md

File metadata and controls

East Caucasian villages: coordinates and languages

Cite

Acknowledgements

Contact

Projects

Data

Legacy data

Villages

Metadata

Updates