Skip to content

Commit

Permalink
+ some villages
Browse files Browse the repository at this point in the history
  • Loading branch information
sverhees committed May 13, 2020
1 parent 9bfda71 commit 63ed59e
Show file tree
Hide file tree
Showing 12 changed files with 5,622 additions and 2,621 deletions.
481 changes: 481 additions & 0 deletions README.html

Large diffs are not rendered by default.

17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# East Caucasian villages: coordinates and languages

This repository contains a dataset with all East Caucasian villages, their coordinates and the language spoken there. It can be used to plot maps on East Caucasian languages or the languages of Daghestan using the [Lingtypology](https://ropensci.github.io/lingtypology/) package for R.
This repository contains a dataset with a list of villages in the eastern Caucasus, their coordinates and the language spoken there. It can be used to plot maps on East Caucasian languages or the languages of Dagestan using the [Lingtypology](https://ropensci.github.io/lingtypology/) package for R.

Feel free to use the data, but if you find any mistakes, please create an issue.
Feel free to use the data. If you find any mistakes, please create an issue.

Data format is TAB-separated CSV files.
Data format is TAB-separated CSV file or XLSX file.

Note that a few villages in the dataset have the same name -- keep this in mind when you are merging data.


### Acknowledgements

The first batch of data (all villages of Daghestan and the language spoken there), was mined by students of the
The first batch of data (all villages of Dagestan and the language spoken there), was mined by students of the
[School of Linguistics](https://ling.hse.ru/en/) at NRU HSE Moscow, under the guidance of [George Moroz](https://github.com/agricolamz). The second batch (all villages of Chechnya and Ingushetia),
was mined by [George Moroz](https://github.com/agricolamz). Some inaccuracies in these data were corrected in the process of working with them. The third batch was created manually and contains the Avar-speaking villages in the Zaqatala and Belokan regions of Azerbaijan, and the Hunzib and Avar villages in the Kakheti region of Georgia. Thanks to Matt Zaslansky for his help locating some villages in Zaqatala. Chechen speaking villages in Georgia were added following an issue created by [Jesse Wichers Schreur](https://github.com/JesseWS).

Expand All @@ -29,7 +31,7 @@ The dataset is divided into two tables containing the following parameters

#### Villages

A list of villages in the Republic of Daghestan, the Chechen and Ingush Republics, and adjacent regions where East Caucasian languages are spoken.
A list of villages in the Republic of Dagestan, the Chechen and Ingush Republics, and adjacent regions where East Caucasian languages are spoken.

* **village** - name of the village in Latin script
* **lat** - latitudinal coordinates
Expand Down Expand Up @@ -77,7 +79,6 @@ For general maps, you can simply filter out the 29 East Caucasian languages that
village Chantliskure: changed name to Chantlisqure and language ~~Hinukh~~ Behzta;
added villages: Duisi, Dzibakhevi, Dzhokolo, Shua Khalatsani, Birkiani, Omalo (Pankisi) - language: Chechen. (Altitude will be added for these villages later.)

21.02.2020 - restructured the data;

added the village Sanzhi, idiom: Sanzhi, language: Dargwa
21.02.2020 - restructured the data; added the village Sanzhi, idiom: Sanzhi, language: Dargwa

13.05.2020 - added Tat language (location: Derbent); added Georgian language; added locations Tlyarata (Avar); Tsunta (Tsez); Qum (Tsakhur); Qax (Azerbaijani); Ilisu (Azerbaijani, historically Tsakhur); Alibeglo (Georgian); Meshabash (Georgian); added .xlsx files of the datasets because some people who use Windows have problems opening the .csv files; added parameter "kutans": this helps to filter out the northern part of Dagestan, which was inhabited relatively recently and consists of a mish-mash of ethnicities and languages. In addition, very little to nothing is known about the varieties spoken there; created an official release for reference, following an [issue](https://github.com/sverhees/master_villages/issues/6) by [George Moroz](https://github.com/agricolamz).
Binary file added data/.RData
Binary file not shown.
348 changes: 348 additions & 0 deletions data/.Rhistory
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
rm(list=ls())
library(lingtypology)
lat.lang("Georgian")
long.lang("Georgian")
gltc.lang("Georgian")
meta <- read_tsv("meta.csv")
non_ec <- meta[(meta$aff != "East Caucasian"),]
non_ec <- meta[(meta$family != "East Caucasian"),]
map.feature(gltc.lang(non_ec$glottocode),
features = non_ec$lang,
color = non_ec$lang_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_lat)
map.feature(gltc.lang(non_ec$glottocode),
features = non_ec$lang,
color = non_ec$lang_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_long)
map.feature(lang.gltc(non_ec$glottocode),
features = non_ec$lang,
color = non_ec$lang_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_long)
map.feature(lang.gltc(non_ec$glottocode),
features = non_ec$lang,
color = non_ec$lang_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_lon)
meta <- read_tsv("meta.csv")
non_ec <- meta[(meta$family != "East Caucasian"),]
map.feature(lang.gltc(non_ec$glottocode),
features = non_ec$lang,
color = non_ec$lang_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_lon)
map.feature(lang.gltc(non_ec$glottocode),
features = non_ec$aff,
color = non_ec$aff_color,
latitude = non_ec$gen_lat,
longitude = non_ec$gen_lon)
villages <- read_tsv("villages.csv")
kutans <- read_tsv("samira_kutans.csv")
kutans <- read_csv("samira_kutans.csv")
kutans <- read_csv("samira_kutans.csv")
View(kutans)
villages_kutans <- merge(villages, kutans, by = "village")
villages_kutans <- left_join(villages, kutans, by = "village")
View(villages_kutans)
duplicated(villages_kutans$village)
dup <- duplicated(villages_kutans$village)
table(dup)
villages_kutans$duplicates <- duplicated(villages_kutans$village)
kutans_small <- kutans %>%
select(village, kutans)
villages_kutans <- left_join(villages, kutans_small, by = "village")
View(villages_kutans)
villages_kutans <- merge(villages, kutans_small, by = "village")
View(kutans_small)
setdiff(villages$village, kutans_small$village)
setdiff(kutans_small$village, villages$village)
villages_kutans <- left_join(villages, kutans_small, by = "village")
villages_kutans$duplicates <- duplicated(villages_kutans$village)
villages <- read_tsv("villages.csv")
kutans <- read_csv("samira_kutans.csv")
kutans_small <- kutans %>%
select(village, kutans)
villages_kutans <- left_join(villages, kutans_small, by = "village")
villages_kutans$duplicates <- duplicated(villages_kutans$village)
table(villages_kutans$duplicates)
View(villages)
View(villages_kutans)
View(kutans)
sort(unique(villages_kutans$village))
sort(unique(villages_kutans$village)) == sort(unique(villages$village))
sum(sort(unique(villages_kutans$village)) == sort(unique(villages$village)))
sum(sort(unique(villages_kutans$village)) != sort(unique(villages$village)))
villages_kutans %>%
full_join(villages)
villages <- read_tsv("villages.csv")
kutans <- read_csv("samira_kutans.csv")
villages_kutans %>%
full_join(villages)
villages_kutans %>%
full_join(villages) %>%
table(villages_kutans$duplicates)
villages_kutans$duplicates <- duplicated(villages_kutans$village)
villages_kutans %>%
full_join(villages) %>%
table(villages_kutans$duplicates)
villages_kutans <- left_join(villages, kutans_small, by = "village")
villages_kutans$duplicates <- duplicated(villages_kutans$village)
villages_kutans %>%
full_join(villages) %>%
table(villages_kutans$duplicates)
villages <- read_tsv("villages.csv")
kutans <- read_csv("samira_kutans.csv")
kutans_small <- kutans %>%
select(village, kutans)
villages_kutans <- left_join(villages, kutans_small, by = "village")
villages_kutans$duplicates <- duplicated(villages_kutans$village)
villages_kutans <- full_join(villages)
villages_kutans %>%
full_join(villages) %>%
table(villages_kutans$duplicates)
villages <- read_tsv("villages.csv")
kutans <- read_csv("samira_kutans.csv")
kutans_small <- kutans %>%
select(village, kutans)
villages_kutans <- left_join(villages, kutans_small, by = "village")
villages_kutans$duplicates <- duplicated(villages_kutans$village)
villages_kutans %>%
full_join(villages) %>%
table(villages_kutans$duplicates)
villages_kutans$duplicates
villages_kutans %>%
count(duplicated)
villages_kutans %>%
count(duplicates)
villages_kutans %>%
count(duplicates) %>%
View()
villages_kutans %>%
View()
villages <- read_tsv("villages.csv")
villages %>%
distinct()
kutans <- read_csv("samira_kutans.csv")
kutans
kutans %>%
distinct()
kutans %>%
select(village, kutans)
kutans %>%
select(village, region, kutans) %>%
full_join(villages)
View(kutans)
kutans %>%
select(village, lat, long, kutans) %>%
full_join(villages, by = c("village", "lat", "long"))
kutans %>%
select(village, lat, lon, kutans) %>%
full_join(villages, by = c("village", "lat", "lon"))
knitr::opts_chunk$set(echo = TRUE)
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian"))
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# packages
library(tidyverse) # data manipulation
library(lingtypology) # drawing maps
# load data
vill <- read_tsv("data/villages.csv") # dataframe with all villages, coordinates and languages
meta <- read_tsv("data/meta.csv") # dataframe with language metadata and color schemes
# data preparation
vill <- vill[complete.cases(vill$lat),] # remove villages without coordinates
meta_core <- meta %>% # remove idioms not (yet) recognized as distinct
filter(core == "yes")
vill_meta <- merge(vill, meta_core, by = "lang") # merge villages with metadata
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# draw map
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
meta_core$emphasis <- "emphasis"
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color_pale,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap")) %>%
map.feature(lang.gltc(meta_core$glottocode),
latitude = meta_core$gltc_lat,
longitude = meta_core$gltc_lon,
features = meta_core$lang,
stroke.features = meta_core$emphasis,
stroke.color = "black",
stroke.legend = F,
width = 3, stroke.radius = 5,
label = meta_core$lang,
color = meta_core$lang_color,
zoom.control = T,
pipe.data = .)
meta_core$emphasis <- "emphasis"
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color_pale,
label = vill_meta$lang,
zoom.control = T,
tile = c("Esri.WorldTopoMap")) %>%
map.feature(lang.gltc(meta_core$glottocode),
latitude = meta_core$gltc_lat,
longitude = meta_core$gltc_lon,
features = meta_core$lang,
stroke.features = meta_core$emphasis,
stroke.color = "black",
stroke.legend = F,
width = 3, stroke.radius = 5,
label = meta_core$lang,
color = meta_core$lang_color,
zoom.control = T,
pipe.data = .)
vill_meta$aff <- factor(vill_meta$aff, levels =c("Dargwa", "Lak", "Nakh", "Khinalug", "Lezgic", "Avar", "Andic", "Tsezic", "Kipchak", "Oghuz", "Armenic", "Iranian", "Georgic"))
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$aff,
width = 8,
label = vill_meta$lang,
color = vill_meta$aff_color,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
# packages
library(tidyverse) # data manipulation
library(lingtypology) # drawing maps
# load data
vill <- read_tsv("data/villages.csv") # dataframe with all villages, coordinates and languages
meta <- read_tsv("data/meta.csv") # dataframe with language metadata and color schemes
# data preparation
vill <- vill[complete.cases(vill$lat),] # remove villages without coordinates
meta_core <- meta %>% # remove idioms not (yet) recognized as distinct
filter(core == "yes")
vill_meta <- merge(vill, meta_core, by = "lang") # merge villages with metadata
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# draw map
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
# packages
library(tidyverse) # data manipulation
library(lingtypology) # drawing maps
# load data
vill <- read_tsv("data/villages.csv") # dataframe with all villages, coordinates and languages
meta <- read_tsv("data/meta.csv") # dataframe with language metadata and color schemes
# data preparation
vill <- vill[complete.cases(vill$lat),] # remove villages without coordinates
meta_core <- meta %>% # remove idioms not (yet) recognized as distinct
filter(core == "yes")
vill_meta <- merge(vill, meta_core, by = "lang") # merge villages with metadata
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# draw map
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
vill_meta$aff <- factor(vill_meta$aff, levels =c("Dargwa", "Lak", "Nakh", "Khinalug", "Lezgic", "Avar", "Andic", "Tsezic", "Kipchak", "Oghuz", "Armenic", "Iranian", "Georgic"))
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$aff,
width = 8,
label = vill_meta$lang,
color = vill_meta$aff_color,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
knitr::opts_chunk$set(echo = TRUE)
# packages
library(tidyverse) # data manipulation
library(lingtypology) # drawing maps
# load data
vill <- read_tsv("data/villages.csv") # dataframe with all villages, coordinates and languages
meta <- read_tsv("data/meta.csv") # dataframe with language metadata and color schemes
# data preparation
vill <- vill[complete.cases(vill$lat),] # remove villages without coordinates
meta_core <- meta %>% # remove idioms not (yet) recognized as distinct
filter(core == "yes")
vill_meta <- merge(vill, meta_core, by = "lang") # merge villages with metadata
vill_meta$aff <- factor(vill_meta$aff, levels =c("Dargwa", "Lak", "Nakh", "Khinalug", "Lezgic", "Avar", "Andic", "Tsezic", "Kipchak", "Oghuz", "Armenic", "Iranian", "Georgic"))
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$aff,
width = 8,
label = vill_meta$lang,
color = vill_meta$aff_color,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
vill_meta$aff <- factor(vill_meta$aff, levels =c("Dargwa", "Lak", "Nakh", "Khinalug", "Lezgic", "Avar", "Andic", "Tsezic", "Kipchak", "Oghuz", "Armenic", "Iranian", "Georgic"))
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$aff,
width = 8,
label = vill_meta$lang,
color = vill_meta$aff_color,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# draw map
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
# re-order the elements in the legend (by default they are in alphabetical order)
vill_meta$lang <- factor(vill_meta$lang, levels =c(
"Dargwa", "Lak", "Bats", "Ingush", "Chechen", "Khinalug", "Archi", "Tsakhur", "Rutul", "Kryz", "Budukh", "Udi", "Lezgian", "Agul", "Tabasaran", "Avar", "Andi", "Botlikh", "Godoberi", "Chamalal", "Bagvalal", "Tindi", "Karata", "Akhvakh", "Tsez", "Hinuq", "Bezhta", "Hunzib", "Khwarshi", "Nogai", "Kumyk", "Azerbaijani", "Armenian", "Tat", "Georgian"))
# draw map
map.feature(lang.gltc(vill_meta$glottocode),
latitude = vill_meta$lat,
longitude = vill_meta$lon,
features = vill_meta$lang,
color = vill_meta$lang_color,
label = vill_meta$lang,
zoom.control = T,
popup = vill_meta$village,
tile = c("Esri.WorldTopoMap"))
Binary file added data/merged_all_census_and_samira.xlsx
Binary file not shown.
Loading

0 comments on commit 63ed59e

Please sign in to comment.