Statistical Methodology for Quantitative Linguistics:
A Case Study of Learnability and Zipf’s Law

Master of Logic Thesis of Valentin Vogelmann

Linguistically pre-processed (see Section 2.1 for pipeline details) and subsequently pickled (using Python module pickle) corpora: Wikipedia dumps in 7 languages. Each folder, prefixed with a language code (see below), contains the corpus split into multiple files in order to stay below GitHub's file size limit. Use wiki_from_pickles in /data/reader.pyto load a corpus from a folder in /data; corpus.py contains wrappers to turn the corpora loaded with wiki_from_pickles into Python objects with convenient functionality.

Language codes are : Esperanto - EO, Finnish - FI, Indonesian - ID, Korean - KO, Norwegian (the Bokmal variant) - NO, Turkish - TR and Vietnamese - VI.

See Table 2.1 of the thesis for the basic size characteristics of the corpora.

`/src`

Code used for generating subcorpora according to the Subsampling and Filtering methods and analysing those.

/src/stats/: Helper functionality such as calculation of ranks, frequencies, typicality or performing MLE
/src/subsampling: Scripts to analyse the various aspect of the Subsampling method analysed in the thesis, such as variance and convergence
/src/filtering: Implementations of the TypicalityFilter and SpeakerRestrictionFilter sampling algorithms, both sequential and parallelised versions (using Python's multiprocessing library)
/src/evaluation: Functions and scripts for evaluation of filtering results, such as lexical diversity and Jaccard distance
shell_scripts: Shell scripts to deploy Python code on SurfSARA's LISA computing cluster

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
data		data
figures		figures
src		src
LICENSE		LICENSE
README.md		README.md
Statistical_Metholody_for_QL_Thesis.pdf		Statistical_Metholody_for_QL_Thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Methodology for Quantitative Linguistics:
A Case Study of Learnability and Zipf’s Law

Contents

`/figures`

`/data`

`/src`

About

Releases

Packages

Languages

License

valevo/Thesis

Folders and files

Latest commit

History

Repository files navigation

Statistical Methodology for Quantitative Linguistics: A Case Study of Learnability and Zipf’s Law

Contents

/figures

/data

/src

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Statistical Methodology for Quantitative Linguistics:
A Case Study of Learnability and Zipf’s Law

`/figures`

`/data`

`/src`

Packages