In recent years the size and quality of biological data sets have increased at a tremendous rate. It is fair to say that we have not yet reached the plateau of that development, so that students and researchers alike will be facing even larger data sets in the near future.
As the amount of information increases, so must the methods we use to explore, analyse and even present that data. The field of population genetics is undergoing a profound paradigm shift, as researchers move from exploiting the sparse, hard-won data sets of old to finding ways to understand and encompass the swaths of data their laboratories turn in.
In recent years i have had the privilege of working on one of the largest genetic data sets to date (2018). In summary, this project involved the description of genetic structure across thousands of possible sites. I would like to leave here some of the more useful insight and tools that resulted.
One important component of Big Data is Data Visualization. Data Visualization is particularly important in decision making, for technical development as well as in directing research projects. This repository consists in a series of Jupyter notebooks, each exploring some method or applications i found particularly useful.
A step-by-step guide through the conceptualization and building of a pipeline for the classification of haplotypes in large data set.
Application of Markov Chains to population genetics. My own incremental study of both subjects.
The implementation, in python, of the more basic aspects of coalescent theory. This was a lot of fun to write.
An exploration of how the methods developped during my PhD an be applied in a playful way.
KDE estimation of hand-written digits and population genetics.