HaploDynamics

A python library to simulate genomic data

Presentation

HaploDynamics (HaploDX) is a Python 3+ library that provides a collection of functions for simulating population-specific genomic data. The package is part of the Genetic Simulator Resources (GSR) catalog, which can accessed by clicking on the image below.

Highlights and updates

Five reasons to use this package:

An intuitive user interface↗ for writing short, concise Python code that achieves realistic simulations.
Speed and efficiency,↗ with a lightweight implementation that allows for fast generation of simulations.
Flexibility,↗ with the ability to mix your own models with the framework to create custom simulations.
A comprehensive set of arithmetic operations↗ (coming soon) for working with mutiple VCF files.
Detailed documentation↗ with thorough tutorials and performance analyses to help you get started quickly.

Release v0.4b*:

Compose your own mutation model: the class Model now lets you create your own mutation model and use it with the generative functions of the HaploDX framework.

import HaploDynamics.Framework as fmx
#Start your simulation
model = fmx.Model("tutorial")
#Initialize the genomic landscape
model.initiate_landscape(reference = 1.245)
#Design your own genomic landscape with any allele frequency model
model.extend_landscape(*(fmx.Model.standard_schema(20) for _ in range(6)))
#Population and LD parameters
strength = 1
population = 0.1
Npop = 1000
chrom = "1"
#Generate the simulation in a VCF file
model.generate_vcf(strength,population,Npop,chrom)

HaploDynamics.Framework.Model.initiate_landscape added;
HaploDynamics.Framework.Model.extend_landscape added;
HaploDynamics.Framework.Model.standard_schema added;
HaploDynamics.Framework.Model.genotype_schema added;
HaploDynamics.Framework.Model.linkage_disequilibrium added;
HaploDynamics.Framework.Model.cond_genotype_schema added;
Documentation for the Framework module polished;
Various typos and clumsy phrasing have been corrected in the documentation;

Loading bar appearance changed:

$ python myscript.py
Model.generate_vcf: |████████████████████| 100%
time (sec.): 0.7510931491851807
max. mem (MB): 0.11163139343261719
cur. mem (MB): 0.0834970474243164

Installation

Installation via `pip`

Install the HaploDynamics package by using the following command.

$ pip install HaploDynamics

After this, you can import the modules of the library to your script as follows.

import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as fmx

To upgrade the package to its latest version, use the following command.

$ pip install --upgrade HaploDynamics==0.4b1

Manual installation

HaploDynamics uses the SciPy library for certain calculations. To install SciPy, run the following command, or see SciPy's installation instructions for more options.

$ python -m pip install scipy

You can install the HaploDynamics GitHub package by using the following command in a terminal.

$ git clone https://github.com/remytuyeras/HaploDynamics.git

Then, use the pwd command to get the absolute path leading to the downloaded package.

$ ls
HaploDynamics
$ cd HaploDynamics/
$ pwd
absolute/path/to/HaploDynamics

To import the modules of the library to your script, you can use the following syntax where the path absolute/path/to/HaploDynamics should be replaced with the path obtained earlier.

import sys
sys.path.insert(1,"absolute/path/to/HaploDynamics")
import HaploDynamics.HaploDX as hdx
import HaploDynamics.Framework as fmx

Quickstart

The following script generates a VCF file containing simulated diploid genotypes for a population of 1000 individuals with LD-blocks of length 20kb, 5kb, 20kb, 35kb, 30kb and 15kb.

import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,5,20,35,30,15],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

The equation strength=1 forces a high amount of linkage disequilibrium and the equation population=0.1 increases the likelyhood of the simulated population to have rare mutations (e.g. to simulate a population profile close to African and South-Asian populations).

More generally, the function genmatrix() takes the following types of parameters:

Parameters	Type	Values
`blocks`	`list[int]`	List of positive integers, ideally between 1 and 40.
`strength`	`float`	From -1 (little linkage) to 1 (high linkage)
`population`	`float`	From 0 (for more rare mutations) to 1 (for less rare mutations)
`Npop`	`int`	Positive integer specifying the number of individuals in the genomic matrix

The generation of each locus in a VCF file tends to be linear in the parameter Npop. On average, a genetic variant can take from 0.3 to 1 seconds to be generated when Npop=100000 (this may vary depending on your machine). The estimated time complexity for an average machine is shown below.

Use cases

The following script shows how to display linkage disequilibirum correlations for the simulated data.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

simulated_data = hdx.genmatrix([20,20,20,20,20,20],strength=1,population=0.1,Npop=1000)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

rel, m, _ = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

A typical output for the previous script should look as follows.

The following script shows how you can control linkage disequilibrium by using LD-blocks of varying legnths. You can display the graph relating distances between pairs of variants to average correlation scores by using the last output of the function LD_corr_matrix().

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

ld_blocks = [5,5,5,10,20,5,5,5,5,5,5,1,1,1,2,2,10,20,40]
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

#from genetic distances to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()

Typical outputs for the previous script should look as follows.

Correlations	genetic distances to average correlations

Finally, the following script shows how you can generate large regions of linkage.

import matplotlib.pyplot as plt
import HaploDynamics.HaploDX as hdx

ld_blocks = [1] * 250
strength=1
population=0.1
Npop = 1000
simulated_data = hdx.genmatrix(ld_blocks,strength,population,Npop)
hdx.create_vcfgz("genomic-data.simulation.v1",*simulated_data)

#Correlations
rel, m, dist = hdx.LD_corr_matrix(simulated_data[0])
plt.imshow(hdx.display(rel,m))
plt.show()

#from genetic distances to average correlaions
plt.plot([i for i in range(len(dist)-1)],dist[1:])
plt.ylim([0, 1])
plt.show()

Typical outputs for the previous script should look as follows.

Correlations	genetic distances to average correlations

To cite this work

Tuyeras, R. (2023). HaploDynamics: A python library to develop genomic data simulators (Version 0.4-beta.1) [Computer software].

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
.github/workflows		.github/workflows
HaploDynamics		HaploDynamics
docs/source		docs/source
img		img
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HaploDynamics

Presentation

Highlights and updates

Five reasons to use this package:

Release v0.4b*:

Installation

Installation via `pip`

Manual installation

Quickstart

Use cases

To cite this work

Documentation

About

Releases 16

Packages

Languages

License

remytuyeras/HaploDynamics

Folders and files

Latest commit

History

Repository files navigation

HaploDynamics

Presentation

Highlights and updates

Five reasons to use this package:

Release v0.4b*:

Installation

Installation via pip

Manual installation

Quickstart

Use cases

To cite this work

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 16

Packages 0

Languages

Installation via `pip`

Packages