This cloud-native module provides enhanced functionalities for software development and experimentation. It includes more efficient versions of the functions in the HaploDX module, as well as support for deployment and scaling on cloud computing platforms.
The module currently contains the following features:
Warning This module is currently under active development
- Coming soon HaploDynamics.Framework.VCF
- Coming soon HaploDynamics.Framework.VCF.__init__
- Coming soon HaploDynamics.Framework.VCF.__add__
- Coming soon HaploDynamics.Framework.VCF.__mul__
This section describes the class Model
, which provides methods for modeling population-specific genomic data in variant call format (VCF). The class currently has the following methods:
- HaploDynamics.Framework.Model.__init__
- HaploDynamics.Framework.Model.initiate_landscape
- HaploDynamics.Framework.Model.extend_landscape
- HaploDynamics.Framework.Model.standard_schema
- HaploDynamics.Framework.Model.initiate_vcf
- HaploDynamics.Framework.Model.generate_vcf
A description of the class variables is given in the description of the initializer method __init__()
.
The initializer method for the class initializes Model
instances along with 5 variables:
Variable | Description |
---|---|
self.pool |
a Pool object from the module multiprocessing of the Python Standard Library. |
self.fname |
a string representing a name of reference for the files generated with the methods of the class. |
self.eol |
a string containing the end of line character for the ambient OS. |
self.initial_schema |
a pair |
self.landscape |
a list of triples defining genomic region profiles called genetic schemas. See the method extend_landscape for the definition of a schema. |
def __init__(self,
fname: str,
cpus: int = 1,
system: str = "unix"
) -> None
Description
-
inputs:
fname
: a string representing a name of reference for the files generated with the methods of the class. For example, the string is used to name the output VCF file for the simulated data generated by the methodgenerate_vcf()
.cpus
: number of cpus to be used by the generative processes associated with the class instance. This feature is currently under developement cannot take other values other than1
;system
: a string that is either equal to"unix"
or any other string. If the string is equal to"unix"
, then every end of line of the VCF file is formatted according to unix systems (i.e using the character\n
). Otherwise, the end of line is formatted according to windows systems (i.e using the character\r\n
).
-
variables:
self.pool
: initialized as thePool
object returned bymultiprocessing.Pool(cpus)
;self.fname
: initialized asfname
;self.eol
: initialized to the string"\n"
if the inputsystem
contains the string"unix"
. Otherwise it is initialized to the string"\r\n"
to be compatible with Windows OS systems;self.initial_schema
: initialized by calling the processself.initialize_landscape()
self.landscape
: initialized to an empty list.
The following code snippet shows how to initialize an instance of the class Model
.
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
The method initiate_landscape()
can be used to define the first genetic position to be included in a simulation of variant call data. Specifically, it initializes the variable self.initial_schema
with a pair
-
$f$ is a stochastic process$[0,1] \to [0,1]$ that takes a bottleneck parameter$\alpha \in [0,1]$ and returns an allele frequency stochastic process genetic position; and -
$r$ is a genetic position.
Note The variable contains a simplified encoding of a genetic schema as defined in extend_landscape. Specifically, if the variable
self.initial_schema
contains a pair$(f,r)$ , then the associated genetic schema is defined by the triple$(1,f,g)$ where$g$ is the constant function$s\mapsto [r]$ .
def initiate_landscape(self,
afs: Callable[[float],float] = afs_sample,
reference: float = 0
) -> None
Description
-
inputs:
-
afs
: a stochastic process$[0,1] \to [0,1]$ that takes a bottleneck parameter$\alpha \in [0,1]$ and returns an allele frequency; -
reference
: a floating-point number that represents a genetic position.Note The genetic position
reference
is used as the least genetic position to index the dataset simulated by the processgenerate_vcf()
. Specifically, this genetic position is prepended to the list of genetic positions generated by the genetic schemas contained in the listself.landscape
. This means that the VCf file generated by the processgenerate_vcf()
containssum([npos for npos,_,_ in self.landscape])+1
genetic positions;
-
The following code snippet shows how to call and use the method standard_schema
.
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
model.initiate_landscape(reference = 1.245)
print(f"The first genetic position of our simulation is at {1000*model.initial_schema[1]}k")
The execution of the previous script should produce the following output.
$ python3 myscript.py
The first genetic position of our simulation is at 1245.0k
The method extend_landscape()
extends the list stored in the variable self.landscape
with the sequence of items passed in the input. By convention, this sequence must only contain genetic schemas, which are defined below.
Definition (genetic schema). A genetic schema is defined as a triple
-
$b$ is a number of genetic positions in an LD block; -
$f$ is a stochastic process$[0,1] \to [0,1]$ that takes a bottleneck parameter$\alpha \in [0,1]$ and returns an allele frequency; -
$g$ is a stochastic process$\mathbb{R} \to \mathbb{R}^N$ that takes an input value and returns a sequence of$N$ successive genetic positions, all greater than the input value.
The previous collection of items can be seen as a specification for a genetic region, where
SchemaType = tuple[int,Callable[[float],float],Callable[[float],list[float]]]
def extend_landscape(self,
*schemas: SchemaType
) -> int
Description
- inputs:
schemas
: a sequence of genetic schemas, as defined above in the introduction.
- output:
- an integer value equal to the legnth of the list contained in
self.landscape
.
- an integer value equal to the legnth of the list contained in
The following code snippet shows how to call and use the method standard_schema
.
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
n = model.extend_landscape(fmx.Model.standard_schema(10),
fmx.Model.standard_schema(15),
fmx.Model.standard_schema(9),
fmx.Model.standard_schema(5),
fmx.Model.standard_schema(5),
fmx.Model.standard_schema(10),
fmx.Model.standard_schema(15),
fmx.Model.standard_schema(12))
print(f"We added {n} new schemas to the genetic landscape of the model")
The execution of the previous script should produce the following output.
$ python3 myscript.py
We added 8 new schemas to the genetic landscape of the model
The static method standard_schema()
returns what one should consider a "standard" genetic schema for the variable self.landscape
(see the method extend_landscape for a definition of a schema). Specifically, Standard schemas are encoded by triples
SchemaType = tuple[int,Callable[[float],float],Callable[[float],list[float]]]
@staticmethod
def standard_schema(npos: int,
) -> SchemaType
Description
- inputs:
npos
: an integer value that represents number of genetic positions in a genomic region
- output:
npos
: the integer value contained in the inputnpos
;mut_rate
: the stochastic processafs_sample
;pos_dist
: the stochastic process defined bylambda reference: SNP_distribution(reference,npos)
;
The following code snippet shows how to call and use the method standard_schema
.
import HaploDynamics.Framework as fmx
x = fmx.Model.standard_schema(10)
print("Number of genetic positions:",x[0])
print("An allele frequency for aplha=0.1:",x[1](0.1))
print(f"A list of {x[0]} positions above 123.45:")
for s in x[2](123.45):
print("-",s)
The execution of the previous script should produce an output looking as follows.
$ python3 myscript.py
Number of genetic positions: 10
An allele frequency for aplha=0.1: 0.05115750329491873
A list of 10 positions above 123.45:
- 125.278
- 125.907
- 126.11
- 126.976
- 127.589
- 127.769
- 128.642
- 130.274
- 130.284
- 132.767
The method genotype_schema()
is a copy-and-paste of the function HaploDX.genotype_schema, with the addition of an optional input, afs=afs_sample
, which can be used to change the allele frequency spectrum used by the function. This allows the user to specify a different allele frequency spectrum for the simulated genotypes.
def genotype_schema(self,
alpha: float = 4/30,
afs: Callable[[float],float] = afs_sample
) -> tuple[float,list[float]]
The method linkage_disequilibrium()
is a copy-and-paste of the function HaploDX.linkage_disequilibrium, with the addition of an optional input, afs=afs_sample
, which can be used to change the allele frequency spectrum used by the function. This allows the user to specify a different allele frequency spectrum for the simulated genotypes.
def linkage_disequilibrium(self,
alpha: float,
beta: float,
gamma: float,
strength: float = -1,
afs: Callable[[float],float] = afs_sample
) -> Callable[[float],Callable[[float],tuple[float,float]]]
The method cond_genotype_schema()
is a copy-and-paste of the function HaploDX.cond_genotype_schema, with the addition of an optional input, afs=afs_sample
, which can be used to change the allele frequency spectrum used by the function. This allows the user to specify a different allele frequency spectrum for the simulated genotypes.
def cond_genotype_schema(self,
previous_maf: float,
distance: float,
alpha: float,
beta: float,
gamma: float,
strength: float = -1,
afs: Callable[[float],float] = afs_sample
) -> tuple[float,Callable[[tuple[int,int]],list[float]],float]
The method initiate_vcf()
locally saves a VCF file with only header information. This means that the file contains metadata and a row of attributes, but no variant call rows.
def initiate_vcf(self,
Npos: int,
Npop: int,
chrom = "23"
) -> IO
Description
- inputs:
Npos
: number of variant calls to be contained by the VCF file;Npop
: number of individuals contained in the attribute section of the VCF file. Each individual is labeled by the formula"ID"+"0"*(int(math.log10(Npop))+1-len(str(s)))+str(s)
wheres
is the rank of the individual among the list of all the inidividuals (i.e.s
is a value between1
andNpop
);chrom
: a string that represents the chromosome number used to annotate the VCF file.
- output:
- a file pointer positioned at the end of the VCF file
The following snippet shows a rudimentary example of how the user can use the method initiate_vcf()
.
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
nb_variants = 120
nb_indiv = 1000
chrom = "1"
model.initiate_vcf(nb_variants,nb_indiv,chrom).close()
The execution of the previous script should produce a VCF file tutorial.vcf.gz
, which can be visualized by using the commands zmore
, zless
or zcat
.
$ python3 myscript.py
$ ls
tutorial.vcf.gz
$ zcat tutorial.vcf.gz
The output:
##fileformat=VCFv4.2
##fileDate=20230824
##source=HaploDX
##reference=https://github.com/remytuyeras/HaploDynamics/blob/main/README.md
##contig=<ID=1,length=120,species="simulated Homo sapiens">
...
ID0987 ID0988 ID0989 ID0990 ID0991 ID0992 ID0993 ID0994 ID0995 ID0996 ID0997 ID0998 ID0999 ID1000
The process generate_vcf()
combines the two processes genmatrix()
and create_vcfgz()
of the module HaploDX
into a single process. This is done to optimize execution time and reduce memory usage. The generate_vcf()
function is ideal for machines with limited memory bandwidth due to cost or equipment constraints.
Important Unlike the two processes
genmatrix()
andcreate_vcfgz()
, the processgenerate_vcf()
relies on the genetic schemas contained in the variablesself.initial_schema
andself.landscape
to generate the simulated variant call data. One key difference is that these schemas are indexed by the number of genetic positions in a genomic region, rather than by LD block lengths. This means that the least genetic position used in the simulated data is the genetic position contained inself.initial_schema
. Subsequent genetic positions are provided by the last process of each schema in the listself.landscape
.
def generate_vcf(self,
strength: float,
population: float,
Npop: int,
chrom: str = "23"
) -> tuple[float,float,float]
Description
-
inputs:
-
strength
: a floating-point number in the interval$[-1,1]$ that represents the strength of the linkage disequilibrium. For example, a strength of$1$ refers to the maximum value that the linkage disequilibrium measure can take given the values of the parameters generated during the siumlation. See the tutorial to learn more about the parameterstrength
; -
population
: a floating-point number in the interval$[0,1]$ , which is given as an input to the functionpopulation_mld()
to generate a stochastic population profile; -
Npop
: an integer$N$ that represents the number of individuals in the simulated dataset; -
chrom
: a string that represents the chromosome number used to annotate the VCF file.
-
-
output:
-
speed
: execution time for the run of the process; -
max_mem
: maximal memory usage during the run of the process; -
cur_mem
: memory usage at the end the process;
-
The following snippet shows a rudimentary example of how the user can use the method generate_vcf()
.
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
#Initial genetic position
model.initiate_landscape(reference = 1.245)
model.extend_landscape(*(fmx.Model.standard_schema(20) for _ in range(6)))
strength = 1
population = 0.1
Npop = 1000
chrom = "1"
model.generate_vcf(strength,population,Npop,chrom)
The execution of the previous script should produce a VCF file tutorial.vcf.gz
, which can be visualized by using the commands zmore
, zless
or zcat
.
$ python myscript.py
Model.generate_vcf: |████████████████████| 100%
time (sec.): 0.7510931491851807
max. mem (MB): 0.11163139343261719
cur. mem (MB): 0.0834970474243164
$ ls
tutorial.vcf.gz
$ zless -S tutorial.vcf.gz
The output:
##fileformat=VCFv4.2
##fileDate=20230824
##source=HaploDX
##reference=https://github.com/remytuyeras/HaploDynamics/blob/main/README.md
##contig=<ID=1,length=121,species="simulated Homo sapiens">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AP,Number=1,Type=Float,Description="Alpha Parameter for Simulation">
##INFO=<ID=BP,Number=1,Type=Float,Description="Beta Parameter for Simulation">
##INFO=<ID=CP,Number=1,Type=Float,Description="Gamma Parameter for Simulation">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ID0001 ID0002 ID000>
1 1245 . G T . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 2175 . G T . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 2703 . C T . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 3083 . G C . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 7977 . C A . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 8494 . G A . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
1 8728 . G T . PASS NS=1000;AP=0.15997193499185136;BP=0.4>
:
The following script compares the performance of the function generate_vcf()
, which is optimized for execution time and memory usage, to the performance of the C++ software MaCS. This comparison is motivated by the fact that both MaCS and HaploDX rely on Bayesian networks to generate variants, which allows them to output rows of genotypes sequentially in a similar fashion.
Note While MaCS is one of the fastest, if not the fastest, coalescence-based methods for generating genomic data, its use of graph structures to encode segregation and recombination events can slow down its time execution. The comparison is intended to demonstrate the strengths of the HaploDX framework in terms of memory savings and time efficiency, while still producing realistic variant call simulations.
import matplotlib.pyplot as plt
import HaploDynamics.Framework as fmx
model = fmx.Model("tutorial")
#Initial genetic position
model.initiate_landscape(reference = 1.245)
#Region of 10 Mb
model.extend_landscape(*(fmx.Model.standard_schema(20) for _ in range(10)))
strength = 1
population = 0.1
chrom = "1"
x = [100,300,1000,3000,10000]
s = []
m = []
for i in range(len(x)):
Npop = x[i]
ave_s = []
ave_m = []
for j in range(10):
print("Npop:",Npop,j)
speed, max_mem, cur_mem = model.generate_vcf(strength,population,Npop,chrom)
ave_s.append(speed)
ave_m.append(max_mem)
s.append(sum(ave_s)/10)
m.append(sum(ave_m)/10)
print(s)
print(m)
plt.plot(x,[13,43,3*60,13*60+5,50*60+50],c="r",linestyle='dashed',alpha = 0.5)
plt.plot(x,s,c="r")
plt.legend(["MaCS", "HaploDX"])
plt.ylabel("Average execution times for a region of 10000kb")
plt.xlabel("Npop")
plt.show()
plt.plot(x,[9.6,10,11.7,13.8,20.4],c="b",linestyle='dashed',alpha = 0.5)
plt.plot(x,m,c="b")
plt.legend(["MaCS", "HaploDX"])
plt.ylabel("Average memory usage for a region of 10000kb")
plt.xlabel("Npop")
plt.show()
The outputs for the previous script should typically look as shown below when run on a 2.4 GHz processor with 16 GB of RAM. Note that the execution time and memory usage statistics for MaCS were reportedly conducted on a slightly faster machine, using a 2 GHz processor with 32 GB of RAM (see the supporting research paper).
Memory usage | Execution times |
---|---|
The method generate_vcf()
relies on the following processes from the module HaploDX
: