DeepConv is a deep learning approach for deconvoluting cell-type proportions from cell-free DNA methylation data. The model learns to estimate the relative contributions of different cell types in a mixture by analysing methylation patterns across genomic markers.
Cell-free DNA (cfDNA) in blood plasma consists of DNA fragments released by cells throughout the body during natural cell death (apoptosis) or other cellular processes. These fragments retain the methylation patterns of their cells of origin, providing a "signature" that can be used to identify their source tissue.
The fundamental question in cfDNA deconvolution is: Given a mixture of DNA from multiple cell types, can we determine what proportion came from each type? Mathematically, this can be formulated as:
M = R × P
Where each matrix represents:
- Shape: (regions × cell_types)
- Each column represents a cell type's methylation profile
- Each row is a genomic region (marker)
- Values are between 0-1 representing methylation level:
- 0: Completely unmethylated
- 1: Completely methylated
- Obtained from reference samples of pure cell types
- Usually sparse, as regions are selected to be differentially methylated across cell types
- Shape: (cell_types × samples)
- Each column represents one sample's composition
- Each row is a cell type
- Values represent the fraction of DNA from each cell type
- Subject to biological constraints:
- Non-negative: All values ≥ 0
- Sum-to-one: Each column sums to 1
- This is what we're trying to estimate
- Shape: (regions × samples)
- Each column represents one mixed sample
- Each row is a genomic region matching R
- Values are between 0-1 representing observed methylation level
- In practice, these values are derived from sequencing data:
- Number of methylated reads / Total number of reads
- Quality depends on sequencing coverage
- Not all regions are covered equally in sequencing
- Coverage can vary from 0 to hundreds of reads
- Low coverage regions have less reliable methylation estimates
- Coverage information is crucial for:
- Weighting reliable measurements more heavily
- Handling missing or low-confidence data
- Sequencing errors
- PCR amplification biases
- DNA fragmentation patterns
- Batch effects
- Cell type similarities
- Tissue-specific methylation patterns
- Biological variation within cell types
- Rare cell types (<1% of mixture)
This complex interplay of factors makes the deconvolution problem challenging for traditional optimisation approaches, motivating our deep learning solution.
This formulation is subject to two key biological constraints:
- Non-negativity: Proportions cannot be negative (P ≥ 0)
- Sum-to-one: Proportions must sum to 1 (ΣP = 1)
Most existing methods solve this problem using Non-negative Least Squares (NNLS), which minimises ||M - RP||² subject to P ≥ 0. While effective, these approaches:
- Assume linear relationships
- May not fully capture complex interactions
- Can be sensitive to noise and missing data
- Often struggle with rare cell types (<1% proportion)
- Methylation values (0-1) for each genomic region
- Coverage information (read counts) for each region
-
Parallel Encoders
- Separate processing paths for methylation and coverage
- Allows the model to learn different feature patterns:
- Methylation encoder: Pattern recognition in methylation signals
- Coverage encoder: Quality and confidence assessment
- Each encoder can specialize in its domain
-
Dimensionality Choices
- Initial expansion to 512 dimensions:
- Allows learning of rich feature representations
- Captures complex interactions between regions
- Reduction to 256 dimensions:
- Compresses information to most relevant features
- Reduces risk of overfitting
- Initial expansion to 512 dimensions:
-
Regularization Strategy
- Heavy dropout (0.4):
- Prevents over-reliance on specific markers
- Improves robustness to missing data
- Batch normalization:
- Stabilizes training
- Handles varying scales of methylation and coverage
- Heavy dropout (0.4):
-
Feature Fusion
- Concatenation rather than addition:
- Preserves distinct information from both streams
- Allows model to learn optimal combination
- No additional processing:
- Lets final layer learn direct mapping to proportions
- Concatenation rather than addition:
-
Output Design
- Single linear layer to proportions:
- Simple mapping from learned features
- Avoids overfitting in final stages
- Softmax activation:
- Ensures biological constraints
- Naturally handles proportion requirements
- Single linear layer to proportions:
-
Biological Motivation
- Mirrors the two key aspects of methylation data:
- Signal (methylation values)
- Confidence (coverage)
- Handles sparsity and noise in real data
- Mirrors the two key aspects of methylation data:
-
Technical Advantages
- End-to-end differentiable
- Relatively simple to train
- Computationally efficient
- Easy to interpret feature importance
-
Practical Benefits
- Can process variable-length inputs
- Robust to missing data
- Scales well with number of markers
- Easily adaptable to different reference panels
- Parallel encoders for methylation and coverage data:
- Process methylation patterns
- Account for varying coverage depths
- Feature fusion through concatenation
- Final layer producing cell type proportions
The model enforces biological constraints through its architecture:
- Softmax activation in the final layer ensures:
- All proportions are positive (0-1 range)
- Proportions sum to 1
- Differentiable end-to-end training
This architectural choice is superior to post-processing normalisation because:
- It incorporates constraints during training
- Allows the model to learn within the constrained space
- Maintains differentiability for gradient-based optimisation
- Batch normalisation: Stabilises training with varying methylation levels
- Dropout (0.4): Prevents overfitting to specific methylation patterns
- ReLU activation: Introduces non-linearity while maintaining non-negativity
- Synthetic mixtures created from reference methylation profiles
- Realistic coverage simulation using negative binomial distribution
- Technical noise simulation
- KL divergence loss function
- Adam optimiser with learning rate scheduling
- Early stopping based on validation loss
- L2 regularisation to prevent overfitting
git clone https://github.com/username/deepconv
cd deepconv
cd deepconv/src
python -m deep_conv.deconvolution.train --batch_size 32 --n_train 100000 --n_val 20000 --atlas_path /mnt/lustre/users/bschuster/OAC_Trial_TAPS_Tissue/Data/TAPS_Atlas/Atlas_dmr_by_read.blood+gi+tum.U100.l4.bed --model_save_path /users/zetzioni/sharedscratch/deepconv/src/deep_conv/saved_models/
cd deepconv/src
python -m deep_conv.deconvolution.estimate_cell_type \
--model_path /users/zetzioni/sharedscratch/deconvolution_model.pt \
--cell_type CD4-T-cells \
--atlas_path /mnt/lustre/users/bschuster/OAC_Trial_TAPS_Tissue/Data/TAPS_Atlas/Atlas_blood+gi+tum.U100.l4.bed \
--wgbs_tools_exec_path /users/zetzioni/sharedscratch/wgbs_tools/wgbstools \
--pats_path /mnt/lustre/users/bschuster/OAC_Trial_TAPS_Tissue/Data/Benchmark/pat/blood+gi+tum.U100/Song/mixed/CD4 \
--output_path /users/zetzioni/sharedscratch/cd4
Python 3.9+ PyTorch NumPy Pandas scikit-learn wgbs_tools plotly kaleido