This repository contains all the required material to reproduce the figures of my Linear MALDI-ToF simultaneous spectrum deconvolution and baseline removal article. The article describes a new deconvolution algorithm. This algorithm processes the spectrum baseline and peak deconvolution jointly.
The algorithm takes the form of an optimization problem:
where the minimizer (xb,xp) represents the smooth spectrum baseline and the deconvolved peaks.
The algorithm involves two steps. These steps are detailed in the article:
- the first step selects peak support,
- the second step performs a peak height debiasing.
The new algorithm is compared to a more usual approach where the baseline removal is performed before and independently of the peak extraction. This comparison is detailed in the submitted article. The material to reproduce the results is available in this subdirectory.
We present here a straightforward implementation with the following characteristics:
- the convolution operator is implemented as a dense matrix-vector product,
- the tridiagonal matrix is implemented as a dense matrix,
- the optimizer is a simple projected gradient.
A more refined implementation, as described in the article takes care of the following points:
- the convolution operator is implemented as a specialized subroutine,
- the tridiagonal matrix is stored in 3 vectors and uses specialized subroutines,
- the optimizer is a projected Barzilai-Borwein solver.
The optimization problem is a strictly convex one, hence its solution is unique and independent of the solver. However, the basic implementation available in this GitHub repository is limited to:
- spectrum of small size (~ 1000 channels),
- running time is higher.
The C++ code has been tested under Linux.
It depends on the usual Blas
and Lapack
libraries which are used
for linear algebra operations:
libcblas.so
libblas.so
liblapacke.so
liblapack.so
It depends on two other libraries (which are in the ExternalSoftware/
directory):
- https://github.com/ben-strasser/fast-cpp-csv-parser used to read csv files
- https://github.com/jarro2783/cxxopts used to parse the command line options
A Makefile
is used to build the demo software:
make
Two executables are generated:
generateSynthetic
used to generate the synthetic data used in the article,jointDeconvolution
used to run the algorithm.
generateSynthetic
is used to generate the synthetic data described in the article.
./generateSynthetic --help
Generates some synthetic data Usage: ./generateSynthetic [OPTION...] -c, --concavity 1 Baseline concavity, an integer in {-1,0,+1} -n, --noise 0.2 Noise standard deviation (>0) -s, --seed 0 Random number generator seed --what 0 0-data, 1-ground truth, 2-peak list --help Print help
For instance
./generateSynthetic --noise 0.2 > Data/synthetic.csv
can be used to generate the Data/synthetic.csv
input data. This
synthetic spectrum can be plotted with gnuplot
. Launch the gnuplot
program
gnuplot
then type
set datafile separator ','
plot "Data/synthetic.csv" u 1:2 w l t "Synthetic data"
The deconvolution algorithm can be invoked by the jointDeconvolution
command. Its options are listed below:
./jointDeconvolution --help
A joint baseline removal and deconvolution algorithm, contact [email protected] Usage: ./jointDeconvolution [OPTION...] positional parameters -i, --input FILE Input file (two columns X,Y) -o, --output OUTPUT FILE Output file (default: $(FILE).out) --sigma_left 10 Peak shape factor (>0) --sigma_right 10 Peak shape factor (>0) --yb_left y[0] Left baseline value (if not defined use y[0]) --yb_right y[n-1] Right baseline value (if not defined use y[n-1]) --peakMinHeight 0.01 Minimal height to accept peak (>=0) --lambda_1 0.1 lambda_1 penalty term (>=0) --lambda_2 1e-05 lambda_2 penalty term (>=0) --mu 500 mu penalty term (>0) --eps 0.0001 eps goal (>=0) --max_iter 5000 maximum number of iterations (>0) -p, --gnuplot Gnuplot script --help Print help
The deconvolution results are saved in an output file $(FILE).out where the input file name has been completed by the .out extension.
The -p option also generates the gnuplot
script
$(FILE).out.gnuplot that can be invoked to create the associated
plot (in eps
or png
format).
The default option values can be used to process the synthetic
spectrum, hence simply type (note the -p option to generate the
gnuploy
script):
./jointDeconvolution -p ./Data/synthetic.csv
to deconvolve these synthetic data. This must create the ./Data/synthetic.csv.out and ./Data/synthetic.csv.out.gnuplot output files.
As described you can generate the associated plots by:
gnuplot ./Data/synthetic.csv.out.gnuplot
This must create the two files:
- ./Data/synthetic.csv.out.png
- ./Data/synthetic.csv.out.eps
Note: if you want to stay in a gnuplot
interactive mode after script
execution, just add a minus sign ’-’ at the end:
gnuplot ./Data/synthetic.csv.out.gnuplot -
We can test the algorithm on a low resolution MALDI-ToF isotopic motif:
./jointDeconvolution Data/MALDI_ToF_Low_A.csv -p --sigma_left 0.4 --sigma_right 0.4 --mu 100
gnuplot ./Data/MALDI_ToF_Low_A.csv.out.gnuplot
This situation is quite extreme, however we can check that extracted peaks (the red impulses) are approximately spaced by 1 m/z which is the expected result (most of the peaks in MALDI spectra are mono-charged).
We can proceed further by giving an example in the 2600 m/z mass range:
./jointDeconvolution Data/MALDI_ToF_Low_B.csv -p --sigma_left 0.4 --sigma_right 0.5 --lambda_1 0.25
gnuplot ./Data/MALDI_ToF_Low_B.csv.out.gnuplot
There the impulses are also approximately separated by 1 m/z.
However isotopic motif deconvolution without using any extra information (like an expected 1 m/z spacing between peaks) can lack of robustness. That is the reason why it is certainly safer to use a coarser peak shape modeling the unresolved isotopic motif as a whole. This is illustrated by the next figure:
./jointDeconvolution Data/MALDI_ToF_Low_B.csv -o Data/MALDI_ToF_Low_B2.csv.out -p --sigma_left 2 --sigma_right 2 --lambda_1 1
gnuplot ./Data/MALDI_ToF_Low_B2.csv.out.gnuplot
We can also modify the regularization λ_1:
./jointDeconvolution Data/MALDI_ToF_Low_B.csv -o Data/MALDI_ToF_Low_B3.csv.out -p --sigma_left 2 --sigma_right 2 --lambda_1 0.5
gnuplot ./Data/MALDI_ToF_Low_B3.csv.out.gnuplot
High resolution MALDI-ToF spectra are easier to deconvolve. Here we modify default parameter values:
- the Gaussian peak shape factor is set to 0.15
- the baseline value at boundaries is set to 80
- the λ_1 penalization is set to 0.5
./jointDeconvolution Data/MALDI_ToF_High_A.csv -p --sigma_left 0.15 --sigma_right 0.15 --yb_left 80 --yb_right 80 --lambda_1 0.5
gnuplot ./Data/MALDI_ToF_High_A.csv.out.gnuplot
We can modify the λ_1 value to 0.2 to accept more peaks, this gives:
The presented algorithm is generic and can be used for other type of spectra.
Here a γ-nuclear spectrum:
./jointDeconvolution --sigma_right 1 --sigma_left 2 --mu 100 --lambda_1 0.01 -p Data/Gamma.csv
gnuplot ./Data/Gamma.csv.out.gnuplot