-
Notifications
You must be signed in to change notification settings - Fork 8
1. Detailed on MAW R Script
An input directory (/input_dir) should have the following files.
- One LCMS-2 spectra .mzML file
- hmdb.rda (10.5281/zenodo.7519270)
- mbankNIST.rda (10.5281/zenodo.7519270)
- gnps.rda (10.5281/zenodo.7519270)
- COCONUT.csv (10.5281/zenodo.7704937)
Follow the R script: MAW/cwl/Workflow_R_Script_all.r. It starts with loading all the necessary libraries and then all the functions are written in the same file. At the end of the script, at line 2165, the actual executable script starts. The beginning refers to the predefining the parallelisation of the some functions.
The script is written in a way to run via Rscript command line, with the following arguments:
#define arguments
args = commandArgs(trailingOnly = TRUE)
# input directory files
mzml_file <- args[1] # path to the mzml input file
gnps_file <- args[2] # path to the gnps.rda downloaded from Zenodo
hmdb_file <- args[3] # path to the hmdb.rda downloaded from Zenodo
mbank_file <- args[4] # path to the mbankNIST.rda downloaded from Zenodo
file_id <- args[5] # any given file ID
ppmx = as.numeric(args[6]) # ppm value for MS2 peak range
collision_info = as.logical(args[7]) # whether files have collision energy information
db_name = args[8] # any given name of the local database for MetFrag
db_path = args[9] # the path to the local database for MetFrag
Finally, generate a result directory using:
mzml_result <- str_remove(basename(mzml_file), ".mzML")
dir.create(mzml_result)
Now, we can start with the MS2 data pre-processing which includes reading the mzML files, removing empty spectra and extracting precursor m/z values. Here x is the mzML input file.
# read mzML file and create output directory
spec_pr <- spec_Processing(mzml_file, mzml_result)
This results into two files. One file saves the pre-processed spectra in mzML format, accessed by spec_pr1 (e.g: /opt/workdir/data/File1/processedSpectra.mzML) and other file keeps a list of precursor m/z(s) as a txt file, accessed by spec_pr2 (e.g: /opt/workdir/data/File1/premz_list.txt)
MAW integrates three open-source spectral Databases: GNPS, HMDB and MassBank. Each of these databases can be downloaded using their specific file formats.
- GNPS (ALL_GNPS) can be downloaded via this link in mgf format.
- MassBank_NIST can be downloaded with this link in msp format.
- HMDB (All Spectra Files)can be downloaded with this link(https://hmdb.ca/downloads) in xml format.
You can download the following versions of these databases from 10.5281/zenodo.7519270.
GNPS saved at 2023-01-09 15:24:46
HMDB saved at 2023-01-09 14:35:46 with release version Current Version (5.0)
MassBank saved at 2022-01-09 15:10:52 with release version 2022.12 as mbankNIST.rda
Store these .rda files in the same directory as your input mzml file e.g: opt/workdir/data
The module called spec_dereplication_file
is used to perform Spectral database dereplication against the input data present in the mzML file. The function can be used in the following way:
df_derep <- spec_dereplication_file(mzml_file = mzml_file,
pre_tbl = paste(mzml_result, "/premz_list.txt", sep = ""),
proc_mzml = paste(mzml_result, "/processedSpectra.mzML", sep = ""),
db = "all",
result_dir = mzml_result,
file_id,
no_of_candidates = 50,
ppmx)
Here, mzml_file is the input file, pre_tbl is the list of precursor m/z(s) saved in a text file and proc_mzml is the pre-processed data saved as mzML file. db can be either "gnps", "hmdb", "mbank" or "all". no_of_candidates define how many candidates will be considered.
The function first performs further pre-processing steps needed for spectral database dereplication, such as removing low intensity peaks from MS2 fragmentation spectra, normalisation of the peak intensities and removal of any peaks higher or equal to precursor m/z value. This is performed for both the input spectra and the spectra in all the databases which matches the precursor m/z of the input mzML files. Once the preprocessing is done, the input spectra and the database candidate spectra are matched resulting in a score matrix.
Here, based on the score observation, GNPS >= 0.85, MassBank and HMDB >= 0.75 is the threshold for cosine similarity score. Another function within this module calculates the difference between matching peaks intensity and individual peak m/z. Based on the above functions, each precursor m/z from each file has a GNPS, MassBank and HMDB .csv files based on the selection of the database. e.g: /opt/workdir/data/File1/spectral_dereplication/MassBank/mbank_results_for_file_1M182RNAID1.csv would look like this
no. | MBmax_similarity | MBmzScore | MBintScore | MQMatchingPeaks | MBTotalPeaks | mQueryTotalPeaks | MBFormula | MBSMILES | MBspectrumID | MBcompound | Source |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.971473979757729 | 0.933333333333333 | 0.951921613619267 | 7 | 7 | 8 | C9H11NO3 | c1cc(ccc1CC@@HN)O | MSBNK-UFZ-UA005601 | L-Tyrosine | MassBank |
Here the name of the file indicates the file id which is file_1, the precursor m/z which is M182 and the retention time is not goven hence RNA, with the id ID1. The .csv file contains information on MBmax_similarity which is the matching spectra cosine similarity score, MBmzScore which is the score for matching m/z of the peaks and ranges from 0-1, MBintScore which is the score for matching intensity of the peaks and ranges from 0-1, MQMatchingPeaks which is the number of matching peaks between MassBank and Query spectra, and other columns names are self-explanatory. This form of result follows for GNPS and HMDB as well.
After the pre-processing steps from section 1, we can also do the compound database dereplication. This module is also divided into further independent functions. Here we use MetFrag as a tool for insilico fragmentation spectral matching, but the workflow also generates SIRIUS input .ms files.
Metfrag .txt parameter file
PeakListPath = File1/insilico/peakfiles_ms2/Peaks_01.txt
IonizedPrecursorMass = 152.995452880859
PrecursorIonMode = -1
IsPositiveIonMode = False
MetFragDatabaseType = LocalCSV
LocalDatabasePath = COCONUT_Jan2022.csv
DatabaseSearchRelativeMassDeviation = 5
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 15
MetFragCandidateWriter = CSV
SampleName = 1_id_endo_negM153R40ID1_mz_152.995452880859_rt_40.0539309
ResultsPath = File1/insilico/MetFrag/coconut/
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter
MetFragPostProcessingCandidateFilter = InChIKeyFilter
MaximumTreeDepth = 2
NumberThreads = 1
SIRIUS .ms file
>compound file_1M182RNAID1 ## ID
>parentmass 151.03643799 ## precursor m/z
>charge -1 ## charge, pos or neg
>rt NAs ## median retention time in seconds
>collision 15eV ## ms2 simply can be written here instead of collision energy, this is the fragmentation peak list
65.1203 26499.103516
67.302422 30616.269531
71.056732 26360.367188
...
136.075729 7684037
147.043976 1080712.25
165.054611 3833276.5
182.081146 575214.9375 ### first column is m/z and second column is intensity
In order to create such a file, further pre-processing and data extraction is required from mzML format to be stored in txt or .ms. To do that, ms2_peaks
function is used, which extracts ms2 fragmentation peaks for all of the precursor m/z. If there are more MS2 spectra associated with one precursor m/z, the function combines these MS2 spectra and extracts all fragmentation peaks. These peaks are stored as txt (e.g: "/usr/MAW/Tutorial/File1/insilico/peakfiles_ms2/Peaks1.txt")
# Extract MS2 peaks
spec_pr2 <- ms2_peaks(pre_tbl = paste(mzml_result, "/premz_list.txt", sep = ""),
proc_mzml = paste(mzml_result, "/processedSpectra.mzML", sep = ""),
result_dir = mzml_result,
file_id)
# Extract MS1 peaks
ms1p <- ms1_peaks(x = paste(mzml_result,'/insilico/MS2DATA.csv', sep = ""),
y = NA,
result_dir = mzml_result,
QCfile = FALSE)
The result from this function is not only the MS2 Fragmentation peak lists, but also a table of precursor m/z along with other elements of each feature. e.g: the file "/opt/workdir/data/File1/insilico/MS2DATA.csv" contains:
no. | id_X | premz | rtmed | rtmean | int | col_eng | pol | ms2Peaks | ms1Peaks |
---|---|---|---|---|---|---|---|---|---|
1 | file_1M182RNAID1 | 182.081 | NA | NA | NA | NA | neg | ./Example_Tyrosine/insilico/peakfiles_ms2/Peaks_01.txt | NA |
Here, id represents a unique id for individual feature (precursor m/z and rt median). This ID corresponds with the one generated for spectral database dereplication. Last column represents the path of the MS2 fragmentation peaks which can later be used to retrieved the ms2 peaks.
(OPTIONAL) Now, to create the ms files, we have extracted all relevant information. To write the .ms files, function called sirius_param
is used. MS1DATA.csv has all the information needed to write the .ms file.
#prepare sirius parameter files
sirius_param_files <- sirius_param(x = paste(mzml_result,'/insilico/MS1DATA.csv', sep = ""),
result_dir = mzml_result,
SL = FALSE,
collision_info)
The resulting table looks like this:
no. | sirius_param_file | outputNames | isotopes |
---|---|---|---|
1 | /opt/workdir/data/File1/insilico/SIRIUS/1_NA_iso_MS1p_182_SIRIUS_param.ms | /opt/workdir/data/File1/insilico/SIRIUS/1_NA_iso_MS1p_182_SIRIUS_param.json | NA |
To generate txt files for metfrag following command is used:
metfrag_param(x= paste(mzml_result,'/insilico/MS1DATA.csv', sep = ""),
result_dir = mzml_result,
db_name,
db_path,
ppm_max = 5,
ppm_max_ms2= 15)
After that a JSON file is created with all the outputs and their relevant paths present in the JSON file.