Skip to content
Ana Mendes edited this page Jun 14, 2024 · 29 revisions

MetaboLink


MetaboLink is a web-based application created with shiny R and it is available at http://computproteomics.bmb.sdu.dk/Metabolomics/.

What you can do with the app:

  • Remove features based on intensity between blank and QC samples.
  • Remove features based on missing values.
  • Normalize features based on internal standards.
  • Impute missing values with different functions.
  • Correct signal drift using QC samples.
  • Merge datasets with different ion modes and remove duplicates.
  • Perform statistical analysis.

Untargeted Metabolomics Workflow Untargeted metabolomics workflow


Tutorial

YouTube - MetaboLink

The example dataset used in the app was obtained during the study Pulmonary maternal immune activation does not cross the placenta but leads to fetal metabolic adaptation and has been deposited in the Metabolomics Workbench database under study ID code ST003125. The dataset can be loaded to the app by pressing "Load example".


1. Input

1.1 Data prerequisite

All metabolomics data that have been peak-picked and optimally annotated can be loaded into MetaboLink.

1.2 Data file

Comma-separated values (CSV) file with samples in columns and features in rows. If samples are in rows, select file format "Samples in rows" and the dataset will be transposed in the app.

If you get an error "Invalid multibyte string at ..." after uploading, your file might have invalid characters. We recommend you only include English letters, underscore, and numbers for naming.

Names m/z RT Adduct_pos SMILES Sample01 Sample02 Blank01 QC01
Riboflavin 377.145 3.49 [M+H]+ AUNGANRZJHBGPY-SCRDCRAPSA-N 1693752 1866529 1481 9799
Verbenone 151.112 4.55 [M+H]+ DCSCXTJOXBUFGB-UHFFFAOYSA-N 514 431 516 455

1.3 Sequence file (metafile)

After uploading the datafile, your dashboard will update and open the sequence panel. Here you will be able to upload a CSV file which works as an ID for the main table. Comma-separated values (CSV) file with samples in rows and attributes in columns.

Sample Label Batch Group Time Paired Amount
Names Name
m.z Mass
QC09 QC 1
Blank01 Blank 1
X1 Sample 1 2 15 20.3
  • Sample: should contain the names of the samples and must be consistent with the sample names in the data file

  • Label: the label is automatically identified by the app when uploading a data file, which means you do not need to include it in your meta file. The labels can be: Name, Mass, Retention time (RT), Blank, Sample, Quality control (QC), Adduct_pos, Adduct_neg, or -. For the data table in section 1.2, the app would identify the following labels:

Names m.z RT Adduct_pos SMILES Sample01 Sample02 Blank01 QC01
Name Mass RT Adduct_pos - Sample Sample Blank QC
  • Batch: numeric column for different batches of samples

  • Group: group name/number (can include only letters and numbers)

  • Time: time point (can include only letters and numbers)

  • Paired: paired samples should have the same value in this column (e.g., samples X1 and X3 are paired, so they both have '1')

  • Amount: total sample amount to be used in normalization

IMPORTANT:

  • Your sequence file does not need to include all the columns above, but it should not have more.
  • When manually updating the metafile in the app, the user must press the 'Update' button to save the changed values.
  • From the sequence panel, the user can delete unwanted columns, edit group names, and export the meta file.
  • It is only possible to change the labels in the metafile by naming the data columns accordingly, i.e.:
    • Metabolites column should include "name";
    • Mass column should include "mass", "m/z" or "m.z", and be numeric;
    • Retention time should include "rt", "time" or "retention", and be numeric;
    • QCs and Blanks should include, respectively, "qc" and "black", and be numeric;
    • Samples are all the numeric columns not containing any of the above and can be named according to user preference;
    • Adduct column should include "adduct_pos", "adduct_neg", "adduct", or "ion(s)". If the user opts for "adduct" or "ion(s)", the values in the column should be in the format "[X]+" or "[X]-"..

Adducts in the app:

Adduct Weight Charge Adduct Weight Charge
[M+H]+ 1.007276 1 [M+H-NH3]+ -16.019274 1
[M-H]- -1.007276 1 [M+H-2H2O]+ 37.027276 1
[M+Na]+ 22.989218 1 [2M-H]- -1.007276 2
[M+Cl]- 34.969402 1 [M+NH4]+ 18.033823 1
[M+H-H2O]+ -17.002724 1 [M-2H]2+ -1.007276 0.5
[M+2H]2+ 1.007276 0.5 [M+K]+ 38.96321 1
[M-H-H2O]- -19.017276 1 [M+HCOOH-H]- 46.00548 1
[2M+H]+ 1.007276 2

2. Data Pre-processing

2.1 Blank filtration

Blank extraction sample

A blank extraction sample is essentially a "no sample" control, i.e., the blank sample is subjected to the same conditions as the test samples and thus contains all the reagents and undergoes all the steps of the extraction procedure, but without the biological sample.

The blank sample helps identify contaminants that may be introduced during the sample preparation process. This includes contaminants from solvents, reagents, and labware. By comparing the blank to actual samples, one can discern which peaks in the mass spectra are due to contaminants rather than the sample itself.

Blank filtration

Blank filtration removes uninformative features based on the ratio of the QC samples mean expression versus the mean expression of the blank samples, removing the features which are not abundant in the QCs when compared to the blank samples, using a user-specified filtration ratio.

The following table contains dummy data to exemplify how the blank filtration function works:

Analyte QC01 QC02 QC03 Blank01 Blank02
A 30 35 33 5 10
B 25 30 25 10 10

If we consider analyte A: Mean(QC) = 32.7 and Mean(Blank) = 7.5. For a user defined signal strength of 3, this feature would be kept given that Mean(Blank)x3 < Mean(QC) (22.5 < 32.7)

However, analyte B would be removed since Mean(QC) = 26.7 and Mean(Blank) = 10. Mean(Blank)x3 > Mean(QC) (30 > 26.7)


2.2 Missing value filtration

Removing features with more than a user-defined percentage (X) of missing values.

Filtering features based on their presence in a high percentage of QC samples ensures that only reliable and reproducible signals are considered. This practice reduces the likelihood of including false positives and maintains high data quality by focusing on features that are consistently detected across multiple QC injections. By implementing this filtering step, researchers can enhance the robustness of the data and the confidence in subsequent statistical and biological interpretations, leading to more accurate and meaningful insights in metabolomics studies.

The following options for missing value filtration are available in the app:

  • in QC: remove features with more than X% of missing values in the quality control samples.
  • in group: remove features with more than X% of missing values in at least one of the groups (excluding QC).
  • entire data: remove features with more than X% of missing values in all samples (including QC).

2.3 Imputation

Imputation is another way of removing missing values. It is commonly done with the mean or the median of the observed values, but one can also use classification algorithms to impute missing values by leveraging patterns and relationships within the data to predict missing entries.

The user can impute missing values with:

  • KNN: impute missing values with KNN algorithm. KNN provides a way to estimate missing values based on the similarity of instances within the dataset. It uses a localized approach, which can be more relevant especially when the data has complex structures or clusters. However, it is computationally expensive for large datasets and it can be sensitive to outliers, as these can skew the distance metrics.
  • Median: impute missing values with median value from class.
  • min/X: impute missing values with class minimum divided by X (user-defined).

Imputation should be used only if strictly necessary.

Drift correction requires full coverage of QC samples, so the app offers the option to only impute these.

PolySTest and VSClust do not require full coverage. If your data contains a lot of missing values, we recommend using these tools for statistical testing without imputation.


2.4 Normalization

Normalization is an important step that adjusts the data to reduce the effect of variables that are not of analytical interest. The user can normalize the data using:

  • Internal standards
  • Drift correction
  • Probabilistic Quotient Normalization (PQN) using the QC samples as reference
  • Sum: column sum normalization.
  • Median: column median normalization.
  • Sample amount

2.4.1 Internal standards normalization

Internal standards normalization is typically used for Lipidomics datasets. This normalization technique enhances the data by correcting for variability in sample preparation, instrumental fluctuations and matrix effects.

This function requires a data file including a column of retention times labeled "RT" and a column of annotations labeled "Name". At least one feature should be an internal standard and include "(is)" within its annotation, e.g., "Glucose (is)". Features containing "(is)" within their annotations are displayed in the IS normalization panel, allowing for the user to select or deselect which features should be included in the normalization.

If the app should prioritize normalizing to internal standards of the same lipid structures, the features must be annotated with lipid structure abbreviations at the start of the annotation and must be space-separated.

2.4.2 Drift correction

Drift correction is a normalization approach that addresses the systematic variability or drift in datasets. By including QC samples from a pool of all the samples and correct each metabolite. To find the drift pattern the app uses locally estimated scatterplot smoothing (LOESS). Full coverage of QC samples is required.

QC-RFSC (Random Forest Signal Correction)

Random Forest is a machine learning method that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The parameter ntree controls the number of trees in the forest. In the app, the user can set this from 100 to 1000 with a standard selection of 500 - this is a common starting point and it is a good balance between performance and computational efficiency. More trees generally lead to better model performance and more stable predictions but with diminishing returns.

This method is suitable when you have complex non-linear relationships in the data that can be better captured by random forest. It's useful if you suspect that the signal variation is influenced by mulyiple factors that interact in a non-linear way.

QC-RLSC (Robust LOESS Signal Correction)

Non parametric regression method that combines multiple regression models in a k-nearest neighbour based approach. It includes an additional robustness step to minimize the influence of outliers.

  • QCspan - controls the span of local regression. Determines the fraction of data points used to fit each local regression.
  • Degree - controls the degree of polynomials used in the local fitting.

This method is suitable when there is a smooth trend in the data that needs to be corrected, and one wants to account for local variations without making strong parametric assumptions. It's useful for data with outliers or non-linear trends that need smoothing.

Important:

  • The order of samples is important here. Blank samples are removed by default, so they should NOT have information in the 'order' column of the sequence file.
  • The order has to start at 1.
  • QC samples are run at regular intervals throughout the analytical sequence alongside the real samples. The distribution of QC samples should be that they can capture the temporal behavior of the instrument or the assay throughout the entire bath of analysis.
  • Use the Feature drift panel to inspect for any obvious trends or shifts over time which indicates the presence of drift.
  • Choose an appropriate span or window size for the LOESS algorithm. A large span can smooth out more noise but might oversimplify real trends. The span should be chosen based on the degree of smoothing desired and the noise level in the data.
  • The degree setting determines the degree of the polynomial with the default being 2 (quadratic). 1 is linear.

2.4.3 Probabilistic Quotient Normalization (PQN)

PQN is another method to normalize using the QC samples. PQN was developed to correct for the differential dilution of samples. This method normalizes each sample based on a calculated "normalization/dilution factor" that is derived from the median of the quotients of each feature relative to a reference (QCs).

Requires columns labeled QC.

2.4.4 Sum and median normalization

Sum normalization, or total sum scaling (TSS) involves scaling each sample based on the sum of all measured values in that sample. This method is more suitable for when the total concentration of all features is relevant and expected to be consistent across samples, such as in controlled experimental conditions.

Median normalization adjusts the data based on the median value of the features within each sample. This is preferable when the data is skweed and there is a need to mitigate the influence of a few highly or low-abundance features.

2.4.5 Sample amount normalization

Sample amount normalization is a method used to adjust the data based on the quantity of sample material used in the analysis. This technique is essential when there are variations in the initial amount of sample material, which can significantly impact the measured values. Common approaches are weight-based normalization, volume-based normalization, and cell count normalization.

It is important as it ensures that the results are not biased by the initial quantity of sample material, thereby allowing for more accurate comparisons between samples.

2.4.6 Selection of normalization method

The selection of the normalization method depends on the nature of the user's data and the analytical goals. One should be aware of the distribution of the data, if there are significant outliers (i.e., a method less sensitive to outliers might be preferred), the consistency of total amount (if this is relatively constant across samples), the variability in sample preparation (e.g., normalization methods that adjust for variations like total content per sample might be more suitable).

Sometimes, it is unclear which method is best just from theoretical considerations and in these cases, it is advised to apply multiple normalization methods and validate the results using known markers or control samples to see which method best recovers expected results.


2.5 Log transformation and Scaling

2.5.1 Log transformation

Log transformation is applied to metabolomics data to stabilize variance across the range of metabolite concentrations. It helps in handling the wide range of metabolite concentrations and reduces the impact of extreme values or outliers, making the data more suitable for statistical analysis.

When: Typically used before data scaling and statistical analysis.

2.5.2 Scaling

Scaling ensures all variables (analytes) contribute equally to the analysis and prevents variables with large ranges from dominating the analysis.

When: After log-transformation, before statistical analysis.


2.6 Merge datasets

The user can merge positive and negative ion modes to ensure the best coverage of detected metabolites. This approach increases confidence in results by reducing false positives and redundancy.

It is required to upload two datasets with the same samples. One dataset needs include an adduct column labeled "adduct_pos" and the other dataset a column labeled "adduct_neg".

2.6.1 Priorities

The user can set priorities for the features by including "_high", "_medium", or "_low" in their names. It is possible to change these definitions and priorities by clicking in the 'Edit priorities' button in the 'Merge datasets' panel.


3. Explore Data

This panel exhibits some features to explore the data:

  • Data table: displays the full data table of the selected file.

  • Sample distribution: shows 2 histograms. The first shows the median across samples (hover the bars to see which sample each bar corresponds to and the specific median).

  • PCA: users can run PCA on log-transformed data. If the data is already log-transformed, the user needs to check the associated checkbox. Two panels are available so one can run the PCA on two datasets and compare them.

  • Feature drift: it is possible to create scatterplots to visualize signal drift on the QC samples as well as compare the CV values before and after any correction. The left box will show the column labeled “Name” and will allow the user to select which features they wish to plot. The plots created show the samples and QC samples using the order from the metafile highlighting the QC samples and with a LOESS regression showing a 0.95 confidence interval area.

  • Feature viewer: shows boxplots of the selected features with the possibility of changing either the axis or intensities with different log bases.

  • Summary: information on the currently selected data file including the number of missing values, the number of zero values, and the CV in the different classes.


4. Statistical Analysis

The app uses the sequence file to identify the different groups/conditions and time points and allows the user to select which they would like to compare.

Remember to log-transform the data before running any statistical tests!

The user has the option to run the test locally or by exporting the dataset to PolySTest. The local test uses the limma package for all tests. Locally, the following tests can be used:

  1. 2 groups (unpaired) - requires group information in the sequence file. With this test you can compare two independent groups and see if there are any significant differences between them.
  2. 2 groups (paired) - requires group and paired columns in sequence file. This test compares the means of two related groups (e.g., measurements taken from the same subjects before and after treatment.
  3. 2 groups with time (unpaired) - requires information about group and time in the sequence file. This test compares the means of two independent groups across different time points to identify significant differences in their time-dependent responses.
  4. 2 groups with time (paired) - requires group, time, and paired columns in the sequence file. This test compares the means of two related groups across different time points to identify significant differences in their time-dependent responses.
  5. Compare to reference group - requires group information. This test compares the means of multiple groups to a reference group to identify significant differences between each group and the reference.

5. Output

5.1 Sequence panel

Download the sequence (.csv file) on the bottom left.

5.2 Export panel

Download:

  • the data file (.csv file)
  • the data file for MetaboAnalyst (.csv file)
  • the results of the statistical tests (.xlsx or .csv file)

Export data directly to:

Example workflow with MetaboLink

MetaboLink


Source code available here: GitHub - anitamnd/MetaboLink

This version might still have the occasional crash, if you are experiencing trouble please reach out: Send Email

Molecular Metabolism and Metabolomics | University of Southern Denmark