Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update volcanoplot #6510

Merged
merged 5 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion tools/decontam/macros.xml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<macros>
<token name="@TOOL_VERSION@">1.22.0</token>
<token name="@VERSION_SUFFIX@">0</token>
<token name="@VERSION_SUFFIX@">1</token>
<token name="@PROFILE@">22.01</token>
<xml name="bio_tools">
<xrefs>
Expand Down
11 changes: 11 additions & 0 deletions tools/volcanoplot/test-data/category.tab
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Gene log2FoldChange pvalue padj category
DOK6 0.51 1.861e-08 0.0003053 Category A
TBX5 -2.129 5.655e-08 0.0004191 Category B
SLC32A1 0.9003 7.664e-08 0.0004191 Category C
IFITM1 -1.687 3.735e-06 0.006809 Category A
NUP93 0.3659 3.373e-06 0.006809 Category B
EMILIN2 1.534 2.976e-06 0.006809 Category C
TPX2 -0.9974 2.097e-06 0.006809 Category A
LAMA2 -1.425 2.39e-06 0.006809 Category B
CAV2 -1.052 3.213e-06 0.006809 Category C
TNN -1.658 8.973e-06 0.01472 Category A
32 changes: 19 additions & 13 deletions tools/volcanoplot/test-data/out.rscript
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,20 @@ suppressPackageStartupMessages({

# Import data ------------------------------------------------------------

results <- read.delim('/private/var/folders/zn/m_qvr9zd7tq0wdtsbq255f8xypj_zg/T/tmprh4qip75/files/d/2/2/dataset_d2255b46-f0f6-4900-8b9e-bd352e34f303.dat', header = TRUE)
results <- read.delim('/tmp/tmpl4o1f_bf/files/5/2/5/dataset_52538741-d085-42da-817b-263bd4f7cf98.dat', header = TRUE)


# Format data ------------------------------------------------------------

# Create columns from the column numbers specified
results <- results %>% mutate(fdr = .[[4]],
pvalue = .[[3]],
logfc = .[[2]],
labels = .[[1]])
# Create columns from the column numbers specified and use the existing category_symbol column for shapes
results <- results %>% mutate(
fdr = .[[4]],
pvalue = .[[3]],
logfc = .[[2]],
labels = .[[1]],
)

# Check if shape_col is provided

# Get names for legend
down <- unlist(strsplit('Down,Not Sig,Up', split = ","))[1]
Expand All @@ -49,7 +53,7 @@ results <- mutate(results, sig = case_when(
# Specify genes to label --------------------------------------------------

# Import file with genes of interest
labelfile <- read.delim('/private/var/folders/zn/m_qvr9zd7tq0wdtsbq255f8xypj_zg/T/tmprh4qip75/files/5/e/5/dataset_5e5b8fb0-bf65-438e-9b5b-03a540d9aa5d.dat', header = TRUE)
labelfile <- read.delim('/tmp/tmpl4o1f_bf/files/5/d/4/dataset_5d401b02-f6af-4ed9-b853-992fd4a4d044.dat', header = TRUE)

# Label the genes of interest in results table
results <- mutate(results, labels = ifelse(labels %in% labelfile[, 1], labels, ""))
Expand All @@ -61,15 +65,17 @@ results <- mutate(results, labels = ifelse(labels %in% labelfile[, 1], labels, "
# Open file to save plot as PDF
pdf("volcano_plot.pdf")

# Set up base plot
# Set up base plot with faceting by category_symbol instead of shapes
p <- ggplot(data = results, aes(x = logfc, y = -log10(pvalue))) +
geom_point(aes(colour = sig)) +
scale_color_manual(values = colours) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.key = element_blank())
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.key = element_blank())

# Conditional logic to use either shape or facet based on user selection
p <- p + geom_point(aes(colour = sig)) #only add color

# Add gene labels
p <- p + geom_text_repel(data = filter(results, labels != ""), aes(label = labels),
Expand Down
102 changes: 84 additions & 18 deletions tools/volcanoplot/volcanoplot.xml
Original file line number Diff line number Diff line change
Expand Up @@ -64,11 +64,18 @@ if (is.numeric(first_pvalue)) {

# Format data ------------------------------------------------------------

# Create columns from the column numbers specified
results <- results %>% mutate(fdr = .[[$fdr_col]],
pvalue = .[[$pval_col]],
logfc = .[[$lfc_col]],
labels = .[[$label_col]])
# Create columns from the column numbers specified and use the existing category_symbol column for shapes
results <- results %>% mutate(
fdr = .[[$fdr_col]],
pvalue = .[[$pval_col]],
logfc = .[[$lfc_col]],
labels = .[[$label_col]],
)

# Check if shape_col is provided
#if $shape_col:
results <- results %>% mutate(category_symbol = .[[$shape_col]]) # Use the shape column if it exists
#end if

# Get names for legend
down <- unlist(strsplit('$plot_options.legend_labs', split = ","))[1]
Expand Down Expand Up @@ -120,15 +127,25 @@ results <- mutate(results, labels = ifelse(labels %in% toplabels, labels, ""))
# Open file to save plot as PDF
pdf("volcano_plot.pdf")

# Set up base plot
# Set up base plot with faceting by category_symbol instead of shapes
p <- ggplot(data = results, aes(x = logfc, y = -log10(pvalue))) +
geom_point(aes(colour = sig)) +
scale_color_manual(values = colours) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.key = element_blank())
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
legend.key = element_blank())

# Conditional logic to use either shape or facet based on user selection
#if $shape_col:
if ('$shape_or_facet' == 'facet') {
p <- p + facet_wrap(~ category_symbol) # Facet the plot based on category_symbol
} else {
p <- p + geom_point(aes(colour = sig, shape = factor(category_symbol))) # Use shapes for categories
}
#else:
p <- p + geom_point(aes(colour = sig)) #only add color
#end if

#if $labels.label_select != "none"
# Add gene labels
Expand Down Expand Up @@ -195,6 +212,11 @@ sessionInfo()
<param name="pval_col" type="data_column" data_ref="input" label="P value (raw) column number" />
<param name="lfc_col" type="data_column" data_ref="input" label="Log Fold Change column number" />
<param name="label_col" type="data_column" data_ref="input" label="Labels column number" />
<param name="shape_col" type="data_column" data_ref="input" label="Categories that can be used to plot different shapes or facet (useful if multivariable associations are investigated)" optional="true" />
<param name="shape_or_facet" type="select" label="Display categories by:" help="Choose whether to display categories by faceting the plot or using shape." optional="true">
<option value="facet">Facet</option>
<option value="shape">Shape</option>
</param>
<param name="signif_thresh" type="float" max="1" value="0.05" label="Significance threshold" help="Default: 0.05"/>
<param name="lfc_thresh" type="float" value="0" label="LogFC threshold to colour" help="Default: 0"/>
<conditional name="labels">
Expand Down Expand Up @@ -248,6 +270,7 @@ sessionInfo()
</assert_contents>
</output>
</test>

<test expect_num_outputs="1">
<!-- Ensure input labels and plot options work -->
<param name="input" ftype="tabular" value="input.tab"/>
Expand Down Expand Up @@ -283,6 +306,45 @@ sessionInfo()
</output>
<output name="rscript" value= "out.rscript" lines_diff="4"/>
</test>

<test expect_num_outputs="1">
<!-- Ensure input labels and plot options work with faceting -->
<param name="input" ftype="tabular" value="category.tab"/>
<param name="fdr_col" value="4" />
<param name="pval_col" value="3" />
<param name="lfc_col" value="2" />
<param name="label_col" value="1" />
<param name="shape_col" value="5" /> <!-- Assuming the shape is in column 5 -->
<param name="lfc_thresh" value="0" />
<param name="label_select" value="file"/>
<param name="label_file" ftype="tabular" value="labels.tab" />
<param name="shape_or_facet" value="facet" /> <!-- Testing the facet option -->
<output name="plot">
<assert_contents>
<has_size value="5007" delta="1000" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have better imaging asserts nowadays, if you want to give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean compare="image_diff", or are there other undocumented diff options, I think not needed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</assert_contents>
</output>
</test>

<test expect_num_outputs="1">
<!-- Ensure input labels and plot options work with shape option -->
<param name="input" ftype="tabular" value="category.tab"/>
<param name="fdr_col" value="4" />
<param name="pval_col" value="3" />
<param name="lfc_col" value="2" />
<param name="label_col" value="1" />
<param name="shape_col" value="5" /> <!-- Assuming the shape is in column 5 -->
<param name="lfc_thresh" value="0" />
<param name="label_select" value="file"/>
<param name="label_file" ftype="tabular" value="labels.tab" />
<param name="shape_or_facet" value="shape" /> <!-- Testing the shape option -->
<output name="plot">
<assert_contents>
<has_size value="5533" delta="1000" />
</assert_contents>
</output>
</test>

</tests>
<help><![CDATA[
.. class:: infomark
Expand All @@ -291,9 +353,11 @@ sessionInfo()

This tool creates a Volcano plot using ggplot2. Points can be labelled via ggrepel. It was inspired by this Getting Genetics Done `blog post`_.

In statistics, a `Volcano plot`_ is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points between two conditions and one wishes to quickly identify the most meaningful changes. A volcano plot combines a measure of statistical significance from a statistical test (e.g., a p value from an ANOVA model) with the magnitude of the change, enabling quick visual identification of those data-points (genes, etc.) that display large magnitude changes that are also statistically significant.
In statistics, a `Volcano plot`_ is a type of scatter plot that is used to quickly identify changes in large data sets composed of replicate data. It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points between two conditions and one wishes to quickly identify the most meaningful changes. A volcano plot combines a measure of statistical significance from a statistical test (e.g., a p-value from an ANOVA model) with the magnitude of the change, enabling quick visual identification of those data points (genes, etc.) that display large magnitude changes that are also statistically significant.

A volcano plot is constructed by plotting the negative log of the p value on the y axis (usually base 10). This results in data points with low p values (highly significant) appearing toward the top of the plot. The x axis is the log of the fold change between the two conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (hence being left or right of center) as well as high statistical significance (hence being toward the top).
A volcano plot is constructed by plotting the negative log of the p-value on the y-axis (usually base 10). This results in data points with low p-values (highly significant) appearing toward the top of the plot. The x-axis is the log of the fold change between the two conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left or right-hand sides. These represent values that display large magnitude fold changes (hence being left or right of center) as well as high statistical significance (hence being toward the top).

Additionally, users can specify a `shape_col`, which allows the differentiation of points in the plot based on categorical variables. The shapes of the points can represent distinct groups or categories within the data, providing another layer of visual information. This feature is particularly useful when comparing multiple groups or conditions in the same plot.

Source: Wikipedia

Expand All @@ -303,12 +367,13 @@ Source: Wikipedia

A tabular file containing the columns below (additional columns may be present):

* P value
* FDR / adjusted P value
* Log fold change
* Labels (e.g. Gene symbols or IDs)
* P value
* FDR / adjusted P value
* Log fold change
* Labels (e.g. Gene symbols or IDs)
* Shape (optional; categorical data for point shapes)

All significant points, those meeting the specified FDR and Log Fold Change thresholds, will be coloured, red for upregulated, blue for downregulated. Users can choose to apply labels to the points (such as gene symbols) from the Labels column. To label all significant points, select "Significant" for the **Points to label** option, or to only label the top most significant specify a number under "Only label top most significant". Users can label any points of interest through selecting **Points to label** "Input from file" and providing a tabular labels file. The labels file must contain a header row and have the labels in the first column. These labels must match the labels in the main input file.
All significant points, those meeting the specified FDR and Log Fold Change thresholds, will be coloured: red for upregulated, blue for downregulated. Users can choose to apply labels to the points (such as gene symbols) from the Labels column. To label all significant points, select "Significant" for the **Points to label** option, or to only label the top most significant, specify a number under "Only label top most significant". Users can label any points of interest through selecting **Points to label** "Input from file" and providing a tabular labels file. The labels file must contain a header row and have the labels in the first column. These labels must match the labels in the main input file.

**Outputs**

Expand All @@ -319,6 +384,7 @@ A PDF containing a Volcano plot like below. The R code can be output through *Ou
.. _Volcano plot: https://en.wikipedia.org/wiki/Volcano_plot_(statistics)
.. _blog post: https://gettinggeneticsdone.blogspot.com/2016/01/


]]></help>
<citations>
<citation type="doi">10.1007/978-3-319-24277-4</citation>
Expand Down
Loading