Skip to content

plantinformatics/lamington

Folders and files

NameName
Last commit message
Last commit date

Latest commit

5ec40ef · Dec 20, 2024

History

29 Commits
Dec 19, 2024
Dec 20, 2024
Dec 20, 2024
Dec 22, 2023
Dec 19, 2024

Repository files navigation

LAMINGTON

PCA & Core Diversity Set Explorer

The primary objective of this tool is to facilitate the creation of core diversity sets from a genetically characterized population. Leveraging Core Hunter, a widely accepted tool for defining core subsets, this application serves as a wrapper. It introduces a user-friendly graphical interface (GUI) and extends its capabilities to include features like genotype and sample filtering, PCA plot visualization, and outlier detection.

image

The initial development of Lamington was carried out by La Trobe University student Muhammad Tahaa Suhail as part of a Work-based Learning placement with Agriculture Victoria in the context of the Australian Grains Genebank Strategic Partnership, a $30m 5-year joint investment between the Victorian State Government and Grains Research and Development Corporation (GRDC) that aims to unlock the genetic potential of plant genetic resources for the benefit of Australian grain growers.

Features

  • Developed in R: A widely used statistical programming language within the research and breeding community.
  • Flexibility of Deployment: Designed for both local and server deployment, accommodating the analysis of large datasets.
  • Accessibility: The source code is open source and available on GitHub is under a GPL license. Docker builds are provided for easy deployment on any operating system.
  • Visualisation Functions: Users can interactively plot statistics such as minor allele frequency (MAF) and call rate, defining suitable cutoffs.
  • PCA Plotting Functions: Users can generate PCA plots for the population and colour data points based on sample information from an associated spreadsheet.
  • Core Hunter Functions: Exposes the main functions and parameters of Core Hunter through a graphical interface.
  • Detection of Outliers: Supports a cyclical workflow for outlier detection. Users can identify outliers, remove them, and rerun the analysis without outliers.
  • Samples to Include/Exclude in Core Set: Users have the ability to specify samples that should always be included or excluded in the core set, facilitating customization based on existing core sets or sample availability.

Installation

To run Lamington on your local machine, follow these initial setup steps by first downloading R and RStudio(optional).

Execute the following commands once installed to configure the environment:

Install the dependencies

install.packages(c('shiny', 'ggplot2', 'shinyFiles', 'dplyr','rjava', 'corehunter','SNPRelate',
                'DT','esquisse','scatterD3','shinycssloaders','shinythemes','rJava','corehunter'))
                
if (!requireNamespace("BiocManager",quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("SNPRelate")

You may now run the shiny app with just one command in R:

#This command runs Lamington from Github without explicitly cloning lamington
shiny::runGitHub("lamington", "plantinformatics",ref="main",subdir = "R")

Or

# First clone the repository with git.
git clone git@github.com:plantinformatics/lamington.git
setwd("~/lamington") # change to match where you downloaded this repo to
shiny::runApp('R') # runs the app 

Docker

Lamington is very easy to install and deploy in a Docker container.

A Docker container for Lamington can be built using the Dockerfile. This file contains all the instructions required to assemble the container image, including the installation of necessary R packages, system dependencies, and the Lamington software itself.

The Docker image is configured to expose port 3838 by default. To change this, edit the EXPOSE instruction in the Dockerfile. Once you have configured the Dockerfile as needed, build the image using the docker build command.

#Clone lamington if not already done so
git clone git@github.com:plantinformatics/lamington.git
cd lamington
docker build -t  lamington  --progress=plain  .

Building the Lamington Docker image will download all required dependencies. This process may take 10 to 15 minutes.

After the build is complete, run the Docker image. You can map the container's exposed port (defined in the Dockerfile) to a port on your host machine. For example, to map host port 3838 to the container's port 3838, use the appropriate docker run command.

To enable data access, mount the VCF directory (for direct uploads) and the GDS directory (containing saved GDS files) to the corresponding locations within the container.

docker run -d -v 'path2localdirectory/VCFs:/root/VCFs' -v 'path2localdirectory/GDS:/root/GDS' -p 127.0.0.1:3838:3838 lamington

Verify the deployment by navigating to your server address in your preferred browser.

127.0.0.1:3838

Workflow

1. Importing Metadata/Passport Data:

Metadata/passport data can be imported into Lamington on the 'Add POP Data' tab.

image

As an example, the passport data extracted using Genolink for the AGG Chickpea - Release 241203 is imported into Lamington as shown in the figure below

image

2. Genotype Data Ingestion:

In the 'Convert VCF File tab', provide your genotype data in VCF (Variant Call Format).

You can download the following VCF file from the AGG Chickpea - Release 241203 as an example to testing Lamington

You then have two options:

  • Upload directly: Select the VCF file from your local computer.
  • Choose from the server: Browse and select the VCF file from a pre-populated list on the Server.

image

  • After entering a GDS file name (without the .gds extension which will be appended automatically), the 'Convert and Display' button is displayed which you can click to convert the input VCF file to GDS.
  • Once the VCF file has been uploaded with a meaningful name provided for the GDS file, the data can then be loaded on the 'Select GDS file' tab

image

3. Filtering

Visualisation: Using the slider users can visualise the change in Missing Rate and MAF on the histogram on the right.

image

In addition, if the metadata/passport data has been uploaded, users can subset and compare different sets within the metadata/passport data.

image

4. Genotype Data Filtering:

Before analysis, you have the option to use the genotype data as is or filter under the 'Genotype Matrix' tab based on the following;

  • Minor Allele Frequency (MAF)
  • Call rate (CR).
  • Linkage Disequilibrium (LD) pruning.
  • Select specific samples by providing a list of sample IDs or selecting from the metadata/passport data.

image

5. PCA Calculation:

After defining the set of SNPs, the Principal Component Analysis (PCA) can be performed under the PCA tab. image

To gain deeper insights from the PCA results, you can include metadata/passport information.

This allows you to:

  • Explore population-specific patterns.
  • Visualise relationships between population groups.

image

6. Calculation of Core Sets:

Lamington utilises the CoreHunter package to compute core sets, smaller representative subsets of your data. Lamington provides access to the main CoreHunter options, enabling you to define multiple core sets with varying sizes. These core sets are then integrated into the PCA data frame for visualisation and analysis.

image

7. PCA Plot Visualization and Outlier Definition:
  • Core sets from step 6 are visualized using a PCA plot based on the PCA components from step 5. The calculated PCA is visualised with an interactive plot allowing for zooming and sample selection.
  • Sample names are displayed on cursor hover.
  • You can select and remove outliers and rerun Steps 5-7.
  • You can add samples to an exclusion list, and rerun steps 5-7.
  • Core sets are exportable as a CSV file containing the list of samples, PCA and population data and the core set.

image

image

8. Final Plot

User can use the addin to create final plot and customise it according to their needs. Selecting Tab2 from the list of data frames. ## Required Dependencies: - shiny - core hunter - SNPRelate - ggplot2 - shinyFiles - dplyr - DT - esquisse - scatterD3 - shinycssloaders - shinythemes

License

This code is licensed under the GPLv3. Please see the file LICENSE.txt for information.