Segpy: a streamlined, user-friendly pipline for variant segregation analysis

Introduction

Segpy is a streamlined, user-friendly pipeline designed for variant segregation analysis, allowing investigators to compute allelic carrier counts at variant sites across study subjects. The pipeline can be applied to both pedigree-based family cohorts — those involving single or multi-family trios, quartets, or extended families — and population-based case-control cohorts. Considering the scale of modern datasets and the computational power required for their analysis, the Segpy pipeline was designed for seamless integration with the users’ high-performance computing (HPC) clusters using the SLURM Workload Manager (Segpy SLURM); however, users may also run the pipleine directly from their Linux workstation or any PC with Bash and Singularity installed (Segpy Local).

As input, users must provide a single VCF file describing the genetic variants of all study subjects and a pedigree file describing the familial relationships among those individuals (if applicable) and their disease status. As output, Segpy computes variant carrier counts for affected and unaffected individuals, both within and outside of families, by categorizing wild-type individuals, heterozygous carriers, and homozygous carriers at specific loci. These counts are organized into a comprehensive data frame, with each row representing a single variant and labeled with the Sample IDs of the corresponding carriers to facilitate donwstream analysis.

To meet the requirements of various study designs, Segpy integrates the ability to analyze pedigree-based and population-based datasets by providing three distinct, yet highly comparable analysis tracks:

Single-family
Multi-family
Case-control

Each analysis tracks consists of four dstinct steps shown in Figure 1. In brief, Step 0 establishes a working directory for the analyis and deposits a modifiable text file to adjust the analytical parameters. In Step 1, the user-provided VCF file is converted to the Hail MatrixTable format. In Step 2, variant segregation is performed based on the sample information defined in the user-provided pedigree file using the MatrixTable. In Step 3, the carrier counts data frame produced in Step 2 is parsed based on user specifications to reduce the computational burden of downstream analyses.

Figure 1. Segpy pipeline workflow.

A containerized version of the Segpy pipeline is publicly available from Zenodo, which includes the code, libraries, and dependicies required for running the analyses.

Below, we provide a quick start guide for Segpy. Please refer to Segpy's documentation for comprehensive tutorials.

Installation

The following code can be used to download segpy.pip, which includes the code, libraries, and dependicies required for running the analyses:

# Download the Segpy container
curl "https://zenodo.org/records/14503733/files/segpy.pip.zip?download=1" --output segpy.pip.zip

# Unzip the Segpy container
unzip segpy.pip.zip

In addition to segpy.pip, users must install Apptainer/Singularity onto their HPC or local system. If you are using an HPC system, Apptainer is likely already installed, and you will simply need to load the module using the following code:

# Load Apptainer
module apptainer/1.2.4

To ensure that segpy.pip is downloaded properly, run the following command:

bash /path/to/segpy.pip/launch_segpy.sh -h

Which should return the folllowing:

------------------------------------ 
segregation pipeline version 0.0.6 is loaded

------------------- 
Usage:  segpy.pip/launch_segpy.sh [arguments]
        mandatory arguments:
                -d  (--dir)      = Working directory (where all the outputs will be printed) (give full path) 
                -s  (--steps)      = Specify what steps, e.g., 2 to run just step 2, 1-3 (run steps 1 through 3). 'ALL' to run all steps.
                                steps:
                                0: initial setup
                                1: create hail matrix
                                2: run segregation
                                3: final cleanup and formatting

        optional arguments:
                -h  (--help)      = Get the program options and exit.
                --jobmode  = The default for the pipeline is local. If you want to run the pipeline on slurm system, use slurm as the argument. 
                --analysis_mode  = The default for the pipeline is analysing single or multiple family. If you want to run the pipeline on case-control, use case-control as  the argumnet. 
                --parser             = 'general': to general parsing, 'unique': drop multiplicities 
                -v  (--vcf)      = VCF file (mandatory for steps 1-3)
                -p  (--ped)      = PED file (mandatory for steps 1-3)
                -c  (--config)      = config file [CURRENT: "/scratch/fiorini9/segpy.pip/configs/segpy.config.ini"]
                -V  (--verbose)      = verbose output 

 ------------------- 
 For a comprehensive help, visit  https://neurobioinfo.github.io/segpy/latest/ for documentation.

After successfully downloading segpy.pip we can proceed with the segregation analysis.

Running Segpy

Prior to initiating the pipeline, users must provide the paths to:

The directory containing segpy.pip (PIPELINE_HOME)
The directory designated for the pipeline's outputs (PWD)
The VCF file (VCF)
The pedigree file (PED)

Use the following code to define these paths:

# Define necessary paths
export PIPELINE_HOME=path/to/segpy.pip
PWD=path/to/outfolder
VCF=path/to/VCF.vcf
PED=path/to/pedigree.ped

To initiate the Segpy pipeline for your local workstation, use the following code:

bash $PIPELINE_HOME/launch_segpy.sh \
-d $PWD \
--steps 0 \
--analysis_mode single_family

Where analysis_mode is one of single_family, multiple_family, or case-control depending on the study design.

To initiate the Segpy pipeline for your HPC cluster, include the --job_mode tag as shown in the following code:

bash $PIPELINE_HOME/launch_segpy.sh \
-d $PWD \
--steps 0 \
--jobmode slurm
--analysis_mode single_family

Upon initiating Segpy, the entire pipeline can be run using the following code:

bash $PIPELINE_HOME/launch_segpy.sh \
-d $PWD \
--steps 1-3 \
--vcf $VCF \
--ped $PED \
--parser general

Where parser is one of general or unique, depending on the user's preferred method for filtering the final output dataframe.

Contributing

Any contributions or suggestions for the Segpy pipeline are welcomed. For direct contact, please reach out to The Neuro Bioinformatics Core at [email protected].

If you encounter any issue, please report them on the GitHub repository.

Changelog

Every release is documented on the GitHub Releases page.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgement

The pipeline is done as part of MNI projects, it is written by Saeid Amiri in association with Dan Spiegelman, Michael Fiorini, Allison Dilliott, and Sali Farhan at Neuro Bioinformatics Core. Copyright belongs MNI BIOINFO CORE.

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
docs		docs
segpy.pip		segpy.pip
segpy		segpy
site		site
test		test
.DS_Store		.DS_Store
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
segpy.png		segpy.png
segpy_pip.png		segpy_pip.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Segpy: a streamlined, user-friendly pipline for variant segregation analysis

Contents

Introduction

Installation

Running Segpy

Contributing

Changelog

License

Acknowledgement

About

Releases 2

Packages

Contributors 3

Languages

License

neurobioinfo/segpy

Folders and files

Latest commit

History

Repository files navigation

Segpy: a streamlined, user-friendly pipline for variant segregation analysis

Contents

Introduction

Installation

Running Segpy

Contributing

Changelog

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages