Skip to content

SAIGE GENE: Hands On Practice (09 2021 updated)

weizhouUMICH edited this page Sep 10, 2021 · 1 revision

In this tutorial, we will perform set-based association tests using SAIGE-GENE, which is implemented in the SAIGE library.
We will use Rstudio terminal to run Linux commands

Table of Contents

Skip Get Ready if you've already done Part 2, please switch to the Rstudio terminal

Get Ready

Install the SAIGE R package

  • Please always refer to this github page for the up-to-date installation##

https://github.com/weizhouUMICH/SAIGE#how-to-install-and-run-saige-and-saige-gene

Example data folder

https://github.com/weizhouUMICH/SAIGE/tree/master/extdata

Running SAIGE-GENE for set-based association tests

2 steps in SAIGE-GENE

Step 0: creating a sparse GRM

Note: This step is only needed for region- and gene-based tests (SAIGE-GENE)
The sparse GRM only needs to be created once for each data set, e.g. a biobank, and can be used for all different phenotypes as long as all tested samples are included in it.

#help info
createSparseGRM.R --help

createSparseGRM.R	\
	--plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2 \
	--nThreads=4  \
	--outputPrefix=./output/sparseGRM  \
	--numRandomMarkerforSparseKin=2000 \
	--relatednessCutoff=0.125
  • --plinkFile Path to plink file for creating the sparse genetic relationship matrix (GRM)
  • --nThreads number of CPUs to be used. Note: the computation time linearly decreases as the nThreads increases
  • --outputPrefix Path and prefix to the output files
  • --numRandomMarkerforSparseKin number of randomly selected markers to be used to identify related samples for sparse GRM [default=2000]
  • --relatednessCutoff The threshold to treat two samples as unrelated. Any value below this value will be set to 0 in the sparse GRM. 0.125 is the relatedness coefficient for 3rd-degree relatives (0.5^3)

Input file

Genotype file for constructing the genetic relationship matrix
SAIGE takes the PLINK binary file for the genotypes and assumes the file prefix is the same one for .bed, .bim. and .fam

./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bed ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.bim ./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.fam

Output files

  1. a file storing the sparse GRM ./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx

The sparse GRM file can be opened using the readMM function in the R library Matrix

  1. a file storing IDs of the samples in the sparse GRM ./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

The screen output ends with the following text if the job above has been run successfully

screen output

Step 1: fitting the null logistic/linear mixed model

Input files

same as step 1 for single-variant assoc tests and step 0 above

  1. Genotype file for constructing the genetic relationship matrix in the plink format

  2. Phenotype file (contains non-genetic covariates if any, such as gender and age)

output by step 0

  1. a file storing the sparse GRM
    ./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx

  2. a file storing IDs of the samples in the sparse GRM
    ./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

Fit the null logistic mixed model for binary traits (--traitType=binary) and then estimate the categorical variance ratio:

step1_fitNULLGLMM.R     \
	      --plinkFile=./input/nfam_100_nindep_0_step1_includeMoreRareVariants_poly.v2 \
        --phenoFile=./input/pheno_1000samples.txt_withdosages_withBothTraitTypes.txt \
        --phenoCol=y_binary \
        --covarColList=x1,x2 \
        --sampleIDColinphenoFile=IID \
        --traitType=binary       \
        --outputPrefix=./output/example_binary \
        --outputPrefix_varRatio=./output/example_binary_cate	\
        --sparseGRMFile=./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx  \
        --sparseGRMSampleIDFile=./output/sparseGRM_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \
        --nThreads=1 \
        --IsSparseKin=TRUE  \
        --isCateVarianceRatio=TRUE  \
        --IsOverwriteVarianceRatioFile=TRUE
  • --plinkFile Path to plink file for creating the genetic relationship matrix (GRM)
  • --phenoFile Path to the phenotype file
  • --phenoCol Column name for phenotype to be tested in the phenotype file
  • --covarColList List of covariates (comma separated), matching column names in the phenotype file
  • --sampleIDColinphenoFile Column name of sample IDs in the phenotype file
  • --traitType binary/quantitative
  • --outputPrefix Path and prefix of the output files
  • --outputPrefix_varRatio Path and prefix of the output the variance ratio file. if not specified, it will be set as the outputPrefix
  • --sparseGRMFile Path to the pre-calculated sparse GRM file in Step 0.
  • --sparseGRMSampleIDFile Path to the sample ID file for the pre-calculated sparse GRM in Step 0
  • --nThreads number of CPUs to be used for Step 1. Note: the computation time linearly decreases as the nThreads increases
  • --IsSparseKin whether to use the sparse pre-computed sparse GRM to estimate the variance ratio. Needs to be TRUE for gene-based tests for rare variants
  • --isCateVarianceRatio Whether to estimate variance ratio based on different MAC categories. Needs to be TRUE for gene-based tests for rare variants
  • --IsOverwriteVarianceRatioFile Whether to overwrite the variance ratio file if the file already exists

For more detailed parameter list. You may run

step1_fitNULLGLMM.R --help

The screen output ends with the following text if the job above has been run successfully screen output

Output files

same as the step 1 output by SAIGE for single-variant association tests

  1. model file containing the null model fitting results in an R list named modglmm (input for step 2)
    ./output/example_binary.rda

  2. variance ratio file(input for step 2)
    ./output/example_binary_cate.varianceRatio.txt.varianceRatio.txt

  3. association result file for the subset of randomly selected markers for estimate the variance ratio (temp file, won't be used next)
    ./output/example_binary_30markers.SAIGE.results.txt

specific to SAIGE-GENE

  1. sparse Sigma file (input for step 2)

./output/example_binary_cate.varianceRatio.txt.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx

NOTE: the file contains the sparse Sigma matrix, which is NOT the same as the sparse GRM. The sparse Sigma matrix is computed based on the sparse GRM and the results of step 1, so it is phenotype-specific.

Step 2: performing the set-based association tests

Input files

  • same as [step 1 input by SAIGE for single-variant association tests
  1. Dosage/genotype file containing dosages or genotypes of markers to test
    SAIGE supports different formats for dosages: VCF, BCF, BGEN and SAV.We will use VCF in the example today
  • VCF containing dosages and the VCF index file
zcat ./input/seedNumLow_1_seedNumHigh_200.DS.vcf.gz | less -S
# press q to quit the file view

index file: ./input/seedNumLow_1_seedNumHigh_200.DS.vcf.gz.csi

  1. Model file from step 1 ./output/example_binary.rda

  2. Variance ratio file from step 1, but has categorical variance ratios ./output/example_binary_cate.varianceRatio.txt

specific to SAIGE-GENE

  1. Group file
less -S ./input/seedNumLow_1_seedNumHigh_200.grp.file
#press q to quit the file view

group file

Each line is for one gene/set of variants. The first element is for gene/set name. The rest of the line is for variant ids included in this gene/set. For vcf/sav, the genetic marker ids are in the format chr:pos_ref/alt. For bgen, the genetic marker ids should match the ids in the bgen file. Each element in the line is separated by tab.

  1. sparse Sigma file ./output/example_binary_cate.varianceRatio.txt.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx

Perform set-based association tests

step2_SPAtests.R \
        --vcfFile=./input/seedNumLow_1_seedNumHigh_200.DS.vcf.gz \
        --vcfFileIndex=./input/seedNumLow_1_seedNumHigh_200.DS.vcf.gz.csi \
        --vcfField=DS \
        --chrom=chr1 \
	--minMAF=0 \
        --minMAC=0.5 \
        --maxMAFforGroupTest=0.01       \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary_cate.varianceRatio.txt.varianceRatio.txt \
        --SAIGEOutputFile=./output/example_binary.SAIGE.gene.txt  \
        --numLinesOutput=1 \
        --groupFile=./input/seedNumLow_1_seedNumHigh_200.grp_50sets.file    \
        --sparseSigmaFile=./output/example_binary_cate.varianceRatio.txt.varianceRatio.txt_relatednessCutoff_0.125_2000_randomMarkersUsed.sparseSigma.mtx \
        --IsSingleVarinGroupTest=TRUE
  • --vcfFile Path to vcf file
  • --vcfFileIndex Path to vcf index file. Indexed by tabix. Path to index for vcf file by tabix, .tbi file by tabix -c -p vcf file.vcf.gz
  • --vcfField DS for dosages or GT for genotypes
  • --chrom string chromosome in vcf to be tested. The string needs to exactly match the chromosome string in the vcf/sav file. For example, '1' does not match 'chr1'
  • --minMAF Minimum minor allele frequency for markers to be tested
  • --minMAC Minimum minor allele count for markers to be tested. The higher threshold between minMAC and minMAF will be used
  • --maxMAFforGroupTest Maximum MAF for markers tested in group test
  • --GMMATmodelFile Path to the input file containing the glmm model, which is output from step 1
  • --varianceRatioFile Path to the input file containing the variance ratio, which is output from the step 1
  • --SAIGEOutputFile Path to the output file containing assoc test results
  • --numLinesOutput Number of sets to be output to the file each time
  • --groupFile Path to the file containing the group information for set-based tests. Each line is for one set of variants.
  • --sparseSigmaFile Path to the file containing the sparse Sigma output by step 1
  • --IsSingleVarinGroupTest Whether to perform single-variant assoc tests for genetic markers included in the set-based tests.

For detailed parameter list. You may run

step2_SPAtests.R --help

The screen output ends with the following text if the job above has been run successfully

screen output

Output files

  1. A file with region- or gene-based association test results
less -S ./output/example_binary.SAIGE.gene.txt
#press q to quit the file view
  • The header of the output file: Gene Pvalue Nmarker_MACCate_1 Nmarker_MACCate_2 Nmarker_MACCate_3 Nmarker_MACCate_4 Nmarker_MACCate_5 Nmarker_MACCate_6 Nmarker_MACCate_7 Nmarker_MACCate_8 markerIDs markerAFs Pvalue_Burden Pvalue_SKAT

  • Gene : Gene name (as provided in the group file)

  • Pvalue: p-value of SKAT-O test

  • Nmarker_MACCate_n: number of markers in the gene falling in the nth MAC category (The MAC category is corresponding to the one used for the variance ratio estimation). For example, by default, Nmarker_MACCate_1 is the number of singletons

  • markerIDs: IDs for variants included in the test

  • markerAFs: allele frequencies for variants included in the test

  • Pvalue_Burden: p-value of Burden test

  • Pvalue_SKAT: p-value of SKAT test

  1. A file with single-variant association test results for genetic variants included in the gene-based tests (if --IsSingleVarinGroupTest=TRUE)
less -S ./output/example_binary.SAIGE.gene.txt_single
#press q to quit the file view

Same as the output by SAIGE for single-variant association tests

Notes and other useful options for SAIGE-GENE

Step 1

  • IsSparseKin=TRUE must be specified for SAIGE-GENE
  • The step 1 model result from the single-variant assoc test can be re-used, except that for gene-based tests, variance ratios for multiple MAC categories and a sparse GRM need to be used (IsSparseKin=TRUE).
  • with IsSparseKin=TRUE, no sparseGRMFile and sparseGRMSampleIDFile are specified (step 0 has not been done beforehand), a sparse GRM will be created based on the relatednessCutoff in this step.
  • with IsSparseKin=TRUE, sparseGRMFile and sparseGRMSampleIDFile can be used to specify a pre-calcuated sparse GRM and the sample ids for the sparse GRM (output from step 0). Tested samples would be a subset of samples in the pre-calcuated GRM.
  • To activate the variance ratio estimation based multiple MAC categories, --isCateVarianceRatio=TRUE
  • cateVarRatioMinMACVecExclude and cateVarRatioMaxMACVecInclude are used to specify the MAC categories
  • by default --cateVarRatioMinMACVecExclude=0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5 --cateVarRatioMaxMACVecInclude=1.5,2.5,3.5,4.5,5.5,10.5,20.5 corresponding to 0.5 < MAC <= 1.5 1.5 < MAC <= 2.5 2.5 < MAC <= 3.5 3.5 < MAC <= 4.5 4.5 < MAC <= 5.5 5.5 < MAC <= 10.5 10.5 < MAC <= 20.5 20.5 < MAC

Step 2

  • The computation will be ~5 times slower using VCF input than using BGEN and SAV input.
  • IsSingleVarinGroupTest=TRUE is to perform single-variant association tests as well for markers included in the gene-based tests
  • --maxMAFforGroupTest can be used to specify the maximum MAF of genetic variants to be included in each gene or varaint set
  • When only testing very rare variants, e.g. maxMAFforGroupTest=0.001 or maxMAFforGroupTest=0.0001, inflation has been oberved for SKAT and SKAT-O tests. Use --method_to_CollapseUltraRare=absence_or_presence can solve the inflation issue and also speed up Step 2, especially when analyzing sequencing data which contains thousands of very reare markers in each gene or variant set.
  • Specify customized weights for markers in the gene- or region-based tests
  • weightsIncludeinGroupFile logical. Whether to specify customized weight for makers in gene- or region-based tests. If TRUE, weights are included in the group file. For vcf/sav, the genetic marker ids and weights are in the format chr:pos_ref/alt;weight. For bgen, the genetic marker ids should match the ids in the bgen filE, e.g. SNPID;weight. Each element in the line is seperated by tab. By default, FALSE
  • weights_for_G2_cond vector of float. weights for conditioning markers for gene- or region-based tests. The length equals to the number of conditioning markers, delimited by comma.
  • Same as the single-variant association tests, conditional analysis based summary stats can be performed (--condition) can be performed in step 2 with dosage/genotype input file formats VCF, BGEN and SAV.
  • Note that the order of variants in the group file needs to be consistent to variants' order in the dosage file