add sexcheck sample qc

immunogenomics · Nov 15, 2022 · 2f45645 · 2f45645
1 parent 9478020
commit 2f45645
Show file tree

Hide file tree

Showing 3 changed files with 106 additions and 6 deletions.
diff --git a/.ipynb_checkpoints/tutorial_HLAQCImputation-checkpoint.ipynb b/.ipynb_checkpoints/tutorial_HLAQCImputation-checkpoint.ipynb
@@ -233,7 +233,57 @@
     "```\n",
     "\n",
     "\n",
-    "#### 4.3 Removal of Highly Genetically Related Samples\n",
+    "#### 4.3 Removal of Samples with Discordant Sex Information\n",
+    "\n",
+    "> Sample sex information is relevant in sex-specific analyses or imputation of the sex chromosome. Inaccurate assignment of sex (i.e. male recorded as female) may be a sign of sample swapping or other sample quality issue. Homozygosity rate across the non pseudo-autosomal region (non-PAR) of the X chromosome is used to impute sample sex assignment. \n",
+    "\n",
+    "##### Calculating X Chromosome Homozygosity per Sample:\n",
+    "\n",
+    ">**While our genotype data in this repository only includes mock genotype data on chromosome 6 due to size restrictions in GitHub, please assume that the data {bim,bed,fam} files are the plink genotype files with all chromosomes, including chromosome X.**\n",
+    "\n",
+    "``` bash\n",
+    "#split off X chromosome PAR\n",
+    "plink --bfile data --split-x hg19 --make-bed --out data.par_split\n",
+    "#output sex check\n",
+    "plink --bfile data.par_split --check-sex\n",
+    "```\n",
+    "\n",
+    "> By default, plink's check-sex will assign samples as female if the F coefficient is < 0.2 or male if the F coefficient is > 0.8. To be most accurate, these cutoffs can be changed manually based on the distribution of the F coefficient, as visualized below.\n",
+    "\n",
+    "\n",
+    "\n",
+    "##### Implementing X Chromosome Homozygosity Thresholds:\n",
+    "\n",
+    "in R: \n",
+    "\n",
+    "```R\n",
+    "library(ggplot2)\n",
+    "name = 'plink.sexcheck'\n",
+    "dat = read.table(name, header = TRUE)\n",
+    "\n",
+    "png(paste0(name, '.png'))\n",
+    "ggplot(dat)+\n",
+    "    geom_histogram(aes(x = F), bins = 30)+\n",
+    "    theme_classic(20)+\n",
+    "    labs(x = 'F coefficient', y = 'Frequency', title = 'X Chromosome Homozygosity')+\n",
+    "    geom_vline(xintercept = c(0.4, 0.9), color = \"red\", linetype = \"dashed\")\n",
+    "dev.off()\n",
+    "```\n",
+    "\n",
+    "![Image](./images/plink.sexcheck.png)\n",
+    "\n",
+    "> We recommend removal of any samples where pedigree-defined sex is discordant with genotype-inferred sex assignment, as this suggests the sample may have been labeled incorrectly.\n",
+    "\n",
+    "``` bash\n",
+    "#update sex in fam file and re-ouptut sex check\n",
+    "plink --bfile data.par_split --impute-sex 0.4 0.85 --make-bed --out data.par_split.impute_sex\n",
+    "#removal of samples with discordant sex information\n",
+    "awk ' ($5==\"PROBLEM\") { print $1,$2 } ' data.par_split.impute_sex.sexcheck >> discordantsex.sample.txt\n",
+    "plink --bfile data.par_split.impute_sex --remove discordantsex.sample.txt --make-bed --out data.par_split.impute_sex.concordant\n",
+    "```\n",
+    "\n",
+    "\n",
+    "#### 4.4 Removal of Highly Genetically Related Samples\n",
     "\n",
     "> A required feature of association studies with cohorts sampled from a general population is the removal of genetically identical individuals in the cohort. Genetically identical samples are detrimental to the overall power of further association tests, as the individual is overrepresented in the study cohort. Additionally, genetically identical samples can be the product of poor sample handling and experimental error. Below we outline steps to remove highly related individuals with plink. \n",
     "\n",
@@ -625,7 +675,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -639,7 +689,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.8"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,

diff --git a/images/plink.sexcheck.png b/images/plink.sexcheck.png
diff --git a/tutorial_HLAQCImputation.ipynb b/tutorial_HLAQCImputation.ipynb
@@ -233,7 +233,57 @@
     "```\n",
     "\n",
     "\n",
-    "#### 4.3 Removal of Highly Genetically Related Samples\n",
+    "#### 4.3 Removal of Samples with Discordant Sex Information\n",
+    "\n",
+    "> Sample sex information is relevant in sex-specific analyses or imputation of the sex chromosome. Inaccurate assignment of sex (i.e. male recorded as female) may be a sign of sample swapping or other sample quality issue. Homozygosity rate across the non pseudo-autosomal region (non-PAR) of the X chromosome is used to impute sample sex assignment. \n",
+    "\n",
+    "##### Calculating X Chromosome Homozygosity per Sample:\n",
+    "\n",
+    ">**While our genotype data in this repository only includes mock genotype data on chromosome 6 due to size restrictions in GitHub, please assume that the data {bim,bed,fam} files are the plink genotype files with all chromosomes, including chromosome X.**\n",
+    "\n",
+    "``` bash\n",
+    "#split off X chromosome PAR\n",
+    "plink --bfile data --split-x hg19 --make-bed --out data.par_split\n",
+    "#output sex check\n",
+    "plink --bfile data.par_split --check-sex\n",
+    "```\n",
+    "\n",
+    "> By default, plink's check-sex will assign samples as female if the F coefficient is < 0.2 or male if the F coefficient is > 0.8. To be most accurate, these cutoffs can be changed manually based on the distribution of the F coefficient, as visualized below.\n",
+    "\n",
+    "\n",
+    "\n",
+    "##### Implementing X Chromosome Homozygosity Thresholds:\n",
+    "\n",
+    "in R: \n",
+    "\n",
+    "```R\n",
+    "library(ggplot2)\n",
+    "name = 'plink.sexcheck'\n",
+    "dat = read.table(name, header = TRUE)\n",
+    "\n",
+    "png(paste0(name, '.png'))\n",
+    "ggplot(dat)+\n",
+    "    geom_histogram(aes(x = F), bins = 30)+\n",
+    "    theme_classic(20)+\n",
+    "    labs(x = 'F coefficient', y = 'Frequency', title = 'X Chromosome Homozygosity')+\n",
+    "    geom_vline(xintercept = c(0.4, 0.9), color = \"red\", linetype = \"dashed\")\n",
+    "dev.off()\n",
+    "```\n",
+    "\n",
+    "![Image](./images/plink.sexcheck.png)\n",
+    "\n",
+    "> We recommend removal of any samples where pedigree-defined sex is discordant with genotype-inferred sex assignment, as this suggests the sample may have been labeled incorrectly.\n",
+    "\n",
+    "``` bash\n",
+    "#update sex in fam file and re-ouptut sex check\n",
+    "plink --bfile data.par_split --impute-sex 0.4 0.85 --make-bed --out data.par_split.impute_sex\n",
+    "#removal of samples with discordant sex information\n",
+    "awk ' ($5==\"PROBLEM\") { print $1,$2 } ' data.par_split.impute_sex.sexcheck >> discordantsex.sample.txt\n",
+    "plink --bfile data.par_split.impute_sex --remove discordantsex.sample.txt --make-bed --out data.par_split.impute_sex.concordant\n",
+    "```\n",
+    "\n",
+    "\n",
+    "#### 4.4 Removal of Highly Genetically Related Samples\n",
     "\n",
     "> A required feature of association studies with cohorts sampled from a general population is the removal of genetically identical individuals in the cohort. Genetically identical samples are detrimental to the overall power of further association tests, as the individual is overrepresented in the study cohort. Additionally, genetically identical samples can be the product of poor sample handling and experimental error. Below we outline steps to remove highly related individuals with plink. \n",
     "\n",
@@ -625,7 +675,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -639,7 +689,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.8"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,