Skip to content

Commit

Permalink
add sexcheck sample qc
Browse files Browse the repository at this point in the history
  • Loading branch information
michelle-curtis committed Nov 15, 2022
1 parent 9478020 commit 2f45645
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 6 deletions.
56 changes: 53 additions & 3 deletions .ipynb_checkpoints/tutorial_HLAQCImputation-checkpoint.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,57 @@
"```\n",
"\n",
"\n",
"#### 4.3 Removal of Highly Genetically Related Samples\n",
"#### 4.3 Removal of Samples with Discordant Sex Information\n",
"\n",
"> Sample sex information is relevant in sex-specific analyses or imputation of the sex chromosome. Inaccurate assignment of sex (i.e. male recorded as female) may be a sign of sample swapping or other sample quality issue. Homozygosity rate across the non pseudo-autosomal region (non-PAR) of the X chromosome is used to impute sample sex assignment. \n",
"\n",
"##### Calculating X Chromosome Homozygosity per Sample:\n",
"\n",
">**While our genotype data in this repository only includes mock genotype data on chromosome 6 due to size restrictions in GitHub, please assume that the data {bim,bed,fam} files are the plink genotype files with all chromosomes, including chromosome X.**\n",
"\n",
"``` bash\n",
"#split off X chromosome PAR\n",
"plink --bfile data --split-x hg19 --make-bed --out data.par_split\n",
"#output sex check\n",
"plink --bfile data.par_split --check-sex\n",
"```\n",
"\n",
"> By default, plink's check-sex will assign samples as female if the F coefficient is < 0.2 or male if the F coefficient is > 0.8. To be most accurate, these cutoffs can be changed manually based on the distribution of the F coefficient, as visualized below.\n",
"\n",
"\n",
"\n",
"##### Implementing X Chromosome Homozygosity Thresholds:\n",
"\n",
"in R: \n",
"\n",
"```R\n",
"library(ggplot2)\n",
"name = 'plink.sexcheck'\n",
"dat = read.table(name, header = TRUE)\n",
"\n",
"png(paste0(name, '.png'))\n",
"ggplot(dat)+\n",
" geom_histogram(aes(x = F), bins = 30)+\n",
" theme_classic(20)+\n",
" labs(x = 'F coefficient', y = 'Frequency', title = 'X Chromosome Homozygosity')+\n",
" geom_vline(xintercept = c(0.4, 0.9), color = \"red\", linetype = \"dashed\")\n",
"dev.off()\n",
"```\n",
"\n",
"![Image](./images/plink.sexcheck.png)\n",
"\n",
"> We recommend removal of any samples where pedigree-defined sex is discordant with genotype-inferred sex assignment, as this suggests the sample may have been labeled incorrectly.\n",
"\n",
"``` bash\n",
"#update sex in fam file and re-ouptut sex check\n",
"plink --bfile data.par_split --impute-sex 0.4 0.85 --make-bed --out data.par_split.impute_sex\n",
"#removal of samples with discordant sex information\n",
"awk ' ($5==\"PROBLEM\") { print $1,$2 } ' data.par_split.impute_sex.sexcheck >> discordantsex.sample.txt\n",
"plink --bfile data.par_split.impute_sex --remove discordantsex.sample.txt --make-bed --out data.par_split.impute_sex.concordant\n",
"```\n",
"\n",
"\n",
"#### 4.4 Removal of Highly Genetically Related Samples\n",
"\n",
"> A required feature of association studies with cohorts sampled from a general population is the removal of genetically identical individuals in the cohort. Genetically identical samples are detrimental to the overall power of further association tests, as the individual is overrepresented in the study cohort. Additionally, genetically identical samples can be the product of poor sample handling and experimental error. Below we outline steps to remove highly related individuals with plink. \n",
"\n",
Expand Down Expand Up @@ -625,7 +675,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -639,7 +689,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.10.4"
}
},
"nbformat": 4,
Expand Down
Binary file added images/plink.sexcheck.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 53 additions & 3 deletions tutorial_HLAQCImputation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,57 @@
"```\n",
"\n",
"\n",
"#### 4.3 Removal of Highly Genetically Related Samples\n",
"#### 4.3 Removal of Samples with Discordant Sex Information\n",
"\n",
"> Sample sex information is relevant in sex-specific analyses or imputation of the sex chromosome. Inaccurate assignment of sex (i.e. male recorded as female) may be a sign of sample swapping or other sample quality issue. Homozygosity rate across the non pseudo-autosomal region (non-PAR) of the X chromosome is used to impute sample sex assignment. \n",
"\n",
"##### Calculating X Chromosome Homozygosity per Sample:\n",
"\n",
">**While our genotype data in this repository only includes mock genotype data on chromosome 6 due to size restrictions in GitHub, please assume that the data {bim,bed,fam} files are the plink genotype files with all chromosomes, including chromosome X.**\n",
"\n",
"``` bash\n",
"#split off X chromosome PAR\n",
"plink --bfile data --split-x hg19 --make-bed --out data.par_split\n",
"#output sex check\n",
"plink --bfile data.par_split --check-sex\n",
"```\n",
"\n",
"> By default, plink's check-sex will assign samples as female if the F coefficient is < 0.2 or male if the F coefficient is > 0.8. To be most accurate, these cutoffs can be changed manually based on the distribution of the F coefficient, as visualized below.\n",
"\n",
"\n",
"\n",
"##### Implementing X Chromosome Homozygosity Thresholds:\n",
"\n",
"in R: \n",
"\n",
"```R\n",
"library(ggplot2)\n",
"name = 'plink.sexcheck'\n",
"dat = read.table(name, header = TRUE)\n",
"\n",
"png(paste0(name, '.png'))\n",
"ggplot(dat)+\n",
" geom_histogram(aes(x = F), bins = 30)+\n",
" theme_classic(20)+\n",
" labs(x = 'F coefficient', y = 'Frequency', title = 'X Chromosome Homozygosity')+\n",
" geom_vline(xintercept = c(0.4, 0.9), color = \"red\", linetype = \"dashed\")\n",
"dev.off()\n",
"```\n",
"\n",
"![Image](./images/plink.sexcheck.png)\n",
"\n",
"> We recommend removal of any samples where pedigree-defined sex is discordant with genotype-inferred sex assignment, as this suggests the sample may have been labeled incorrectly.\n",
"\n",
"``` bash\n",
"#update sex in fam file and re-ouptut sex check\n",
"plink --bfile data.par_split --impute-sex 0.4 0.85 --make-bed --out data.par_split.impute_sex\n",
"#removal of samples with discordant sex information\n",
"awk ' ($5==\"PROBLEM\") { print $1,$2 } ' data.par_split.impute_sex.sexcheck >> discordantsex.sample.txt\n",
"plink --bfile data.par_split.impute_sex --remove discordantsex.sample.txt --make-bed --out data.par_split.impute_sex.concordant\n",
"```\n",
"\n",
"\n",
"#### 4.4 Removal of Highly Genetically Related Samples\n",
"\n",
"> A required feature of association studies with cohorts sampled from a general population is the removal of genetically identical individuals in the cohort. Genetically identical samples are detrimental to the overall power of further association tests, as the individual is overrepresented in the study cohort. Additionally, genetically identical samples can be the product of poor sample handling and experimental error. Below we outline steps to remove highly related individuals with plink. \n",
"\n",
Expand Down Expand Up @@ -625,7 +675,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -639,7 +689,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.10.4"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 2f45645

Please sign in to comment.