From 729140a05eb1de259a105b9dc88232ce2b3c4a08 Mon Sep 17 00:00:00 2001
From: Piero Palacios Bernuy
Date: Tue, 30 Apr 2024 18:01:14 -0500
Subject: [PATCH] tcga commit
---
.../notebooks/TCGA/execute-results/html.json | 4 ++--
notebooks/TCGA.qmd | 20 +++++++++++++++++-
.../number_mut_genes_per_tumor_stage.jpeg | Bin 280357 -> 278777 bytes
3 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/_freeze/notebooks/TCGA/execute-results/html.json b/_freeze/notebooks/TCGA/execute-results/html.json
index 4ae9e0f..81d6419 100644
--- a/_freeze/notebooks/TCGA/execute-results/html.json
+++ b/_freeze/notebooks/TCGA/execute-results/html.json
@@ -1,8 +1,8 @@
{
- "hash": "ef20f1042b06780e7761a9aca04808aa",
+ "hash": "dc08553dbecf0ac0d417e1815f428244",
"result": {
"engine": "knitr",
- "markdown": "---\ntitle: \"Integrative Analysis with TCGA Data\"\nsubtitle: \"Analysis of Mutation Data from The Cancer Genome Atlas (TCGA)\"\nformat: html\neditor: visual\n---\n\n::: {.cell .hidden}\n\n```{.r .cell-code .hidden}\n#| include: false\nlibrary(knitr)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'knitr' was built under R version 4.3.3\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\nknitr::opts_chunk$set(\n warning = FALSE,\n message = FALSE,\n dpi = 180,\n echo = TRUE\n)\n\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'ggplot2' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'tidyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'readr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'purrr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'dplyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'stringr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'lubridate' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.0 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\ntheme_set(theme_minimal())\n```\n:::\n\n\n## Introduction\n\nThe Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for [public access](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).\n\nWe are gonna use the `curatedTCGAData` package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a *MultiAssayExperiment* object on the Bioconductor [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).\n\nFirst, let's load the packages needed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(curatedTCGAData)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MultiAssayExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MultiAssayExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SummarizedExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SummarizedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MatrixGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MatrixGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: matrixStats\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'matrixStats' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'matrixStats'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:dplyr':\n\n count\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'MatrixGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n colWeightedMeans, colWeightedMedians, colWeightedSds,\n colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n rowWeightedSds, rowWeightedVars\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomicRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: stats4\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: BiocGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'BiocGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'BiocGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n combine, intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:stats':\n\n IQR, mad, sd, var, xtabs\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n anyDuplicated, aperm, append, as.data.frame, basename, cbind,\n colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,\n get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,\n match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,\n Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,\n table, tapply, union, unique, unsplit, which.max, which.min\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Vectors\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Vectors'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n second, second<-\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n first, rename\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:tidyr':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:utils':\n\n findMatches\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n expand.grid, I, unname\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: IRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'IRanges'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:lubridate':\n\n %within%\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n collapse, desc, slice\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n reduce\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:grDevices':\n\n windows\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomeInfoDb\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GenomeInfoDb' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Biobase\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWelcome to Bioconductor\n\n Vignettes contain introductory material; view with\n 'browseVignettes()'. To cite Bioconductor, see\n 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Biobase'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:MatrixGenerics':\n\n rowMedians\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n anyMissing, rowMedians\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(TCGAutils)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'TCGAutils' was built under R version 4.3.2\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(MultiAssayExperiment)\n```\n:::\n\n\n## Download the Data\n\nTo download the data we need to use `curatedTCGAData`function. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the [NCI Genomic Data Commons website](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations)), the second argument is a vector of data types we want to download. We need to specify `dry.run = FALSE` to download the data.\n\nIn this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData = curatedTCGAData(\"READ\", \n c(\"RNASeq2GeneNorm\", \"Mutation\", \"Methylation_methyl450\"), \n dry.run = FALSE, version = \"2.1.1\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Mutation-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"RaggedExperiment\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'RaggedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_RNASeq2GeneNorm-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Methylation_methyl450-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"rhdf5\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'rhdf5' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: HDF5Array\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'HDF5Array' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: DelayedArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'DelayedArray' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'Matrix' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:S4Vectors':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Arrays\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'S4Arrays' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Arrays'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:abind':\n\n abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:base':\n\n rowsum\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SparseArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SparseArray' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'DelayedArray'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n simplify\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n apply, scale, sweep\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'HDF5Array'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:rhdf5':\n\n h5ls\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_colData-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_sampleMap-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_metadata-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nharmonizing input:\n removing 1903 sampleMap rows not in names(experiments)\n removing 2 colData rownames not in sampleMap 'primary'\n```\n\n\n:::\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData # this is a MultiAssayExperiment object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nA MultiAssayExperiment object of 3 listed\n experiments with user-defined names and respective classes.\n Containing an ExperimentList class object of length 3:\n [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns\n [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns\n [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns\nFunctionality:\n experiments() - obtain the ExperimentList instance\n colData() - the primary/phenotype DataFrame\n sampleMap() - the sample coordination DataFrame\n `$`, `[`, `[[` - extract colData columns, subset, or experiment\n *Format() - convert into a long or wide DataFrame\n assays() - convert ExperimentList to a SimpleList of matrices\n exportClass() - save data to flat files\n```\n\n\n:::\n:::\n\n\n## Review the Clinical Metadata\n\nWe can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 352 rows and 3 columns\n assay primary colname\n \n1 READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..\n2 READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..\n3 READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..\n4 READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..\n5 READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..\n... ... ... ...\n348 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G TCGA-AG-A02G-01\n349 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N TCGA-AG-A02N-01\n350 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X TCGA-AG-A02X-01\n351 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032 TCGA-AG-A032-01\n352 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A036 TCGA-AG-A036-01\n```\n\n\n:::\n:::\n\n\nNot all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData) |> \n as_tibble() |> \n pull(primary) |> \n table() |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 1 2 3 4 \n 5 147 7 8 \n```\n\n\n:::\n:::\n\n\nWe can see the metadata of the patients with `colData`. Note that there are more than 2000 columns of data per patient (not necessarily complete).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin = colData(readData) |> \n as_tibble()\ndim(clin)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 167 2260\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(colnames(clin), 10) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"patientID\" \"years_to_birth\" \"vital_status\" \n [4] \"days_to_death\" \"days_to_last_followup\" \"tumor_tissue_site\" \n [7] \"pathologic_stage\" \"pathology_T_stage\" \"pathology_N_stage\" \n[10] \"pathology_M_stage\" \n```\n\n\n:::\n:::\n\n\nAs an example, for rectum adenocarcinoma, we can see the tumor stage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin |> \n pull(pathology_T_stage) |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 t4a t4b \n 9 28 114 5 8 1 \n```\n\n\n:::\n:::\n\n\nStage T4 have subgroups. To simplify the analysis, let's combine all T4 tumors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin <- clin |> \n mutate(t_stage = case_when(\n pathology_T_stage %in% c(\"t4\",\"t4a\",\"t4b\") ~ \"t4\",\n .default = pathology_T_stage\n ))\n\nclin$t_stage |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 \n 9 28 114 14 \n```\n\n\n:::\n:::\n\n\nAlso, we can see the vital status (alive=0, deceased=1)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin$vital_status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 0 1 \n139 28 \n```\n\n\n:::\n:::\n\n\nOr combine tumor status and vital status.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(clin$t_stage, clin$vital_status)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n t1 9 0\n t2 24 4\n t3 96 18\n t4 9 5\n```\n\n\n:::\n:::\n\n\n## Analyzing Mutation Data\n\nLet's begin analyzing the mutation data. Below is the code to retrieve the mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_data = readData[[1]]\n\nmut_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: RaggedExperiment \ndim: 22075 69 \nassays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target\nrownames: NULL\ncolnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\n ... TCGA-AG-A032-01 TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nFrom the inspection of the sample IDs we can see that the mutation colnames match to the **primary** column from he clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids = colnames(mut_data)\nhead(mut_sample_ids)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2689-01A-01W-0831-10\" \"TCGA-AF-2691-01A-01W-0831-10\"\n[3] \"TCGA-AF-2692-01A-01W-0831-10\" \"TCGA-AF-3400-01A-01W-0831-10\"\n[5] \"TCGA-AF-3913-01\" \"TCGA-AG-3574-01A-01W-0831-10\"\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2687\" \"TCGA-AF-2689\" \"TCGA-AF-2690\" \"TCGA-AF-2691\" \"TCGA-AF-2692\"\n[6] \"TCGA-AF-2693\"\n```\n\n\n:::\n:::\n\n\nWe need to manipulate these by substracting 12 characters.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids <- mut_sample_ids |> \n stringr::str_sub(1,12)\n\nall(mut_sample_ids %in% clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nIs important to note that the data stored in `assay(mut_data)` is difficult to work with because is a sparse matrix that has a row for each `GRanges` with a mutation in at least one sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassay(mut_data)[1:3,1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\nX:54241715-54241716:+ \"WNK3\" NA \n7:111899511:+ \"IFRD1\" NA \n7:113309556:+ \"PPP1R3A\" NA \n TCGA-AF-2692-01A-01W-0831-10\nX:54241715-54241716:+ NA \n7:111899511:+ NA \n7:113309556:+ NA \n```\n\n\n:::\n\n```{.r .cell-code}\n# Is a sparse matrix\n\nassay(mut_data)[1,] |> \n table(useNA=\"ifany\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nWNK3 \n 1 68 \n```\n\n\n:::\n:::\n\n\nWe can get more information if we look at the mutation information for each patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay = mut_data@assays\n\nmut_assay # GRangesList\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRangesList object of length 69:\n$`TCGA-AF-2689-01A-01W-0831-10`\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n\n...\n<68 more elements>\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay |> class()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"CompressedGRangesList\"\nattr(,\"package\")\n[1] \"GenomicRanges\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlength(mut_assay)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 69\n```\n\n\n:::\n:::\n\n\nLet's inspect the data from the first patient. We can see from the metadata information the Hugo Symbol, mutation status and predicted effect of each mutation at variant classification.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Hugo_Symbol\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"WNK3\" \"IFRD1\" \"PPP1R3A\" \"FAM71F1\" \"NOM1\" \"TRIM73\" \n [7] \"C7orf51\" \"DIDO1\" \"DNMT1\" \"CALR\" \"SIN3B\" \"ZNF569\" \n[13] \"SIGLEC12\" \"ZNF160\" \"NLRP4\" \"ENPP2\" \"TRPA1\" \"MMP16\" \n[19] \"GPR89A\" \"SLC9A11\" \"PAX7\" \"PLA2G5\" \"SLC26A9\" \"PIK3R3\" \n[25] \"LPPR4\" \"BCL9L\" \"PRDM11\" \"PEX3\" \"DCDC2\" \"KRT20\" \n[31] \"TP53\" \"CDH8\" \"SF3B3\" \"SLC9A10\" \"SLC7A14\" \"NLGN1\" \n[37] \"MYRIP\" \"CYP8B1\" \"TGM4\" \"COL7A1\" \"P2RX7\" \"KRAS\" \n[43] \"TPH2\" \"ANO4\" \"UBR1\" \"LBXCOR1\" \"PDLIM5\" \"ODZ1\" \n[49] \"SMARCA1\" \"CNKSR2\" \"RBM10\" \"RBM3\" \"HUWE1\" \"CYLC1\" \n[55] \"ATP6V1E2\" \"ASTN2\" \"TAF1L\" \"SGCG\" \"GBF1\" \"OGDHL\" \n[61] \"HELLS\" \"PSD2\" \"FBXO38\" \"GZMK\" \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Mutation_Status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nSomatic \n 64 \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Variant_Classification |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n Frame_Shift_Ins Missense_Mutation Nonsense_Mutation Silent \n 1 39 5 19 \n```\n\n\n:::\n:::\n\n\nNow, is kind of a trouble to inspect manually each patient. So, lets get all mutation information from Hugo symbol and Variant classification for all the patients.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvar_class_df = mapply(function(sample_id, mutation_assay){\n \n d = mcols(mutation_assay)[,c(\"Hugo_Symbol\",\"Variant_Classification\")] |> \n as.data.frame()\n \n colnames(d) = c(\"symbol\",\"variant_class\")\n \n d$patientID = sample_id\n \n return(d)\n \n}, sample_id=mut_sample_ids, mutation_assay = mut_assay,SIMPLIFY = F, USE.NAMES = F)\n\n\nvar_class_df = do.call(rbind, var_class_df)\n\n\nhead(var_class_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n symbol variant_class patientID\n1 WNK3 Frame_Shift_Ins TCGA-AF-2689\n2 IFRD1 Missense_Mutation TCGA-AF-2689\n3 PPP1R3A Missense_Mutation TCGA-AF-2689\n4 FAM71F1 Silent TCGA-AF-2689\n5 NOM1 Missense_Mutation TCGA-AF-2689\n6 TRIM73 Nonsense_Mutation TCGA-AF-2689\n```\n\n\n:::\n:::\n\n\nWe can visualize the most common mutated genes genes in rectum adenocarcinoma\n\n\n::: {#cell-fig-genesmut .cell}\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\np <- var_class_df |> \n as_tibble() |> \n group_by(symbol,variant_class) |> \n summarise(n = n()) |> \n arrange(desc(n)) |> \n ungroup() |> \n slice_max(order_by = n, n = 20) |> \n ungroup() |> \n mutate(symbol = fct_reorder(symbol,n)) |> \n ggplot(aes(n, symbol,fill=variant_class)) +\n geom_col() +\n facet_wrap(~variant_class) +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\") +\n labs(x = \"Samples with specific mutation\", y=\"Gene Symbol\", fill=\"Variant Class\",\n title=\"Top 20 Samples with READ\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n`summarise()` has grouped output by 'symbol'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\npath <- \"images/\"\nggsave(paste0(path,\"samples_with_mut.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n\n## Linking mutations and tumor stage\n\nNow that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma.\n\nWe can filter the clinical data with the patients that have mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex <- which( (var_class_df$patientID |> unique()) %in% clin$patientID)\n\nclin_and_mutation = clin[index,]\n\nhead(clin_and_mutation) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2,261\n patientID years_to_birth vital_status days_to_death days_to_last_followup\n \n1 TCGA-AF-2687 57 0 NA 1427\n2 TCGA-AF-2689 41 1 1201 NA\n3 TCGA-AF-2690 76 1 524 NA\n4 TCGA-AF-2691 48 0 NA 1309\n5 TCGA-AF-2692 54 0 NA 412\n6 TCGA-AF-2693 75 0 NA 1155\n# ℹ 2,256 more variables: tumor_tissue_site , pathologic_stage ,\n# pathology_T_stage , pathology_N_stage , pathology_M_stage ,\n# gender , date_of_initial_pathologic_diagnosis ,\n# days_to_last_known_alive , radiation_therapy ,\n# histological_type , tumor_stage , residual_tumor ,\n# number_of_lymph_nodes , ethnicity , admin.bcr ,\n# admin.day_of_dcc_upload , admin.disease_code , …\n```\n\n\n:::\n:::\n\n\nWe can count the number of genes with mutations per patient and then make a `left_join` to the clinical data to plot the mutated genes per tumor stage.\n\n\n::: {#cell-fig-plot1 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot1\n\ndf = var_class_df |> \n group_by(patientID) |> \n summarise(n = n())\n\np <- clin_and_mutation |> \n left_join(df, by = join_by(patientID)) |> \n ggplot(aes(t_stage,log(n), fill=t_stage))+\n geom_boxplot() +\n geom_jitter(width = 0.05, colour=\"red2\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(x = \"Tumor stage\", y=\"Number of mutated genes in log scale\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_genes_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing non-finite outside the scale range\n(`stat_boxplot()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n:::\n\n\n\n\nAlso, we can focus on a specific gene, e.g. [TP53](https://www.ncbi.nlm.nih.gov/gene/7157).\n\n\n::: {#cell-fig-plot_tp53 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot_tp53\n\nmut_on_tp53 <- var_class_df |> \n dplyr::filter(symbol == \"TP53\") \n\n\np <-clin_and_mutation |> \n dplyr::filter(patientID %in% mut_on_tp53$patientID) |> \n dplyr::select(t_stage, patientID) |> \n group_by(t_stage) |> \n summarise(n = n()) |> \n # mutate(t_stage = fct_reorder(t_stage,n)) |> \n ggplot(aes(n, t_stage, fill=t_stage, label=n)) +\n geom_col() +\n geom_label(fill=\"white\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(y = \"Tumor Stage\", x=\"N° of Mutations\", fill=\"Tumor Stage\", title=\"Number of Mutations on Gene TP53\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_tp53_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n\n\n# Linking expression and tumor stage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnaseq = readData[[2]]\n\nrnaseq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 18115 177 \nmetadata(3): filename build platform\nassays(1): ''\nrownames(18115): A1BG A1CF ... ZZZ3 psiTPTE22\nrowData names(0):\ncolnames(177): TCGA-AF-2687-01 TCGA-AF-2689-11 ... TCGA-AG-A032-01\n TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(rnaseq) = colnames(rnaseq) |> \n stringr::str_sub(1,12)\n\nclin <- clin |> as.data.frame()\nrownames(clin) <- clin$patientID\n\nindex <- which(colnames(rnaseq) %in% rownames(clin))\n\ncolData(rnaseq) = as(clin[colnames(rnaseq),],\"DataFrame\")\n\n\nidx <- is.na(colData(rnaseq)$t_stage)\n\nrnaseq <- rnaseq[,!idx]\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(limma)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'limma' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'limma'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:BiocGenerics':\n\n plotMA\n```\n\n\n:::\n\n```{.r .cell-code}\nmm = model.matrix(~t_stage, data=colData(rnaseq))\nf1 = lmFit(assay(rnaseq), mm)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Partial NA coefficients for 5 probe(s)\n```\n\n\n:::\n\n```{.r .cell-code}\nef1 = eBayes(f1)\ntopTable(ef1) |> \n arrange(desc(AveExpr))\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRemoving intercept from test coefficients\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n t_staget2 t_staget3 t_staget4 AveExpr F P.Value\nPAM -0.5791836 0.08813521 0.4984790 10.94216940 8.309141 3.409613e-05\nPAIP2 -0.3240061 0.18551765 0.0958645 10.29398797 9.056896 1.331550e-05\nSPP1 -0.3352086 1.56339367 2.6409082 9.61423581 9.316731 9.621982e-06\nNIPSNAP3A -0.9988818 -0.35051776 -0.6085837 9.38910637 8.712515 2.051165e-05\nASPN -0.7366657 0.75340984 1.9537316 7.45952050 8.323019 3.350393e-05\nOLR1 -0.2557879 1.39542897 2.1822156 4.97144475 8.531715 2.597133e-05\nSEMA5B 1.3576218 1.27252058 1.9271042 4.60337761 8.091237 4.490926e-05\nCPSF4L 1.0629502 -0.17984587 -1.4274556 0.70036223 8.955085 2.311056e-05\nGRM5 1.3027501 -0.14596612 -0.7395609 0.06134539 10.268851 7.301656e-06\nC1orf223 0.7830779 -0.40735736 -0.7062534 -0.35136703 9.681632 1.743477e-05\n adj.P.Val\nPAM 0.06860899\nPAIP2 0.06719153\nSPP1 0.06719153\nNIPSNAP3A 0.06719153\nASPN 0.06860899\nOLR1 0.06719153\nSEMA5B 0.07597995\nCPSF4L 0.06719153\nGRM5 0.06719153\nC1orf223 0.06719153\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow=c(1,2))\nboxplot(split(assay(rnaseq)[\"PAM\",], rnaseq$t_stage), main=\"PAM\") # higher expression in lower t_stage\nboxplot(split(assay(rnaseq)[\"PAIP2\",], rnaseq$t_stage), main=\"PAIP2\")\n```\n\n::: {.cell-output-display}\n{width=1260}\n:::\n:::\n\n\n# Linking methylation and expression\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl <- readData[[3]]\n\nidx <- colnames(methyl) |> \n str_split(pattern = \"-\") |> \n map(4) |> \n enframe() |> \n unnest(value)\n\nidx <- which(idx$value == \"01A\")\n\nmethyl = methyl[,idx]\n\nmethyl\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 485577 95 \nmetadata(0):\nassays(1): counts\nrownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873\nrowData names(3): Gene_Symbol Chromosome Genomic_Coordinate\ncolnames(95): TCGA-AF-2687-01A-02D-1734-05 TCGA-AF-2690-01A-02D-1734-05\n ... TCGA-G5-6572-01A-11D-1828-05 TCGA-G5-6641-01A-11D-1828-05\ncolData names(0):\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(methyl) <- colnames(methyl) |> \n str_sub(start = 1,end = 12)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncolData(methyl) <- as(clin[colnames(methyl),],\"DataFrame\")\n\n\nintersect(colnames(methyl), colnames(rnaseq)) |> length()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 93\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]\nrnaseq_subset = rnaseq[,which(colnames(rnaseq) %in% colnames(methyl))]\n\nmethyl_genes = rowData(methyl_subset)\nmethyl_genes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 485577 rows and 3 columns\n Gene_Symbol Chromosome Genomic_Coordinate\n \ncg00000029 RBL2 16 53468112\ncg00000108 C3orf35 3 37459206\ncg00000109 FNDC3B 3 171916037\ncg00000165 NA 1 91194674\ncg00000236 VDAC3 8 42263294\n... ... ... ...\nrs9363764 NA NA 0\nrs939290 NA NA 0\nrs951295 NA NA 0\nrs966367 NA NA 0\nrs9839873 NA NA 0\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n#| label: figplot\n\nme_rna_cor = function(sym, mpick = 3){\n require(GGally)\n # subset methylation data to first mpick methylation sites for given gene symbol\n methyl_ind = which(methyl_genes$Gene_Symbol == sym)\n if (length(methyl_ind) > mpick){ \n methyl_ind = methyl_ind[1:mpick]\n }\n methyl_dat = assay(methyl_subset)[methyl_ind,] # subset to selected methylation sites\n\n # subset expression data to selected gene symbol\n expr_ind = which(rownames(rnaseq_subset) == sym) \n expr_dat = assay(rnaseq_subset)[expr_ind,]\n\n # combine methylation and expression data as data frame\n combined_dat = as(t(methyl_dat), \"DataFrame\")\n combined_dat$expr = expr_dat\n\n # plot pairs and calculate correlation coefficients between methylation and expression\n ggpairs(combined_dat) |> print()\n sapply(1:mpick, function(i){\n \n cor_to <- as.numeric(combined_dat[,mpick+1])\n \n df <- data.frame(v1= as.numeric(combined_dat[,i]),\n v2 = cor_to)\n \n df <- df |> na.omit()\n \n cor(df[,1], df[,2])\n })\n}\n\nme_rna_cor(\"TAC1\", mpick=3)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GGally\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GGally' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRegistered S3 method overwritten by 'GGally':\n method from \n +.gg ggplot2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing non-finite outside the scale range\n(`stat_density()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n{width=1260}\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0410095 0.1457715 0.0856926\n```\n\n\n:::\n:::\n\n\n## Conclusion\n",
+ "markdown": "---\ntitle: \"Integrative Analysis with TCGA Data\"\nsubtitle: \"Analysis of Mutation Data from The Cancer Genome Atlas (TCGA)\"\nformat: html\neditor: visual\n---\n\n::: {.cell .hidden}\n\n```{.r .cell-code .hidden}\n#| include: false\nlibrary(knitr)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'knitr' was built under R version 4.3.3\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\nknitr::opts_chunk$set(\n warning = FALSE,\n message = FALSE,\n dpi = 180,\n echo = TRUE\n)\n\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'ggplot2' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'tidyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'readr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'purrr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'dplyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'stringr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'lubridate' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.0 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\ntheme_set(theme_minimal())\n```\n:::\n\n\n## Introduction\n\nThe Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for [public access](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).\n\nWe are gonna use the `curatedTCGAData` package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a *MultiAssayExperiment* object on the Bioconductor [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).\n\nFirst, let's load the packages needed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(curatedTCGAData)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MultiAssayExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MultiAssayExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SummarizedExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SummarizedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MatrixGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MatrixGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: matrixStats\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'matrixStats' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'matrixStats'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:dplyr':\n\n count\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'MatrixGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n colWeightedMeans, colWeightedMedians, colWeightedSds,\n colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n rowWeightedSds, rowWeightedVars\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomicRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: stats4\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: BiocGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'BiocGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'BiocGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n combine, intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:stats':\n\n IQR, mad, sd, var, xtabs\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n anyDuplicated, aperm, append, as.data.frame, basename, cbind,\n colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,\n get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,\n match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,\n Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,\n table, tapply, union, unique, unsplit, which.max, which.min\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Vectors\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Vectors'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n second, second<-\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n first, rename\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:tidyr':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:utils':\n\n findMatches\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n expand.grid, I, unname\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: IRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'IRanges'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:lubridate':\n\n %within%\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n collapse, desc, slice\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n reduce\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:grDevices':\n\n windows\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomeInfoDb\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GenomeInfoDb' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Biobase\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWelcome to Bioconductor\n\n Vignettes contain introductory material; view with\n 'browseVignettes()'. To cite Bioconductor, see\n 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Biobase'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:MatrixGenerics':\n\n rowMedians\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n anyMissing, rowMedians\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(TCGAutils)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'TCGAutils' was built under R version 4.3.2\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(MultiAssayExperiment)\n```\n:::\n\n\n## Download the Data\n\nTo download the data we need to use `curatedTCGAData`function. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the [NCI Genomic Data Commons website](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations)), the second argument is a vector of data types we want to download. We need to specify `dry.run = FALSE` to download the data.\n\nIn this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData = curatedTCGAData(\"READ\", \n c(\"RNASeq2GeneNorm\", \"Mutation\", \"Methylation_methyl450\"), \n dry.run = FALSE, version = \"2.1.1\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Mutation-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"RaggedExperiment\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'RaggedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_RNASeq2GeneNorm-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Methylation_methyl450-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"rhdf5\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'rhdf5' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: HDF5Array\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'HDF5Array' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: DelayedArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'DelayedArray' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'Matrix' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:S4Vectors':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Arrays\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'S4Arrays' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Arrays'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:abind':\n\n abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:base':\n\n rowsum\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SparseArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SparseArray' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'DelayedArray'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n simplify\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n apply, scale, sweep\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'HDF5Array'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:rhdf5':\n\n h5ls\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_colData-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_sampleMap-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_metadata-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nharmonizing input:\n removing 1903 sampleMap rows not in names(experiments)\n removing 2 colData rownames not in sampleMap 'primary'\n```\n\n\n:::\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData # this is a MultiAssayExperiment object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nA MultiAssayExperiment object of 3 listed\n experiments with user-defined names and respective classes.\n Containing an ExperimentList class object of length 3:\n [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns\n [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns\n [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns\nFunctionality:\n experiments() - obtain the ExperimentList instance\n colData() - the primary/phenotype DataFrame\n sampleMap() - the sample coordination DataFrame\n `$`, `[`, `[[` - extract colData columns, subset, or experiment\n *Format() - convert into a long or wide DataFrame\n assays() - convert ExperimentList to a SimpleList of matrices\n exportClass() - save data to flat files\n```\n\n\n:::\n:::\n\n\n## Review the Clinical Metadata\n\nWe can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 352 rows and 3 columns\n assay primary colname\n \n1 READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..\n2 READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..\n3 READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..\n4 READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..\n5 READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..\n... ... ... ...\n348 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G TCGA-AG-A02G-01\n349 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N TCGA-AG-A02N-01\n350 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X TCGA-AG-A02X-01\n351 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032 TCGA-AG-A032-01\n352 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A036 TCGA-AG-A036-01\n```\n\n\n:::\n:::\n\n\nNot all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData) |> \n as_tibble() |> \n pull(primary) |> \n table() |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 1 2 3 4 \n 5 147 7 8 \n```\n\n\n:::\n:::\n\n\nWe can see the metadata of the patients with `colData`. Note that there are more than 2000 columns of data per patient (not necessarily complete).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin = colData(readData) |> \n as_tibble()\ndim(clin)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 167 2260\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(colnames(clin), 10) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"patientID\" \"years_to_birth\" \"vital_status\" \n [4] \"days_to_death\" \"days_to_last_followup\" \"tumor_tissue_site\" \n [7] \"pathologic_stage\" \"pathology_T_stage\" \"pathology_N_stage\" \n[10] \"pathology_M_stage\" \n```\n\n\n:::\n:::\n\n\nAs an example, for rectum adenocarcinoma, we can see the tumor stage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin |> \n pull(pathology_T_stage) |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 t4a t4b \n 9 28 114 5 8 1 \n```\n\n\n:::\n:::\n\n\nStage T4 have subgroups. To simplify the analysis, let's combine all T4 tumors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin <- clin |> \n mutate(t_stage = case_when(\n pathology_T_stage %in% c(\"t4\",\"t4a\",\"t4b\") ~ \"t4\",\n .default = pathology_T_stage\n ))\n\nclin$t_stage |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 \n 9 28 114 14 \n```\n\n\n:::\n:::\n\n\nAlso, we can see the vital status (alive=0, deceased=1)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin$vital_status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 0 1 \n139 28 \n```\n\n\n:::\n:::\n\n\nOr combine tumor status and vital status.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(clin$t_stage, clin$vital_status)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n t1 9 0\n t2 24 4\n t3 96 18\n t4 9 5\n```\n\n\n:::\n:::\n\n\n## Analyzing Mutation Data\n\nLet's begin analyzing the mutation data. Below is the code to retrieve the mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_data = readData[[1]]\n\nmut_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: RaggedExperiment \ndim: 22075 69 \nassays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target\nrownames: NULL\ncolnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\n ... TCGA-AG-A032-01 TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nFrom the inspection of the sample IDs we can see that the mutation colnames match to the **primary** column from he clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids = colnames(mut_data)\nhead(mut_sample_ids)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2689-01A-01W-0831-10\" \"TCGA-AF-2691-01A-01W-0831-10\"\n[3] \"TCGA-AF-2692-01A-01W-0831-10\" \"TCGA-AF-3400-01A-01W-0831-10\"\n[5] \"TCGA-AF-3913-01\" \"TCGA-AG-3574-01A-01W-0831-10\"\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2687\" \"TCGA-AF-2689\" \"TCGA-AF-2690\" \"TCGA-AF-2691\" \"TCGA-AF-2692\"\n[6] \"TCGA-AF-2693\"\n```\n\n\n:::\n:::\n\n\nWe need to manipulate these by substracting 12 characters.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids <- mut_sample_ids |> \n stringr::str_sub(1,12)\n\nall(mut_sample_ids %in% clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nIs important to note that the data stored in `assay(mut_data)` is difficult to work with because is a sparse matrix that has a row for each `GRanges` with a mutation in at least one sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassay(mut_data)[1:3,1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\nX:54241715-54241716:+ \"WNK3\" NA \n7:111899511:+ \"IFRD1\" NA \n7:113309556:+ \"PPP1R3A\" NA \n TCGA-AF-2692-01A-01W-0831-10\nX:54241715-54241716:+ NA \n7:111899511:+ NA \n7:113309556:+ NA \n```\n\n\n:::\n\n```{.r .cell-code}\n# Is a sparse matrix\n\nassay(mut_data)[1,] |> \n table(useNA=\"ifany\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nWNK3 \n 1 68 \n```\n\n\n:::\n:::\n\n\nWe can get more information if we look at the mutation information for each patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay = mut_data@assays\n\nmut_assay # GRangesList\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRangesList object of length 69:\n$`TCGA-AF-2689-01A-01W-0831-10`\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n\n...\n<68 more elements>\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay |> class()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"CompressedGRangesList\"\nattr(,\"package\")\n[1] \"GenomicRanges\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlength(mut_assay)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 69\n```\n\n\n:::\n:::\n\n\nLet's inspect the data from the first patient. We can see from the metadata information the Hugo Symbol, mutation status and predicted effect of each mutation at variant classification.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Hugo_Symbol\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"WNK3\" \"IFRD1\" \"PPP1R3A\" \"FAM71F1\" \"NOM1\" \"TRIM73\" \n [7] \"C7orf51\" \"DIDO1\" \"DNMT1\" \"CALR\" \"SIN3B\" \"ZNF569\" \n[13] \"SIGLEC12\" \"ZNF160\" \"NLRP4\" \"ENPP2\" \"TRPA1\" \"MMP16\" \n[19] \"GPR89A\" \"SLC9A11\" \"PAX7\" \"PLA2G5\" \"SLC26A9\" \"PIK3R3\" \n[25] \"LPPR4\" \"BCL9L\" \"PRDM11\" \"PEX3\" \"DCDC2\" \"KRT20\" \n[31] \"TP53\" \"CDH8\" \"SF3B3\" \"SLC9A10\" \"SLC7A14\" \"NLGN1\" \n[37] \"MYRIP\" \"CYP8B1\" \"TGM4\" \"COL7A1\" \"P2RX7\" \"KRAS\" \n[43] \"TPH2\" \"ANO4\" \"UBR1\" \"LBXCOR1\" \"PDLIM5\" \"ODZ1\" \n[49] \"SMARCA1\" \"CNKSR2\" \"RBM10\" \"RBM3\" \"HUWE1\" \"CYLC1\" \n[55] \"ATP6V1E2\" \"ASTN2\" \"TAF1L\" \"SGCG\" \"GBF1\" \"OGDHL\" \n[61] \"HELLS\" \"PSD2\" \"FBXO38\" \"GZMK\" \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Mutation_Status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nSomatic \n 64 \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Variant_Classification |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n Frame_Shift_Ins Missense_Mutation Nonsense_Mutation Silent \n 1 39 5 19 \n```\n\n\n:::\n:::\n\n\nNow, is kind of a trouble to inspect manually each patient. So, lets get all mutation information from Hugo symbol and Variant classification for all the patients.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvar_class_df = mapply(function(sample_id, mutation_assay){\n \n d = mcols(mutation_assay)[,c(\"Hugo_Symbol\",\"Variant_Classification\")] |> \n as.data.frame()\n \n colnames(d) = c(\"symbol\",\"variant_class\")\n \n d$patientID = sample_id\n \n return(d)\n \n}, sample_id=mut_sample_ids, mutation_assay = mut_assay,SIMPLIFY = F, USE.NAMES = F)\n\n\nvar_class_df = do.call(rbind, var_class_df)\n\n\nhead(var_class_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n symbol variant_class patientID\n1 WNK3 Frame_Shift_Ins TCGA-AF-2689\n2 IFRD1 Missense_Mutation TCGA-AF-2689\n3 PPP1R3A Missense_Mutation TCGA-AF-2689\n4 FAM71F1 Silent TCGA-AF-2689\n5 NOM1 Missense_Mutation TCGA-AF-2689\n6 TRIM73 Nonsense_Mutation TCGA-AF-2689\n```\n\n\n:::\n:::\n\n\nWe can visualize the most common mutated genes genes in rectum adenocarcinoma\n\n\n::: {#cell-fig-genesmut .cell}\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\np <- var_class_df |> \n as_tibble() |> \n group_by(symbol,variant_class) |> \n summarise(n = n()) |> \n arrange(desc(n)) |> \n ungroup() |> \n slice_max(order_by = n, n = 20) |> \n ungroup() |> \n mutate(symbol = fct_reorder(symbol,n)) |> \n ggplot(aes(n, symbol,fill=variant_class)) +\n geom_col() +\n facet_wrap(~variant_class) +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\") +\n labs(x = \"Samples with specific mutation\", y=\"Gene Symbol\", fill=\"Variant Class\",\n title=\"Top 20 Samples with READ\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n`summarise()` has grouped output by 'symbol'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\npath <- \"images/\"\nggsave(paste0(path,\"samples_with_mut.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n\n## Linking mutations and tumor stage\n\nNow that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma.\n\nWe can filter the clinical data with the patients that have mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex <- which( (var_class_df$patientID |> unique()) %in% clin$patientID)\n\nclin_and_mutation = clin[index,]\n\nhead(clin_and_mutation) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2,261\n patientID years_to_birth vital_status days_to_death days_to_last_followup\n \n1 TCGA-AF-2687 57 0 NA 1427\n2 TCGA-AF-2689 41 1 1201 NA\n3 TCGA-AF-2690 76 1 524 NA\n4 TCGA-AF-2691 48 0 NA 1309\n5 TCGA-AF-2692 54 0 NA 412\n6 TCGA-AF-2693 75 0 NA 1155\n# ℹ 2,256 more variables: tumor_tissue_site , pathologic_stage ,\n# pathology_T_stage , pathology_N_stage , pathology_M_stage ,\n# gender , date_of_initial_pathologic_diagnosis ,\n# days_to_last_known_alive , radiation_therapy ,\n# histological_type , tumor_stage , residual_tumor ,\n# number_of_lymph_nodes , ethnicity , admin.bcr ,\n# admin.day_of_dcc_upload , admin.disease_code , …\n```\n\n\n:::\n:::\n\n\nWe can count the number of genes with mutations per patient and then make a `left_join` to the clinical data to plot the mutated genes per tumor stage.\n\n\n::: {#cell-fig-plot1 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot1\n\ndf = var_class_df |> \n group_by(patientID) |> \n summarise(n = n())\n\np <- clin_and_mutation |> \n left_join(df, by = join_by(patientID)) |> \n ggplot(aes(t_stage,log(n), fill=t_stage))+\n geom_boxplot() +\n geom_jitter(width = 0.05, colour=\"red2\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(x = \"Tumor stage\", y=\"Number of mutated genes in log scale\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_genes_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing non-finite outside the scale range\n(`stat_boxplot()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n:::\n\n\n\n\nAlso, we can focus on a specific gene, e.g. [TP53](https://www.ncbi.nlm.nih.gov/gene/7157).\n\n\n::: {#cell-fig-plot_tp53 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot_tp53\n\nmut_on_tp53 <- var_class_df |> \n dplyr::filter(symbol == \"TP53\") \n\n\np <-clin_and_mutation |> \n dplyr::filter(patientID %in% mut_on_tp53$patientID) |> \n dplyr::select(t_stage, patientID) |> \n group_by(t_stage) |> \n summarise(n = n()) |> \n # mutate(t_stage = fct_reorder(t_stage,n)) |> \n ggplot(aes(n, t_stage, fill=t_stage, label=n)) +\n geom_col() +\n geom_label(fill=\"white\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(y = \"Tumor Stage\", x=\"N° of Mutations\", fill=\"Tumor Stage\", title=\"Number of Mutations on Gene TP53\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_tp53_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n\n\n# Linking expression and tumor stage\n\nWe also can link the expression data (RNA-Seq) with the tumor stage data from the clinical metadata. [Check the TCGA RNA-Seq protocol online]{.aside}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnaseq = readData[[2]]\n\nrnaseq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 18115 177 \nmetadata(3): filename build platform\nassays(1): ''\nrownames(18115): A1BG A1CF ... ZZZ3 psiTPTE22\nrowData names(0):\ncolnames(177): TCGA-AF-2687-01 TCGA-AF-2689-11 ... TCGA-AG-A032-01\n TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\nAs before, we need to shorten the sample IDs so they can match to the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(rnaseq) = colnames(rnaseq) |> \n stringr::str_sub(1,12)\n\nclin <- clin |> as.data.frame()\nrownames(clin) <- clin$patientID\n\nindex <- which(colnames(rnaseq) %in% rownames(clin))\n\ncolData(rnaseq) = as(clin[colnames(rnaseq),],\"DataFrame\")\n\n\nidx <- is.na(colData(rnaseq)$t_stage)\n\nrnaseq <- rnaseq[,!idx]\n```\n:::\n\n\nThen, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(limma)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'limma' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'limma'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:BiocGenerics':\n\n plotMA\n```\n\n\n:::\n\n```{.r .cell-code}\nmm = model.matrix(~t_stage, data=colData(rnaseq))\nf1 = lmFit(assay(rnaseq), mm)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Partial NA coefficients for 5 probe(s)\n```\n\n\n:::\n\n```{.r .cell-code}\nef1 = eBayes(f1)\ntopTable(ef1) |> \n arrange(desc(AveExpr))\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRemoving intercept from test coefficients\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n t_staget2 t_staget3 t_staget4 AveExpr F P.Value\nPAM -0.5791836 0.08813521 0.4984790 10.94216940 8.309141 3.409613e-05\nPAIP2 -0.3240061 0.18551765 0.0958645 10.29398797 9.056896 1.331550e-05\nSPP1 -0.3352086 1.56339367 2.6409082 9.61423581 9.316731 9.621982e-06\nNIPSNAP3A -0.9988818 -0.35051776 -0.6085837 9.38910637 8.712515 2.051165e-05\nASPN -0.7366657 0.75340984 1.9537316 7.45952050 8.323019 3.350393e-05\nOLR1 -0.2557879 1.39542897 2.1822156 4.97144475 8.531715 2.597133e-05\nSEMA5B 1.3576218 1.27252058 1.9271042 4.60337761 8.091237 4.490926e-05\nCPSF4L 1.0629502 -0.17984587 -1.4274556 0.70036223 8.955085 2.311056e-05\nGRM5 1.3027501 -0.14596612 -0.7395609 0.06134539 10.268851 7.301656e-06\nC1orf223 0.7830779 -0.40735736 -0.7062534 -0.35136703 9.681632 1.743477e-05\n adj.P.Val\nPAM 0.06860899\nPAIP2 0.06719153\nSPP1 0.06719153\nNIPSNAP3A 0.06719153\nASPN 0.06860899\nOLR1 0.06719153\nSEMA5B 0.07597995\nCPSF4L 0.06719153\nGRM5 0.06719153\nC1orf223 0.06719153\n```\n\n\n:::\n:::\n\n\nLet's visualize two of the most expressed genes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow=c(1,2))\nboxplot(split(assay(rnaseq)[\"PAM\",], rnaseq$t_stage), main=\"PAM\") # higher expression in lower t_stage\nboxplot(split(assay(rnaseq)[\"PAIP2\",], rnaseq$t_stage), main=\"PAIP2\")\n```\n\n::: {.cell-output-display}\n{width=1260}\n:::\n:::\n\n\n# Linking methylation and expression\n\nFinally, we can use the methylation data with the expression data. This is important because methylated cytosines of the DNA change the expression of the genes.\n\nSome patients have methylation data for multiples tissue types. This information is encoded in the fourth component of the sample names. The code `01A` correspond to primary tumor samples and the code `11A` correspond to normal tissue. We'll keep the primary tumor samples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl <- readData[[3]]\n\nidx <- colnames(methyl) |> \n str_split(pattern = \"-\") |> \n map(4) |> \n enframe() |> \n unnest(value)\n\nidx <- which(idx$value == \"01A\")\n\nmethyl = methyl[,idx]\n\nmethyl\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 485577 95 \nmetadata(0):\nassays(1): counts\nrownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873\nrowData names(3): Gene_Symbol Chromosome Genomic_Coordinate\ncolnames(95): TCGA-AF-2687-01A-02D-1734-05 TCGA-AF-2690-01A-02D-1734-05\n ... TCGA-G5-6572-01A-11D-1828-05 TCGA-G5-6641-01A-11D-1828-05\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nAs before, let's truncate the names of the sample to match the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(methyl) <- colnames(methyl) |> \n str_sub(start = 1,end = 12)\n```\n:::\n\n\nWe can add the clinical data to the methyl object and count the number of patients with methylation and transcription data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolData(methyl) <- as(clin[colnames(methyl),],\"DataFrame\")\n\n\nintersect(colnames(methyl), colnames(rnaseq)) |> length()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 93\n```\n\n\n:::\n:::\n\n\nLet's subset common sample names and check the methylation data as row ranges.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]\nrnaseq_subset = rnaseq[,which(colnames(rnaseq) %in% colnames(methyl))]\n\nmethyl_genes = rowData(methyl_subset)\nmethyl_genes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 485577 rows and 3 columns\n Gene_Symbol Chromosome Genomic_Coordinate\n \ncg00000029 RBL2 16 53468112\ncg00000108 C3orf35 3 37459206\ncg00000109 FNDC3B 3 171916037\ncg00000165 NA 1 91194674\ncg00000236 VDAC3 8 42263294\n... ... ... ...\nrs9363764 NA NA 0\nrs939290 NA NA 0\nrs951295 NA NA 0\nrs966367 NA NA 0\nrs9839873 NA NA 0\n```\n\n\n:::\n:::\n\n\nThis function takes a gene symbol and returns a scatter plot showing the relationship between 3 different sites near that gene and gene expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| label: figplot\n\nme_rna_cor = function(sym, mpick = 3){\n require(GGally)\n # subset methylation data to first mpick methylation sites for given gene symbol\n methyl_ind = which(methyl_genes$Gene_Symbol == sym)\n if (length(methyl_ind) > mpick){ \n methyl_ind = methyl_ind[1:mpick]\n }\n methyl_dat = assay(methyl_subset)[methyl_ind,] # subset to selected methylation sites\n\n # subset expression data to selected gene symbol\n expr_ind = which(rownames(rnaseq_subset) == sym) \n expr_dat = assay(rnaseq_subset)[expr_ind,]\n\n # combine methylation and expression data as data frame\n combined_dat = as(t(methyl_dat), \"DataFrame\")\n combined_dat$expr = expr_dat\n\n # plot pairs and calculate correlation coefficients between methylation and expression\n ggpairs(combined_dat) |> print()\n sapply(1:mpick, function(i){\n \n cor_to <- as.numeric(combined_dat[,mpick+1])\n \n df <- data.frame(v1= as.numeric(combined_dat[,i]),\n v2 = cor_to)\n \n df <- df |> na.omit()\n \n cor(df[,1], df[,2])\n })\n}\n\nme_rna_cor(\"TAC1\", mpick=3)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GGally\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GGally' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRegistered S3 method overwritten by 'GGally':\n method from \n +.gg ggplot2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing non-finite outside the scale range\n(`stat_density()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n{width=1260}\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0410095 0.1457715 0.0856926\n```\n\n\n:::\n:::\n",
"supporting": [
"TCGA_files"
],
diff --git a/notebooks/TCGA.qmd b/notebooks/TCGA.qmd
index cf7c00d..687ba9a 100644
--- a/notebooks/TCGA.qmd
+++ b/notebooks/TCGA.qmd
@@ -319,12 +319,15 @@ ggsave(paste0(path,"number_mut_tp53_per_tumor_stage.jpeg"), device = "jpeg")
# Linking expression and tumor stage
+We also can link the expression data (RNA-Seq) with the tumor stage data from the clinical metadata. [Check the TCGA RNA-Seq protocol online]{.aside}
+
```{r}
rnaseq = readData[[2]]
rnaseq
```
+As before, we need to shorten the sample IDs so they can match to the clinical data.
```{r}
@@ -345,6 +348,8 @@ rnaseq <- rnaseq[,!idx]
```
+Then, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes.
+
```{r}
library(limma)
@@ -356,6 +361,8 @@ topTable(ef1) |>
```
+Let's visualize two of the most expressed genes.
+
```{r}
par(mfrow=c(1,2))
boxplot(split(assay(rnaseq)["PAM",], rnaseq$t_stage), main="PAM") # higher expression in lower t_stage
@@ -364,6 +371,10 @@ boxplot(split(assay(rnaseq)["PAIP2",], rnaseq$t_stage), main="PAIP2")
# Linking methylation and expression
+Finally, we can use the methylation data with the expression data. This is important because methylated cytosines of the DNA change the expression of the genes.
+
+Some patients have methylation data for multiples tissue types. This information is encoded in the fourth component of the sample names. The code `01A` correspond to primary tumor samples and the code `11A` correspond to normal tissue. We'll keep the primary tumor samples.
+
```{r}
methyl <- readData[[3]]
@@ -381,11 +392,15 @@ methyl = methyl[,idx]
methyl
```
+As before, let's truncate the names of the sample to match the clinical data.
+
```{r}
colnames(methyl) <- colnames(methyl) |>
str_sub(start = 1,end = 12)
```
+We can add the clinical data to the methyl object and count the number of patients with methylation and transcription data.
+
```{r}
colData(methyl) <- as(clin[colnames(methyl),],"DataFrame")
@@ -394,6 +409,8 @@ intersect(colnames(methyl), colnames(rnaseq)) |> length()
```
+Let's subset common sample names and check the methylation data as row ranges.
+
```{r}
methyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]
@@ -404,6 +421,8 @@ methyl_genes
```
+This function takes a gene symbol and returns a scatter plot showing the relationship between 3 different sites near that gene and gene expression.
+
```{r}
#| label: figplot
@@ -444,4 +463,3 @@ me_rna_cor("TAC1", mpick=3)
```
-## Conclusion
diff --git a/notebooks/images/number_mut_genes_per_tumor_stage.jpeg b/notebooks/images/number_mut_genes_per_tumor_stage.jpeg
index ea511f979f3412eacf30861ab0ed5a862d806d12..39ec99b6ee3789ee63ad56e0273236a7f2659c4a 100644
GIT binary patch
literal 278777
zcmeEP2|SeR_kTt9rDWgBR@RX0CS(thC4@>6Lbi-`M2KV!C0p6DtL(e%3E4x|u?sWS
zvCaIay7%|1dvEvNZf^Jg`~9Y;&v=*TeP`bDp7TBDIp;j*LA^yS0;DomY%iMuSAY-y
zP=kORfRBrdhr1si4-b!k0H2VUf`pieh?tiA5E%s{9WxUn9RmXk8y_dj5gt|shNB|K
zc#aF4IB|lRQ%p)!P?AsRgdnOKIEcUh9sUVyt%HE#K|uB(fT{zS0RU@1*2jMzzY6_<
zwGSI-KQ10V0U;5(L%{)H9~L(DJ{;`*`*Cp4XFbs80i1*T$qx(2;2zSvgm=V_Lhx4D
zV|-TGXEl^s9WXYb%l6&`gjCctv~=tooJWswoe(}LA}S^>cUoRSQAt_loVJdx-g$ik
z(<^3I&97NlIygGrbOyV)`rP*Q^S=`i7#?vyGAjDP!q9&GBoOl-Ljep0FckP7r9iTt?sgVgXPA-UVjf6E-pRJKdiR)x#kC&`jMtoa^N(|O
zop>{6cS2(!Z+0kXS&k6=YH&ai7}G1<`J5Wv2Y^E&udS$Na;S=Su6c8?Ob}d
zraZ+D{4X4*w{)EOoSM1L_G`TEF8v?)SLIEMLOdZSYANX3(EI_HnjM#>xXN01F#=%`RZGWWn$eiVnVW#7c{T<*
z44eL^v8iUD!W{;z_G~c!J)}fSl4UfViJx`$G}!P;Q8e&;kx}+2tLPdhY{{4i3JG4G
z8Axe}9(!cvGECLaJ6X+B=6^O2zfDScN1{8^zIvzn;KF$%!|s!Ur~Dm~a;DZ!S6cnU
z`-$xn*RVxx)0hn!Vbcqg*)pC1c
z^Dbqj>SyP$lFJ`iHDhJz>$pkXI}qnvv39>jxS;N`QnGPS@)bIRXSzR5?0?~iNCv^z
zQNTI0AHeX(`tW|?iAd+%@yzYkDJ{~S`LVwv-DM+U_A>*;a6xxlr?UGC`{lRc;CJl5
z|GZP$4o!K3Evz@5`i8tAZ0}2nAtRl$A2e@QRW^{SaK>RXg=g)$>LnAqXIgIvu6>Y_&lkIp^b@%)O-1dphl|j%o&nFqm
zXvbvc8l_EkPw5XM#m6toF!q?((otD{~uVs$F|@>z-t%)0akL27?PiKM6}#i5
zNF0Y5cH`aUTs-gPw%cC1k|v0sOs)d$uv-pEmvp6jj*h6m(d!hynP8PXTR7km#7&Z&
zfB9&V$=KQB+_Q`l8U>w>gVxdwlzzPf0u-9Lu%|Q>1wYac{tK~FfnFtmQWUps>JpHkH@5chpZ<&)BF}XXpmLT)P
z8oM|>YI?|18&zvdoJeoW)Jzn>;x@9i5rG0^geM?a(i?ZAqfo%MR(Z%M=P(?va|8ug
z>V@s#Y!9<+8KQs%m37w-?Z-Y-h8!BcZP$dq4DcAD8S1y5_V0VU8R3w|xn21$T$2BQ
zJ@@aJPDXTsi*%%I!3R_7`vCesP$B6En`~kaqkz8hge`)vvA9(ea-YZ20S{gju%6<=
zT&+D?88!r4%
z8|eQ7uzz9Tf1Sg>z~rwQBQN^U?MK^@n9P#T_~`rd=S%zJ)3EWMa6f$Xdwe-+{0H(8
zKV<)X8YzA>MLxJd(5Zl5^8SwGm8SIV~~zRT_~VkcxPq^w6kK;b`P_Mp~Uxc!CdOoF31(*Nopp!4HR&tM@=&o
z*;Mh0Ptt!0*_;uRq_!L8cK?D4*~o5$u_}zA*cx)#r@qo2GN0nYUu{2Dup3B5wTX6m
z=R{6aVYT@6tsj*%oaRK9>OEgfK)U2k8@H4~Hm9oM-1vUtVVjSK6)<~_7)oHc0Qo=3
z1v*Lm>wbr2Z0v=8YfINz0|iL-tz=51p@4X+&7L@nqQfY^Sdd3jSA6FjEYY@2zdq
z1QphA*-a)-x~WK;(Ppy15=sADv{nC;VVw2J^!OP^l76zHMLxj_X7}Tm0z-`tItIfG
z7-oQof0Hca+FytqPISH4;}$-VvxeP;oB1x9-T5ig17Yb;gBqt_{Bof7qn$FG32K$nZD25I=c1
zf6fHOM23H?Oa3d#HJHfo-n@n{QqnLr2mT^B>{$>56B*v4H}FMP+Rxbi59AYJviLvX
z)Hh`DV)lZHzqQ?N1*Yok!`Lo)8Kz^s>;_gh(FI(tBC*5_Dq-jsi(or
zHT;ryFmnw*!jI0^CwqL!<)0_)e4a!6dA6FH)Q9$29^mgG+#Yqq%xC(XkeK;QpQQY2
zlRYr=nf{cr_!m@unVI&fU>#;Y(+_~P-~5WN`t&}ZvWJ<^^c7Fi&p7$B^W%$=BW6C+
zk2OdCg)mPrUh8jz%w6ZH$*<7e!~c&-1ya@!kB`L#~Kc&+z{@EEW49^u2DpgzWH{mr=iHM(Oe
zMs_33?C>*pwQS>=n+GqPpV^>G4mm1EO#3S6P?@idXrubt-EnKn
z4Y|C)@kYhs9yS+Qnu~4P7vtV>+ck(g=a?X=sP8~#WO_hiAViNn3V`L6qsz9*lBR!{
z{o0TjjRJgq$CDzAS^Kq4{*EK|!;)}T1Vm=gBI4<=NQ7mEpp4Xc
zvg^nzdc4qfa*+QjW8p*bdr+5?8A)u1dc4-#r{uqSiB19@*jiFJvGIDf@I+O4X@nT>
z;e`a5H=Z7I?LOrv>7Nw1mL4AX0nToV{PPU6D9#FyNr
zs^n(=#*sO}b%r}J$2;(7^2bHddc!_0k!GOaWKv%3VNrPd)WydHDOe?7(KQqy_N
zl`1y-c7Lk0pmn}2%xxui>r_P0zJu$e+qZF?
z!Eg-)6n-eQ5T{eH`x=jG7qXDz#{ZGea?L(6m(zO;3!=plBToPI;`HAErajjl%C`ux
z5XW2QpLsI*AQV?a*nmUR{0r(fi(<4Hf
z7HxOz%g3UT-do6pvWV8;YwJDO7QvH#m%ft
zkqplW?7|hKxIoW2H)4{uk3d{9K5mA5*vEYQzotQkWmMpaN43RpY!ooV@q2sa38QC@
z==LouRTrV_pjdM^BXM4`TSM4_h&&V4JAZlhEwU+L*mC_NA5?_A57?bb!hB$8@ZZY^
zPsv%jo|E4YI?E=Lc7<*9YNC>ldXsW>tH%0i{(B%l#(2*vk%vBz)>dxMYt|}d&8P1M;*C=@?q++DA
zxE<%3cyM$E!Vd*#6)hVh=iNV=%KZz9R+bb9!TD
zV|+V4a5uY~!{rR5k4C&S#_8U+EenIQ#pF$T#uVff=}
z5uOOjS6*bV0|k{CoH9m;B#od_H~Hxdt@N0!dQKE5EsP$ld&IE|Z;p_J_WIANk8b6P>w4ogaY;ndg?$;07~|JxYXnu=7Ht8(?;e?SV2K3pgJfa
zQew1kCnSNq>ODu**2(cWm=#j7J-}rBkjXB+ojO7o1+Y%n5mL7%v78|31gpk%cX)V`be6L
zF3}8w%c{AGj*pOUlsdTgn>I@*)87&U9eYQ>a>J^=TywOY8X?gj63+)C%ziCWSFsp7
za`ODLdZ)suy2dN#eM$^;a9awnP(6|qesB72`MyjC5;(tGnn-JuWJG>g@u7(2xax2`
z++u!WgwHt0&&k?zv9_kq)-79|^MPt=+iA6u!ASi5)ySZF}FY;fj><47WF|{f;y%!d$m+@x3JD&5hFTahKbr*C$R7##L%I
z(uS^aCgz?-*uu8!G{#FLZJ`0Gb16!Nw;oW*Dh#z2ONU|}R-388iKeciVpD86x%E(vHd0r4T(LhRc
zTkCE6_3>{%QEq~yGj2ipJ_^|(GC^|u{%#1u!i|56$_51Eg&<-;V=3>eX7r2ETE74V
zj9)_me$mfSz~ie>w29D%0s@l0xb^Q|Y>5ommerp(K*pu;`QG;Qv+5ABaB`i)b3F!nmeUjHL&XOFP)U)tu=21h+|W+(k5|AykE@-s*&
zh^xBK-Fz&B_;ubqT}4XrYZ3ehi|(!qw4xjJ7p*FVV2yrP&pX{S?(DoufBTHHwA9U@
zShw*!;SpMdX{`{Cctfs*g%HV04v7S&G6i)h`1vYW5t145rq&h(c(S9V+KL)Ug#v1q
zwrH$Tz*1e+j@o4u@Nxv9EpXW6
zry$oq)u18SbrDfDYE_}ouJNo#t!HSc?j~H3GRtS`CLsY&qA6#*MV`9}On5lSJnijV
z>fGYuxCA&Uuvcb9Vn0sQS%HXl;bA=&qsd9|eX?hlq#7b!CTm0zj)+8;^gRhBdxyscp_GU9jkWu(WxLf6kQSwtZ_w#X
zN8=2;-x)TZHV!-my1@{(qeES5;xA3zTd+iZO?)_h<(-9I{Dng<%d9SLajwUJ-D-P9
zxNh*R(OuTDkp@GxBkQlYw;8oS5-MdSr#cPK+>SL*wHLqD{g`Gd-;oS&mYEbOIAB3t
zsA=+cE^Ko^XXq7f_`nr5sFYAxH)#rfqi|9p7^H-7hDUWNRXh(wHfo2ob>=LdJrUJ)
zn3HxkHaOAgsDfCo+lPkH@xQQPGSCyxnQhmEKP~+Jb8Ym0pk87)q;YOn{tJ}A)Ytue
z=rzU~15wDq?l#0s`wGwNL%VX8q&v*)D9rM*Uxtf^Y9HNn@J_zC4kIH#JYl|Ve4DDOvG~yucdu2X>CYQ%lCwTzkQ*^JVS%`WNJB
zuXEo>4o;>dl)f;UP9v2yY!!t9NRXhpqJaj)iF;Rj^sj33P>gwA2AXSGgT*#)Ws1a1
zR@Fj6wzAisRg$pk7wGCo673f?dEg~@qX`=9dC{_Us>BvH0!3zeWZC*C^;+vhnaS3u
zQ%|uF#|nx<6CTj{n80=X>p(%3Bn7aE5Ifscjp=6lgZEF;8dc+9dp(0nwRqmJLUQFH
z{Nd3xg_R6pi@dANJDYs6f_}B!#emlhDoOJa+cpwNy)N?sq#)vCUv0QledR+?0(+Cf
z*|%oSzyZ>IjX@i{o7)R+==S|q!Y}TCF&ciKbN}qwFW}(ka>YexT=)gO|K?pCe(7Zo
zu9WFvjTG(j?dH`?kBkw^qJdvP>HnAA2}$!D1jozqm+kc=X1p!0vaN{oHlK4J@~Q>y
z(=xl=s<&qn>T^GI$?lT*xZm@r$w}P;zw_?>lmXS3yi|xoDyqjz$Y|#)3G#~bUclw_
zRqdBE4QhF$uLstR9g1d^mt!cGs*Z?v?05O{42;gqax(alYMS
z^g$xeTUGz7H{(G4*3yJy*ST}TemlXVt!8vEE
zY;Ni4UfTee(kFYy`Izc*f59K1cWZ|a))oh+HtE~38b4b3ZVJ(8YkiFU5OA>A8B66U
zUND)lz$k>MTqw<5z7JjYfesPKUnEv3FgU29e7YLPhOt^J0}JZY7X)U0xkNlWa?oiP
z(olKi?n0Yw(9QMEaLwab^!_&}GQ7Z}Yfz?nH&csr{bmafE0UgkA2Q+&e}!pDpPC86
z4Z}ftHI+71)%o)_*~v8Et3l!-ViQbcEPAEGZwD-O?F~!}YMrcEYHLqi>L_Vvfw}A}
zSJds`m8NsIg0&8Nrw-Q|Iao=>glm((N@Kbkt$_XP++ZM_(eQLW!UCRKE8EA_
z=Q@38doB2wGG%)t_?UqFw;M1fQqt)eJEZWo^l6Ld!(bELgtmaB(Zg(^6szwNL$#qS
z*!AwKTdA2aiAM3XxYQPxF%JWXI7f)r?eJ6Qy6nAW&INgqQLc2pG5mG0er^e8fBL
zyJoR76C=HkTp3G4?G?d(bwNN<@+)H}R>X?`=764q9$+3SzF?e30`P1zfAbyCaev
zmDdKDObDX^1*`}d$ss#D$qv{W@?MHX0WF{-$03KNEq1E2=n1-JxZXkWDsa!{5xUTeRyV;KXU`4G&ei~>z9&5Z+1YE^k
z2zKJ^uoWHCJ)H1o%d8>l_mb_!BeB4Z3@|Z<6+_R1@IHS@GaD4}*7?>jAlc$9N8cdf
zo>o-clwF07n6Hf&EP}k3s9pk?%*r0$gECg08+>;~cjUIa(^xyLVbwb0sIwQD;Lw`}
zc<{)vg+)Q`)SQ@7%U`k8_ewh^b^PBz_g%Ea6!QOe)17j3gm6d78rn{6B`w`E8RK%P
z`KCyj1?%c4oq7X@=EX1~6l*OdKR{j0UvX2SJ?#J58rk}!7
zCv`9;f#ufrqH7-Bd;wiB)=!IKRHf1kAiT}(r)Aenm;x}0inX5Wwq`-x+g7YhxOvwh
z#X`w6ToR^xcm+Kb+u9IsAbsv+sJp^$(uhWxd`V}1vOqHvnESxyEp&~zY@Y?~HF_m!
zR?m*5w4o>OW(~AxCPE}B@3>c^fI;1w^=uINwHSz(v@jTS_Dyzg8;OCOl~lguVYZAz
z!Cszlb#XI?-uxZqCl*Km(KEU$CtGAFDr`ZLwf+WnJN&7-IM#N(=xk)WK^L4OAPs>H
z)$>^+Aso6s+LL_p#h~0va+SxTGk8lB&{RDst9O~`h8e;CT4sOt+-v(IrAOv0lR@xx
z6maez@SuNM8t-Rc-A|}cTBJMk>)-r$9{G=2)l{zyZWXgpLt<;xrS7${CBIu!ocHj2
z#y)%C9mNFn8h6bmc+}g|`R;|jN_;EV#TZAts!skzlKR`P>dzhu#T`xKq?a)!YiYze
zv{G&`RRNPT>~ri<P9F@FBQ
zcsR3};>;s6N1UcZ
z%p**M?{lhr=Ecv5PH>Tqv@I}P(B1xa-Tmz&_HVHJ|9$hbr&zoNNpSYzv$DMTxOpVw
ziSJp;Q`NvlLU|o|!bGlHUO+WPU`8P2DIIJ8%j=kh#kIW#R*Z-HyG-LR_z$BW|B@X1
zmVgSpOFiyedB4&A!Aq*GbBwP{@eOCstRx`86+&&0&GFTCF%&R;w__T#19z32Mgj5s
zifFggO8v~oK1>cMpq)DDV^5cM6d-9Fg6aQXL?4|VcGG0zkYoOf%qT!cSWi1LILuWt
zmAM$Qi_>GIr7JE!v;4qw3Wn1>tl^0;NyqGtKPk);;Z<|(}B74VK-hMh-jxY7CAEZ
z;4o)ASK~(so%`Ys2J^QK3cO09|lEhp#P%5
z0Oo~%|KL`OH5%`z-u#z;)YHyA-D`If_mtZbwYzCPTyKZ<>^y;6hA%0se_*hA(q|Vs
zIR<8lU2V#JxKwTg5ALcrf!&kKVqeUlkf=4cvgM2-+X#vF9rbqTI#Ov~33qOg(4d68
z6ei7!E`54-i6kz&JpMdrSlXD{o1pzn;0PFWjjeFSvyVsw9C>|%Y{!{7+qRrJpWM);
z%F6dqxk{m%KT;y3w`iDbfPlArq~K-iFuS2tEec?4oWHOh1_j-t%PDh8Hh8xgb1!j7
zDz1(yub8R$Vsfw|Yqx9Qbbbi!&b1z=!R#Zw)33X&acAEzvK07Muhs5nvlnAZ>eiD=
z2Dd{MGING3sEVYoCS4QJFt^*E#fO(ebBr~tX^rVgSuXz0Ir#pnyO&^TA^G>amTYbm
z_TiUdKO2*o@FNBjr_n+g8=c0O13g)a;Mb)o;OpTr9t&+$17p+|Z%Q7O46O;lx(}5!
zEnX~Vg=^GNA$k2@#!Gje)2JwUIe6j7d(uWh5y`*`X{&H30~pyj>Ft#<&Ep_O9Mn)wrkXtc0@K4BEjv|%126VDTjtlw{IGq
zepJ3Yn#SfyztjGz+jqV1{EZ#}XF!4C0H8vn&}iVOIeR1sq>EjK{