diff --git a/_freeze/notebooks/ENCODE_project/execute-results/html.json b/_freeze/notebooks/ENCODE_project/execute-results/html.json new file mode 100644 index 0000000..0f0539d --- /dev/null +++ b/_freeze/notebooks/ENCODE_project/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "ceedd343dc07b96c4af9c4a939f8d621", + "result": { + "engine": "knitr", + "markdown": "---\ntitle: \"Analyzing Cloud Data from ENCODE\"\nsubtitle: \"The Encyclopedia of DNA Element (ENCODE)\"\nformat: html\neditor: visual\n---\n\n\n# Introduction\n\nThe Encyclopedia of DNA Elements ([ENCODE](https://www.encodeproject.org/)) is a collection of experimental results (from human, mouse, worm and fly) and computing resources devoted to elaborating the mechanisms of gene regulation. \n\nWe will use Bioconductor to acquire and interpret ENCODE resources. In particular, we'll analyze ChIP-Seq, ATAC-Seq and CRISPR interference RNA-Seq experiments.\n\n# ENCODE ChIP-seq Data\n\n\n\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/notebooks/TCGA/execute-results/html.json b/_freeze/notebooks/TCGA/execute-results/html.json index 81d6419..032a276 100644 --- a/_freeze/notebooks/TCGA/execute-results/html.json +++ b/_freeze/notebooks/TCGA/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "dc08553dbecf0ac0d417e1815f428244", + "hash": "98d1461385d96f5400e95691212ffc5d", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Integrative Analysis with TCGA Data\"\nsubtitle: \"Analysis of Mutation Data from The Cancer Genome Atlas (TCGA)\"\nformat: html\neditor: visual\n---\n\n::: {.cell .hidden}\n\n```{.r .cell-code .hidden}\n#| include: false\nlibrary(knitr)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'knitr' was built under R version 4.3.3\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\nknitr::opts_chunk$set(\n warning = FALSE,\n message = FALSE,\n dpi = 180,\n echo = TRUE\n)\n\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'ggplot2' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'tidyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'readr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'purrr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'dplyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'stringr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'lubridate' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.0 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\ntheme_set(theme_minimal())\n```\n:::\n\n\n## Introduction\n\nThe Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for [public access](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).\n\nWe are gonna use the `curatedTCGAData` package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a *MultiAssayExperiment* object on the Bioconductor [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).\n\nFirst, let's load the packages needed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(curatedTCGAData)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MultiAssayExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MultiAssayExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SummarizedExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SummarizedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MatrixGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MatrixGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: matrixStats\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'matrixStats' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'matrixStats'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:dplyr':\n\n count\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'MatrixGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n colWeightedMeans, colWeightedMedians, colWeightedSds,\n colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n rowWeightedSds, rowWeightedVars\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomicRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: stats4\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: BiocGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'BiocGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'BiocGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n combine, intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:stats':\n\n IQR, mad, sd, var, xtabs\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n anyDuplicated, aperm, append, as.data.frame, basename, cbind,\n colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,\n get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,\n match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,\n Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,\n table, tapply, union, unique, unsplit, which.max, which.min\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Vectors\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Vectors'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n second, second<-\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n first, rename\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:tidyr':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:utils':\n\n findMatches\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n expand.grid, I, unname\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: IRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'IRanges'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:lubridate':\n\n %within%\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n collapse, desc, slice\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n reduce\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:grDevices':\n\n windows\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomeInfoDb\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GenomeInfoDb' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Biobase\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWelcome to Bioconductor\n\n Vignettes contain introductory material; view with\n 'browseVignettes()'. To cite Bioconductor, see\n 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Biobase'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:MatrixGenerics':\n\n rowMedians\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n anyMissing, rowMedians\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(TCGAutils)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'TCGAutils' was built under R version 4.3.2\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(MultiAssayExperiment)\n```\n:::\n\n\n## Download the Data\n\nTo download the data we need to use `curatedTCGAData`function. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the [NCI Genomic Data Commons website](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations)), the second argument is a vector of data types we want to download. We need to specify `dry.run = FALSE` to download the data.\n\nIn this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData = curatedTCGAData(\"READ\", \n c(\"RNASeq2GeneNorm\", \"Mutation\", \"Methylation_methyl450\"), \n dry.run = FALSE, version = \"2.1.1\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Mutation-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"RaggedExperiment\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'RaggedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_RNASeq2GeneNorm-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Methylation_methyl450-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"rhdf5\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'rhdf5' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: HDF5Array\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'HDF5Array' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: DelayedArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'DelayedArray' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'Matrix' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:S4Vectors':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Arrays\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'S4Arrays' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Arrays'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:abind':\n\n abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:base':\n\n rowsum\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SparseArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SparseArray' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'DelayedArray'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n simplify\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n apply, scale, sweep\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'HDF5Array'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:rhdf5':\n\n h5ls\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_colData-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_sampleMap-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_metadata-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nharmonizing input:\n removing 1903 sampleMap rows not in names(experiments)\n removing 2 colData rownames not in sampleMap 'primary'\n```\n\n\n:::\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData # this is a MultiAssayExperiment object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nA MultiAssayExperiment object of 3 listed\n experiments with user-defined names and respective classes.\n Containing an ExperimentList class object of length 3:\n [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns\n [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns\n [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns\nFunctionality:\n experiments() - obtain the ExperimentList instance\n colData() - the primary/phenotype DataFrame\n sampleMap() - the sample coordination DataFrame\n `$`, `[`, `[[` - extract colData columns, subset, or experiment\n *Format() - convert into a long or wide DataFrame\n assays() - convert ExperimentList to a SimpleList of matrices\n exportClass() - save data to flat files\n```\n\n\n:::\n:::\n\n\n## Review the Clinical Metadata\n\nWe can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 352 rows and 3 columns\n assay primary colname\n \n1 READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..\n2 READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..\n3 READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..\n4 READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..\n5 READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..\n... ... ... ...\n348 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G TCGA-AG-A02G-01\n349 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N TCGA-AG-A02N-01\n350 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X TCGA-AG-A02X-01\n351 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032 TCGA-AG-A032-01\n352 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A036 TCGA-AG-A036-01\n```\n\n\n:::\n:::\n\n\nNot all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData) |> \n as_tibble() |> \n pull(primary) |> \n table() |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 1 2 3 4 \n 5 147 7 8 \n```\n\n\n:::\n:::\n\n\nWe can see the metadata of the patients with `colData`. Note that there are more than 2000 columns of data per patient (not necessarily complete).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin = colData(readData) |> \n as_tibble()\ndim(clin)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 167 2260\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(colnames(clin), 10) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"patientID\" \"years_to_birth\" \"vital_status\" \n [4] \"days_to_death\" \"days_to_last_followup\" \"tumor_tissue_site\" \n [7] \"pathologic_stage\" \"pathology_T_stage\" \"pathology_N_stage\" \n[10] \"pathology_M_stage\" \n```\n\n\n:::\n:::\n\n\nAs an example, for rectum adenocarcinoma, we can see the tumor stage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin |> \n pull(pathology_T_stage) |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 t4a t4b \n 9 28 114 5 8 1 \n```\n\n\n:::\n:::\n\n\nStage T4 have subgroups. To simplify the analysis, let's combine all T4 tumors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin <- clin |> \n mutate(t_stage = case_when(\n pathology_T_stage %in% c(\"t4\",\"t4a\",\"t4b\") ~ \"t4\",\n .default = pathology_T_stage\n ))\n\nclin$t_stage |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 \n 9 28 114 14 \n```\n\n\n:::\n:::\n\n\nAlso, we can see the vital status (alive=0, deceased=1)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin$vital_status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 0 1 \n139 28 \n```\n\n\n:::\n:::\n\n\nOr combine tumor status and vital status.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(clin$t_stage, clin$vital_status)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n t1 9 0\n t2 24 4\n t3 96 18\n t4 9 5\n```\n\n\n:::\n:::\n\n\n## Analyzing Mutation Data\n\nLet's begin analyzing the mutation data. Below is the code to retrieve the mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_data = readData[[1]]\n\nmut_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: RaggedExperiment \ndim: 22075 69 \nassays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target\nrownames: NULL\ncolnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\n ... TCGA-AG-A032-01 TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nFrom the inspection of the sample IDs we can see that the mutation colnames match to the **primary** column from he clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids = colnames(mut_data)\nhead(mut_sample_ids)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2689-01A-01W-0831-10\" \"TCGA-AF-2691-01A-01W-0831-10\"\n[3] \"TCGA-AF-2692-01A-01W-0831-10\" \"TCGA-AF-3400-01A-01W-0831-10\"\n[5] \"TCGA-AF-3913-01\" \"TCGA-AG-3574-01A-01W-0831-10\"\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2687\" \"TCGA-AF-2689\" \"TCGA-AF-2690\" \"TCGA-AF-2691\" \"TCGA-AF-2692\"\n[6] \"TCGA-AF-2693\"\n```\n\n\n:::\n:::\n\n\nWe need to manipulate these by substracting 12 characters.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids <- mut_sample_ids |> \n stringr::str_sub(1,12)\n\nall(mut_sample_ids %in% clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nIs important to note that the data stored in `assay(mut_data)` is difficult to work with because is a sparse matrix that has a row for each `GRanges` with a mutation in at least one sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassay(mut_data)[1:3,1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\nX:54241715-54241716:+ \"WNK3\" NA \n7:111899511:+ \"IFRD1\" NA \n7:113309556:+ \"PPP1R3A\" NA \n TCGA-AF-2692-01A-01W-0831-10\nX:54241715-54241716:+ NA \n7:111899511:+ NA \n7:113309556:+ NA \n```\n\n\n:::\n\n```{.r .cell-code}\n# Is a sparse matrix\n\nassay(mut_data)[1,] |> \n table(useNA=\"ifany\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nWNK3 \n 1 68 \n```\n\n\n:::\n:::\n\n\nWe can get more information if we look at the mutation information for each patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay = mut_data@assays\n\nmut_assay # GRangesList\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRangesList object of length 69:\n$`TCGA-AF-2689-01A-01W-0831-10`\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n\n...\n<68 more elements>\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay |> class()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"CompressedGRangesList\"\nattr(,\"package\")\n[1] \"GenomicRanges\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlength(mut_assay)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 69\n```\n\n\n:::\n:::\n\n\nLet's inspect the data from the first patient. We can see from the metadata information the Hugo Symbol, mutation status and predicted effect of each mutation at variant classification.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Hugo_Symbol\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"WNK3\" \"IFRD1\" \"PPP1R3A\" \"FAM71F1\" \"NOM1\" \"TRIM73\" \n [7] \"C7orf51\" \"DIDO1\" \"DNMT1\" \"CALR\" \"SIN3B\" \"ZNF569\" \n[13] \"SIGLEC12\" \"ZNF160\" \"NLRP4\" \"ENPP2\" \"TRPA1\" \"MMP16\" \n[19] \"GPR89A\" \"SLC9A11\" \"PAX7\" \"PLA2G5\" \"SLC26A9\" \"PIK3R3\" \n[25] \"LPPR4\" \"BCL9L\" \"PRDM11\" \"PEX3\" \"DCDC2\" \"KRT20\" \n[31] \"TP53\" \"CDH8\" \"SF3B3\" \"SLC9A10\" \"SLC7A14\" \"NLGN1\" \n[37] \"MYRIP\" \"CYP8B1\" \"TGM4\" \"COL7A1\" \"P2RX7\" \"KRAS\" \n[43] \"TPH2\" \"ANO4\" \"UBR1\" \"LBXCOR1\" \"PDLIM5\" \"ODZ1\" \n[49] \"SMARCA1\" \"CNKSR2\" \"RBM10\" \"RBM3\" \"HUWE1\" \"CYLC1\" \n[55] \"ATP6V1E2\" \"ASTN2\" \"TAF1L\" \"SGCG\" \"GBF1\" \"OGDHL\" \n[61] \"HELLS\" \"PSD2\" \"FBXO38\" \"GZMK\" \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Mutation_Status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nSomatic \n 64 \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Variant_Classification |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n Frame_Shift_Ins Missense_Mutation Nonsense_Mutation Silent \n 1 39 5 19 \n```\n\n\n:::\n:::\n\n\nNow, is kind of a trouble to inspect manually each patient. So, lets get all mutation information from Hugo symbol and Variant classification for all the patients.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvar_class_df = mapply(function(sample_id, mutation_assay){\n \n d = mcols(mutation_assay)[,c(\"Hugo_Symbol\",\"Variant_Classification\")] |> \n as.data.frame()\n \n colnames(d) = c(\"symbol\",\"variant_class\")\n \n d$patientID = sample_id\n \n return(d)\n \n}, sample_id=mut_sample_ids, mutation_assay = mut_assay,SIMPLIFY = F, USE.NAMES = F)\n\n\nvar_class_df = do.call(rbind, var_class_df)\n\n\nhead(var_class_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n symbol variant_class patientID\n1 WNK3 Frame_Shift_Ins TCGA-AF-2689\n2 IFRD1 Missense_Mutation TCGA-AF-2689\n3 PPP1R3A Missense_Mutation TCGA-AF-2689\n4 FAM71F1 Silent TCGA-AF-2689\n5 NOM1 Missense_Mutation TCGA-AF-2689\n6 TRIM73 Nonsense_Mutation TCGA-AF-2689\n```\n\n\n:::\n:::\n\n\nWe can visualize the most common mutated genes genes in rectum adenocarcinoma\n\n\n::: {#cell-fig-genesmut .cell}\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\np <- var_class_df |> \n as_tibble() |> \n group_by(symbol,variant_class) |> \n summarise(n = n()) |> \n arrange(desc(n)) |> \n ungroup() |> \n slice_max(order_by = n, n = 20) |> \n ungroup() |> \n mutate(symbol = fct_reorder(symbol,n)) |> \n ggplot(aes(n, symbol,fill=variant_class)) +\n geom_col() +\n facet_wrap(~variant_class) +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\") +\n labs(x = \"Samples with specific mutation\", y=\"Gene Symbol\", fill=\"Variant Class\",\n title=\"Top 20 Samples with READ\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n`summarise()` has grouped output by 'symbol'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\npath <- \"images/\"\nggsave(paste0(path,\"samples_with_mut.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n![Samples with specific mutation per gene](images/samples_with_mut.jpeg)\n## Linking mutations and tumor stage\n\nNow that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma.\n\nWe can filter the clinical data with the patients that have mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex <- which( (var_class_df$patientID |> unique()) %in% clin$patientID)\n\nclin_and_mutation = clin[index,]\n\nhead(clin_and_mutation) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2,261\n patientID years_to_birth vital_status days_to_death days_to_last_followup\n \n1 TCGA-AF-2687 57 0 NA 1427\n2 TCGA-AF-2689 41 1 1201 NA\n3 TCGA-AF-2690 76 1 524 NA\n4 TCGA-AF-2691 48 0 NA 1309\n5 TCGA-AF-2692 54 0 NA 412\n6 TCGA-AF-2693 75 0 NA 1155\n# ℹ 2,256 more variables: tumor_tissue_site , pathologic_stage ,\n# pathology_T_stage , pathology_N_stage , pathology_M_stage ,\n# gender , date_of_initial_pathologic_diagnosis ,\n# days_to_last_known_alive , radiation_therapy ,\n# histological_type , tumor_stage , residual_tumor ,\n# number_of_lymph_nodes , ethnicity , admin.bcr ,\n# admin.day_of_dcc_upload , admin.disease_code , …\n```\n\n\n:::\n:::\n\n\nWe can count the number of genes with mutations per patient and then make a `left_join` to the clinical data to plot the mutated genes per tumor stage.\n\n\n::: {#cell-fig-plot1 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot1\n\ndf = var_class_df |> \n group_by(patientID) |> \n summarise(n = n())\n\np <- clin_and_mutation |> \n left_join(df, by = join_by(patientID)) |> \n ggplot(aes(t_stage,log(n), fill=t_stage))+\n geom_boxplot() +\n geom_jitter(width = 0.05, colour=\"red2\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(x = \"Tumor stage\", y=\"Number of mutated genes in log scale\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_genes_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing non-finite outside the scale range\n(`stat_boxplot()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n:::\n\n\n![Number of mutated genes per tumor stage](images/number_mut_genes_per_tumor_stage.jpeg)\n\nAlso, we can focus on a specific gene, e.g. [TP53](https://www.ncbi.nlm.nih.gov/gene/7157).\n\n\n::: {#cell-fig-plot_tp53 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot_tp53\n\nmut_on_tp53 <- var_class_df |> \n dplyr::filter(symbol == \"TP53\") \n\n\np <-clin_and_mutation |> \n dplyr::filter(patientID %in% mut_on_tp53$patientID) |> \n dplyr::select(t_stage, patientID) |> \n group_by(t_stage) |> \n summarise(n = n()) |> \n # mutate(t_stage = fct_reorder(t_stage,n)) |> \n ggplot(aes(n, t_stage, fill=t_stage, label=n)) +\n geom_col() +\n geom_label(fill=\"white\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(y = \"Tumor Stage\", x=\"N° of Mutations\", fill=\"Tumor Stage\", title=\"Number of Mutations on Gene TP53\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_tp53_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n![Number of mutations on gene TP53 per tumor stage](images/number_mut_tp53_per_tumor_stage.jpeg)\n\n# Linking expression and tumor stage\n\nWe also can link the expression data (RNA-Seq) with the tumor stage data from the clinical metadata. [Check the TCGA RNA-Seq protocol online]{.aside}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnaseq = readData[[2]]\n\nrnaseq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 18115 177 \nmetadata(3): filename build platform\nassays(1): ''\nrownames(18115): A1BG A1CF ... ZZZ3 psiTPTE22\nrowData names(0):\ncolnames(177): TCGA-AF-2687-01 TCGA-AF-2689-11 ... TCGA-AG-A032-01\n TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\nAs before, we need to shorten the sample IDs so they can match to the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(rnaseq) = colnames(rnaseq) |> \n stringr::str_sub(1,12)\n\nclin <- clin |> as.data.frame()\nrownames(clin) <- clin$patientID\n\nindex <- which(colnames(rnaseq) %in% rownames(clin))\n\ncolData(rnaseq) = as(clin[colnames(rnaseq),],\"DataFrame\")\n\n\nidx <- is.na(colData(rnaseq)$t_stage)\n\nrnaseq <- rnaseq[,!idx]\n```\n:::\n\n\nThen, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(limma)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'limma' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'limma'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:BiocGenerics':\n\n plotMA\n```\n\n\n:::\n\n```{.r .cell-code}\nmm = model.matrix(~t_stage, data=colData(rnaseq))\nf1 = lmFit(assay(rnaseq), mm)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Partial NA coefficients for 5 probe(s)\n```\n\n\n:::\n\n```{.r .cell-code}\nef1 = eBayes(f1)\ntopTable(ef1) |> \n arrange(desc(AveExpr))\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRemoving intercept from test coefficients\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n t_staget2 t_staget3 t_staget4 AveExpr F P.Value\nPAM -0.5791836 0.08813521 0.4984790 10.94216940 8.309141 3.409613e-05\nPAIP2 -0.3240061 0.18551765 0.0958645 10.29398797 9.056896 1.331550e-05\nSPP1 -0.3352086 1.56339367 2.6409082 9.61423581 9.316731 9.621982e-06\nNIPSNAP3A -0.9988818 -0.35051776 -0.6085837 9.38910637 8.712515 2.051165e-05\nASPN -0.7366657 0.75340984 1.9537316 7.45952050 8.323019 3.350393e-05\nOLR1 -0.2557879 1.39542897 2.1822156 4.97144475 8.531715 2.597133e-05\nSEMA5B 1.3576218 1.27252058 1.9271042 4.60337761 8.091237 4.490926e-05\nCPSF4L 1.0629502 -0.17984587 -1.4274556 0.70036223 8.955085 2.311056e-05\nGRM5 1.3027501 -0.14596612 -0.7395609 0.06134539 10.268851 7.301656e-06\nC1orf223 0.7830779 -0.40735736 -0.7062534 -0.35136703 9.681632 1.743477e-05\n adj.P.Val\nPAM 0.06860899\nPAIP2 0.06719153\nSPP1 0.06719153\nNIPSNAP3A 0.06719153\nASPN 0.06860899\nOLR1 0.06719153\nSEMA5B 0.07597995\nCPSF4L 0.06719153\nGRM5 0.06719153\nC1orf223 0.06719153\n```\n\n\n:::\n:::\n\n\nLet's visualize two of the most expressed genes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow=c(1,2))\nboxplot(split(assay(rnaseq)[\"PAM\",], rnaseq$t_stage), main=\"PAM\") # higher expression in lower t_stage\nboxplot(split(assay(rnaseq)[\"PAIP2\",], rnaseq$t_stage), main=\"PAIP2\")\n```\n\n::: {.cell-output-display}\n![](TCGA_files/figure-html/unnamed-chunk-25-1.png){width=1260}\n:::\n:::\n\n\n# Linking methylation and expression\n\nFinally, we can use the methylation data with the expression data. This is important because methylated cytosines of the DNA change the expression of the genes.\n\nSome patients have methylation data for multiples tissue types. This information is encoded in the fourth component of the sample names. The code `01A` correspond to primary tumor samples and the code `11A` correspond to normal tissue. We'll keep the primary tumor samples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl <- readData[[3]]\n\nidx <- colnames(methyl) |> \n str_split(pattern = \"-\") |> \n map(4) |> \n enframe() |> \n unnest(value)\n\nidx <- which(idx$value == \"01A\")\n\nmethyl = methyl[,idx]\n\nmethyl\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 485577 95 \nmetadata(0):\nassays(1): counts\nrownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873\nrowData names(3): Gene_Symbol Chromosome Genomic_Coordinate\ncolnames(95): TCGA-AF-2687-01A-02D-1734-05 TCGA-AF-2690-01A-02D-1734-05\n ... TCGA-G5-6572-01A-11D-1828-05 TCGA-G5-6641-01A-11D-1828-05\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nAs before, let's truncate the names of the sample to match the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(methyl) <- colnames(methyl) |> \n str_sub(start = 1,end = 12)\n```\n:::\n\n\nWe can add the clinical data to the methyl object and count the number of patients with methylation and transcription data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolData(methyl) <- as(clin[colnames(methyl),],\"DataFrame\")\n\n\nintersect(colnames(methyl), colnames(rnaseq)) |> length()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 93\n```\n\n\n:::\n:::\n\n\nLet's subset common sample names and check the methylation data as row ranges.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]\nrnaseq_subset = rnaseq[,which(colnames(rnaseq) %in% colnames(methyl))]\n\nmethyl_genes = rowData(methyl_subset)\nmethyl_genes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 485577 rows and 3 columns\n Gene_Symbol Chromosome Genomic_Coordinate\n \ncg00000029 RBL2 16 53468112\ncg00000108 C3orf35 3 37459206\ncg00000109 FNDC3B 3 171916037\ncg00000165 NA 1 91194674\ncg00000236 VDAC3 8 42263294\n... ... ... ...\nrs9363764 NA NA 0\nrs939290 NA NA 0\nrs951295 NA NA 0\nrs966367 NA NA 0\nrs9839873 NA NA 0\n```\n\n\n:::\n:::\n\n\nThis function takes a gene symbol and returns a scatter plot showing the relationship between 3 different sites near that gene and gene expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| label: figplot\n\nme_rna_cor = function(sym, mpick = 3){\n require(GGally)\n # subset methylation data to first mpick methylation sites for given gene symbol\n methyl_ind = which(methyl_genes$Gene_Symbol == sym)\n if (length(methyl_ind) > mpick){ \n methyl_ind = methyl_ind[1:mpick]\n }\n methyl_dat = assay(methyl_subset)[methyl_ind,] # subset to selected methylation sites\n\n # subset expression data to selected gene symbol\n expr_ind = which(rownames(rnaseq_subset) == sym) \n expr_dat = assay(rnaseq_subset)[expr_ind,]\n\n # combine methylation and expression data as data frame\n combined_dat = as(t(methyl_dat), \"DataFrame\")\n combined_dat$expr = expr_dat\n\n # plot pairs and calculate correlation coefficients between methylation and expression\n ggpairs(combined_dat) |> print()\n sapply(1:mpick, function(i){\n \n cor_to <- as.numeric(combined_dat[,mpick+1])\n \n df <- data.frame(v1= as.numeric(combined_dat[,i]),\n v2 = cor_to)\n \n df <- df |> na.omit()\n \n cor(df[,1], df[,2])\n })\n}\n\nme_rna_cor(\"TAC1\", mpick=3)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GGally\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GGally' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRegistered S3 method overwritten by 'GGally':\n method from \n +.gg ggplot2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing non-finite outside the scale range\n(`stat_density()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](TCGA_files/figure-html/unnamed-chunk-30-1.png){width=1260}\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0410095 0.1457715 0.0856926\n```\n\n\n:::\n:::\n", + "markdown": "---\ntitle: \"Integrative Analysis with TCGA Data\"\nsubtitle: \"Analysis of Mutation, Transcription and Methylation Data from The Cancer Genome Atlas (TCGA)\"\nformat: html\neditor: visual\n---\n\n::: {.cell .hidden}\n\n```{.r .cell-code .hidden}\n#| include: false\nlibrary(knitr)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'knitr' was built under R version 4.3.3\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\nknitr::opts_chunk$set(\n warning = FALSE,\n message = FALSE,\n dpi = 180,\n echo = TRUE\n)\n\nlibrary(tidyverse)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'ggplot2' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'tidyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'readr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'purrr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'dplyr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'stringr' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'lubridate' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──\n✔ dplyr 1.1.4 ✔ readr 2.1.5\n✔ forcats 1.0.0 ✔ stringr 1.5.1\n✔ ggplot2 3.5.0 ✔ tibble 3.2.1\n✔ lubridate 1.9.3 ✔ tidyr 1.3.1\n✔ purrr 1.0.2 \n── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──\n✖ dplyr::filter() masks stats::filter()\n✖ dplyr::lag() masks stats::lag()\nℹ Use the conflicted package () to force all conflicts to become errors\n```\n\n\n:::\n\n```{.r .cell-code .hidden}\n#| include: false\ntheme_set(theme_minimal())\n```\n:::\n\n\n## Introduction\n\nThe Cancer Genome Atlas (TCGA) is a massive cancer genomics project compiling high-throughput multi-omic data on dozens of cancer types for [public access](https://www.cancer.gov/ccg/research/genome-sequencing/tcga).\n\nWe are gonna use the `curatedTCGAData` package to manipulate locally to multiple high-throughput datasets from the project. The package provides access to TCGA data that has been curated and stored as a *MultiAssayExperiment* object on the Bioconductor [ExperimentHub](https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html).\n\nFirst, let's load the packages needed.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(curatedTCGAData)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MultiAssayExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MultiAssayExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SummarizedExperiment\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SummarizedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: MatrixGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'MatrixGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: matrixStats\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'matrixStats' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'matrixStats'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:dplyr':\n\n count\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'MatrixGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n colWeightedMeans, colWeightedMedians, colWeightedSds,\n colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n rowWeightedSds, rowWeightedVars\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomicRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: stats4\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: BiocGenerics\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'BiocGenerics' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'BiocGenerics'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n combine, intersect, setdiff, union\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:stats':\n\n IQR, mad, sd, var, xtabs\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n anyDuplicated, aperm, append, as.data.frame, basename, cbind,\n colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,\n get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,\n match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,\n Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,\n table, tapply, union, unique, unsplit, which.max, which.min\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Vectors\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Vectors'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:lubridate':\n\n second, second<-\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n first, rename\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:tidyr':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:utils':\n\n findMatches\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n expand.grid, I, unname\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: IRanges\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'IRanges'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:lubridate':\n\n %within%\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:dplyr':\n\n collapse, desc, slice\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n reduce\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:grDevices':\n\n windows\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GenomeInfoDb\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GenomeInfoDb' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Biobase\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWelcome to Bioconductor\n\n Vignettes contain introductory material; view with\n 'browseVignettes()'. To cite Bioconductor, see\n 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Biobase'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:MatrixGenerics':\n\n rowMedians\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:matrixStats':\n\n anyMissing, rowMedians\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(TCGAutils)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'TCGAutils' was built under R version 4.3.2\n```\n\n\n:::\n\n```{.r .cell-code}\nlibrary(MultiAssayExperiment)\n```\n:::\n\n\n## Download the Data\n\nTo download the data we need to use `curatedTCGAData`function. The first argument is a four letter disease (cancer) code (A complete list of disease codes used by the TCGA project are available on the [NCI Genomic Data Commons website](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations)), the second argument is a vector of data types we want to download. We need to specify `dry.run = FALSE` to download the data.\n\nIn this specific case, we are gonna work with RNA-Seq data, mutation data and methylation data from Rectum Adenocarcinoma (READ). The clinical data is included by default.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData = curatedTCGAData(\"READ\", \n c(\"RNASeq2GeneNorm\", \"Mutation\", \"Methylation_methyl450\"), \n dry.run = FALSE, version = \"2.1.1\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Mutation-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"RaggedExperiment\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'RaggedExperiment' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_RNASeq2GeneNorm-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_Methylation_methyl450-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nrequire(\"rhdf5\")\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'rhdf5' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: HDF5Array\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'HDF5Array' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: DelayedArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'DelayedArray' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: Matrix\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'Matrix' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'Matrix'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:S4Vectors':\n\n expand\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:tidyr':\n\n expand, pack, unpack\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: S4Arrays\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'S4Arrays' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'S4Arrays'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:abind':\n\n abind\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:base':\n\n rowsum\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: SparseArray\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'SparseArray' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'DelayedArray'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:purrr':\n\n simplify\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following objects are masked from 'package:base':\n\n apply, scale, sweep\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'HDF5Array'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:rhdf5':\n\n h5ls\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_colData-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_sampleMap-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWorking on: READ_metadata-20160128\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nsee ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nloading from cache\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nharmonizing input:\n removing 1903 sampleMap rows not in names(experiments)\n removing 2 colData rownames not in sampleMap 'primary'\n```\n\n\n:::\n\n```{.r .cell-code}\n#| message: false\n#| warning: false\n#| label: read-data\n\nreadData # this is a MultiAssayExperiment object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nA MultiAssayExperiment object of 3 listed\n experiments with user-defined names and respective classes.\n Containing an ExperimentList class object of length 3:\n [1] READ_Mutation-20160128: RaggedExperiment with 22075 rows and 69 columns\n [2] READ_RNASeq2GeneNorm-20160128: SummarizedExperiment with 18115 rows and 177 columns\n [3] READ_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 106 columns\nFunctionality:\n experiments() - obtain the ExperimentList instance\n colData() - the primary/phenotype DataFrame\n sampleMap() - the sample coordination DataFrame\n `$`, `[`, `[[` - extract colData columns, subset, or experiment\n *Format() - convert into a long or wide DataFrame\n assays() - convert ExperimentList to a SimpleList of matrices\n exportClass() - save data to flat files\n```\n\n\n:::\n:::\n\n\n## Review the Clinical Metadata\n\nWe can see which patients have data for each assay. The assay column gives the experiment type, the primary column gives the unique patient ID and the colname gives the sample ID used as a identifier within a given experiment.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 352 rows and 3 columns\n assay primary colname\n \n1 READ_Methylation_methyl450-20160128 TCGA-AF-2687 TCGA-AF-2687-01A-02D..\n2 READ_Methylation_methyl450-20160128 TCGA-AF-2690 TCGA-AF-2690-01A-02D..\n3 READ_Methylation_methyl450-20160128 TCGA-AF-2693 TCGA-AF-2693-01A-02D..\n4 READ_Methylation_methyl450-20160128 TCGA-AF-3911 TCGA-AF-3911-01A-01D..\n5 READ_Methylation_methyl450-20160128 TCGA-AF-4110 TCGA-AF-4110-01A-02D..\n... ... ... ...\n348 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02G TCGA-AG-A02G-01\n349 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02N TCGA-AG-A02N-01\n350 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A02X TCGA-AG-A02X-01\n351 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A032 TCGA-AG-A032-01\n352 READ_RNASeq2GeneNorm-20160128 TCGA-AG-A036 TCGA-AG-A036-01\n```\n\n\n:::\n:::\n\n\nNot all patients have data for all assays, and some of them can have multiple data entries for one or more experiment type. This may correspond to multiple biopsies or matched tumor and normal samples from an individual patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsampleMap(readData) |> \n as_tibble() |> \n pull(primary) |> \n table() |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 1 2 3 4 \n 5 147 7 8 \n```\n\n\n:::\n:::\n\n\nWe can see the metadata of the patients with `colData`. Note that there are more than 2000 columns of data per patient (not necessarily complete).\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin = colData(readData) |> \n as_tibble()\ndim(clin)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 167 2260\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(colnames(clin), 10) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"patientID\" \"years_to_birth\" \"vital_status\" \n [4] \"days_to_death\" \"days_to_last_followup\" \"tumor_tissue_site\" \n [7] \"pathologic_stage\" \"pathology_T_stage\" \"pathology_N_stage\" \n[10] \"pathology_M_stage\" \n```\n\n\n:::\n:::\n\n\nAs an example, for rectum adenocarcinoma, we can see the tumor stage.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin |> \n pull(pathology_T_stage) |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 t4a t4b \n 9 28 114 5 8 1 \n```\n\n\n:::\n:::\n\n\nStage T4 have subgroups. To simplify the analysis, let's combine all T4 tumors.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin <- clin |> \n mutate(t_stage = case_when(\n pathology_T_stage %in% c(\"t4\",\"t4a\",\"t4b\") ~ \"t4\",\n .default = pathology_T_stage\n ))\n\nclin$t_stage |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n t1 t2 t3 t4 \n 9 28 114 14 \n```\n\n\n:::\n:::\n\n\nAlso, we can see the vital status (alive=0, deceased=1)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclin$vital_status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n 0 1 \n139 28 \n```\n\n\n:::\n:::\n\n\nOr combine tumor status and vital status.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(clin$t_stage, clin$vital_status)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n t1 9 0\n t2 24 4\n t3 96 18\n t4 9 5\n```\n\n\n:::\n:::\n\n\n## Analyzing Mutation Data\n\nLet's begin analyzing the mutation data. Below is the code to retrieve the mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_data = readData[[1]]\n\nmut_data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: RaggedExperiment \ndim: 22075 69 \nassays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target\nrownames: NULL\ncolnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\n ... TCGA-AG-A032-01 TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nFrom the inspection of the sample IDs we can see that the mutation colnames match to the **primary** column from he clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids = colnames(mut_data)\nhead(mut_sample_ids)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2689-01A-01W-0831-10\" \"TCGA-AF-2691-01A-01W-0831-10\"\n[3] \"TCGA-AF-2692-01A-01W-0831-10\" \"TCGA-AF-3400-01A-01W-0831-10\"\n[5] \"TCGA-AF-3913-01\" \"TCGA-AG-3574-01A-01W-0831-10\"\n```\n\n\n:::\n\n```{.r .cell-code}\nhead(clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"TCGA-AF-2687\" \"TCGA-AF-2689\" \"TCGA-AF-2690\" \"TCGA-AF-2691\" \"TCGA-AF-2692\"\n[6] \"TCGA-AF-2693\"\n```\n\n\n:::\n:::\n\n\nWe need to manipulate these by substracting 12 characters.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_sample_ids <- mut_sample_ids |> \n stringr::str_sub(1,12)\n\nall(mut_sample_ids %in% clin$patientID)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] TRUE\n```\n\n\n:::\n:::\n\n\nIs important to note that the data stored in `assay(mut_data)` is difficult to work with because is a sparse matrix that has a row for each `GRanges` with a mutation in at least one sample.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nassay(mut_data)[1:3,1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10\nX:54241715-54241716:+ \"WNK3\" NA \n7:111899511:+ \"IFRD1\" NA \n7:113309556:+ \"PPP1R3A\" NA \n TCGA-AF-2692-01A-01W-0831-10\nX:54241715-54241716:+ NA \n7:111899511:+ NA \n7:113309556:+ NA \n```\n\n\n:::\n\n```{.r .cell-code}\n# Is a sparse matrix\n\nassay(mut_data)[1,] |> \n table(useNA=\"ifany\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nWNK3 \n 1 68 \n```\n\n\n:::\n:::\n\n\nWe can get more information if we look at the mutation information for each patient.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay = mut_data@assays\n\nmut_assay # GRangesList\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRangesList object of length 69:\n$`TCGA-AF-2689-01A-01W-0831-10`\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n\n...\n<68 more elements>\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay |> class()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"CompressedGRangesList\"\nattr(,\"package\")\n[1] \"GenomicRanges\"\n```\n\n\n:::\n\n```{.r .cell-code}\nlength(mut_assay)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 69\n```\n\n\n:::\n:::\n\n\nLet's inspect the data from the first patient. We can see from the metadata information the Hugo Symbol, mutation status and predicted effect of each mutation at variant classification.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmut_assay[[1]]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nGRanges object with 64 ranges and 34 metadata columns:\n seqnames ranges strand | Hugo_Symbol Entrez_Gene_Id\n | \n [1] X 54241715-54241716 + | WNK3 65267\n [2] 7 111899511 + | IFRD1 3475\n [3] 7 113309556 + | PPP1R3A 5506\n [4] 7 128146325 + | FAM71F1 84691\n [5] 7 156447624 + | NOM1 64434\n ... ... ... ... . ... ...\n [60] 10 50616059 + | OGDHL 55753\n [61] 10 96343335 + | HELLS 3070\n [62] 5 139173166 + | PSD2 84249\n [63] 5 147800932 + | FBXO38 81545\n [64] 5 54365356 + | GZMK 3003\n Center NCBI_Build Variant_Classification Variant_Type\n \n [1] hgsc.bcm.edu 36 Frame_Shift_Ins INS\n [2] hgsc.bcm.edu 36 Missense_Mutation SNP\n [3] hgsc.bcm.edu 36 Missense_Mutation SNP\n [4] hgsc.bcm.edu 36 Silent SNP\n [5] hgsc.bcm.edu 36 Missense_Mutation SNP\n ... ... ... ... ...\n [60] hgsc.bcm.edu 36 Silent SNP\n [61] hgsc.bcm.edu 36 Silent SNP\n [62] hgsc.bcm.edu 36 Nonsense_Mutation SNP\n [63] hgsc.bcm.edu 36 Silent SNP\n [64] hgsc.bcm.edu 36 Missense_Mutation SNP\n Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS\n \n [1] - - T novel\n [2] A A G novel\n [3] T T C novel\n [4] G G A novel\n [5] A A G novel\n ... ... ... ... ...\n [60] G G A novel\n [61] T T C novel\n [62] C C T novel\n [63] G G A novel\n [64] T T A novel\n dbSNP_Val_Status Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1\n \n [1] unknown TCGA-AF-2689-10A-01W.. -\n [2] unknown TCGA-AF-2689-10A-01W.. A\n [3] unknown TCGA-AF-2689-10A-01W.. T\n [4] unknown TCGA-AF-2689-10A-01W.. G\n [5] unknown TCGA-AF-2689-10A-01W.. A\n ... ... ... ...\n [60] unknown TCGA-AF-2689-10A-01W.. G\n [61] unknown TCGA-AF-2689-10A-01W.. T\n [62] unknown TCGA-AF-2689-10A-01W.. C\n [63] unknown TCGA-AF-2689-10A-01W.. G\n [64] unknown TCGA-AF-2689-10A-01W.. T\n Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2\n \n [1] - - T\n [2] A A G\n [3] T T C\n [4] G . .\n [5] A A G\n ... ... ... ...\n [60] G . .\n [61] T . .\n [62] C C T\n [63] G G A\n [64] T T A\n Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2\n \n [1] - -\n [2] A A\n [3] T T\n [4] . .\n [5] A A\n ... ... ...\n [60] . .\n [61] . .\n [62] C C\n [63] G G\n [64] T T\n Verification_Status Validation_Status Mutation_Status Sequencing_Phase\n \n [1] Unknown Valid Somatic Phase_I\n [2] Unknown Valid Somatic Phase_I\n [3] Unknown Valid Somatic Phase_I\n [4] Unknown Unknown Somatic Phase_I\n [5] Unknown Valid Somatic Phase_I\n ... ... ... ... ...\n [60] Unknown Unknown Somatic Phase_I\n [61] Unknown Unknown Somatic Phase_I\n [62] Unknown Valid Somatic Phase_I\n [63] Unknown Valid Somatic Phase_I\n [64] Unknown Valid Somatic Phase_I\n Sequence_Source Validation_Method Score BAM_file Sequencer\n \n [1] Capture 454 SOLID\n [2] Capture 454 SOLID\n [3] Capture 454 SOLID\n [4] Capture . SOLID\n [5] Capture 454 SOLID\n ... ... ... ... ... ...\n [60] Capture . SOLID\n [61] Capture . SOLID\n [62] Capture 454 SOLID\n [63] Capture 454 SOLID\n [64] Capture 454 SOLID\n TranscriptID Exon ChromChange AAChange COSMIC_Codon\n \n [1] NM_001002838 exon23 c.4999_5000insA p.L1667fs .\n [2] NM_001550 exon10 c.A1043G p.E348G .\n [3] NM_002711 exon2 c.A838G p.T280A .\n [4] NM_032599 exon3 c.G639A p.T213T .\n [5] NM_138400 exon5 c.A1651G p.T551A .\n ... ... ... ... ... ...\n [60] NM_001143996 exon18 c.C2286T p.S762S .\n [61] NM_018063 exon18 c.T2061C p.D687D .\n [62] NM_032289 exon3 c.C460T p.Q154X .\n [63] NM_205836 exon21 c.G3327A p.R1109R .\n [64] NM_002104 exon5 c.T640A p.S214T .\n COSMIC_Gene Drug_Target\n \n [1] . .\n [2] . .\n [3] . .\n [4] . .\n [5] . .\n ... ... ...\n [60] . .\n [61] . .\n [62] . .\n [63] . .\n [64] . .\n -------\n seqinfo: 24 sequences from NCBI36 genome; no seqlengths\n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Hugo_Symbol\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] \"WNK3\" \"IFRD1\" \"PPP1R3A\" \"FAM71F1\" \"NOM1\" \"TRIM73\" \n [7] \"C7orf51\" \"DIDO1\" \"DNMT1\" \"CALR\" \"SIN3B\" \"ZNF569\" \n[13] \"SIGLEC12\" \"ZNF160\" \"NLRP4\" \"ENPP2\" \"TRPA1\" \"MMP16\" \n[19] \"GPR89A\" \"SLC9A11\" \"PAX7\" \"PLA2G5\" \"SLC26A9\" \"PIK3R3\" \n[25] \"LPPR4\" \"BCL9L\" \"PRDM11\" \"PEX3\" \"DCDC2\" \"KRT20\" \n[31] \"TP53\" \"CDH8\" \"SF3B3\" \"SLC9A10\" \"SLC7A14\" \"NLGN1\" \n[37] \"MYRIP\" \"CYP8B1\" \"TGM4\" \"COL7A1\" \"P2RX7\" \"KRAS\" \n[43] \"TPH2\" \"ANO4\" \"UBR1\" \"LBXCOR1\" \"PDLIM5\" \"ODZ1\" \n[49] \"SMARCA1\" \"CNKSR2\" \"RBM10\" \"RBM3\" \"HUWE1\" \"CYLC1\" \n[55] \"ATP6V1E2\" \"ASTN2\" \"TAF1L\" \"SGCG\" \"GBF1\" \"OGDHL\" \n[61] \"HELLS\" \"PSD2\" \"FBXO38\" \"GZMK\" \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Mutation_Status |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nSomatic \n 64 \n```\n\n\n:::\n\n```{.r .cell-code}\nmut_assay[[1]]$Variant_Classification |> \n table()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n Frame_Shift_Ins Missense_Mutation Nonsense_Mutation Silent \n 1 39 5 19 \n```\n\n\n:::\n:::\n\n\nNow, is kind of a trouble to inspect manually each patient. So, lets get all mutation information from Hugo symbol and Variant classification for all the patients.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvar_class_df = mapply(function(sample_id, mutation_assay){\n \n d = mcols(mutation_assay)[,c(\"Hugo_Symbol\",\"Variant_Classification\")] |> \n as.data.frame()\n \n colnames(d) = c(\"symbol\",\"variant_class\")\n \n d$patientID = sample_id\n \n return(d)\n \n}, sample_id=mut_sample_ids, mutation_assay = mut_assay,SIMPLIFY = F, USE.NAMES = F)\n\n\nvar_class_df = do.call(rbind, var_class_df)\n\n\nhead(var_class_df)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n symbol variant_class patientID\n1 WNK3 Frame_Shift_Ins TCGA-AF-2689\n2 IFRD1 Missense_Mutation TCGA-AF-2689\n3 PPP1R3A Missense_Mutation TCGA-AF-2689\n4 FAM71F1 Silent TCGA-AF-2689\n5 NOM1 Missense_Mutation TCGA-AF-2689\n6 TRIM73 Nonsense_Mutation TCGA-AF-2689\n```\n\n\n:::\n:::\n\n\nWe can visualize the most common mutated genes genes in rectum adenocarcinoma\n\n\n::: {#cell-fig-genesmut .cell}\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\np <- var_class_df |> \n as_tibble() |> \n group_by(symbol,variant_class) |> \n summarise(n = n()) |> \n arrange(desc(n)) |> \n ungroup() |> \n slice_max(order_by = n, n = 20) |> \n ungroup() |> \n mutate(symbol = fct_reorder(symbol,n)) |> \n ggplot(aes(n, symbol,fill=variant_class)) +\n geom_col() +\n facet_wrap(~variant_class) +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\") +\n labs(x = \"Samples with specific mutation\", y=\"Gene Symbol\", fill=\"Variant Class\",\n title=\"Top 20 Samples with READ\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n`summarise()` has grouped output by 'symbol'. You can override using the\n`.groups` argument.\n```\n\n\n:::\n\n```{.r .cell-code}\n#| label: fig-genesmut\n#| message: false\n#| warning: false\n\npath <- \"images/\"\nggsave(paste0(path,\"samples_with_mut.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n![Samples with specific mutation per gene](images/samples_with_mut.jpeg) \\## Linking mutations and tumor stage\n\nNow that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma.\n\nWe can filter the clinical data with the patients that have mutation data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nindex <- which( (var_class_df$patientID |> unique()) %in% clin$patientID)\n\nclin_and_mutation = clin[index,]\n\nhead(clin_and_mutation) \n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n# A tibble: 6 × 2,261\n patientID years_to_birth vital_status days_to_death days_to_last_followup\n \n1 TCGA-AF-2687 57 0 NA 1427\n2 TCGA-AF-2689 41 1 1201 NA\n3 TCGA-AF-2690 76 1 524 NA\n4 TCGA-AF-2691 48 0 NA 1309\n5 TCGA-AF-2692 54 0 NA 412\n6 TCGA-AF-2693 75 0 NA 1155\n# ℹ 2,256 more variables: tumor_tissue_site , pathologic_stage ,\n# pathology_T_stage , pathology_N_stage , pathology_M_stage ,\n# gender , date_of_initial_pathologic_diagnosis ,\n# days_to_last_known_alive , radiation_therapy ,\n# histological_type , tumor_stage , residual_tumor ,\n# number_of_lymph_nodes , ethnicity , admin.bcr ,\n# admin.day_of_dcc_upload , admin.disease_code , …\n```\n\n\n:::\n:::\n\n\nWe can count the number of genes with mutations per patient and then make a `left_join` to the clinical data to plot the mutated genes per tumor stage.\n\n\n::: {#cell-fig-plot1 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot1\n\ndf = var_class_df |> \n group_by(patientID) |> \n summarise(n = n())\n\np <- clin_and_mutation |> \n left_join(df, by = join_by(patientID)) |> \n ggplot(aes(t_stage,log(n), fill=t_stage))+\n geom_boxplot() +\n geom_jitter(width = 0.05, colour=\"red2\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(x = \"Tumor stage\", y=\"Number of mutated genes in log scale\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_genes_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing non-finite outside the scale range\n(`stat_boxplot()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 20 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n:::\n\n\n![Number of mutated genes per tumor stage](images/number_mut_genes_per_tumor_stage.jpeg)\n\nAlso, we can focus on a specific gene, e.g. [TP53](https://www.ncbi.nlm.nih.gov/gene/7157).\n\n\n::: {#cell-fig-plot_tp53 .cell}\n\n```{.r .cell-code}\n#| label: fig-plot_tp53\n\nmut_on_tp53 <- var_class_df |> \n dplyr::filter(symbol == \"TP53\") \n\n\np <-clin_and_mutation |> \n dplyr::filter(patientID %in% mut_on_tp53$patientID) |> \n dplyr::select(t_stage, patientID) |> \n group_by(t_stage) |> \n summarise(n = n()) |> \n # mutate(t_stage = fct_reorder(t_stage,n)) |> \n ggplot(aes(n, t_stage, fill=t_stage, label=n)) +\n geom_col() +\n geom_label(fill=\"white\") +\n paletteer::scale_fill_paletteer_d(\"awtools::a_palette\")+\n labs(y = \"Tumor Stage\", x=\"N° of Mutations\", fill=\"Tumor Stage\", title=\"Number of Mutations on Gene TP53\")\n\npath <- \"images/\"\nggsave(paste0(path,\"number_mut_tp53_per_tumor_stage.jpeg\"), device = \"jpeg\")\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nSaving 7 x 5 in image\n```\n\n\n:::\n:::\n\n\n![Number of mutations on gene TP53 per tumor stage](images/number_mut_tp53_per_tumor_stage.jpeg)\n\n# Linking expression and tumor stage\n\nWe also can link the expression data (RNA-Seq) with the tumor stage data from the clinical metadata. [Check the TCGA RNA-Seq protocol online]{.aside}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrnaseq = readData[[2]]\n\nrnaseq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 18115 177 \nmetadata(3): filename build platform\nassays(1): ''\nrownames(18115): A1BG A1CF ... ZZZ3 psiTPTE22\nrowData names(0):\ncolnames(177): TCGA-AF-2687-01 TCGA-AF-2689-11 ... TCGA-AG-A032-01\n TCGA-AG-A036-01\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nAs before, we need to shorten the sample IDs so they can match to the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(rnaseq) = colnames(rnaseq) |> \n stringr::str_sub(1,12)\n\nclin <- clin |> as.data.frame()\nrownames(clin) <- clin$patientID\n\nindex <- which(colnames(rnaseq) %in% rownames(clin))\n\ncolData(rnaseq) = as(clin[colnames(rnaseq),],\"DataFrame\")\n\n\nidx <- is.na(colData(rnaseq)$t_stage)\n\nrnaseq <- rnaseq[,!idx]\n```\n:::\n\n\nThen, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(limma)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'limma' was built under R version 4.3.1\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\n\nAttaching package: 'limma'\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nThe following object is masked from 'package:BiocGenerics':\n\n plotMA\n```\n\n\n:::\n\n```{.r .cell-code}\nmm = model.matrix(~t_stage, data=colData(rnaseq))\nf1 = lmFit(assay(rnaseq), mm)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Partial NA coefficients for 5 probe(s)\n```\n\n\n:::\n\n```{.r .cell-code}\nef1 = eBayes(f1)\ntopTable(ef1) |> \n arrange(desc(AveExpr))\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRemoving intercept from test coefficients\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n t_staget2 t_staget3 t_staget4 AveExpr F P.Value\nPAM -0.5791836 0.08813521 0.4984790 10.94216940 8.309141 3.409613e-05\nPAIP2 -0.3240061 0.18551765 0.0958645 10.29398797 9.056896 1.331550e-05\nSPP1 -0.3352086 1.56339367 2.6409082 9.61423581 9.316731 9.621982e-06\nNIPSNAP3A -0.9988818 -0.35051776 -0.6085837 9.38910637 8.712515 2.051165e-05\nASPN -0.7366657 0.75340984 1.9537316 7.45952050 8.323019 3.350393e-05\nOLR1 -0.2557879 1.39542897 2.1822156 4.97144475 8.531715 2.597133e-05\nSEMA5B 1.3576218 1.27252058 1.9271042 4.60337761 8.091237 4.490926e-05\nCPSF4L 1.0629502 -0.17984587 -1.4274556 0.70036223 8.955085 2.311056e-05\nGRM5 1.3027501 -0.14596612 -0.7395609 0.06134539 10.268851 7.301656e-06\nC1orf223 0.7830779 -0.40735736 -0.7062534 -0.35136703 9.681632 1.743477e-05\n adj.P.Val\nPAM 0.06860899\nPAIP2 0.06719153\nSPP1 0.06719153\nNIPSNAP3A 0.06719153\nASPN 0.06860899\nOLR1 0.06719153\nSEMA5B 0.07597995\nCPSF4L 0.06719153\nGRM5 0.06719153\nC1orf223 0.06719153\n```\n\n\n:::\n:::\n\n\nLet's visualize two of the most expressed genes.\n\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow=c(1,2))\nboxplot(split(assay(rnaseq)[\"PAM\",], rnaseq$t_stage), main=\"PAM\") # higher expression in lower t_stage\nboxplot(split(assay(rnaseq)[\"PAIP2\",], rnaseq$t_stage), main=\"PAIP2\")\n```\n\n::: {.cell-output-display}\n![](TCGA_files/figure-html/unnamed-chunk-25-1.png){width=1260}\n:::\n:::\n\n\n# Linking methylation and expression\n\nFinally, we can use the methylation data with the expression data. This is important because methylated cytosines of the DNA change the expression of the genes.\n\nSome patients have methylation data for multiples tissue types. This information is encoded in the fourth component of the sample names. The code `01A` correspond to primary tumor samples and the code `11A` correspond to normal tissue. We'll keep the primary tumor samples.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl <- readData[[3]]\n\nidx <- colnames(methyl) |> \n str_split(pattern = \"-\") |> \n map(4) |> \n enframe() |> \n unnest(value)\n\nidx <- which(idx$value == \"01A\")\n\nmethyl = methyl[,idx]\n\nmethyl\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nclass: SummarizedExperiment \ndim: 485577 95 \nmetadata(0):\nassays(1): counts\nrownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873\nrowData names(3): Gene_Symbol Chromosome Genomic_Coordinate\ncolnames(95): TCGA-AF-2687-01A-02D-1734-05 TCGA-AF-2690-01A-02D-1734-05\n ... TCGA-G5-6572-01A-11D-1828-05 TCGA-G5-6641-01A-11D-1828-05\ncolData names(0):\n```\n\n\n:::\n:::\n\n\nAs before, let's truncate the names of the sample to match the clinical data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(methyl) <- colnames(methyl) |> \n str_sub(start = 1,end = 12)\n```\n:::\n\n\nWe can add the clinical data to the methyl object and count the number of patients with methylation and transcription data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolData(methyl) <- as(clin[colnames(methyl),],\"DataFrame\")\n\n\nintersect(colnames(methyl), colnames(rnaseq)) |> length()\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 93\n```\n\n\n:::\n:::\n\n\nLet's subset common sample names and check the methylation data as row ranges.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmethyl_subset = methyl[,which(colnames(methyl) %in% colnames(rnaseq))]\nrnaseq_subset = rnaseq[,which(colnames(rnaseq) %in% colnames(methyl))]\n\nmethyl_genes = rowData(methyl_subset)\nmethyl_genes\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nDataFrame with 485577 rows and 3 columns\n Gene_Symbol Chromosome Genomic_Coordinate\n \ncg00000029 RBL2 16 53468112\ncg00000108 C3orf35 3 37459206\ncg00000109 FNDC3B 3 171916037\ncg00000165 NA 1 91194674\ncg00000236 VDAC3 8 42263294\n... ... ... ...\nrs9363764 NA NA 0\nrs939290 NA NA 0\nrs951295 NA NA 0\nrs966367 NA NA 0\nrs9839873 NA NA 0\n```\n\n\n:::\n:::\n\n\nThis function takes a gene symbol and returns a scatter plot showing the relationship between 3 different sites near that gene and gene expression.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n#| label: figplot\n\nme_rna_cor = function(sym, mpick = 3){\n require(GGally)\n # subset methylation data to first mpick methylation sites for given gene symbol\n methyl_ind = which(methyl_genes$Gene_Symbol == sym)\n if (length(methyl_ind) > mpick){ \n methyl_ind = methyl_ind[1:mpick]\n }\n methyl_dat = assay(methyl_subset)[methyl_ind,] # subset to selected methylation sites\n\n # subset expression data to selected gene symbol\n expr_ind = which(rownames(rnaseq_subset) == sym) \n expr_dat = assay(rnaseq_subset)[expr_ind,]\n\n # combine methylation and expression data as data frame\n combined_dat = as(t(methyl_dat), \"DataFrame\")\n combined_dat$expr = expr_dat\n\n # plot pairs and calculate correlation coefficients between methylation and expression\n ggpairs(combined_dat) |> print()\n sapply(1:mpick, function(i){\n \n cor_to <- as.numeric(combined_dat[,mpick+1])\n \n df <- data.frame(v1= as.numeric(combined_dat[,i]),\n v2 = cor_to)\n \n df <- df |> na.omit()\n \n cor(df[,1], df[,2])\n })\n}\n\nme_rna_cor(\"TAC1\", mpick=3)\n```\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nLoading required package: GGally\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: package 'GGally' was built under R version 4.3.2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nRegistered S3 method overwritten by 'GGally':\n method from \n +.gg ggplot2\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\nWarning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :\nRemoved 5 rows containing missing values\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\nRemoved 5 rows containing missing values or values outside the scale range\n(`geom_point()`).\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr .hidden}\n\n```\nWarning: Removed 5 rows containing non-finite outside the scale range\n(`stat_density()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](TCGA_files/figure-html/unnamed-chunk-30-1.png){width=1260}\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.0410095 0.1457715 0.0856926\n```\n\n\n:::\n:::\n", "supporting": [ "TCGA_files" ], diff --git a/notebooks/ENCODE_project.qmd b/notebooks/ENCODE_project.qmd new file mode 100644 index 0000000..66662eb --- /dev/null +++ b/notebooks/ENCODE_project.qmd @@ -0,0 +1,18 @@ +--- +title: "Analyzing Cloud Data from ENCODE" +subtitle: "The Encyclopedia of DNA Element (ENCODE)" +format: html +editor: visual +--- + +# Introduction + +The Encyclopedia of DNA Elements ([ENCODE](https://www.encodeproject.org/)) is a collection of experimental results (from human, mouse, worm and fly) and computing resources devoted to elaborating the mechanisms of gene regulation. + +We will use Bioconductor to acquire and interpret ENCODE resources. In particular, we'll analyze ChIP-Seq, ATAC-Seq and CRISPR interference RNA-Seq experiments. + +# ENCODE ChIP-seq Data + +```{r} + +``` diff --git a/notebooks/TCGA.qmd b/notebooks/TCGA.qmd index 687ba9a..6810ca4 100644 --- a/notebooks/TCGA.qmd +++ b/notebooks/TCGA.qmd @@ -1,6 +1,6 @@ --- title: "Integrative Analysis with TCGA Data" -subtitle: "Analysis of Mutation Data from The Cancer Genome Atlas (TCGA)" +subtitle: "Analysis of Mutation, Transcription and Methylation Data from The Cancer Genome Atlas (TCGA)" format: html editor: visual --- @@ -247,8 +247,7 @@ ggsave(paste0(path,"samples_with_mut.jpeg"), device = "jpeg") ``` -![Samples with specific mutation per gene](images/samples_with_mut.jpeg) -## Linking mutations and tumor stage +![Samples with specific mutation per gene](images/samples_with_mut.jpeg) \## Linking mutations and tumor stage Now that we are familiarized with the mutation data, we can link mutations to the tumor stage from the patients with rectum adenocarcinoma. @@ -327,6 +326,7 @@ rnaseq = readData[[2]] rnaseq ``` + As before, we need to shorten the sample IDs so they can match to the clinical data. ```{r} @@ -348,7 +348,7 @@ rnaseq <- rnaseq[,!idx] ``` -Then, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes. +Then, we can use the `limma` package to proceed with the differential expression analysis. In this case, we'll ise the tumor stage as a variable to explain the expression of the genes. ```{r} @@ -462,4 +462,3 @@ me_rna_cor = function(sym, mpick = 3){ me_rna_cor("TAC1", mpick=3) ``` - diff --git a/notebooks/images/number_mut_genes_per_tumor_stage.jpeg b/notebooks/images/number_mut_genes_per_tumor_stage.jpeg index 39ec99b..e6cb139 100644 Binary files a/notebooks/images/number_mut_genes_per_tumor_stage.jpeg and b/notebooks/images/number_mut_genes_per_tumor_stage.jpeg differ