proposal.Rmd

---
# setspace: doublespacing
output: 
    pdf_document:
      includes:
        in_header: import.sty
        before_body: title.sty
        after_body: tail.sty
      keep_tex: yes
      number_sections: yes
      # toc: yes
      toc_depth: 2

# linestretch: 2
bibliography: MasterOfCellTypes.bib
fontsize: 11
# csl: chicago-author-date.csl
csl: nature.csl
---

```{r include=FALSE, cache=FALSE}
library(knitr)
```
<!-- \todo{Here's something I might need to add.} -->

Motivation and Introduction 
====================================================

The brain is a heterogeneous organ composed of a wide variety of cell types that can be very closely related (eg. neuronal subtypes) or highly differentiated from each other (eg. neurons and glial cells). While this heterogeneity is well studied, most large scale expression studies that focus on brain disorders use whole tissue samples to examine the effects of diseases [@iwamoto_molecular_2004;@maycox_analysis_2009;@hokama_altered_2014]. Though examination of whole tissue remains popular due to its relative ease and low cost, it does not reveal which cell types are affected by the changes and complicates detection of changes in less abundant cell types due to signal dilution [@chikina_cellcode_2015]. Studies which do use single cell types are typically examining only one type at a time to compare conditions such as disease [@heiman_molecular_2014;@dougherty_disruption_2013], development [@okaty_transcriptional_2009;@bellesi_effects_2013], or they are designed to discover unique properties, such as marker genes or electrophysiological properties, among a small subset of cell types of interest [@sugino_molecular_2006;@doyle_application_2008].

The current state of the literature on neurological diseases and brain cell types leaves several questions unanswered:

1. **What are the marker genes of a cell type in the scope of the entire brain region they reside in?** \newline
Discovery or marker genes in more general scopes is important because unique cell type markers are needed to identify the cell type in whole tissue samples in many types of experiments (eg. cell type specific microarrays, in-situ hybridization). Even when a cell type has a well defined marker gene, whole experimental design can be compromised if that gene is regulated under the condition tested by the experiment [@sommeijer_synaptotagmin-2_2012]. Having as many marker genes as possible for a specific cell makes this less likely since researchers can swap the marker gene they are using with another one if needed. Marker genes were also shown to be useful in computational settings [@westra_cell_2015;@chikina_cellcode_2015;@okaty_quantitative_2011].

2. **How do cell type proportion's change in neurological diseases?** \newline
While there are many known cases where the cell type proportions change in the brain such as Parkinson's or Alzheimer's diseases, in many other conditions, changes are much less clear. Experiments that aim to reveal cell type proportion changes require the cell types to be labeled and counted via methods like in situ hybridization or immunohistochemistry. These experiments are lengthy, expensive and often assume that a single marker gene is stably expressed in all samples regardless of the condition.


3. **Under neurologically relevant conditions, which gene expression changes occur in which cell types?** \newline
Apart from the fact that it is harder to detect differentially expressed genes in whole tissue studies, it is also uncertain which cell types are effected by the changes when the changes are detected. Acquiring this information requires isolation and expression profiling of the cell types in different conditions and is not practical to perform for a high number of conditions and a high number of cell types.

For this work, I formed a comprehensive database of cell type-specific expression profiles from a variate of resources in an attempt to resolve the problems described above. This database, along with in siloco and in vitro validation methods, is used to detect marker genes that best represent a particular cell type in it’s associated brain region. We believe that this list of cell type specific marker genes will be useful to scientific community in characterization of cell types and understanding neurological diseases.

To make further use of the discovered marker genes I will consider their expression levels in whole tissue samples as surrogates for their abundance in the sample. This will allow me to understand the fate of cell type populations under specific diseases and conditions. Finally, I will use these surrogate proportions as covariates in statistical tests in order to improve the statistical power of differential expression analyses and to tell which cell types are affected by specific changes.

The schematic of the project can be found in Figure \ref{fig:workflow}.

<!-- \newpage -->

Research questions and specific aims 
=======================================================================================================================

Research questions 
------------------------------------------------------------------------------------------------------------------------

### What are the specific marker genes of brain cell types? 

Cell types of the brain, particularly neurons are loosely defined in terms of their marker genes and properties.  Most research focusing on cell types isolates a small number of related cell types and characterizes the cells in relation to each other [@okaty_transcriptional_2009;@sugino_molecular_2006]. Relatively few studies [@zeisel_cell_2015;@tasic_adult_2016] attempt to characterize cell types in the context of other known cell types of the brain. Absence of such a comprehensive approach in the literature motivates our approach of choosing marker genes using an inclusive database of cell type expression profiles gathered from multiple independent studies to be as inclusive as possible.

### Are mouse marker genes applicable to humans? 

Most available data in the literature on isolated cell types originates from mouse cells. Ideally researchers would like to have information about human marker genes as well. It is necessary to assess how well marker genes detected in mice can be applied to humans.

### How accurately can cell type proportions be predicted with the use of marker genes? 

Since marker genes are specific to a cell type by nature, their expression in whole tissue samples can be used as a surrogate for cell type proportion. Even though this is not a new approach, it is necessary to show how accurate it is for the brain.

### How do cell type proportions change accross neurological diseases? 

It is known that many diseases of the CNS are neurodegenerative in nature. Computational prediction of cell type proportion will allow us to show which cell types are effected in any given condition

### Can cell type specific regulatory events be detected using cell type proportion information? 

Enumeration of cell types in a sample allows these values to be used as covariates in other models. This information was previously used to improve accuracy of differential expression studies and assign differentially expressed genes to cell types [@chikina_cellcode_2015]. Applying this method to neurological diseases may uncover cell type specific changes to gene expression.

### How do recent single cell experiments correlate with each other and cell type specific microarray samples in the literature? 

There has been a recent surge in single cell RNA sequencing experiments attempting to characterize the cell types of the brain [@zeisel_cell_2015;@tasic_adult_2016]. Such studies often use different sequencing and clustering methods to define cell types and find marker genes. Due the complex nature of cell type determination and incompleteness of RNA-seq data, it is important to know how well the results correlate with each other and with pre-existing microarray studies working on the same cell types. 

Specific aims 
------------------------------------------------------------------------------------------------------------------------

### Aim 1: Compilation of cell type specific expression database and make it available to third parties 

1. Gather high quality gene expression data representing brain cell types
2. Employ quality control measures to maximize data integrity
3. Make the data available in a web application for easy access.

### Aim 2: Identification and verification of marker gene sets 

1. Detect cell type marker genes in a region based on the localization of their expression
2. Verify marker genes in independent datasets and by in situ hybridization
3. Asses the concordance of single cell RNA seq data with each other and with the cell types in our database

### Aim 3: Estimation of cell type proportions 

1. Using marker gene expression, estimate the relative proportions of particular cell types in whole tissue samples an a variety of conditions.
2. Look for generalizable effects in conditions such as neurological diseases.
3. Use datasets on neurological diseases with known effects on cell type composition as positive controls to validate estimation method.
4. Use an independent dataset of isolated blood cell types and manually enumerated blood samples to repeat and validate the estimation method.
5. Use estimated relative proportion information to improve the accuracy of differential expression analyses.
6. Create an R package for easy application of the method by other researchers.

<!-- \newpage -->

Background 
=======================================================================================================================

Cell type specific effects in neurological conditions
------------------------------------------------------------------------------------------------------------------------
A major focus of this my thesis will be the detection of cell type specific changes. The brain is composed of a variety of cell types and is one of the most heterogeneous organs in the mammalian body. In neurological diseases and conditions, it is common to see that certain effects are cell type specific. During development in rats, different cell types mature and proliferate at different rates, neurons in general developing earlier than astrocytes and oligodendrocytes [@sauvageot_molecular_2002]. Inflammation, which is observed in many neurological diseases [@aguzzi_microglia_2013], and it is known to hinder neurogenesis [@ekdahl_brain_2009] and cause neuronal degradation [@aguzzi_microglia_2013]. Neurodegenerative diseases can cause degredation of specific cell types such as Parkinson's Disease with dopaminergic cells [@hegarty_midbrain_2013]. Detection of such changes is possible by using common laboratory methods can be laborious and often impossible due to scarcity of the samples, as explained in section 3.4, several computational methods were developed to reveal this information from whole tissue expression data.

Cell type specific effects are not limited to proportion changes. Gene expression of individual cell types can also be subject to change under specific conditions. Inflammation for instance, causes significant changes in microglia transcriptome [@holtman_glia_2015] and specific neurological diseases can cause gene expression changes in individual cells. In Parkinson's disease for instance dopaminergic cells both decrease in numbers and experience changes in expression levels in a large number of genes [@simunovic_gene_2009]. Just like detecting cell type proportion changes, cell type specific expression changes are difficult by common laboratory methods since it requires isolation and expression profiling of the target cell types. On the other hand whole tissue analysis coupled by cell counts or estimation of cell type proportions have been suggested to allow access to this information [@chikina_cellcode_2015].


<!-- The brain is composed of a variety of cell types and is one of the most heterogeneous organs in the mammalian body. Alongside the major groups of cell types associated with it: neurons and glia, it also contains blood and endothelial cells. The major neuron and glia types that are included in our cell type specific expression database are described briefly below. -->

<!-- ### Glia -->

<!-- - **Astrocytes** are star shaped cells present throughout the brain [@sofroniew_astrocytes_2010]. They have roles in preserving chemical balance in synapses, regulating blood flow by controlling vessel diameters and by providing support to endothelial cells forming the blood brain barrier [@sofroniew_astrocytes_2010]. Relatively recently it was also discovered that they play roles in signal transduction by regulating intracellular ion concentrations and releasing gliaotransmitters [@jahn_genetic_2015]. Astrocytes often proliferate in diseased brain [@sofroniew_astrocytes_2010]. -->

<!-- - **Oligodendrocytes** are responsible for forming the mylein sheet around neurons to insulate their axons [@bradl_oligodendrocytes_2010]. A single oligodendrocyte ensheats multiple neuronal cells as a result of a highly coordinated process affected by axon size, neuronal activity and molecular signalling [@bradl_oligodendrocytes_2010]. Most myelination occurs early in the differentiation process. Alongside myelination, oligodendrocytes also have neuroprotective functions and are known to dysfunction in several neurodegenerative diseases [@bankston_oligodendroglia_2013]. -->

<!-- - **Microglia** are resident macrophages of the brain [@aguzzi_microglia_2013]. They clear apaptotic cells and are involved in maintenance of synapses [@aguzzi_microglia_2013], and are the antigen presenting cells of the brain [@graeber_microglia_2010]. In most neurological diseases, they activate and proliferate [@aguzzi_microglia_2013], and their proliferation is associated with degradation of neuronal cells [@graeber_microglia_2010]. -->

<!-- ### Neurons -->

<!-- **Pyramidal Cells** are prominent in areas of the brain associated with high cognitive functions such as the cerebral cortex and hippocampus [@spruston_pyramidal_2008]. They are recognized by their triangular cell body and short basal dendrites [@spruston_pyramidal_2008]. Pyramidal cells of different brain regions and layers have differences in structure and gene expression [@spruston_pyramidal_2008]. They can receive inhibitory GABAergic inputs through soma and axon, while excitatory signals are delivered through the dendrites [@spruston_pyramidal_2008]. Dendrites of pyramidal cells are covered with spines that act as synapse sites for glutamatergic synapses [@spruston_pyramidal_2008]. -->

<!-- **Cortical GABAergic neurons** are local inhibitory neurons that control pyramidal cell firing and generation of cortical rhythms. In the cortex, three major groups are defined: PV expressing, SST expressing and 5HT3a expressing [@rudy_three_2011]. These subtypes of gabaergic cells localize into specific layers of the neocortex [@rudy_three_2011]. PV expressing neurons are fast spiking cells with low input resistance [@rudy_three_2011]. They are thought to be the dominant inhibitory system in the cortex. SST positive cells often receive facilitating signals from pyramidal neurons to provide inhibitory feedback [@rudy_three_2011]. Finally 5HT3aR interneurons are a more heterogeneous subgroup of interneurons which differ in function and morphology [@rudy_three_2011]. -->

<!-- **Midbrain dopaminergic neurons** are primarily thought to regulate motor functions, emotion and reward [@hegarty_midbrain_2013]. During development of the midbrain they separate into three distinct clusters of cells [@hegarty_midbrain_2013]. Cells making up the A9 cluster, which forms substantia nigra are lost during the progression of Parkinson's disease [@hegarty_midbrain_2013]. -->

<!-- **Purkinje cells** are GABAergic cells local to cerebellum's purkinje layer [@ito_historical_2002]. They receive signals from nearby granule cells and send inhibitory signals to other purkinje cells towards deep cerebellar nuclei [@ito_historical_2002]. They have functions in motor learning [@ito_historical_2002]. -->

<!-- **Cholinergic cells** are involved in memory formation  [@baxter_selective_2013]. In particular, they are associated with Alzheimer's Disease [@baxter_selective_2013]. They are also responsible for maintaining the circadian rhythm, by a high acetylcholine release during awake periods [@hut_cholinergic_2011]. -->


Cell type markers and their applications 
------------------------------------------------------------------------------------------------------------------------

A product of my work is a list of cell type markers. Marker genes are useful in many ways to understand the biology of their associated cell type. Primarily, they can be used to identify cells of interest in whole tissue samples for purposes such as counting and purifying cells. Marker genes are also powerful tools in computational experiments. For example, they can be used as features in deconvolution of complex tissue samples as described in following sections.

Some brain cell types have specific and well known markers that makes them easy to identify. Many of these genes have known functions that are directly related to the cell type. These include Mog, an oligodendrocyte marker [@linington_augmentation_1988] with roles in myelination and Aif1, a microglia marker with roles in inflammation [@imai_novel_1996] for glial cells; Th that is responsible for dopamine synthesis in dopaminergic cells [@hegarty_midbrain_2013], and Gad1/2 that catalyzes production of GABA in gabaergic cells [@rudy_three_2011] for neurons. Many neuron types, however, do not have known markers. For instance while Gad1/2 are useful markers of gabaergic neurons, many finer subtypes of gabaergic neurons lack known specific markers which makes studying them more challenging [@rudy_three_2011].

Methods for cell type purification
------------------------------------------------------------------------------------------------------------------------

In my work, detection of cell type markers will rely on cell type specific gene expression profiles, acquired through isolation of cell types. Isolation of single cell types is necessary as a precursor to their proper characterization, or analysis of specific cells in different conditions such as to diseases or chemicals. There are multiple ways to isolate the cell types of interest with which vary in precision and quality. Most commonly, such methods rely on one or more marker genes specific to the cell type, selectively isolating cells that express the marker. A well established method is  Fluorescence Activated Cell Sorting (FACS) where one or more protein or RNA is labelled to be fluorescently active, either through genetic manipulation or labelled antibodies respectively. Cells are then gated according to specific conditions (eg. expression or absence of a gene) (Figure \ref{fig:isolationMethods} A) [@kang_embryonic_2011]. Another established method of isolation is immunopanning where antibodies layered onto plate are used to hold the cells that express a specific surface marker (Figure \ref{fig:isolationMethods} B) [@cahoy_transcriptome_2008]. A relatively recent marker based isolation is Translating Ribosome Affinity Purification [@heiman_cell_2014]. This method combines the promoter of a marker gene with the coding region of L10a ribosomal subunit fused with green fluorescent protein (GFP) [@heiman_cell_2014]. The tissue is degraded en masse and after fixation, ribosomes marked with GFP are captured, ensuring only translating RNAs from the target cell type are isolated (Figure \ref{fig:isolationMethods} C) [@sanz_cell-type-specific_2009;@heiman_cell_2014]. Alternatively, based on visible characteristics or expression of known markers, cells can be visually located on the tissue and isolated by manual extraction or laser capture microdissection (LCM) (Figure \ref{fig:isolationMethods} D) [@liotta_molecular_2000]. The resulting samples from each of these methods have varying purity [@okaty_quantitative_2011] which is a potential confound on studies that use samples acquired by multiple methods.

Methods for estimation of cell type proportions from expression patterns 
------------------------------------------------------------------------------------------------------------------------

Part of my efforts will be estimation of relative cell type proportions in complex tissue samples in a reference-free manner. Generally, expression profile of complex tissues can be modelled as

\begin{equation} \label{eq:01}
X_{ij} = \sum_{k=1}^{K}{W_{ij}h_{kj} + e_{ij}} 
\end{equation} 

where $X_{ij}$ is the expression value from a complex sample for genes $j$ and sample $i$, $W_{ik}$ is a matrix containing cell type proportions for sample $i$ and cell type $k$, $h_{kj}$ is the cell type specific gene expression of cell type $k$ and gene $j$ and $e_{ij}$ represents random error. Various methods can be applied to acquire information about the matrices $W$ and $h$. Two main classes of deconvolution methods exist; namely, reference-based and reference-free methods which will be discussed in the next subsections. In mammals, proportion estimation (deconvolution) methods are commonly applied to blood data due to ease of access to both mixed samples and isolated cell types [@westra_cell_2015;@newman_robust_2015;@chikina_cellcode_2015].

While the use of deconvolution in brain is not a new idea, so far, applications have been restricted to a superficial level. Early studies that attempted to estimate cell type proportions in human brains grouped all neurons together as a single cell type alongside astrocytes, oligodendrocytes and microglia[@kuhn_population-specific_2011]. Later, more in depth deconvolution was performed in human cortex and cerebellum which estimated cerebellar neuron types separately while leaving cortical neurons as a single group [@xu_cell_2013]. Deconvolution of human brains is difficult due to the absence of human cell type specific expression profiles from human brain cell types which prevents proper use of reference-based deconvolution methods and lack of high numbers of reliable marker genes which prevents reduces the reliability of reference-based deconvolution methods. Expression profiles of the cell types of the mouse brain on the other hand are available in the literature. 

A reference-based deconvolution of 64 distinct cell types was performed on whole tissue expression profiles across various brain regions was perfomed by Grange et al. [@grange_cell-typebased_2014]. In the study by Grange et al., proportion estimations of most cell types agreed with the literature, but the authors also reported paradoxical results such as detecting high levels of purkinje cells in thalamus instead of cerebellum[@grange_cell-typebased_2014]. Also no attempt was made to deconvolute cell types in samples from the same regions but under different conditions (eg. disease models, developmental stages).

### Reference-based deconvolution

Reference-based deconvolution methods assume we have accurate information about the matrix $h$: expression profiles of the cell types in the tissue. At the most basic level, researchers try to estimate the $W$ (matrix of cell type proportions) in solving the equation \ref{eq:01} by minimizing the sum of squares of $e$ (error) [@grange_cell-typebased_2014]. This approach assumes that **1)** reference expression profiles are good matches to the actual expression of the cell types in the mixed sample, which can be violated due to noise or differences in RNA extraction methods, and **2)** the reference dataset has all cell types represented in the mixed sample, which can be violated by the presence of previously uncharacterized cell types in the region. To combat such problems, different methods of feature selection that aim to identify the most informative parts the reference expression matrix can be used which makes the estimation process more robust [@newman_robust_2015].

### Reference-free deconvolution

In cases where cell type expression profiles are not available or are likely to have high level of error compared to the real expression of the cell types in the mixed sample, reference-free deconvolution methods provides an alternative. A common method is to use expression of certain marker genes as a surrogate for cell type proportions [@xu_cell_2013;@chikina_cellcode_2015;@okaty_quantitative_2011;@westra_cell_2015]. Even though the marker genes themselves are often acquired from a reference expression dataset, deconvolution is independent of their expression in the reference. Often the first principal component of the genes in the whole tissue samples is used as a surrogate [@xu_cell_2013;@chikina_cellcode_2015;@tan_neuron-enriched_2013]. This assumes that most of the used marker genes are not differentially regulated between samples and the main source of variation is the difference in the cell type proportions across samples.

<!--Midbrain serotonergic cells** are known to react to pH changes [@severson_midbrain_2003] -->


Expression profiling 
------------------------------------------------------------------------------------------------------------------------

Expression profiles are the major component of my work. Microarrays and RNA-seq are the two competing methods for high-throughput expression profiling. I work on expression profiling datasets that are extracted using both of these methods, therefore it is important to make note of properties and limitations of these methods.


### Microarrays 

RNA microarray is the most common way to quantify RNA in a high-throughput manner and its development have been a transformative force in many branches of biology [@hoheisel_microarray_2006]. Microarrays use single standed probes, specific to a location on the target genome to quantify RNA. These probes are placed on known locations on a solid surface. These probes are later hybridized to a labelled complementary DNA (cDNA) acquired by reverse transcription of a target transcriptome. The amount of cDNA that hybridizes to a probe is quantified by staining the label attached to the cDNA molecules [@hegde_concise_2000]. 

<!--Often, multiple probes target a nearby region on the genome which are then summarized to make up the signal for a single gene [@gautier_affyanalysis_2004].-->

<!-- Often, multiple probes target a close-by region on the genome to form a probeset that target a single gene. The standard output for most normalization techniques is a summarization of the probes that make up a probeset [@gautier_affyanalysis_2004]. -->

There are several of microarray platforms available for researchers to choose from. These platforms primarily differ in the probes they use, which can cover a different number of genes and/or cover the same genes using different sequences.

Microarrays are extensively used in neurobiology as a tool for expression analysis. A wide array of data is available at both in the tissue [@iwamoto_molecular_2004;@kang_spatio-temporal_2011;@torrey_stanley_2000] and isolated cell type [@okaty_transcriptional_2009;@sugino_molecular_2006;@dougherty_disruption_2013] level.

<!-- A short summary of microarray methodology can be found at Figure \ref{fig:microarrayRnaSeq}. -->

### RNA sequencing 

RNA sequencing (RNA-seq) is a much more recent method of RNA quantification. RNA seq is performed by sequencing of cDNA molecules acquired from the reverse transcription of an entire target transcriptome. Unlike microarrays, they do not target specific genes, hence can give a much comprehensive picture of the transcriptome including gene discovery and splicing variant identification. Quantification is done by normalizing the number of reads found from a single transcript to the length of the transcript and to the total number of transcripts [@wang_rna-seq_2009]. A major limitation of RNA-seq is it's accuracy drops significantly in lower expression levels. It was shown that less than 30% of the genes were able to be quantified with high accuracy [@labaj_characterization_2011].

Recently, RNA sequencing of single cells is becoming increasingly popular [@kolodziejczyk_technology_2015]. While single cell RNA-seq is a powerful tool that allows characterization of individual cells in the population, due to scarcity of the starting product, technical biases resulting from amplification and sequencing are more prominent [@kolodziejczyk_technology_2015]. Single cell studies are starting to gain popularity in neuroscience [@zeisel_cell_2015;@darmanis_survey_2015]. Due to its heterogeneous structure, the brain is a prime target for single cell studies that allows differentiation of individual cell types with much less concern of isolating heterogeneous samples.

<!-- A short summary of RNAseq methodology can be found at Figure \ref{fig:microarrayRnaSeq}. -->

<!-- ### Analysis of RNA quantification results  -->

<!-- Most common use of RNA quantification is to observe differential expression between two or more groups. This is done in order to find what genes are effected by certain conditions or expression is different between different cell or tissue types to a degree of statistical significance [@gautier_affyanalysis_2004;@islam_quantitative_2014;@darmanis_survey_2015]. Another common method of analysis is coexpression, where researchers attempt to identify genes that show similar changes in expression across different samples [@hughes_functional_2000]. This information is often used to derive functional relationships between the genes [@kim_gene_2001;@stuart_gene-coexpression_2003]. -->

<!-- A higher level analysis of RNA quantification results uses more complex methods to deconvolute cell type proportions as discussed in a later section. -->

<!-- ### RNA quantification vs protein quantification -->
<!-- \todo{The part is there since RNA levels is often just a surrogate for protein level information. and I felt necesary to state that a marker gene might not be actually translated. this also aims to justify why protein quantification was not used instead} -->

<!-- In the process of central dogma, messenger RNAs, that make up the majority of the genes that researchers are interested in, are later translated into proteins. When researchers want to make inferences about the functional impact of expression changes in a gene, they often have to quantify protein levels instead since it is known that RNA levels do not always correlate with protein levels [@maier_correlation_2009]. RNA quantification experiments are still more common than protein quantification experiments due to their scalability and low cost. While RNA quantification has its limitations in terms of functional information it provides, it's analysis is still useful  -->

<!-- Proteins are the main functioning units in a cell. They are the ultimate products of the central dogma the way it is traditionally described. DNA transcribing RNA and RNA is translated into proteins. Quantification of specific proteins is, while possible, much more difficult than quantification of specific RNA molecules and often not scalable to the same degree. RNA quantification on the other hand have been historically used much more frequently on a high throughput fashion particularly with the rise of microarrays and more recently RNA sequencing. A common concern about RNA quantification is that they do not always correlate with protein levels [@maier_correlation_2009]. Therefore care should be taken when considering their biological significance. -->

<!-- While RNA quantification has its limitations in terms of functional information it provides, its anaylsis still provides useful information for our purposes since RNAs **1)** can still be used to label specific cell types in whole tissue samples, **2)** can be used to choose targets for protein quantificatin studies, **3)** can be  -->


Aim 1: Compilation of cell type specific expression database available to third parties 
=======================================================================================================================
The first aim of the project, which also provides the groundwork for later stages, is to compile a comprehensive database of cell type specific expression profiles. The database is a valuable resource since it allows comparison of all available cell types to each other, allowing us to find specific expression patterns. The database is collected from the datasets available through Gene Expression Omnibus (GEO) and through personal communications. We made the data available via a web application that allows easy browsing of the data. Mouse cell type specific data being is used due to its higher quality and abundance compared to that available for humans.

Data acquisition and preprocessing 
------------------------------------------------------------------------------------------------------------------------

The bulk of the dataset is based on a previous compilation made by Okaty et al. [-@okaty_quantitative_2011] for a study comparing different cell type isolation methods. This initial dataset was obtained using Affymetrix Mouse Expression 430A Array (430A) and Affymetrix Mouse Genome 430 2.0 Array (430.2). Data from these two platforms is straightforward to combine since the 430A array contains a subset of the probesets in the 430.2 array. Due to the high availability of the data collected 430A and 430.2 arrays and to simplify processing of data, we decided to populate our database with datasets from these platforms. We queried [GEO](http://www.ncbi.nlm.nih.gov/geo/) for isolated cell types from mouse samples. In order to pre-process the database, we acquired raw data files (CEL format) for each sample. Samples from the 430.2 array were stripped of the extra probesets they contained and merged with the data from 430A array samples. The resulting dataset was pre-processed and normalized using Robust Multichip Average (RMA) method [@irizarry_exploration_2003;@irizarry_summaries_2003]. We observed significant differences in the distribution of the probeset level signal distribution after RMA normalization, potentially due to the technical differences between studies. To make samples comparable to each other, we used a second quantile normalization after RMA [@bolstad_comparison_2003]. In ideal conditions, batch correction would have been desirable, but since datasets were composed of independent sources with non overlapping cell types, this was not possible.  All samples including the ones used in Okaty et. al. study, passed through a quality control phase that involved ensuring expression of known cell type markers (markers from literature and markers that were used to isolate the cell type) and confirming samples were not contaminated by other cell types by looking for expression of foreign markers. At the end of the cleanup process, cell types were separated into non overlapping groups. This lead to removal of some samples whose associated subtypes were already represented by another sample. For instance samples representing Htr3a positive GABAergic cells were removed due to the presence of VIP positive cells (their subtypes [@rudy_three_2011]) in another dataset. In addition, when cell types were too similar to each other to detect meaningful differences between them, they were grouped together in a single cell type. For example Drd1 and Drd2 positive spiny neurons' expression profiles were too similar to each other to detect different markers for both. The current dataset has 31 cell types, isolated from 11 regions gathered from 24 studies, and isolated with a variety of methods (\Cref{table:dataTable,table:cellTypeCite}). We are still looking at newly published papers in order to add more cell types.

## Presentation of the data in a web application 

We created a web application to facilitate  access to the cell type database The web application allows users to easily visualize expression of chosen genes in individual cell types in their respective regions (Figure \ref{fig:neuroexpresso}). The application also allows grouping of cells together in a hierarchical manner. Every sample shown links to the original data source if it is a publicly available dataset. Future modifications will add the ability to group samples based on sources along with other visualization options. We will also be embedding tools to perform rapid differential expression analyses between cell types, and a gene set enrichment tool that will allow researchers to check a list of genes for cell type specific enrichment. The application increases the usability of our database especially for researchers who are not from a computational background less familiar with computational tools.

Aim 2: Identification and Validation of Marker Gene Sets 
=======================================================================================================================

Separation of samples into brain regions 
------------------------------------------------------------------------------------------------------------------------

A principal use of the comprehensive database we created is to find gene sets with highly enriched expression in single cell types. We reasoned that since most biological samples are from specific brain regions, for marker genes to be biologically and computationally relevant, they should be unique to a single cell type in the context of the associated region. For instance Pvalb is a marker of fast spiking gabaergic cells in cortex but it is a purkinje marker in the cerebellum. To accomplish this, we created groups representing various brain regions. Each cell type in the database is assignetd to a group based on the brain region it was isolated from and a hierarchy of brain regions (Figure \ref{fig:regionHierarchy}). In this hierarchy, cell types that are isolated from regions represented by the lower nodes are included in regions represented by the higher nodes connected to them, for instance samples isolated from cortex are added to the brain regions higher in the hierarchy: cerebrum and whole brain (All). Similarly, cell types isolated from regions represented by higher nodes are included in the regions, represented by lower nodes. For instance, microglia that are isolated from whole brain extracts are added to all other regions lower in the hierarchy. Only exception to this procedure are the astrocyte and oligodendrocyte samples isolated from cortex. Since these cell types are well known to be prevalent across the brain and we did not have samples isolated from other regions, oligodendrocytes are considered to be originated from whole tissue, while astrocytes are added to all regions except cerebellum where they are replaced by Bergmann glia.

Selection of marker genes
------------------------------------------------------------------------------------------------------------------------

Upon separation of regions we chose specific marker genes for the cell types represented in each region by a clustering based method. For a given cell type, we designated a gene as a marker gene if the following conditions were met:

* There was more than 10 fold change between the median expression of the gene in samples representing the cell type and all other samples from the same region. This condition ensures that there is a large enough difference between the cell type and the rest of the samples to reduce the number of genes selected due to differences that can be study or strain specific.
* When samples were separated into two clusters, those representing the target cell type and all others, with the distance between samples defined as the difference of expression of the target gene, the silhouette coefficient of the resulting clusters was higher than 0.5. This condition prioritizes genes that separate the cell types best.
* By randomly removing samples and equilizing the number of samples chosen from each individual study, we make sure that studies with more samples do not unfairly influence the silhouette coefficient and reduce the effect of outliers.

Since the samples are gathered across multiple samples and cells are isolated by different extraction methods, it is inevitable for there to be technical artifacts, making sample to sample comparison difficult. This was the main advantage for gene selection method over a simple differential expression analysis.

Using those conditions, we selected marker genes across 31 cell types isolated from 11 regions. The number of marker genes greatly varied from one cell type to another depending on the presence of highly similar cell types in the dataset (Figure \ref{fig:cortexGenes}).

Validation of marker genes
------------------------------------------------------------------------------------------------------------------------

Finding reliable marker genes using independent datasets is challenging due to artifacts caused by differences in mouse strains, isolation methods and batches. It is also uncertain if marker genes detected using mouse cell types will apply to human cell types. To establish the reliability of our genes, we used the validation methods described below to verify that they act as marker genes in both biological and computational settings. 

### Validation of marker genes via in situ hybridization

In situ hybridization (ISH), is a well accepted way of assessing the sensitivity and specificity of marker genes, by ensuring that the expression of newly discovered markers and markers from the literature are co-localized to the cells. When possible, we used the ISH data available from the Allen Brain Atlas (ABA) [@lein_genome-wide_2007] to confirm our findings. While ABA is a powerful resource including thousands of ISH images, every slice is labelled by a probe specific to a single gene. Thus, it is not possible to conclusively decide if the signal is coming from the same cell types unless that cell type is highly concentrated in a specific structure in the brain and are thus identifiable based on location. Granule cells of dentate gyrus and purkinje cells of cerebellum fit this criteria. This allowed us to validate the marker gene specificity for these cell types. The majority of the marker genes with ISH slides available in ABA was shown to be cell type specific (\Cref{fig:allenValidate,fig:allenAllPurkinje,fig:allenAllDentate}) [@lein_genome-wide_2007].

We are currently collaborating with Dr. Etienne Sibille (University of Toronto) to validate the markers by dual labelling. Dr. Sibelle's group was able to validate Cox6a2 as a marker of fast spiking gabaergic cells in mice (Figure \ref{fig:hybridizationValidation}). We will be expanding the number of validated genes through this method, and will apply it to human samples as well.


### Computational validation of marker genes in mouse and human single cell data

Since it is unfeasible to apply biological validation methods to all marker genes that we selected, we are using recently published single cell RNA-sequencing datasets [@darmanis_survey_2015;@zeisel_cell_2015;@tasic_adult_2016] to validate our findings. These data sets have been generated by a recent proliferation of studies aimed at characterizing the cell types of the brain. These studies attempt to define cell types from the ground up by using a variety of clustering methods. Based on descriptions of the identified cell types in the papers, and that the cells are randomly selected from cortices of mouse and men, we have assumed that they have identified the same cell types that our database represents. Unfortunately, due to the high granularity of the clusters of the single cell data, it is not straightforward to match which individual cells correspond to the cell types defined in our data. To combat this problem, we tried to validate our marker genes in a cell type-agnostic way. For all of our marker gene sets, we checked to see if the genes are more co-expressed than average based on a null distribution of random genes with similar prevalence in the dataset. For human samples, we used the homologues of the marker genes for the same purpose.

Since single cell expression analysis is still in its infancy, often the transcript counts for most genes are being very low, which makes the exact expression value an unreliable measure. Therefore, instead of using the full expression values, we converted the data into a binary matrix where 0 indicated no expression of the gene and 1 indicated any expression of the gene. This approach may be too conservative since we do not chose genes based on their exclusivity to a single cell type, but its heightened expression in the cell type. Despite this, applying the method to a mouse [@zeisel_cell_2015] and a human [@darmanis_survey_2015] single cell dataset from cortex returned favourable results. In mouse all marker gene sets for cortical cell types were found to be significantly more coexpressed than expected under the null distribution (\Cref{table:singleCellRNAMouse}), and in the human dataset a majority (7/10) of the marker gene sets showed significantly more coexpression (\Cref{table:singleCellRNAHuman}).

The next step is to repeat this analysis using a more recently published single cell RNA-seq study performed on mouse brains [@tasic_adult_2016]. This dataset has a better coverage of cortical cell types.

### Validation of marker genes in human whole tissue data

As another validation approach, we analysed several human datasets containing multiple brain regions isolated from pathologically healthy humans. The first dataset included samples from 16 different brain regions [@trabzuni_widespread_2013]. Of these 16 regions cell types from, various cortical sub-regions, hippocampus, substantia nigra, thalamus and cerebellum were included in our cell type specific expression dataset. The second dataset included samples from two different cortical regions (Brodmann areas 11 and 47) [@chen_effects_2016]. Finally, we used 4 datasets that analysed the cortex samples from the Stanley Array Collection. In these studies, we discarded the samples from bipolar disorder and schizophrenia patients.

For all marker gene sets we obtained, the coexpression of marker genes in brain regions relevant to the cell type were analysed. Since marker gene sets are cell type specific, in a complex tissue, variation in the amount of a cell type is determinant of the associated markers' expression. That is, if a sample has higher amount of a cell type, expression of all marker genes for that cell type will be higher. This is expected to result in increased coexpression of marker genes in whole tissue datasets due to the variability in cell type proportions. We compared the overall level of marker gene coexpression between these samples to coexpression levels of randomly selected genes with similar expression levels. The results showed significantly heightened coexpression for the majority of the cell type marker sets in all datasets analysed (15/22) while PV\textsuperscript{+} gabaergic cells and VIP\textsuperscript{+} Reln\textsuperscript{+} cabaergic cells were shown to have a significantly heightened co-expression in the majority of the datasets (\Cref{table:corticalTable,table:cerebellarTable,table:hippocampalTable,table:substantialTable,table:thalamicTable,table:allTable}).

<!-- \ref{table:corticalTable}-\ref{table:cerebellarTable}-\ref{table:hippocampalTable}-\ref{table:substantialTable}-\ref{table:thalamicTable}-\ref{table:allTable}). -->


Assess condordance of single cell RNA-seq studies with each other and to microarray samples in our database
------------------------------------------------------------------------------------------------------------------------

The high volume of recently published RNA-seq studies has created a large output of data that are very similar to each other in terms of design. All of them are derived from cells of the brains of the same organism, hence in ideal conditions, the cell types they identify should overlap with each other and with our microarray database. Due to the inherent differences in the data structure, assessing repeatability of such single cell studies is not straightforward. The definitions of cell types in our database is based on characterization of the cell types by experts while single cell studies use various clustering methods to seperate their cells into groups and retroactively decide which groups represent which cell types based on the expression of known markers.

We aim to capture similarities between individual cells from RNA-seq datasets and samples from the microarray database by using common genes that are detected by all the datasets in an expression independent manner. For microarray data we will be looking to see if expression of a gene is above the background level. For RNA-seq, we will simply determine whether the gene is captured at all in the sample. This analysis should provide sufficient information to correlate the samples to and allow us to identify which samples from each dataset correspond to each other. However, if we cannot reliably group samples from independent sources together, we will group the single cell data according to their designated groups as identified by the original investigators. We hope to determine whether these independent studies really identify the same cell types or if the cell types are not fully equivalent due to experimental methods or differences between mouse strains. A plan of the proposed analysis can be found in Figure \ref{fig:concordance}.

Aim 3: Estimation of cell type proportions 
=======================================================================================================================


Estimation of cell type proportions in whole tissue samples using the marker gene sets
------------------------------------------------------------------------------------------------------------------------

A well-established use of marker genes is the estimation of cell type proportions in whole tissue samples using their expression. As discussed in the introduction, two types of deconvolution methods dominate the field: reference-based and reference-free deconvolution. We choose to use reference-free deconvolution because the fact that the reference expression profiles we are using were derived from mice, but we often want to do proportion estimation in human brains. Hence the exact level of expression in our dataset might not be reliable since we are attempting to deconvolute human samples using expression profiles extracted from mouse brains. Genes highly enriched in cell types however are more likely to be functionally relevant to the cell type, hence they are are likely to remain as markers across species. Human astrocytes for instance was shown to have enriched expression of 52% of genes enriched in mouse astrocytes [@zhang_purification_2016]. Therefore it is sensible to believe that sufficiently many marker genes will be preserved across species for our purposes. Our aforementioned validation in human RNA-seq and whole tissue data, along with further validation of our experimental pipeline that will be explained in the next subsection confirms that this is not an unreasonable assumption to make.

To estimate the relative amount of cell types between samples, we used the first principal component of expression of marker genes in the samples as a proxy for cell type proportions. This method has been adopted by multiple other groups performing deconvolution in whole tissues [@chikina_cellcode_2015;@westra_cell_2015;@xu_cell_2013]. The idea behind the method is that most of the variation in marker gene expression will be explained by changes in the cell type proportions. We have implemented countermeasures against genes that do not behave as marker genes or show variation independent from cell type proportion in between samples, such as calculation of rotations based on the control samples, to ensure that genes that do not act as markers do not interfere with the estimation. The result from the analysis is a unitless number per cell type, representing the relative amount of that cell type compared to other samples. This number is used to compare samples only and cannot be used to compare two different cell types.

To check that the method works as expected, we estimated relative cell proportions in different brain regions with known differences between cell type proportions. Our results were concordant with the literature. In a dataset of different brain regions from healthy donors [@trabzuni_widespread_2013] we observed an increase in glial population and decrease in most neuronal populations between white matter and grey matter (Figure \ref{fig:cortexWhite}), and demonstrating that purkinje cells were exclusive to the cerebellum (Figure \ref{fig:purkinjeCereb}). 

To assess the potential use of the method in the context of brain related disorders, we acquired datasets of substantia nigra expression from healthy donors and Parkinson's disease patients [@lesnick_genomic_2007;@moran_whole_2006], characterized by loss of dopaminergic cells in the region. In the first dataset [@lesnick_genomic_2007] that included samples from healthy donors and Parkinson's disease patients, our initial analysis only showed a significant difference between healthy donors and Parkinson's disease patients in male samples (Figure \ref{fig:parkinsonMale}). Further examination of the metadata revealed that female healthy donors had a large age difference (mean age of control females: 85.75, median age of Parkinson's disease females: 71.00). While it was not possible to add the age and other metadata components to a model, due to the way metadata is presented in the original source, age is known to effect dopaminergic cell counts [@costa_effects_2014]. Apart from that, female patients are also thought to be more resistant to dopaminergic cell loss [@gillies_sex_2014]. The second dataset we used included samples from both lateral and medial substantia nigra [@moran_whole_2006]. Previous work suggested that medial substantia nigra is less affected by the disease than lateral substantia nigra [@duke_medial_2007]. To test the region effect and the effect of sex we fitted a linear mixed-effects model that included sex, brain region, disease state of the samples and the interactions between the factors. Patient ID was added as a random effect. It was not possible to add the age to the model because it confounded with the disease, namely, Parkinson's Disease patients were on average, older than controls (Mean age of controls: 69.3, mean age of patients: 80.7). The model confirmed that along with the disease state, that has a large effect size on the estimates for the dopaminergic cells, interaction between the disease state and brain region was also shown to be effective, supporting previous findings that medial substantia nigra has less severe loss of dopaminergic cells [@duke_medial_2007]. While sex, by itself did not show a large effect size, its interaction with brain region and disease state did, indicating that males might have more cell loss specifically in medial substantia nigra (Table \ref{table:parkinsonMixed}). While this can be a novel finding, current literature says little to support or discredit this claim and the sample size is limited.

We will continue to analyze more datasets from brain related disorders and brains under different conditions to increase our confidence in the database apply it as a discovery tool. There are thousands of studies of rodent or human brain tissue samples available in Gemma database [@zoubarev_gemma_2012] for expression profiles with well-annotated metadata to facilitate fast analysis. These datasets include brain samples under various conditions, such as different developmental stages, disorders, treatments, etc. We expect this analysis to reveal many condition specific cell type proportion changes. This step will also provide further validation, since we it will allow us to compare our ability to detect the same proportion changes in different studies working on similar conditions.


Validation of the pipeline using isolated blood cell types and whole blood data
------------------------------------------------------------------------------------------------------------------------

Estimation of brain cell type proportions is challenging to validate, since, to our knowledge, there are no expression datasets coupled with cell type counts. Any result we find is unverifiable, other than the expected differences between groups. Therefore, to assess the accuracy our method, expression data from whole tissues paired with cell type counts are required. Isolation of blood cell types is much more straightforward than isolation and brain cell types, and can be done more easily without harming the subject. While this data is virtually absent for brain tissue samples, a wide array of blood expression panels are coupled with cell counts, acquired through well established methods [@abbas_deconvolution_2009;@novershtern_densely_2011]. cell type reference datasets for blood cell types are present in the literature for both mouse and men. We used these data to construct a similar database to our brain database for mouse (Table \ref{table:bloodCite}) and human blood cell types [@newman_robust_2015]. We subjected both these databases to the same marker gene selection steps. To examine expression changes of marker genes between species, we checked if homologues of genes selected for one species behave as marker genes in other. As expected, not all genes were specific to the cell type they were selected from in different species(Figure \ref{fig:lm11MouseHumanSwap}). To evaluate the ability of marker genes selected for mouse cell types to correctly estimate the relative proportions, we estimated cell type proportions in whole blood cell types and compared our results with a recently published reference-based estimation method [@newman_robust_2015]. When the cell type definitions were kept at a relatively general level(eg. B cells, T cells) not differentiating cell types at different activation stages, mouse genes performed better than human genes (Figure \ref{fig:lm11Estimations}), potentially due to difference of quality between the datasets. On the other hand attempting to estimate finely defined cell (eg. activated/deactivated CD4 cells) types with mouse genes yielded poor correlation to actual counts (Figure \ref{fig:lm22EstimationsHuman} - \ref{fig:lm22EstimationsMouse}). Shay et al. [@shay_conservation_2013] showed that while most lineage specific gene expression is conserved between mouse and human, there are still significant expression changes between a considerable number of genes. This might explain the poor quality of estimations when attempting to estimate the finer subtypes.

<!-- Future work will focus on characterizing the genes that do not perform well across species and assessing if their properties are generalizible to brain cell type markers. With our current resources, it is impossible to tell if certain brain cell type markers are working or not. We will be attempting to characterize such genes by analyzing their individual performance at marker gene validation steps -->

Use proportion estimations to improve accuracy of differential expression analysis
------------------------------------------------------------------------------------------------------------------------

Differential expression analyses on whole tissues are complicated by the heterogeneity of the sample. Since effects are likely to be specific to cell types, having unaffected cell types in the sample will reduce the observed difference, reducing statistical power. Also changes in the cell type proportions can generate false positives. Previous work suggests that it might be possible to increase the power of differential expression analysis by adding estimated cell type proportions as covariates [@chikina_cellcode_2015]. The authors also show observed effects can be localized to their cell types by using the estimated proportions in interaction models. In neuroscience of human brains, where sample sizes are often small and data quality is relatively poor, this approach had the potential to increase the value of the existing data by increasing the statistical power of the analyses. We are hoping to validate this approach by finding studies that isolate cell types under certain conditions and controls, paired with other studies that work on the same condition using whole tissue samples. We will assess our ability to assign the differentially expressed genes detected in the single cell type study to that cell type in the study that uses whole tissue samples.

Create an R package for easy application of the method by third parties.
------------------------------------------------------------------------------------------------------------------------

The pipeline we have developed for gene selection and cell type estimation is time consuming to set up with the magnitude steps aiming to fine tune the process. By creating an R package we will make for our process to be reproducible by other researchers. The package will include streamlined functions to select and validate the marker genes, along with functions used in the estimation process. The package will be publicly available on Bioconductor, CRAN and/or Github platforms.

# References