This repo contains code that parses raw CellMarker_Data into a JSON file that Dumper can use in Biothings Studio.
Table of Contents
![Product Name Screen Shot][product-screenshot]
About this dataset: The dataset can be found here. These records contain data about molecules expressed on the surface, within, or secreted by cells. These markers are used to identify and classify the types, states, or functions of cells within a population.
There in total 4 different different Cell_Marker Dataset but all of them have the same attributes.
- speciesType: the species from which the data originates
- there are only two data type, either
Human
orMouse
- there are only two data type, either
- tissueType: the type of tissues from which data originates
- in total 181 different kinds of cells
- a lot of them are undefined
- UberonOntologyID: The universal unique identifier of the anatomy structure found in animals
- needs to confirm with the team
- contain missing value and most of them are missing due to undefined tissueType
- cancerType: the association of the cell marker with the cancer name
- if the cell Marker does not represent cancer, then it is named as
Normal
- if the cell Marker does not represent cancer, then it is named as
- cellName: the English name of the cell that the marker belongs to
- CellOntologyID: The universal unique identifier of the cell that the marker belongs to
- contain missing value
- cellMarker: a marker molecule of the cell
- in string like list, can be converted to a list
- geneSymbol: gene expression of the cell marker
- in string like list, can be converted to a list
- geneID: The universal unique identifier of the gene
- in string like list, can be converted to a list
- contain missing value
- proteinName: name of the protein
- in string like list, can be converted to a list
- contain missing value
- proteinID: The universal unique identifier of the protein
- in string like list, can be converted to a list
- markerResource: the type of resource or methodology used to identify the marker
- there are only four data types, either
Experiment
orSingle-cell sequencing
orCompany
orReview
- there are only four data types, either
- PMID: The PudMed ID for the publication or study where the marker data was reported
- if the
markerResource
is value company, the the value here is containscompany
- if the
- Company: the company associated with the resources
- most of them are missing and only exist when the
markerResource
is Company
- most of them are missing and only exist when the
This section will discuss some of the features of the data set. All of the work can be found in the (Jupyter Notebook) in the EDA folder.
Most of the missing UberonOntologyID is due to "Undefined" tissue-type
Most of the missing CellOntologyID is due to "Cancer stem cell" in cellName
They either exist or are missing at the same time
They either exist or are missing at the same time
In the following columns:
- geneSymbol
- geneID
- proteinName
- proteinID
values are stored in list-like strings. Here are Example of strings:
- "A"
- "A, B"
- "A B"
- "A, [A, B], C"
- "A, B, C, D, [E, F], [G, H I]"
We expected the parsing result to be:
- ['A']
- ['A', 'B']
- ['A B']
- ['A', ['A', 'B'], 'C']
- ['A', 'B', 'C', 'D', ['E', 'F'], ['G', 'H I']]
91% of the value in the company
column is missing, but it is missing by design. There are in total 4 different kinds of values in markerResource
column which are "Experiment", "Review", "Single-cell sequencing", and "Company". The company
column is not missing when the value in markerResource
column is "company"
geneID
- geneSymbol: str
- proteinID: dict
- proteinName
- cellMarker: dict
- speciesType: str
- tissueType: str
- UberonOntologyID: str
- cancerType: str
- cellType: str
- cellName: str
- CellOntologyID: str
- markerResource: tuple
one of this
- Experiment: PMID
- Review: PMID
- Single-cell sequencing: PMID
- Company: Company name
- concatenating all_cell_markers df and all_singleCell_markers df
- replacing all the "undefined" tissue with NaN value
- converting all the listLikeString into the list for column [geneSymbol, geneID, proteinName, proteinName]
- remove all rows with missing "geneID"
Guoxuan Xu - @github_profile - [email protected]
- Apperciate Dr. Wu and Jason Lin for the help!