Skip to content

Commit

Permalink
Fixed the converters for CTD 2024 data, added hgnc.symbol xrefs, remo…
Browse files Browse the repository at this point in the history
…ved not so useful synonyms and parent/related/other IDs, etc., added tests. This should generate smaller biopax model.
  • Loading branch information
IgorRodchenkov committed Mar 10, 2024
1 parent 35831d1 commit e7799c9
Show file tree
Hide file tree
Showing 7 changed files with 162 additions and 60 deletions.
19 changes: 8 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Originated from https://bitbucket.org/armish/gsoc14 and will continue here (ToDo

Unlike many other drug-target databases, this data resource has a controlled
vocabulary that can be mapped to BioPAX, for example: 'nutlin 3 results
in increased expression of BAX'. Therefore implementation of a converter
in increased expression of BAX'. Therefore, implementation of a converter
first requires a manual mapping from CTD terms to BioPAX ontology.
Once the mapping is done, then the actual conversion requires parsing
and integrating several CSV files that are distributed by the provider.
Expand Down Expand Up @@ -36,20 +36,17 @@ the converted models are merged and a single BioPAX file is provided as output.

The gene/chemical vocabulary converters produce BioPAX file with only
`EntityReference`s in them. Each entity reference in this converted
models includes all the external referneces provided within the vocabulary file.
models includes the external references provided within the vocabulary file.
From the chemical vocabulary, `SmallMoleculeReference`s are produced;
and from the gene vocabulary, various types of references are produced
for corresponding CTD gene forms: `ProteinReference`, `DnaReference`,
`RnaReference`, `DnaRegionReference` and `RnaRegionReference`.

The interactions file contains all detailed interactions between chemicals
and genes, but no background information on the chemical/gene entities.
Therefore it is necessary to convert all these files and merge these
models into one in order to get a properly annotated BioPAX model.
The converter exactly does that by making sure that the entity references
from the vocabulary files match with the ones produced from the interactions file.
This allows filling in the gaps and annotations of the entities in the
final converted model.

We can convert any or all of these three files at once,
merge into one BioPAX model.

The CTD data sets have nested interactions that are captured by their
structured XML file and their XML schema:
Expand Down Expand Up @@ -79,8 +76,8 @@ to run without any command line options to see the help text:
-c,--chemical <arg> CTD chemical vocabulary (CSV) [optional]
-g,--gene <arg> CTD gene vocabulary (CSV) [optional]
-o,--output <arg> Output (BioPAX file) [required]
-r,--remove-dangling Remove dangling entities for clean-up [optional]
-t,--taxonomy <arg> Taxonomy (e.g. '9606' for human;
-r,--remove-dangling Remove dangling utility class entities [optional; use with -x -t]
-t,--taxonomy <arg> filter interactions by species, Taxonomy ID ('9606' for human);
can use special values: 'defined', 'undefined', and 'null') [optional]
-x,--interaction <arg> structured chemical-gene interaction file (XML)
[optional]
Expand All @@ -89,6 +86,6 @@ If you want to test the converter though, you can download small (old) example
files from [goal2_ctd_smallSampleInputFiles-20140702.zip](https://bitbucket.org/armish/gsoc14/downloads/goal2_ctd_smallSampleInputFiles-20140702.zip).
To convert these sample files into a single BioPAX file, run the following command:

$ java -jar ctd-to-biopax.jar -x ctd_small.xml -c CTD_chemicals_small.csv -g CTD_genes_small.csv -r -o ctd.owl
$ java -jar ctd-to-biopax.jar -x ctd_small.xml -c CTD_chemicals_small.csv -g CTD_genes_small.csv -r -t 9606 -o ctd.owl

which will create the `ctd.owl` file for you.
8 changes: 5 additions & 3 deletions src/main/java/org/ctdbase/CtdToBiopax.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package org.ctdbase;

import org.biopax.paxtools.model.level3.UtilityClass;
import org.ctdbase.converter.CTDChemicalConverter;
import org.ctdbase.converter.CTDGeneConverter;
import org.ctdbase.converter.CTDInteractionConverter;
Expand Down Expand Up @@ -32,7 +33,8 @@ public static void main( String[] args ) {
.addOption("c", "chemical", true, "CTD chemical vocabulary (CSV) [optional]")
.addOption("o", "output", true, "Output (BioPAX file) [required]")
.addOption("t", "taxonomy", true, "Taxonomy (e.g. '9606' for human) [optional]")
.addOption("r", "remove-dangling", false, "Remove dangling entities for clean-up [optional]")
.addOption("r", "remove-dangling", false,
"Remove dangling UtilityClass objects from final model [optional; recommended when using options: -x -t]")
;

try {
Expand Down Expand Up @@ -80,8 +82,8 @@ public static void main( String[] args ) {
}

if(commandLine.hasOption("r")) {
Set<BioPAXElement> removed = ModelUtils.removeObjectsIfDangling(finalModel, EntityReference.class);
log.info("Removed " + removed.size() + " dangling entity references from the model.");
Set<BioPAXElement> removed = ModelUtils.removeObjectsIfDangling(finalModel, UtilityClass.class);
log.info("Removed " + removed.size() + " dangling UtilityClass objects from the model.");
}

finalModel.setXmlBase(Converter.sharedXMLBase);
Expand Down
33 changes: 16 additions & 17 deletions src/main/java/org/ctdbase/converter/CTDChemicalConverter.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ public Model convert(InputStream inputStream) throws IOException {
// Skip commented lines
if(nextLine[0].startsWith("#")) { continue; }

if(nextLine.length < 9) {
if(nextLine.length < 8) {
log.warn(nextLine[0] + "' does not have enough columns. Skipping.");
continue;
}
Expand All @@ -45,9 +45,9 @@ public Model convert(InputStream inputStream) throws IOException {
String chemicalId = nextLine[1];
String casRN = nextLine[2];
String definition = nextLine[3];
String[] parentIDs = nextLine[4].split(INTRA_FIELD_SEPARATOR);
String[] synonyms = nextLine[7].split(INTRA_FIELD_SEPARATOR);
String[] dbIds = nextLine[8].split(INTRA_FIELD_SEPARATOR);
// String[] parentIDs = nextLine[4].split(INTRA_FIELD_SEPARATOR);
// String[] synonyms = nextLine[7].split(INTRA_FIELD_SEPARATOR);
// String[] dbIds = nextLine[8].split(INTRA_FIELD_SEPARATOR); //not present in CTD 2024 data...

String rdfId = CtdUtil.sanitizeId("ref_chemical_" + chemicalId.toLowerCase());

Expand All @@ -60,26 +60,25 @@ public Model convert(InputStream inputStream) throws IOException {

smallMoleculeReference.setDisplayName(chemName);
smallMoleculeReference.setStandardName(chemName);
smallMoleculeReference.addName(chemName);
for (String synonym : synonyms) {
smallMoleculeReference.addName(synonym);
}
// for (String synonym : synonyms) {
// smallMoleculeReference.addName(synonym);
// }

smallMoleculeReference.addComment(definition);

String[] tokens = chemicalId.split(":"); //length=2 always
smallMoleculeReference.addXref(createXref(model, UnificationXref.class, tokens[0], tokens[1]));

for (String dbId : dbIds) {
if(dbId.isEmpty()) { continue; }
smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "DrugBank", dbId));
}
// for (String dbId : dbIds) {
// if(dbId.isEmpty()) { continue; }
// smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "DrugBank", dbId));
// }

for (String parentID : parentIDs) {
if(parentID.isEmpty()) { continue; }
tokens = parentID.split(":");
smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "MeSH 2013", tokens[1]));
}
// for (String parentID : parentIDs) {
// if(parentID.isEmpty()) { continue; }
// tokens = parentID.split(":");
// smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "MeSH 2013", tokens[1]));
// }

if(casRN != null && !casRN.isEmpty()) {
smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "CAS", casRN));
Expand Down
48 changes: 25 additions & 23 deletions src/main/java/org/ctdbase/converter/CTDGeneConverter.java
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,22 @@ public class CTDGeneConverter extends Converter {
public Model convert(InputStream inputStream) throws IOException {
CSVReader reader = new CSVReader(new InputStreamReader(inputStream));
String[] nextLine;

Model model = createNewModel();

while((nextLine = reader.readNext()) != null) {
// Skip commented lines
if (nextLine[0].startsWith("#")) {
continue;
}

if(nextLine.length < 8) {
log.warn(nextLine[0] + "' does not have enough columns to it. Skipping.");
continue;
}

// create an ER of different type for each gene form
for (GeneForm geneForm : GeneForm.values()) {
generateReference(model, geneForm.getReferenceClass(), geneForm, nextLine);
generateReference(model, geneForm, nextLine);
}
}

reader.close();

log.info("Done with the gene conversion. "
Expand All @@ -51,7 +48,6 @@ public Model convert(InputStream inputStream) throws IOException {

private EntityReference generateReference(
Model model,
Class<? extends EntityReference> aClass,
GeneForm geneForm,
String[] tokens)
{
Expand All @@ -70,11 +66,11 @@ private EntityReference generateReference(
String geneSymbol = tokens[0];
String geneName = tokens[1];
String geneID = tokens[2];
String[] altGeneIds = tokens[3].split(INTRA_FIELD_SEPARATOR);
String[] synonyms = tokens[4].split(INTRA_FIELD_SEPARATOR);
String[] biogridIds = tokens[5].split(INTRA_FIELD_SEPARATOR);
String[] pharmGKBIds = tokens[6].split(INTRA_FIELD_SEPARATOR);
String[] uniprotIds = tokens[7].split(INTRA_FIELD_SEPARATOR);
// String[] altGeneIds = tokens[3].split(INTRA_FIELD_SEPARATOR);
// String[] synonyms = tokens[4].split(INTRA_FIELD_SEPARATOR);
// String[] biogridIds = tokens[5].split(INTRA_FIELD_SEPARATOR);
// String[] pharmGKBIds = tokens[6].split(INTRA_FIELD_SEPARATOR);
// String[] uniprotIds = tokens[7].split(INTRA_FIELD_SEPARATOR); //often not relevant organism...

String rdfId = CtdUtil.sanitizeId("ref_" + geneForm.toString().toLowerCase()
+ "_gene_" + geneID.toLowerCase());
Expand All @@ -84,28 +80,34 @@ private EntityReference generateReference(
log.warn("Already had the gene " + geneID + ". Skipping it.");
return null;
}
entityReference = create(aClass, rdfId);

entityReference = create(geneForm.getReferenceClass(), rdfId);
entityReference.addXref(createXref(model, RelationshipXref.class, "hgnc.symbol", geneSymbol));
entityReference.addXref(createXref(model, RelationshipXref.class, "ncbigene", geneID));
entityReference.setStandardName(geneSymbol);
entityReference.setDisplayName(geneSymbol);
entityReference.addName(geneSymbol);
for (String synonym : synonyms) {
if(!synonym.isEmpty()) { entityReference.addName(synonym); }
// for (String synonym : synonyms) { //too many, can be found in other/external resources after all
// if(!synonym.isEmpty()) {
// entityReference.addName(synonym);
// }
// }

if(!geneName.isEmpty()) {
entityReference.addComment(geneName);
}

if(!geneName.isEmpty()) { entityReference.addComment(geneName); }
entityReference.addXref(createXref(model, RelationshipXref.class, "NCBI Gene", geneID));
// Let's skip other NCBI gene references, they inflate the model
//addXrefsFromArray(model, entityReference, RelationshipXref.class, "NCBI Gene", altGeneIds);
addXrefsFromArray(model, entityReference, RelationshipXref.class, "BioGRID", biogridIds);
addXrefsFromArray(model, entityReference, RelationshipXref.class, "PharmGKB Gene", pharmGKBIds);
addXrefsFromArray(model, entityReference, RelationshipXref.class, "UniProt", uniprotIds);
// Let's skip other NCBI gene references, for they inflate the model too much...
// addXrefsFromArray(model, entityReference, RelationshipXref.class, "NCBI Gene", altGeneIds);
// addXrefsFromArray(model, entityReference, RelationshipXref.class, "BioGRID", biogridIds);
// addXrefsFromArray(model, entityReference, RelationshipXref.class, "PharmGKB Gene", pharmGKBIds);
// addXrefsFromArray(model, entityReference, RelationshipXref.class, "UniProt", uniprotIds);

model.add(entityReference);
return entityReference;
}

private void addXrefsFromArray(Model model, EntityReference entityReference, Class<? extends Xref> xrefClass, String db, String[] ids) {
private void addXrefsFromArray(Model model, EntityReference entityReference,
Class<? extends Xref> xrefClass, String db, String[] ids) {
for (String id : ids) {
if(!id.isEmpty()) {
entityReference.addXref(createXref(model, xrefClass, db, id));
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
package org.ctdbase.converter;

//import org.biopax.paxtools.io.SimpleIOHandler;
import org.biopax.paxtools.model.Model;
import org.biopax.paxtools.model.level3.*;
import org.ctdbase.util.model.GeneForm;
import org.junit.Test;

import java.io.IOException;

import static org.junit.Assert.*;

/**
* Created by igor on 04/04/17.
*/
public class CTDInteractionConverterTest {
public class CTDConvertersTest {

@Test
public void convert() {
Expand Down Expand Up @@ -95,18 +96,19 @@ public void convert() {

// test filtering by a taxonomy id which is not present in the data
@Test
public void convertYest() {
public void convertTaxon() {
CTDInteractionConverter converter = new CTDInteractionConverter("559292");
Model m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
assertTrue(m.getObjects(Control.class).isEmpty());

converter = new CTDInteractionConverter("undefined");
m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
// (new SimpleIOHandler()).convertToOWL(m, System.out);
assertEquals(8, m.getObjects(Control.class).size());

converter = new CTDInteractionConverter("defined");
m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
assertEquals(36, m.getObjects(Control.class).size());

//see if no. controls generated with undefined + defined = all (null)
converter = new CTDInteractionConverter(null); //convert everything - any species, and undefined too
m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
Expand All @@ -116,9 +118,43 @@ public void convertYest() {
converter = new CTDInteractionConverter("10090");
m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
assertEquals(1, m.getObjects(Control.class).size());

//human (ignoring records with no taxon defined)
converter = new CTDInteractionConverter("9606");
m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
assertEquals(35, m.getObjects(Control.class).size());
// (new SimpleIOHandler()).convertToOWL(m, System.out);
}

@Test
public void convertGenes() throws IOException {
CTDGeneConverter converter = new CTDGeneConverter();
Model m = converter.convert(getClass().getResourceAsStream("/test_CTD_genes.csv"));
assertEquals(GeneForm.values().length, m.getObjects(EntityReference.class).size());
assertEquals(3, m.getObjects(ProteinReference.class).size());
assertEquals(5, m.getObjects(RnaRegionReference.class).size());
assertEquals(2, m.getObjects(DnaRegionReference.class).size());
assertEquals(3, m.getObjects(RnaReference.class).size());
assertEquals(2, m.getObjects(DnaReference.class).size());
assertEquals(2, m.getObjects(RelationshipXref.class).size());
assertEquals(0, m.getObjects(UnificationXref.class).size());
assertEquals(17, m.getObjects().size());
RnaReference rr1 = (RnaReference) m.getByID("ctdbase:ref_chemical_mesh_c106820");
//assertNotNull(rr1);
// (new SimpleIOHandler()).convertToOWL(m, System.out);
}

@Test
public void convertChemicals() throws IOException {
CTDChemicalConverter converter = new CTDChemicalConverter();
converter.setXmlBase("ctd:");
Model m = converter.convert(getClass().getResourceAsStream("/test_CTD_chemicals.csv"));
assertEquals(2, m.getObjects(SmallMoleculeReference.class).size());
assertEquals(1, m.getObjects(RelationshipXref.class).size());
assertEquals(2, m.getObjects(UnificationXref.class).size());
assertEquals(5, m.getObjects().size());
SmallMoleculeReference smr1 = (SmallMoleculeReference) m.getByID("ctd:ref_chemical_mesh_c106820");
assertNotNull(smr1);
// (new SimpleIOHandler()).convertToOWL(m, System.out);
}
}
32 changes: 32 additions & 0 deletions src/test/resources/test_CTD_chemicals.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# THIS IS an EXCERPT from original CTD_chemicals.csv file for DEV/TESTING PURPOSE!
# The Comparative Toxicogenomics Database (CTD) - http://ctdbase.org/
# Copyright 2002-2012 MDI Biological Laboratory. All rights reserved.
# Copyright 2012-2024 NC State University. All rights reserved.
#
#
# Use is subject to the terms set forth at http://ctdbase.org/about/legal.jsp
# These terms include:
#
# 1. All forms of publication (e.g., web sites, research papers, databases,
# software applications, etc.) that use or rely on CTD data must cite CTD.
# Citation guidelines: http://ctdbase.org/about/publications/#citing
#
# 2. All electronic or online applications must include hyperlinks from
# contexts that use CTD data to the applicable CTD data pages.
# Linking instructions: http://ctdbase.org/help/linking.jsp
#
# 3. You must notify CTD, and describe your use of our data:
# http://ctdbase.org/help/contact.go
#
# 4. For quality control purposes, you must provide CTD with periodic
# access to your publication of our data.
#
# More information: http://ctdbase.org/downloads/
#
# Report created: Wed Feb 28 10:59:04 EST 2024
#
# Fields:
# ChemicalName,ChemicalID,CasRN,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms
#
thiolactic acid,MESH:C023884,79-42-5,,MESH:D013438,D02.886.489/C023884,D02.886.489,"2-mercaptopropanoic acid|2-mercaptopropionic acid|ammonium thiolactate|thiolactic acid, calcium salt (2:1)|thiolactic acid, disilver salt (+1)|thiolactic acid, monoammonium salt|thiolactic acid, monolithium salt|thiolactic acid, (R)-isomer|thiolactic acid, (S)-isomer|thiolactic acid, strontium salt (2:1)"
oxidative potential water,MESH:C106820,,,MESH:D014867,D01.045.250.875/C106820|D01.248.497.158.459.650/C106820|D01.650.550.925/C106820,D01.045.250.875|D01.248.497.158.459.650|D01.650.550.925,dental OPW
Loading

0 comments on commit e7799c9

Please sign in to comment.