Fixed the converters for CTD 2024 data, added hgnc.symbol xrefs, remo…

…ved not so useful synonyms and parent/related/other IDs, etc., added tests. This should generate smaller biopax model.
PathwayCommons · Mar 10, 2024 · e7799c9 · e7799c9
1 parent 35831d1
commit e7799c9
Show file tree

Hide file tree

Showing 7 changed files with 162 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ Originated from https://bitbucket.org/armish/gsoc14 and will continue here (ToDo
 
 Unlike many other drug-target databases, this data resource has a controlled 
 vocabulary that can be mapped to BioPAX, for example: 'nutlin 3 results 
-in increased expression of BAX'. Therefore implementation of a converter 
+in increased expression of BAX'. Therefore, implementation of a converter 
 first requires a manual mapping from CTD terms to BioPAX ontology. 
 Once the mapping is done, then the actual conversion requires parsing 
 and integrating several CSV files that are distributed by the provider.
@@ -36,20 +36,17 @@ the converted models are merged and a single BioPAX file is provided as output.
 
 The gene/chemical vocabulary converters produce BioPAX file with only 
 `EntityReference`s in them. Each entity reference in this converted 
-models includes all the external referneces provided within the vocabulary file.
+models includes the external references provided within the vocabulary file.
 From the chemical vocabulary, `SmallMoleculeReference`s are produced;
 and from the gene vocabulary, various types of references are produced 
 for corresponding CTD gene forms: `ProteinReference`, `DnaReference`, 
 `RnaReference`, `DnaRegionReference` and `RnaRegionReference`.
 
 The interactions file contains all detailed interactions between chemicals 
 and genes, but no background information on the chemical/gene entities.
-Therefore it is necessary to convert all these files and merge these 
-models into one in order to get a properly annotated BioPAX model.
-The converter exactly does that by making sure that the entity references 
-from the vocabulary files match with the ones produced from the interactions file.
-This allows filling in the gaps and annotations of the entities in the 
-final converted model.
+
+We can convert any or all of these three files at once, 
+merge into one BioPAX model.
 
 The CTD data sets have nested interactions that are captured by their 
 structured XML file and their XML schema: 
@@ -79,8 +76,8 @@ to run without any command line options to see the help text:
 	 -c,--chemical <arg>      CTD chemical vocabulary (CSV) [optional]
 	 -g,--gene <arg>          CTD gene vocabulary (CSV) [optional]
 	 -o,--output <arg>        Output (BioPAX file) [required]
-	 -r,--remove-dangling     Remove dangling entities for clean-up [optional]
-	 -t,--taxonomy <arg>      Taxonomy (e.g. '9606' for human; 
+	 -r,--remove-dangling     Remove dangling utility class entities [optional; use with -x -t]
+	 -t,--taxonomy <arg>      filter interactions by species, Taxonomy ID ('9606' for human);
 	                          can use special values: 'defined', 'undefined', and 'null') [optional]
 	 -x,--interaction <arg>   structured chemical-gene interaction file (XML)
 	                          [optional]
@@ -89,6 +86,6 @@ If you want to test the converter though, you can download small (old) example
 files from [goal2_ctd_smallSampleInputFiles-20140702.zip](https://bitbucket.org/armish/gsoc14/downloads/goal2_ctd_smallSampleInputFiles-20140702.zip).
 To convert these sample files into a single BioPAX file, run the following command:
 
-	$ java -jar ctd-to-biopax.jar -x ctd_small.xml -c CTD_chemicals_small.csv -g CTD_genes_small.csv -r -o ctd.owl
+	$ java -jar ctd-to-biopax.jar -x ctd_small.xml -c CTD_chemicals_small.csv -g CTD_genes_small.csv -r -t 9606 -o ctd.owl
 
 which will create the `ctd.owl` file for you.
diff --git a/src/main/java/org/ctdbase/CtdToBiopax.java b/src/main/java/org/ctdbase/CtdToBiopax.java
@@ -1,5 +1,6 @@
 package org.ctdbase;
 
+import org.biopax.paxtools.model.level3.UtilityClass;
 import org.ctdbase.converter.CTDChemicalConverter;
 import org.ctdbase.converter.CTDGeneConverter;
 import org.ctdbase.converter.CTDInteractionConverter;
@@ -32,7 +33,8 @@ public static void main( String[] args ) {
                 .addOption("c", "chemical", true, "CTD chemical vocabulary (CSV) [optional]")
                 .addOption("o", "output", true, "Output (BioPAX file) [required]")
                 .addOption("t", "taxonomy", true, "Taxonomy (e.g. '9606' for human) [optional]")
-                .addOption("r", "remove-dangling", false, "Remove dangling entities for clean-up [optional]")
+                .addOption("r", "remove-dangling", false,
+                    "Remove dangling UtilityClass objects from final model [optional; recommended when using options: -x -t]")
         ;
 
         try {
@@ -80,8 +82,8 @@ public static void main( String[] args ) {
             }
 
             if(commandLine.hasOption("r")) {
-                Set<BioPAXElement> removed = ModelUtils.removeObjectsIfDangling(finalModel, EntityReference.class);
-                log.info("Removed " + removed.size() + " dangling entity references from the model.");
+                Set<BioPAXElement> removed = ModelUtils.removeObjectsIfDangling(finalModel, UtilityClass.class);
+                log.info("Removed " + removed.size() + " dangling UtilityClass objects from the model.");
             }
 
             finalModel.setXmlBase(Converter.sharedXMLBase);

diff --git a/src/main/java/org/ctdbase/converter/CTDChemicalConverter.java b/src/main/java/org/ctdbase/converter/CTDChemicalConverter.java
@@ -25,7 +25,7 @@ public Model convert(InputStream inputStream) throws IOException {
             // Skip commented lines
             if(nextLine[0].startsWith("#")) { continue; }
 
-            if(nextLine.length < 9) {
+            if(nextLine.length < 8) {
                 log.warn(nextLine[0] + "' does not have enough columns. Skipping.");
                 continue;
             }
@@ -45,9 +45,9 @@ public Model convert(InputStream inputStream) throws IOException {
             String chemicalId = nextLine[1];
             String casRN = nextLine[2];
             String definition = nextLine[3];
-            String[] parentIDs = nextLine[4].split(INTRA_FIELD_SEPARATOR);
-            String[] synonyms = nextLine[7].split(INTRA_FIELD_SEPARATOR);
-            String[] dbIds = nextLine[8].split(INTRA_FIELD_SEPARATOR);
+//            String[] parentIDs = nextLine[4].split(INTRA_FIELD_SEPARATOR);
+//            String[] synonyms = nextLine[7].split(INTRA_FIELD_SEPARATOR);
+//            String[] dbIds = nextLine[8].split(INTRA_FIELD_SEPARATOR); //not present in CTD 2024 data...
 
             String rdfId = CtdUtil.sanitizeId("ref_chemical_" + chemicalId.toLowerCase());
 
@@ -60,26 +60,25 @@ public Model convert(InputStream inputStream) throws IOException {
 
             smallMoleculeReference.setDisplayName(chemName);
             smallMoleculeReference.setStandardName(chemName);
-            smallMoleculeReference.addName(chemName);
-            for (String synonym : synonyms) {
-                smallMoleculeReference.addName(synonym);
-            }
+//            for (String synonym : synonyms) {
+//                smallMoleculeReference.addName(synonym);
+//            }
 
             smallMoleculeReference.addComment(definition);
 
             String[] tokens = chemicalId.split(":"); //length=2 always
             smallMoleculeReference.addXref(createXref(model, UnificationXref.class, tokens[0], tokens[1]));
 
-            for (String dbId : dbIds) {
-                if(dbId.isEmpty()) { continue; }
-                smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "DrugBank", dbId));
-            }
+//            for (String dbId : dbIds) {
+//                if(dbId.isEmpty()) { continue; }
+//                smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "DrugBank", dbId));
+//            }
 
-            for (String parentID : parentIDs) {
-                if(parentID.isEmpty()) { continue; }
-                tokens = parentID.split(":");
-                smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "MeSH 2013", tokens[1]));
-            }
+//            for (String parentID : parentIDs) {
+//                if(parentID.isEmpty()) { continue; }
+//                tokens = parentID.split(":");
+//                smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "MeSH 2013", tokens[1]));
+//            }
 
             if(casRN != null && !casRN.isEmpty()) {
                 smallMoleculeReference.addXref(createXref(model, RelationshipXref.class, "CAS", casRN));

diff --git a/src/main/java/org/ctdbase/converter/CTDGeneConverter.java b/src/main/java/org/ctdbase/converter/CTDGeneConverter.java
@@ -20,25 +20,22 @@ public class CTDGeneConverter extends Converter {
     public Model convert(InputStream inputStream) throws IOException {
         CSVReader reader = new CSVReader(new InputStreamReader(inputStream));
         String[] nextLine;
-
         Model model = createNewModel();
 
         while((nextLine = reader.readNext()) != null) {
             // Skip commented lines
             if (nextLine[0].startsWith("#")) {
                 continue;
             }
-
             if(nextLine.length < 8) {
                 log.warn(nextLine[0] + "' does not have enough columns to it. Skipping.");
                 continue;
             }
-
+            // create an ER of different type for each gene form
             for (GeneForm geneForm : GeneForm.values()) {
-                generateReference(model, geneForm.getReferenceClass(), geneForm, nextLine);
+                generateReference(model, geneForm, nextLine);
             }
         }
-
         reader.close();
 
         log.info("Done with the gene conversion. "
@@ -51,7 +48,6 @@ public Model convert(InputStream inputStream) throws IOException {
 
     private EntityReference generateReference(
             Model model,
-            Class<? extends EntityReference> aClass,
             GeneForm geneForm,
             String[] tokens)
     {
@@ -70,11 +66,11 @@ private EntityReference generateReference(
         String geneSymbol = tokens[0];
         String geneName = tokens[1];
         String geneID = tokens[2];
-        String[] altGeneIds = tokens[3].split(INTRA_FIELD_SEPARATOR);
-        String[] synonyms = tokens[4].split(INTRA_FIELD_SEPARATOR);
-        String[] biogridIds = tokens[5].split(INTRA_FIELD_SEPARATOR);
-        String[] pharmGKBIds = tokens[6].split(INTRA_FIELD_SEPARATOR);
-        String[] uniprotIds = tokens[7].split(INTRA_FIELD_SEPARATOR);
+//        String[] altGeneIds = tokens[3].split(INTRA_FIELD_SEPARATOR);
+//        String[] synonyms = tokens[4].split(INTRA_FIELD_SEPARATOR);
+//        String[] biogridIds = tokens[5].split(INTRA_FIELD_SEPARATOR);
+//        String[] pharmGKBIds = tokens[6].split(INTRA_FIELD_SEPARATOR);
+//        String[] uniprotIds = tokens[7].split(INTRA_FIELD_SEPARATOR); //often not relevant organism...
 
         String rdfId = CtdUtil.sanitizeId("ref_" +  geneForm.toString().toLowerCase()
                 + "_gene_" + geneID.toLowerCase());
@@ -84,28 +80,34 @@ private EntityReference generateReference(
             log.warn("Already had the gene " + geneID + ". Skipping it.");
             return null;
         }
-        entityReference = create(aClass, rdfId);
 
+        entityReference = create(geneForm.getReferenceClass(), rdfId);
+        entityReference.addXref(createXref(model, RelationshipXref.class, "hgnc.symbol", geneSymbol));
+        entityReference.addXref(createXref(model, RelationshipXref.class, "ncbigene", geneID));
         entityReference.setStandardName(geneSymbol);
         entityReference.setDisplayName(geneSymbol);
-        entityReference.addName(geneSymbol);
-        for (String synonym : synonyms) {
-            if(!synonym.isEmpty()) { entityReference.addName(synonym); }
+//        for (String synonym : synonyms) { //too many, can be found in other/external resources after all
+//            if(!synonym.isEmpty()) {
+//                entityReference.addName(synonym);
+//            }
+//        }
+
+        if(!geneName.isEmpty()) {
+            entityReference.addComment(geneName);
         }
 
-        if(!geneName.isEmpty()) { entityReference.addComment(geneName); }
-        entityReference.addXref(createXref(model, RelationshipXref.class, "NCBI Gene", geneID));
-        // Let's skip other NCBI gene references, they inflate the model
-        //addXrefsFromArray(model, entityReference, RelationshipXref.class, "NCBI Gene", altGeneIds);
-        addXrefsFromArray(model, entityReference, RelationshipXref.class, "BioGRID", biogridIds);
-        addXrefsFromArray(model, entityReference, RelationshipXref.class, "PharmGKB Gene", pharmGKBIds);
-        addXrefsFromArray(model, entityReference, RelationshipXref.class, "UniProt", uniprotIds);
+        // Let's skip other NCBI gene references, for they inflate the model too much...
+//        addXrefsFromArray(model, entityReference, RelationshipXref.class, "NCBI Gene", altGeneIds);
+//        addXrefsFromArray(model, entityReference, RelationshipXref.class, "BioGRID", biogridIds);
+//        addXrefsFromArray(model, entityReference, RelationshipXref.class, "PharmGKB Gene", pharmGKBIds);
+//        addXrefsFromArray(model, entityReference, RelationshipXref.class, "UniProt", uniprotIds);
 
         model.add(entityReference);
         return entityReference;
     }
 
-    private void addXrefsFromArray(Model model, EntityReference entityReference, Class<? extends Xref> xrefClass, String db, String[] ids) {
+    private void addXrefsFromArray(Model model, EntityReference entityReference,
+                                   Class<? extends Xref> xrefClass, String db, String[] ids) {
         for (String id : ids) {
             if(!id.isEmpty()) {
                 entityReference.addXref(createXref(model, xrefClass, db, id));

diff --git a/...onverter/CTDInteractionConverterTest.java → .../ctdbase/converter/CTDConvertersTest.java b/...onverter/CTDInteractionConverterTest.java → .../ctdbase/converter/CTDConvertersTest.java
@@ -1,15 +1,16 @@
 package org.ctdbase.converter;
 
+//import org.biopax.paxtools.io.SimpleIOHandler;
 import org.biopax.paxtools.model.Model;
 import org.biopax.paxtools.model.level3.*;
+import org.ctdbase.util.model.GeneForm;
 import org.junit.Test;
 
+import java.io.IOException;
+
 import static org.junit.Assert.*;
 
-/**
- * Created by igor on 04/04/17.
- */
-public class CTDInteractionConverterTest {
+public class CTDConvertersTest {
 
     @Test
     public void convert() {
@@ -95,18 +96,19 @@ public void convert() {
 
     // test filtering by a taxonomy id which is not present in the data
     @Test
-    public void convertYest() {
+    public void convertTaxon() {
         CTDInteractionConverter converter = new CTDInteractionConverter("559292");
         Model m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
         assertTrue(m.getObjects(Control.class).isEmpty());
 
         converter = new CTDInteractionConverter("undefined");
         m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
-//        (new SimpleIOHandler()).convertToOWL(m, System.out);
         assertEquals(8, m.getObjects(Control.class).size());
+
         converter = new CTDInteractionConverter("defined");
         m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
         assertEquals(36, m.getObjects(Control.class).size());
+
         //see if no. controls generated with undefined + defined = all (null)
         converter = new CTDInteractionConverter(null); //convert everything - any species, and undefined too
         m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
@@ -116,9 +118,43 @@ public void convertYest() {
         converter = new CTDInteractionConverter("10090");
         m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
         assertEquals(1, m.getObjects(Control.class).size());
+
         //human (ignoring records with no taxon defined)
         converter = new CTDInteractionConverter("9606");
         m = converter.convert(getClass().getResourceAsStream("/chem_gene_ixns_struct.xml"));
         assertEquals(35, m.getObjects(Control.class).size());
+//        (new SimpleIOHandler()).convertToOWL(m, System.out);
+    }
+
+    @Test
+    public void convertGenes() throws IOException {
+        CTDGeneConverter converter = new CTDGeneConverter();
+        Model m = converter.convert(getClass().getResourceAsStream("/test_CTD_genes.csv"));
+        assertEquals(GeneForm.values().length, m.getObjects(EntityReference.class).size());
+        assertEquals(3, m.getObjects(ProteinReference.class).size());
+        assertEquals(5, m.getObjects(RnaRegionReference.class).size());
+        assertEquals(2, m.getObjects(DnaRegionReference.class).size());
+        assertEquals(3, m.getObjects(RnaReference.class).size());
+        assertEquals(2, m.getObjects(DnaReference.class).size());
+        assertEquals(2, m.getObjects(RelationshipXref.class).size());
+        assertEquals(0, m.getObjects(UnificationXref.class).size());
+        assertEquals(17, m.getObjects().size());
+        RnaReference rr1 = (RnaReference) m.getByID("ctdbase:ref_chemical_mesh_c106820");
+        //assertNotNull(rr1);
+//        (new SimpleIOHandler()).convertToOWL(m, System.out);
+    }
+
+    @Test
+    public void convertChemicals() throws IOException {
+        CTDChemicalConverter converter = new CTDChemicalConverter();
+        converter.setXmlBase("ctd:");
+        Model m = converter.convert(getClass().getResourceAsStream("/test_CTD_chemicals.csv"));
+        assertEquals(2, m.getObjects(SmallMoleculeReference.class).size());
+        assertEquals(1, m.getObjects(RelationshipXref.class).size());
+        assertEquals(2, m.getObjects(UnificationXref.class).size());
+        assertEquals(5, m.getObjects().size());
+        SmallMoleculeReference smr1 = (SmallMoleculeReference) m.getByID("ctd:ref_chemical_mesh_c106820");
+        assertNotNull(smr1);
+//        (new SimpleIOHandler()).convertToOWL(m, System.out);
     }
 }
diff --git a/src/test/resources/test_CTD_chemicals.csv b/src/test/resources/test_CTD_chemicals.csv
@@ -0,0 +1,32 @@
+# THIS IS an EXCERPT from original CTD_chemicals.csv file for DEV/TESTING PURPOSE!
+# The Comparative Toxicogenomics Database (CTD) - http://ctdbase.org/
+#   Copyright 2002-2012 MDI Biological Laboratory. All rights reserved.
+#   Copyright 2012-2024 NC State University. All rights reserved.
+#
+#
+# Use is subject to the terms set forth at http://ctdbase.org/about/legal.jsp
+# These terms include:
+#
+#   1. All forms of publication (e.g., web sites, research papers, databases,
+#      software applications, etc.) that use or rely on CTD data must cite CTD.
+#      Citation guidelines: http://ctdbase.org/about/publications/#citing
+#
+#   2. All electronic or online applications must include hyperlinks from
+#      contexts that use CTD data to the applicable CTD data pages.
+#      Linking instructions: http://ctdbase.org/help/linking.jsp
+#
+#   3. You must notify CTD, and describe your use of our data:
+#      http://ctdbase.org/help/contact.go
+#
+#   4. For quality control purposes, you must provide CTD with periodic
+#      access to your publication of our data.
+#
+# More information: http://ctdbase.org/downloads/
+#
+# Report created: Wed Feb 28 10:59:04 EST 2024
+#
+# Fields:
+# ChemicalName,ChemicalID,CasRN,Definition,ParentIDs,TreeNumbers,ParentTreeNumbers,Synonyms
+#
+thiolactic acid,MESH:C023884,79-42-5,,MESH:D013438,D02.886.489/C023884,D02.886.489,"2-mercaptopropanoic acid|2-mercaptopropionic acid|ammonium thiolactate|thiolactic acid, calcium salt (2:1)|thiolactic acid, disilver salt (+1)|thiolactic acid, monoammonium salt|thiolactic acid, monolithium salt|thiolactic acid, (R)-isomer|thiolactic acid, (S)-isomer|thiolactic acid, strontium salt (2:1)"
+oxidative potential water,MESH:C106820,,,MESH:D014867,D01.045.250.875/C106820|D01.248.497.158.459.650/C106820|D01.650.550.925/C106820,D01.045.250.875|D01.248.497.158.459.650|D01.650.550.925,dental OPW