Skip to content
This repository has been archived by the owner on Nov 29, 2023. It is now read-only.
/ pubmed_xml2rdbms Public archive

Loads citation records from the annual MEDLINE/PubMed XML format distribution into a MySQL table

Notifications You must be signed in to change notification settings

ncbo/pubmed_xml2rdbms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pubmed_xml2rdbms

Dependency Status

This is an ETL project that transfers MEDLINE/PubMed citation record data from a set of XML files to a MySQL table.

The National Library of Medicine (NLM) provides a baseline set of MEDLINE/PubMed citation records in XML format for bulk download on an annual basis. For each citation record, we extract the following subset of data:

  • PubMed ID
  • Article Title
  • Abstract Text
  • Keywords
  • Medical Subject Headings (MeSH)

After all citation records are processed, the resulting fully-populated MySQL table is utilized by the NCBO Resource Index project. The Resource Index consumes data from biomedical resources and generates annotations from ontology classes in the BioPortal application.

Configuration

The configuration file in src/main/resources allows for specification of a path to the baseline set of XML files, as well as database information, e.g., table name, credentials, etc.

Use the logback.xml file in src/main/resources to customize log output.

Build

This is a Maven project. Use the typical Maven command to compile and package a runnable JAR file:

mvn package

Make sure to use the JAR file with dependencies included, e.g.:

pubmed-xml2rdbms-1.0-SNAPSHOT-jar-with-dependencies.jar

Run

Successful execution of the JAR file assumes that:

  • You have access to the MySQL database specified in the configuration file
  • You downloaded the baseline set of XML files from NLM, and specified the path in the configuration file

java -jar pubmed-xml2rdbms-1.0-SNAPSHOT-jar-with-dependencies.jar

About

Loads citation records from the annual MEDLINE/PubMed XML format distribution into a MySQL table

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages