Skip to content

An R package to upload datasets to BigQuery for public sharing

License

Notifications You must be signed in to change notification settings

iainmwallace/DataDepository

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataDepository

An R package to upload datasets to BigQuery for public sharing so that they can be integrated with other public datasets easily.

Why?

Before an analysis can be done, often a number of datasets have to be downloaded, cleaned and standardized.

A central repository that can store the data set after standardization would reduce the time required for the next analysis using the same source data. It would eliminate the time required to download, and parse a dataset.

What?

BigQuery is a serverless database that is an attractive solution to store and share datasets of general interest for a number of reasons:

  • Very fast - joining two files in PubChem, 100 million chemical structures and 70 million names took less than 3 minutes without having to define an index
  • Very cheap. There is no fee for the server it is hosted on, rather there is a small fee for storing data (10Gb free, $0.02 for each additional Gb - i.e. 1TB for $20 per month) and a fee for querying the data (1Tb free, $5 per additional TB)
  • It has UI from which data can be stored or queried
  • It has a rest API (and many clients)
  • Metadata can be used to describe the dataset.
  • All datasets can be referenced with unique URL

Examples

Compound names from PubChem mapped onto InChIKeys Compound activities from ChEMBL enhanced with InChIKeys Count of compounds appearing in databases based on UniChem

Shiny app to query BigQuery

About

An R package to upload datasets to BigQuery for public sharing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages