Skip to content
This repository has been archived by the owner on Jan 18, 2020. It is now read-only.

Removing load off the database during VCF processing #7

Open
naegelyd opened this issue Aug 13, 2014 · 0 comments
Open

Removing load off the database during VCF processing #7

naegelyd opened this issue Aug 13, 2014 · 0 comments

Comments

@naegelyd
Copy link
Collaborator

@naegelyd and I discussed refactoring the load pipeline to not use the database during processing of VCF files. The high-level approach is as follows:

  1. Populate a cache of required data for processing from the database
    • Variants are keyed by the chromosome, start and end position or by the MD5 which is already stored in the table. This depends on Variant, Chromosome and VariantType.
      • As a side note, the MD5 is only used for pipeline processing, so it could be dropped from the table altogether if they are generated dynamically during this phase. This would reduce the size of the variant table by a third.
    • Variant effects which depends on Variant, Chromosome, Effect, FunctionalClass, Gene, and Transcript
      • The current behavior is to only assess variant effects if the variant is new, but does not handle merges/updates to existing effects. This could be a time add support for this and log all the occurrences that effects have changed over time as new snpEff versions are released.
    • Samples are keyed by project, batch, name, version
    • Sample results are insert only, so there is no need to cache them
  2. Store max surrogate key values which act as the starting point for incrementing new identifiers.
    • This enables the loader to handle incrementing it's own keys to ensure references are properly set up outside of the database
    • Memcached should be used ensure atomicity during reads and writes of these surrogate keys during parallel processing.
  3. Writing out data for load
    • Generate flat TSV to be loaded into the database using Postgres' COPY command (fastest)
      • Generic format to easily make data assertions using all kinds of tools, visual inspection, etc.
    • Write to temporary empty postgres database that is then sync it with the targets
      • Full power of SQL to make assertions, compute stats to be applied as the diff to existing ones

Originally reported as chop-dbhi/varify#146

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant