Removing load off the database during VCF processing #7

naegelyd · 2014-08-13T15:45:31Z

@naegelyd and I discussed refactoring the load pipeline to not use the database during processing of VCF files. The high-level approach is as follows:

Populate a cache of required data for processing from the database
- Variants are keyed by the chromosome, start and end position or by the MD5 which is already stored in the table. This depends on Variant, Chromosome and VariantType.
  - As a side note, the MD5 is only used for pipeline processing, so it could be dropped from the table altogether if they are generated dynamically during this phase. This would reduce the size of the variant table by a third.
- Variant effects which depends on Variant, Chromosome, Effect, FunctionalClass, Gene, and Transcript
  - The current behavior is to only assess variant effects if the variant is new, but does not handle merges/updates to existing effects. This could be a time add support for this and log all the occurrences that effects have changed over time as new snpEff versions are released.
- Samples are keyed by project, batch, name, version
- Sample results are insert only, so there is no need to cache them
Store max surrogate key values which act as the starting point for incrementing new identifiers.
- This enables the loader to handle incrementing it's own keys to ensure references are properly set up outside of the database
- Memcached should be used ensure atomicity during reads and writes of these surrogate keys during parallel processing.
Writing out data for load
- Generate flat TSV to be loaded into the database using Postgres' COPY command (fastest)
  - Generic format to easily make data assertions using all kinds of tools, visual inspection, etc.
- Write to temporary empty postgres database that is then sync it with the targets
  - Full power of SQL to make assertions, compute stats to be applied as the diff to existing ones

Originally reported as chop-dbhi/varify#146

The text was updated successfully, but these errors were encountered:

naegelyd mentioned this issue Aug 13, 2014

Removing load off the database during VCF processing chop-dbhi/varify#146

Closed

naegelyd added version:1.0 type:refactor labels Oct 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing load off the database during VCF processing #7

Removing load off the database during VCF processing #7

naegelyd commented Aug 13, 2014

Removing load off the database during VCF processing #7

Removing load off the database during VCF processing #7

Comments

naegelyd commented Aug 13, 2014