You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 18, 2020. It is now read-only.
@naegelyd and I discussed refactoring the load pipeline to not use the database during processing of VCF files. The high-level approach is as follows:
Populate a cache of required data for processing from the database
Variants are keyed by the chromosome, start and end position or by the MD5 which is already stored in the table. This depends on Variant, Chromosome and VariantType.
As a side note, the MD5 is only used for pipeline processing, so it could be dropped from the table altogether if they are generated dynamically during this phase. This would reduce the size of the variant table by a third.
Variant effects which depends on Variant, Chromosome, Effect, FunctionalClass, Gene, and Transcript
The current behavior is to only assess variant effects if the variant is new, but does not handle merges/updates to existing effects. This could be a time add support for this and log all the occurrences that effects have changed over time as new snpEff versions are released.
Samples are keyed by project, batch, name, version
Sample results are insert only, so there is no need to cache them
Store max surrogate key values which act as the starting point for incrementing new identifiers.
This enables the loader to handle incrementing it's own keys to ensure references are properly set up outside of the database
Memcached should be used ensure atomicity during reads and writes of these surrogate keys during parallel processing.
Writing out data for load
Generate flat TSV to be loaded into the database using Postgres' COPY command (fastest)
Generic format to easily make data assertions using all kinds of tools, visual inspection, etc.
Write to temporary empty postgres database that is then sync it with the targets
Full power of SQL to make assertions, compute stats to be applied as the diff to existing ones
@naegelyd and I discussed refactoring the load pipeline to not use the database during processing of VCF files. The high-level approach is as follows:
Variant
,Chromosome
andVariantType
.variant
table by a third.Variant
,Chromosome
,Effect
,FunctionalClass
,Gene
, andTranscript
Originally reported as chop-dbhi/varify#146
The text was updated successfully, but these errors were encountered: