Skip to content
This repository has been archived by the owner on Jan 18, 2020. It is now read-only.

Performance problem due to redundant queries #387

Open
murphyke opened this issue Oct 10, 2014 · 1 comment
Open

Performance problem due to redundant queries #387

murphyke opened this issue Oct 10, 2014 · 1 comment

Comments

@murphyke
Copy link
Member

A typical Varify query in our installation generates between five and six thousand SQL queries to show the initial 20 HTML results.

Varify could benefit greatly from some caching or elimination of redundant database lookups. (If there is caching in place that doesn't seem to be working, could it possibly be due to memcached's default 1MB object limit or a too-small MAX_ENTRIES value in the Django CACHES setting?)

The main problem is highly redundant pmid queries. These add up to about the same as the big money queries (the top 3 slowest) and in terms of overall processing time may be significantly more costly.

I doubt the details of the query matter, but my test query was as follows:

Sample is one of the following:
P-PSeq_0038-P-A from project U01 (CPF1308003)
P-Pseq_0043-P-A from project U01 (CPF1309006)
P-Pseq_0044-P-A from project U01 (CPF1309006)
P-Pseq_0045-P-A from project U01 (CPF1309006)
1007 from project U01 (NEMRCD)
1007p1 from project U01 (NEMRCD)
P-PSeq0065-P-A from project U01 (Pseq_batch17)
Gene Symbol is BRCA1
Effect is Non-Synonymous Coding

SQL was collected for the first page of HTML results.

The biggest offenders were the pmid queries:

1800 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7061
960 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7888
360 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7887
240 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7374
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 8044
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 7609
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 6878
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 12332
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11075
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11074
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 11070
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 10401
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "phenotype_articles" ON ("pubmed"."pmid" = "phenotype_articles"."pubmed_id") WHERE "phenotype_articles"."phenotype_id" = 10397
120 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "gene_pubmed" ON ("pubmed"."pmid" = "gene_pubmed"."pubmed_id") WHERE "gene_pubmed"."gene_id" = 2071
6 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254416
5 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254415
5 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254414
3 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 1108096
1 SELECT "pubmed"."pmid" FROM "pubmed" INNER JOIN "variant_pubmed" ON ("pubmed"."pmid" = "variant_pubmed"."pubmed_id") WHERE "variant_pubmed"."variant_id" = 254403

There are also a number of queries that are each executed 120 times, e.g.:

20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75417
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75416
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75415
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75414
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75413
20 SELECT "transcript"."id", "transcript"."refseq_id", "transcript"."strand", "transcript"."start", "transcript"."end", "transcript"."coding_start", "transcript"."coding_end", "transcript"."coding_start_status", "transcript"."coding_end_status", "transcript"."exon_count", "transcript"."gene_id" FROM "transcript" WHERE "transcript"."id" = 75412

These are similar redundant queries for effect, gene, phenotype, pmid, and effect_impact.

I don't really understand the redundancy patterns. E.g. for gene, we have:

120 SELECT "gene"."id", "gene"."chr_id", "gene"."symbol", "gene"."name", "gene"."hgnc_id" FROM "gene" WHERE "gene"."id" = 2071

whereas for phenotype we have:

120 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" INNER JOIN "gene_phenotype" ON ("phenotype"."id" = "gene_phenotype"."phenotype_id") WHERE "gene_phenotype"."gene_id" = 2071
6 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11075
5 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11074
5 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 11070
3 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 12332
1 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 7887
1 SELECT "phenotype"."id", "phenotype"."term", "phenotype"."description", "phenotype"."hpo_id" FROM "phenotype" WHERE "phenotype"."id" = 7061

@murphyke
Copy link
Member Author

Expect a roughly six-fold performance improvement for this query if all SQL redundancy is removed. The time for psql to execute all the SQL queries behind this Varify query over the local network is 42.8 sec. If all of the redundant queries are removed via | sort | uniq, the time taken is 7.2 sec. The database server cache was warm in both cases. The real-world effect including Python and caching overhead remains to be seen.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants