Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: optimize database queries #7191

Merged
merged 18 commits into from
Jun 17, 2024
Merged

fix: optimize database queries #7191

merged 18 commits into from
Jun 17, 2024

Conversation

Bento007
Copy link
Contributor

@Bento007 Bento007 commented Jun 14, 2024

Reason for Change

Changes

  • Modify persistance layer calls to avoid database calls within a for loop. The queries are instead batched together into a single call. In the future we may need to split up the calls into batches to avoid waiting for large queries to finish.
  • change print statements to logging statements
  • add timing metrics to response to diagnose slow database calls.

Testing steps

  • All unit tests pass.
  • verified timing metrics show up in responses

Note

  • These change would benefit from load testing but we currently lack the tooling. This would help determine when or if we should start batch the requests to the database.

Copy link
Contributor

Deployment Summary

Copy link

codecov bot commented Jun 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.74%. Comparing base (ed8cd5e) to head (9f8934e).

Current head 9f8934e differs from pull request most recent head 82d4b03

Please upload reports for the commit 82d4b03 to get more accurate results.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7191   +/-   ##
=======================================
  Coverage   92.73%   92.74%           
=======================================
  Files         191      191           
  Lines       16066    16083   +17     
=======================================
+ Hits        14899    14916   +17     
  Misses       1167     1167           
Flag Coverage Δ
unittests 92.74% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dataset_versions.append(dv)
else:
dataset_version_ids.append(dv)
dataset_versions.extend(business_logic.database_provider.get_dataset_versions_by_id(dataset_version_ids))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add an empty list check, to avoid opening a session / querying for nothing in case all are DatasetVersion instances

update(DatasetTable).where(DatasetTable.id.in_(dataset_ids_to_tombstone)).values(tombstone=True)
)
session.execute(tombstone_dataset_statement)
dataset_all_version_ids = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tests are passing so maybe i'm misunderstanding, but this query looks like we're getting List[DatasetVersion] rather than List[DatasetVersionId], no? And dataset_version_ids_to_delete_from_s3 expects Ids

Copy link
Contributor Author

@Bento007 Bento007 Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're totally right, but that is how the original code was written. I change it to query for DatasetVersion.id and it works. Not sure how it managed to work before, because it was returning a list[row] not a list[string].

Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test in rdev and ensure no regressions but looks good to me!

@Bento007
Copy link
Contributor Author

I added some timing metric to the queries and the two slowed queries are in get_all_mapped_collection_versions. Fetching all of the canonical collections take about seconds at worst. Fetching all current collection versions to 20 seconds at worst.

with ServerTiming.time("get-cc-all"), log_time_taken("get-cc-all"):

I also did some performance and found the endpoint on average took over 60 seconds to return

@Bento007 Bento007 merged commit 5afa539 into main Jun 17, 2024
21 of 22 checks passed
@Bento007 Bento007 deleted the tsmith/db-query-optimize branch June 17, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants