Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArchiveDiffer vs covidcast API blog #822

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open

Conversation

minhkhul
Copy link
Contributor

No description provided.

@minhkhul minhkhul requested a review from krivard June 26, 2023 14:45
@netlify
Copy link

netlify bot commented Jun 26, 2023

Deploy Preview for cmu-delphi-main ready!

Name Link
🔨 Latest commit 36ded14
🔍 Latest deploy log https://app.netlify.com/sites/cmu-delphi-main/deploys/6499a47df9cbc000084cfa50
😎 Deploy Preview https://deploy-preview-822--cmu-delphi-main.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First copy-edit pass (your english is perfectly fine, i'm just making it more idiomatic)

Limitations and Next Steps will need another pass, but they're always tricky

heroImage: ensemble-hero.jpg
heroImageThumb: ensemble-thumb.jpg
summary: |
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which ensures only new rows of data get added to the database) had lost sync with the API.

</blockquote>

## The Widget
The widget works as following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be "in the following way"

Suggested change
The widget works as following:
The widget works as follows:

## The Widget
The widget works as following:

- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.
- Pull each .csv file from the ArchiveDiffer cache currently stored in an AWS S3 bucket.


- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.

- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.
- Construct the parameters for an API call based on the labels of the .csv from ArchiveDiffer.



## The Result
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.
There exist significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.


Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing.

| Sources | Number of full data missing from API |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Sources | Number of full data missing from API |
| Sources | Number of full files missing from API |

| dsew-cpr | 2 |
| chng | 1 |

Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.
Analysis by date shows that mismatched files are pretty evenly distributed in time. Overall, there is no particular period with less/more mismatches than others.


Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.

![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
![](/blog/2023-06-26-mismatch/plot.png "Number of files from ArchiveDiffer with mismatched rows")

![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")

## Widget Limitations
This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make more sense if we explicitly define what a mismatched row means in the above analysis. Maybe as part of the The Widget section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other limitations:

For simplicity and expediency, this widget checks only the most recent version of the data, and ignores all data versioning information. Data versioning support in the Epidata API exists because it matters for modeling purposes _when_ information became available (this is the `issue` parameter in the covidcast endpoint). When we make repairs, Epidata permits us to correct old issues, but S3 does not permit us to correct old uploads. This means that patching typically only touches the API, and leaves ArchiveDiffer data alone even if the data being patched is the most recent version available. This widget does not check the timestamp of S3 uploads for ArchiveDiffer data or the `issue` column of API data. This means that it includes differences due to patches, which we might wish to ignore.

This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.

## Next steps
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.
We should construct one or more patches which bring the API data and ArchiveDiffer data back into sync. Most changes will modify the API data, since we expect the ArchiveDiffer data to generally be more accurate than the API data, which lies downstream of ArchiveDiffer. We should however identify any differences which are due to pre-existing patches, and ensure that ArchiveDiffer really does represent the most recent figures on all data.
Priority should be given to the least-accurate and most-used indicators: jhu, chng, and indicator-combination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants