-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArchiveDiffer vs covidcast API blog #822
base: dev
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for cmu-delphi-main ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First copy-edit pass (your english is perfectly fine, i'm just making it more idiomatic)
Limitations and Next Steps will need another pass, but they're always tricky
heroImage: ensemble-hero.jpg | ||
heroImageThumb: ensemble-thumb.jpg | ||
summary: | | ||
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API. | |
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which ensures only new rows of data get added to the database) had lost sync with the API. |
</blockquote> | ||
|
||
## The Widget | ||
The widget works as following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also be "in the following way"
The widget works as following: | |
The widget works as follows: |
## The Widget | ||
The widget works as following: | ||
|
||
- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket. | |
- Pull each .csv file from the ArchiveDiffer cache currently stored in an AWS S3 bucket. |
|
||
- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket. | ||
|
||
- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer. | |
- Construct the parameters for an API call based on the labels of the .csv from ArchiveDiffer. |
|
||
|
||
## The Result | ||
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API. | |
There exist significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API. |
|
||
Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing. | ||
|
||
| Sources | Number of full data missing from API | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Sources | Number of full data missing from API | | |
| Sources | Number of full files missing from API | |
| dsew-cpr | 2 | | ||
| chng | 1 | | ||
|
||
Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others. | |
Analysis by date shows that mismatched files are pretty evenly distributed in time. Overall, there is no particular period with less/more mismatches than others. |
|
||
Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others. | ||
|
||
![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows") | |
![](/blog/2023-06-26-mismatch/plot.png "Number of files from ArchiveDiffer with mismatched rows") |
![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows") | ||
|
||
## Widget Limitations | ||
This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make more sense if we explicitly define what a mismatched row means in the above analysis. Maybe as part of the The Widget
section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other limitations:
For simplicity and expediency, this widget checks only the most recent version of the data, and ignores all data versioning information. Data versioning support in the Epidata API exists because it matters for modeling purposes _when_ information became available (this is the `issue` parameter in the covidcast endpoint). When we make repairs, Epidata permits us to correct old issues, but S3 does not permit us to correct old uploads. This means that patching typically only touches the API, and leaves ArchiveDiffer data alone even if the data being patched is the most recent version available. This widget does not check the timestamp of S3 uploads for ArchiveDiffer data or the `issue` column of API data. This means that it includes differences due to patches, which we might wish to ignore.
This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them. | ||
|
||
## Next steps | ||
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng. | |
We should construct one or more patches which bring the API data and ArchiveDiffer data back into sync. Most changes will modify the API data, since we expect the ArchiveDiffer data to generally be more accurate than the API data, which lies downstream of ArchiveDiffer. We should however identify any differences which are due to pre-existing patches, and ensure that ArchiveDiffer really does represent the most recent figures on all data. | |
Priority should be given to the least-accurate and most-used indicators: jhu, chng, and indicator-combination. |
No description provided.