New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ArchiveDiffer vs covidcast API blog #822

Open

minhkhul wants to merge 1 commit into dev from archive-differ-blog

Contributor

minhkhul commented Jun 26, 2023

No description provided.


          ArchiveDiffer vs covidcast API blog

36ded14

minhkhul requested a review from krivard

June 26, 2023 14:45

netlify bot commented Jun 26, 2023

✅ Deploy Preview for cmu-delphi-main ready!

Name	Link
🔨 Latest commit	`36ded14`
🔍 Latest deploy log	https://app.netlify.com/sites/cmu-delphi-main/deploys/6499a47df9cbc000084cfa50
😎 Deploy Preview	https://deploy-preview-822--cmu-delphi-main.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

krivard reviewed

View reviewed changes

Contributor

krivard left a comment

First copy-edit pass (your english is perfectly fine, i'm just making it more idiomatic)

Limitations and Next Steps will need another pass, but they're always tricky

content/blog/2023-06-26-mismatch.Rmd

+              heroImage: ensemble-hero.jpg
+              heroImageThumb: ensemble-thumb.jpg
+              summary: |
+                Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.

Contributor

krivard Jun 26, 2023

Suggested change

      
              Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
          
              Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which ensures only new rows of data get added to the database) had lost sync with the API.

content/blog/2023-06-26-mismatch.Rmd

+              </blockquote>
+              ## The Widget
+              The widget works as following:

Contributor

krivard Jun 26, 2023

could also be "in the following way"

Suggested change

      
            The widget works as following:
          
            The widget works as follows:

content/blog/2023-06-26-mismatch.Rmd

+              ## The Widget
+              The widget works as following:
+              - Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.

Contributor

krivard Jun 26, 2023

Suggested change

      
            - Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.
          
            - Pull each .csv file from the ArchiveDiffer cache currently stored in an AWS S3 bucket.

content/blog/2023-06-26-mismatch.Rmd


		- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.

		- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.

Contributor

krivard Jun 26, 2023

Suggested change

      
            - Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.
          
            - Construct the parameters for an API call based on the labels of the .csv from ArchiveDiffer.

content/blog/2023-06-26-mismatch.Rmd



		## The Result
		There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.

Contributor

krivard Jun 26, 2023

Suggested change

      
            There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API. 
          
            There exist significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.

content/blog/2023-06-26-mismatch.Rmd


		Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing.

		\| Sources \| Number of full data missing from API \|

Contributor

krivard Jun 26, 2023

Suggested change

      
            | Sources              | Number of full data missing from API |
          
            | Sources              | Number of full files missing from API |

content/blog/2023-06-26-mismatch.Rmd

+              | dsew-cpr |  2   |
+              | chng |  1   |
+              Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.

Contributor

krivard Jun 26, 2023

Suggested change

      
            Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others. 
          
            Analysis by date shows that mismatched files are pretty evenly distributed in time. Overall, there is no particular period with less/more mismatches than others.

content/blog/2023-06-26-mismatch.Rmd


		Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.

		![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")

Contributor

krivard Jun 26, 2023

Suggested change

      
            ![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
          
            ![](/blog/2023-06-26-mismatch/plot.png "Number of files from ArchiveDiffer with mismatched rows")

content/blog/2023-06-26-mismatch.Rmd

+              ![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
+              ## Widget Limitations
+              This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.

Contributor

krivard Jun 26, 2023

This will make more sense if we explicitly define what a mismatched row means in the above analysis. Maybe as part of the The Widget section?

Contributor

krivard Jun 26, 2023

Other limitations:

For simplicity and expediency, this widget checks only the most recent version of the data, and ignores all data versioning information. Data versioning support in the Epidata API exists because it matters for modeling purposes _when_ information became available (this is the `issue` parameter in the covidcast endpoint). When we make repairs, Epidata permits us to correct old issues, but S3 does not permit us to correct old uploads. This means that patching typically only touches the API, and leaves ArchiveDiffer data alone even if the data being patched is the most recent version available. This widget does not check the timestamp of S3 uploads for ArchiveDiffer data or the `issue` column of API data. This means that it includes differences due to patches, which we might wish to ignore.

content/blog/2023-06-26-mismatch.Rmd

+              This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.
+              ## Next steps
+              Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.

Contributor

krivard Jun 26, 2023

Suggested change

      
            Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.
          
            We should construct one or more patches which bring the API data and ArchiveDiffer data back into sync. Most changes will modify the API data, since we expect the ArchiveDiffer data to generally be more accurate than the API data, which lies downstream of ArchiveDiffer. We should however identify any differences which are due to pre-existing patches, and ensure that ArchiveDiffer really does represent the most recent figures on all data.
          
            Priority should be given to the least-accurate and most-used indicators: jhu, chng, and indicator-combination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet