Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArchiveDiffer vs covidcast API blog #822

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions content/blog/2023-06-26-mismatch.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: "Mismatch between ArchiveDiffer cache and covidcast API data"
author: Minh Le
date: 2023-06-26
tags:
- epidata
authors:
- minh
heroImage: ensemble-hero.jpg
heroImageThumb: ensemble-thumb.jpg
summary: |
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which ensures only new rows of data get added to the database) had lost sync with the API.

output:
blogdown::html_page:
toc: true
---

# Mismatch between ArchiveDiffer cache and covidcast API data
## The Problem
Per Katie's [issue description](https://github.com/cmu-delphi/covidcast-indicators/issues/1697):

<blockquote>

We found some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API. This is a problem for at least three reasons:

* The data in the API is wrong

* The publish mechanism assumes agreement with the API. No agreement means we might publish rows that shouldn't be published, or not-publish rows that should be published.

* Possibly indicates the publish mechanism has a bug (what caused us to lose sync?)

If we had a widget that exhaustively compared what was in the API with the ArchiveDiffer cache, we could identify and address mismatches.

</blockquote>

## The Widget
The widget works as following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be "in the following way"

Suggested change
The widget works as following:
The widget works as follows:


- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.
- Pull each .csv file from the ArchiveDiffer cache currently stored in an AWS S3 bucket.


- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.
- Construct the parameters for an API call based on the labels of the .csv from ArchiveDiffer.


- Compare the data pulled from ArchiveDiffer with data from our API.

The widget eventually outputs a csv file. Each row describes in detail the comparison between data in each file on ArchiveDiffer and data returned from our API given the same params. This comparison includes whether there is a difference between data from the two sources, and if there is, how many rows are impacted.

Additionally, the widget will identify if a whole file from ArchiveDiffer is missing. This happens when the API does not return any data given matching params from an ArchiveDiffer csv file.


## The Result
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.
There exist significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.


| Sources | Number of files with row mismatches | Number of files in ArchiveDiffer | (%) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Sources | Number of files with row mismatches | Number of files in ArchiveDiffer | (%) |
| Source | Number of files with row mismatches | Number of files in ArchiveDiffer | (%) |

| :--------------------- | :--------: | :--------: | :------: |
| jhu-csse | 17165 | 92608 | 18.535116 |
| chng | 15725 | 39961 | 39.350867 |
| usa-facts | 9413 | 90392 | 10.413532 |
| quidel | 6055 | 89982 | 6.729124 |
| dsew-cpr | 2692 | 15652 | 17.199080 |
| hhs | 1506 | 44748 | 3.365514 |
| indicator-combination| 288 | 288 | 100.000000 |
| covid-act-now | 280 | 7704 | 3.634476 |

From this table, we see that source JHU has the highest number of of files with some row mismatches, closely followed by Change Healthcare. However, since JHU has a much bigger number of files on ArchiveDiffer in total, Change data from the API is actually the more inaccurate source by comparison (39% files have mismatches). The most inaccurate source when comparing API data to ArchiveDiffer data is indicator-combination. All indicator-combination data files on ArchiveDiffer currently have one or more rows different from the data returned from covidcast API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
From this table, we see that source JHU has the highest number of of files with some row mismatches, closely followed by Change Healthcare. However, since JHU has a much bigger number of files on ArchiveDiffer in total, Change data from the API is actually the more inaccurate source by comparison (39% files have mismatches). The most inaccurate source when comparing API data to ArchiveDiffer data is indicator-combination. All indicator-combination data files on ArchiveDiffer currently have one or more rows different from the data returned from covidcast API.
From this table, we see that the JHU source has the highest number of files with row mismatches, closely followed by Change Healthcare. However, since JHU has a much bigger number of files in total, Change data from the API is actually less accurate by comparison, having 39% files with mismatches. The most inaccurate source when comparing API data to ArchiveDiffer data is indicator-combination. All indicator-combination data files from ArchiveDiffer currently have one or more rows different from the data returned from the covidcast API.


Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing.
Full-file mismatch happens when the API returns nothing at all when looking for data corresponding to a file from ArchiveDiffer. Fortunately, USAFacts is the only source with a significant number of files where the full file is missing.


| Sources | Number of full data missing from API |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Sources | Number of full data missing from API |
| Sources | Number of full files missing from API |

| :---------------- | :------: |
| usa-facts | 340 |
| hhs | 6 |
| jhu-csse | 5 |
| dsew-cpr | 2 |
| chng | 1 |

Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.
Analysis by date shows that mismatched files are pretty evenly distributed in time. Overall, there is no particular period with less/more mismatches than others.


![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![](/blog/2023-06-26-mismatch/plot.png "Number of files on ArchiveDiffer with mismatch rows")
![](/blog/2023-06-26-mismatch/plot.png "Number of files from ArchiveDiffer with mismatched rows")


## Widget Limitations
This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make more sense if we explicitly define what a mismatched row means in the above analysis. Maybe as part of the The Widget section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other limitations:

For simplicity and expediency, this widget checks only the most recent version of the data, and ignores all data versioning information. Data versioning support in the Epidata API exists because it matters for modeling purposes _when_ information became available (this is the `issue` parameter in the covidcast endpoint). When we make repairs, Epidata permits us to correct old issues, but S3 does not permit us to correct old uploads. This means that patching typically only touches the API, and leaves ArchiveDiffer data alone even if the data being patched is the most recent version available. This widget does not check the timestamp of S3 uploads for ArchiveDiffer data or the `issue` column of API data. This means that it includes differences due to patches, which we might wish to ignore.


## Next steps
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.
We should construct one or more patches which bring the API data and ArchiveDiffer data back into sync. Most changes will modify the API data, since we expect the ArchiveDiffer data to generally be more accurate than the API data, which lies downstream of ArchiveDiffer. We should however identify any differences which are due to pre-existing patches, and ensure that ArchiveDiffer really does represent the most recent figures on all data.
Priority should be given to the least-accurate and most-used indicators: jhu, chng, and indicator-combination.


170 changes: 170 additions & 0 deletions content/blog/2023-06-26-mismatch.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
---
title: "Mismatch between ArchiveDiffer cache and covidcast API data"
author: Minh Le
date: 2023-06-26
tags:
- epidata
authors:
- minh
heroImage: ensemble-hero.jpg
heroImageThumb: ensemble-thumb.jpg
summary: |
Recently, the Delphi team discovered some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day's output actually get added to the database) has lost sync with the API.
output:
blogdown::html_page:
toc: true
---


<div id="TOC">
<ul>
<li><a href="#mismatch-between-archivediffer-cache-and-covidcast-api-data" id="toc-mismatch-between-archivediffer-cache-and-covidcast-api-data">Mismatch between ArchiveDiffer cache and covidcast API data</a>
<ul>
<li><a href="#the-problem" id="toc-the-problem">The Problem</a></li>
<li><a href="#the-widget" id="toc-the-widget">The Widget</a></li>
<li><a href="#the-result" id="toc-the-result">The Result</a></li>
<li><a href="#widget-limitations" id="toc-widget-limitations">Widget Limitations</a></li>
<li><a href="#next-steps" id="toc-next-steps">Next steps</a></li>
</ul></li>
</ul>
</div>

<div id="mismatch-between-archivediffer-cache-and-covidcast-api-data" class="section level1">
<h1>Mismatch between ArchiveDiffer cache and covidcast API data</h1>
<div id="the-problem" class="section level2">
<h2>The Problem</h2>
<p>Per Katie’s <a href="https://github.com/cmu-delphi/covidcast-indicators/issues/1697">issue description</a>:</p>
<blockquote>
<p>We found some items in USAFacts where the ArchiveDiffer cache (which determines which rows of each day’s output actually get added to the database) has lost sync with the API. This is a problem for at least three reasons:</p>
<ul>
<li><p>The data in the API is wrong</p></li>
<li><p>The publish mechanism assumes agreement with the API. No agreement means we might publish rows that shouldn’t be published, or not-publish rows that should be published.</p></li>
<li><p>Possibly indicates the publish mechanism has a bug (what caused us to lose sync?)</p></li>
</ul>
<p>If we had a widget that exhaustively compared what was in the API with the ArchiveDiffer cache, we could identify and address mismatches.</p>
</blockquote>
</div>
<div id="the-widget" class="section level2">
<h2>The Widget</h2>
<p>The widget works as following:</p>
<ul>
<li><p>Pull each .csv file from ArchiveDiffer cache currently stored in an AWS S3 bucket.</p></li>
<li><p>Construct the parameters for an API call based on the labels of our .csv on ArchiveDiffer.</p></li>
<li><p>Compare the data pulled from ArchiveDiffer with data from our API.</p></li>
</ul>
<p>The widget eventually outputs a csv file. Each row describes in detail the comparison between data in each file on ArchiveDiffer and data returned from our API given the same params. This comparison includes whether there is a difference between data from the two sources, and if there is, how many rows are impacted.</p>
<p>Additionally, the widget will identify if a whole file from ArchiveDiffer is missing. This happens when the API does not return any data given matching params from an ArchiveDiffer csv file.</p>
</div>
<div id="the-result" class="section level2">
<h2>The Result</h2>
<p>There exists significant mismatches between data obtained from the ArchiveDiffer cache and covidcast API.</p>
<table>
<colgroup>
<col width="44%" />
<col width="20%" />
<col width="20%" />
<col width="16%" />
</colgroup>
<thead>
<tr class="header">
<th align="left">Sources</th>
<th align="center">Number of files with row mismatches</th>
<th align="center">Number of files in ArchiveDiffer</th>
<th align="center">(%)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">jhu-csse</td>
<td align="center">17165</td>
<td align="center">92608</td>
<td align="center">18.535116</td>
</tr>
<tr class="even">
<td align="left">chng</td>
<td align="center">15725</td>
<td align="center">39961</td>
<td align="center">39.350867</td>
</tr>
<tr class="odd">
<td align="left">usa-facts</td>
<td align="center">9413</td>
<td align="center">90392</td>
<td align="center">10.413532</td>
</tr>
<tr class="even">
<td align="left">quidel</td>
<td align="center">6055</td>
<td align="center">89982</td>
<td align="center">6.729124</td>
</tr>
<tr class="odd">
<td align="left">dsew-cpr</td>
<td align="center">2692</td>
<td align="center">15652</td>
<td align="center">17.199080</td>
</tr>
<tr class="even">
<td align="left">hhs</td>
<td align="center">1506</td>
<td align="center">44748</td>
<td align="center">3.365514</td>
</tr>
<tr class="odd">
<td align="left">indicator-combination</td>
<td align="center">288</td>
<td align="center">288</td>
<td align="center">100.000000</td>
</tr>
<tr class="even">
<td align="left">covid-act-now</td>
<td align="center">280</td>
<td align="center">7704</td>
<td align="center">3.634476</td>
</tr>
</tbody>
</table>
<p>From this table, we see that source JHU has the highest number of of files with some row mismatches, closely followed by Change Healthcare. However, since JHU has a much bigger number of files on ArchiveDiffer in total, Change data from the API is actually the more inaccurate source by comparison (39% files have mismatches). The most inaccurate source when comparing API data to ArchiveDiffer data is indicator-combination. All indicator-combination data files on ArchiveDiffer currently have one or more rows different from the data returned from covidcast API.</p>
<p>Full-file mismatch happens when the API returns no row at all when looking for data from a file on ArchiveDiffer. Fortunately, USAFacts is the only source with significant number of files where the full file is missing.</p>
<table>
<thead>
<tr class="header">
<th align="left">Sources</th>
<th align="center">Number of full data missing from API</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">usa-facts</td>
<td align="center">340</td>
</tr>
<tr class="even">
<td align="left">hhs</td>
<td align="center">6</td>
</tr>
<tr class="odd">
<td align="left">jhu-csse</td>
<td align="center">5</td>
</tr>
<tr class="even">
<td align="left">dsew-cpr</td>
<td align="center">2</td>
</tr>
<tr class="odd">
<td align="left">chng</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
<p>Analysis by date shows that time in mismatched files are pretty evenly distributed. Overall, there is no particular period of time with less/more mismatch than others.</p>
<p><img src="/blog/2023-06-26-mismatch/plot.png" title="Number of files on ArchiveDiffer with mismatch rows" /></p>
</div>
<div id="widget-limitations" class="section level2">
<h2>Widget Limitations</h2>
<p>This widget cannot tell if a mismatched row exists in S3 but not API or vice versa. It can only show the number of rows that should exists in both storage places but in reality, only exists in one of them.</p>
</div>
<div id="next-steps" class="section level2">
<h2>Next steps</h2>
<p>Assuming that the ArchiveDiffer is always correct in comparison to the API content, we should start patching data in the API from the least accurate indicators, indicator-combination and chng.</p>
</div>
</div>
Binary file added static/blog/2023-06-26-mismatch/plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.