-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support incremental appending #81
Conversation
Codecov Report
@@ Coverage Diff @@
## master #81 +/- ##
==========================================
+ Coverage 91.36% 92.39% +1.02%
==========================================
Files 5 5
Lines 359 368 +9
Branches 49 51 +2
==========================================
+ Hits 328 340 +12
+ Misses 26 22 -4
- Partials 5 6 +1
Continue to review full report at Codecov.
|
self.processed_input_nb = len(self.processed_input_urls) | ||
input_pattern = ExplicitURLSequence(self.processed_input_urls + self.input_urls) | ||
self._inputs = { | ||
k: {"url": v, "processed": k < self.processed_input_nb} for k, v in input_pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of ExplicitURLSequence
leaks here, as k
would not necessarily be an incrementing index. We could instead have:
"processed": v in self.processed_input_urls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is not true, since self.input_urls
are the new URLs (they don't contain the already processed URLs), which is actually different than for the NetCDFtoZarrMultiVarSequentialRecipe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change self.input_urls
so that it also includes the already processed URLs, this would make NetCDFtoZarrSequentialRecipe
similar to NetCDFtoZarrMultiVarSequentialRecipe
.
self._inputs = {k: v for k, v in self.input_pattern} | ||
self.processed_input_nb = len(self.processed_input_urls) // len(self._variables) | ||
self._inputs = { | ||
k: {"url": v, "processed": v in self.processed_input_urls} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the input_pattern
generates the input file URLs including the already processed ones (which is different than for the NetCDFtoZarrSequentialRecipe
). This allows to tag the already processed URLs.
If the input_pattern
didn't generate the already processed URLs, the pattern would have to know about the (number of) already processed URLs, which is not great.
Hi @davidbrochart - thanks so much for this PR! We really appreciate it! Sorry it has taken me so long to reply--I had a crazy week last week and wasn't able to spend much time on Pangeo Forge. I'm going to try to review this in detail in the next few days. I'm a bit concerned about the intersection with #78, but we will see. |
David thanks for working on this. We really appreciate your efforts. Closed this because the code base has evolved quite a bit since this PR started. I plan to work on this feature in January and will use this PR as a basis for that work. Appreciate your patience with this high-churn code base. It hasn't been trivial to get the base abstractions right. |
See #37 (comment)