-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: default implementations of RESTStream (and others) do not adhere to _MAX_RECORDS_LIMIT
#1349
Comments
@kgpayne - understood. One of the reasons this was created as an internal constant was due to the fact that it could not be relied on everywhere. There are also questions about how this should behave - since arbitrary integers here could break the tap's ability to resume meaningfully. Also this could effect parent child streams, preventing could streams from being reached. Perhaps a path forward would be to define a "dry run" mode formally, so that certain assumptions can be baked in. Then, this might end up with a name like 🤔 |
@kgpayne - I opened this issue to specifically focus on dry run use cases: For test-specific implementations, why not create a custom loop in the same pattern as
I think this is actually functioning as designed. For the reasons described in #1366, this was not intended to be a "normal" feature - exactly because it breaks stability expectations if attempted to use in a production capacity. The raising of the exception is how the counter is designed to operate, within the context of the connection test. |
On second pass through the code above, what about amending the connection test method to take any Another thing to tweak here if we want tests to emit at least one - def run_connection_test(self) -> bool:
+ def run_connection_test(self, max_records_per_stream: int=1) -> bool:
"""Run connection test.
Returns:
True if the test succeeded.
"""
for stream in self.streams.values():
# Initialize streams' record limits before beginning the sync test.
- stream._MAX_RECORDS_LIMIT = 1
+ stream._MAX_RECORDS_LIMIT = max_records_per_stream
+ stream.STATE_MSG_FREQUENCY = max_records_per_stream Important to note that many many streams are not sorted and |
@aaronsteers that makes sense for the connection test, but doesn't help for the rest of the standardised test suite. The root issue is that tests require a finalised sync to perform stream-level tests. Ultimately we want to test that the Tap runs to completion, and (among other things) correctly finalises accumulated STATE into a bookmark. However we also want to reduce the cost of testing overall, by limiting the number of records fetched as part of that completed sync. Hence the desire to have a mechanism for limiting the number of records fetched and returned by Whilst I understand the sorting case, and also the parent-child limitations called out elsewhere, having a means to specify "run to completion but fetch no more than n records" for testing purposes will reduce the cost of CI/CD and improve the overall developer experience for Tap maintainers. Looking further down the line, I still cannot see a reason why this feature couldn't be made available with limitations/caveats; i.e. setting record limits for the purposes of syncing is only available to sorted parent streams, for all the reasons you mentioned. There are plenty of sources that are sorted and not nested that would work as expected and benefit hugely (including many if not most SQL use-cases) 🙂 |
My point above is that because it's impossible to do so, and because that is understandably part of the spec, we should not add the feature - because we can't deliver generically and still meet the minimal set of Spec-adherence expectations.
Understood. Again my point is just that this is impossible, not that it is not desireable. We cannot finalize the STATE because the bookmark is invalid until all records are synced to that point.
If we're talking about testing, then the raised exception can be handled by the test harness. If we're talking about production syncs, the exception is needed to make sure we're alerting on the failure to complete the sync.
I don't know how "run to completion" could be compatible with "no more than n records". Either we ran to completion or we aborted abnormally - but we can't have done both. There is a way to do this as a feature - but the number of errors we'd have to throw would make it unusable in the majority of tap implementations - especially if delivered before #1350 which would allow individual stream config. That said, if we keep scope focused on dry-run and test use cases, then we end up with something like #1366 that is specifically named and defined as a non-production config option.
It's not clear how those exceptions/limitations/caveats would be delivered. Would we fail the entire sync operation if users tries to apply a record limit on an unsorted stream or on a stream which cannot support the record limit? How would users know in advance which streams are sorted or not, since that's internal to the tap implementation? What if a stream switches from
I don't really see this being generically valuable outside of Teasing apart test/dry-run use cases from "real world" production applicationsFor dry run scenarios, we can do: For all other real world use cases, I think we'd need to agree on a pretty large set of failure scenarios - circumstances where we'd basically be forced to fail the whole sync operation, whenever we run into a stream that can't honor a resumable sync operation with the record limit applied. Those cases include but wouldn't be limited to: unsorted streams (the default for For all these reasons, I think this would be better to let tap developers implement themselves - and to not add it as a generic/global config across all taps. If we wanted to deliver genrically, I think it just needs to be positioned as a "dry-run" or "test" feature, and not a generic record limit. |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Still relevant |
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the |
Singer SDK Version
0.18.0
Python Version
3.10
Bug scope
Taps (catalog, state, stream maps, etc.)
Operating System
MacOS
Description
In testing MeltanoLabs/tap-stackexchange (PR here) with the upcoming test improvements, we noticed that the
RESTStream
implementation included in the SDK does not adhere to the_MAX_RECORDS_LIMIT
internal attribute when returning records fromget_records()
. Specifically,get_records()
returns more than the limit, causing the_check_max_record_limit()
to fail and throw aMaxRecordsLimitException
.For connection testing (currently the only use of
_MAX_RECORDS_LIMIT
) this error is caught and passed. However, in the new testing framework, we need the stream to complete gracefully (emitting state etc.) so that we can test the records returned. Catching the raised error, whilst possible, is still unhelpful when records/messages are required for testing.In the recent
SQLStream
implementation we added a snippet to "push down" the limit to the remote sql engine, to avoid materialising a complete dataset on the server but exiting early after a small number of records:By limiting the number of records fetched in
get_records()
we both limit load on the upstream system and avoid triggering theMaxRecordsLimitException
thrown by_check_max_record_limit()
if too many records are returned. We could apply the same approach to the other stream classes base implementation. We should then also document this approach for the cases where users override theget_records()
implementations.Code
No response
The text was updated successfully, but these errors were encountered: