-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce eager loading functions #147
Conversation
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
05d5b8a
to
5026389
Compare
Some work is still required in calamine: tafia/calamine#409 |
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Okay well just noticed that the API changed so we actually need to use |
Signed-off-by: Luka Peschke <[email protected]>
Glad to see tafia/calamine#409 has been merged. Hopefully we get a new release soon 👍 |
Signed-off-by: Luka Peschke <[email protected]>
new data
|
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
calamine 0.25.0 should be released soon, meaning I should finally be able to finish this 🙂 tafia/calamine#435 |
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great
Love the small refactorings done
@@ -165,9 +165,38 @@ def load_sheet( | |||
schema_sample_rows=schema_sample_rows, | |||
use_columns=use_columns, | |||
dtypes=dtypes, | |||
eager=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably improve the doc string here "lazy load"?
And question don't we want to have the eager version by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not have the eager version by default as I don't want to introduce a breaking change. But you're right, I'll improve the docstring 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_columns: list[str] | list[int] | str | None = None, | ||
dtypes: DTypeMap | None = None, | ||
) -> pa.RecordBatch: | ||
"""Loads a sheet eagerly by index or name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only has an impact for xlsx since the other formats don't support lazy iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Luka Peschke <[email protected]>
What
This introduces eager loading functions that make use of the calamine's new
DataTypeRef
.This prevents some allocations, resulting in a lower memory footprint.
Caveats
The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
The functions need to be eager because
DataTypeRef
has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: refactor: make
DataTypeRef
public and introduce aDataTypeTrait
trait tafia/calamine#390Gains
While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise 🥳 (they still beat us by a few MBs), while being about 10 times faster
Before
After
Pandas