Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce eager loading functions #147

Merged
merged 31 commits into from
Jul 1, 2024
Merged

Conversation

lukapeschke
Copy link
Collaborator

@lukapeschke lukapeschke commented Dec 22, 2023

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise 🥳 (they still beat us by a few MBs), while being about 10 times faster

Before

before

After

after

Pandas

pandas

@lukapeschke lukapeschke added enhancement New feature or request 🔨 WIP 🔧 🦀 rust 🦀 Pull requests that edit Rust code labels Dec 22, 2023
@lukapeschke lukapeschke self-assigned this Dec 22, 2023
@lukapeschke lukapeschke added the 🐍 python 🐍 Pull requests that edit Python code label Dec 22, 2023
@lukapeschke lukapeschke added this to the v1.0.0 milestone Feb 9, 2024
@PrettyWood PrettyWood modified the milestones: v1.0.0, v0.10.0 Feb 14, 2024
@lukapeschke lukapeschke force-pushed the strings-by-ref-calamine branch from 05d5b8a to 5026389 Compare February 20, 2024 12:54
@lukapeschke
Copy link
Collaborator Author

Some work is still required in calamine: tafia/calamine#409

@lukapeschke
Copy link
Collaborator Author

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

@lukapeschke lukapeschke modified the milestones: v0.10.0, v1.0.0 Feb 27, 2024
@PrettyWood
Copy link
Member

Glad to see tafia/calamine#409 has been merged. Hopefully we get a new release soon 👍

@lukapeschke
Copy link
Collaborator Author

new data

main

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

main

this branch

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

branch

@lukapeschke
Copy link
Collaborator Author

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out 🥳

Benchmarks with the latest version:

iterations owned by ref
1 lazy eager
20 lazy_20 eager_20

@lukapeschke
Copy link
Collaborator Author

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this 🙂 tafia/calamine#435

@lukapeschke
Copy link
Collaborator Author

lukapeschke commented Jun 18, 2024

latest measurements with this branch

iterations master this branch (lazy) this branch (eager)
1 master_1 lazy_1 eager_1
20 master_20 lazy_20 eager_20

@lukapeschke lukapeschke marked this pull request as ready for review June 18, 2024 14:37
@lukapeschke lukapeschke requested a review from PrettyWood June 18, 2024 14:37
Signed-off-by: Luka Peschke <[email protected]>
Signed-off-by: Luka Peschke <[email protected]>
Copy link
Member

@PrettyWood PrettyWood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great
Love the small refactorings done

@@ -165,9 +165,38 @@ def load_sheet(
schema_sample_rows=schema_sample_rows,
use_columns=use_columns,
dtypes=dtypes,
eager=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably improve the doc string here "lazy load"?
And question don't we want to have the eager version by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have the eager version by default as I don't want to introduce a breaking change. But you're right, I'll improve the docstring 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_columns: list[str] | list[int] | str | None = None,
dtypes: DTypeMap | None = None,
) -> pa.RecordBatch:
"""Loads a sheet eagerly by index or name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only has an impact for xlsx since the other formats don't support lazy iteration

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/types/python/excelsheet/sheet_data.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🦀 rust 🦀 Pull requests that edit Rust code enhancement New feature or request ✋ need review ✋ 🐍 python 🐍 Pull requests that edit Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants