feat: introduce eager loading functions #147

lukapeschke · 2023-12-22T14:34:42Z

What

This introduces eager loading functions that make use of the calamine's new DataTypeRef.

This prevents some allocations, resulting in a lower memory footprint.

Caveats

The API is kinda rough for now, it will probably need some cleaning (I mostly wanted to check if the memory gain was interesting here).
The functions need to be eager because DataTypeRef has an explicit lifetime, which is not allowed by PyO3 (lifetimes are hard to enforce on the python side: https://pyo3.rs/v0.20.0/class.html#no-lifetime-parameters)
In order for this to work, some changes are needed in calamine, and we don't know if this is something the library maintainers had in mind. PR and discussion: refactor: make DataTypeRef public and introduce a DataTypeTrait trait tafia/calamine#390

Gains

While the speed stays roughly the same (it was even 3~5% faster on my machine on several tests), the memory footprint decreases by almost 25%. . This means that we're almost as good as pandas memory-wise 🥳 (they still beat us by a few MBs), while being about 10 times faster

Before

After

Pandas

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-02-20T12:56:04Z

Some work is still required in calamine: tafia/calamine#409

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-02-26T13:13:49Z

Okay well just noticed that the API changed so we actually need to use workshet_range_ref in case Sheets are the Xlsx variant

Signed-off-by: Luka Peschke <[email protected]>

PrettyWood · 2024-02-27T16:30:07Z

Glad to see tafia/calamine#409 has been merged. Hopefully we get a new release soon 👍

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-02-28T11:23:26Z

new data

`main`

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_by_name(sheet_name, use_columns=use_columns).to_arrow()
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

this branch

import argparse
from time import sleep
import fastexcel


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("file")
    parser.add_argument("-c", "--column", type=str, nargs="+", help="the columns to use")
    return parser.parse_args()


def main():
    args = get_args()
    excel_file = fastexcel.read_excel(args.file)
    use_columns = args.column or None

    for sheet_name in excel_file.sheet_names:
        arrow_data = excel_file.load_sheet_eager(sheet_name)
        # sleeping to be really visible on the resulting graph
        sleep(1)
        arrow_data.to_pandas()


if __name__ == "__main__":
    main()

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-03-04T21:23:07Z

Good news, looks like we should be able to have lazy-by-ref once a new calamine version is out 🥳

Benchmarks with the latest version:

iterations	owned	by ref
1
20

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-05-22T08:04:30Z

calamine 0.25.0 should be released soon, meaning I should finally be able to finish this 🙂 tafia/calamine#435

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-06-18T14:36:13Z

latest measurements with this branch

iterations	master	this branch (lazy)	this branch (eager)
1
20

Signed-off-by: Luka Peschke <[email protected]>

PrettyWood

Overall looks great
Love the small refactorings done

PrettyWood · 2024-06-30T12:09:03Z

python/fastexcel/__init__.py

@@ -165,9 +165,38 @@ def load_sheet(
                schema_sample_rows=schema_sample_rows,
                use_columns=use_columns,
                dtypes=dtypes,
+                eager=False,


We should probably improve the doc string here "lazy load"?
And question don't we want to have the eager version by default?

I'd rather not have the eager version by default as I don't want to introduce a breaking change. But you're right, I'll improve the docstring 👍

PrettyWood · 2024-06-30T12:18:16Z

python/fastexcel/__init__.py

+        use_columns: list[str] | list[int] | str | None = None,
+        dtypes: DTypeMap | None = None,
+    ) -> pa.RecordBatch:
+        """Loads a sheet eagerly by index or name.


Only has an impact for xlsx since the other formats don't support lazy iteration

Will update

src/types/python/excelsheet/sheet_data.rs

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added 2 commits December 22, 2023 14:55

chore(deps): Upgrade calamine 0.22.1 -> 0.23.0

0a97e8d

Signed-off-by: Luka Peschke <[email protected]>

feat: introduce eager loading functions

5cc31be

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added enhancement New feature or request 🔨 WIP 🔧 🦀 rust 🦀 Pull requests that edit Rust code labels Dec 22, 2023

lukapeschke self-assigned this Dec 22, 2023

lukapeschke added the 🐍 python 🐍 Pull requests that edit Python code label Dec 22, 2023

This was referenced Jan 24, 2024

chore(deps): bump calamine from 0.22.1 to 0.23.1 #145

Closed

feat(python): add "calamine" support to read_excel, using fastexcel (~8-10x speedup) pola-rs/polars#14000

Merged

lukapeschke added this to the v1.0.0 milestone Feb 9, 2024

PrettyWood modified the milestones: v1.0.0, v0.10.0 Feb 14, 2024

lukapeschke added 3 commits February 20, 2024 10:00

Merge branch 'main' into strings-by-ref-calamine

4db77ee

adapt to recent changes

b9ec9a2

Signed-off-by: Luka Peschke <[email protected]>

Merge branch 'main' into strings-by-ref-calamine

5026389

lukapeschke force-pushed the strings-by-ref-calamine branch from 05d5b8a to 5026389 Compare February 20, 2024 12:54

lukapeschke added 3 commits February 23, 2024 13:25

feat: added support for schema_sample_rows

15b52a1

Signed-off-by: Luka Peschke <[email protected]>

Merge branch 'main' into strings-by-ref-calamine

230a832

solve merge conflicts

6cd77ab

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke modified the milestones: v0.10.0, v1.0.0 Feb 27, 2024

lukapeschke added 2 commits February 27, 2024 16:51

Merge branch 'main' into strings-by-ref-calamine

1cde690

adapt to recent changes on main

d6548a4

Signed-off-by: Luka Peschke <[email protected]>

adapt to recent changes on main

d64dc03

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added 2 commits February 28, 2024 12:24

Merge branch 'main' into strings-by-ref-calamine

a7e8175

adapt error message

6adca86

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added 2 commits March 4, 2024 22:05

fat refactor, might support non-eager by-ref

3eb6ca3

Signed-off-by: Luka Peschke <[email protected]>

add iterations to test.py

6bf5fb1

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added 6 commits March 4, 2024 22:24

remove unused file

c1c0990

Signed-off-by: Luka Peschke <[email protected]>

Merge branch 'main' into strings-by-ref-calamine

eb51afc

adapt to recent changes on main

bfb92fe

Signed-off-by: Luka Peschke <[email protected]>

fix: ensure eager=True always returns a RecordBatch

ad5f326

Signed-off-by: Luka Peschke <[email protected]>

remove commented out code

4bcc947

Signed-off-by: Luka Peschke <[email protected]>

simplify lifetime annotations

50d518d

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke mentioned this pull request May 27, 2024

chore(deps): bump calamine from 0.24.0 to 0.25.0 in the prod-deps group #237

Merged

lukapeschke added 4 commits May 27, 2024 17:35

Merge branch 'main' into strings-by-ref-calamine

9089378

adapt code to recent changes

6b67cfc

Signed-off-by: Luka Peschke <[email protected]>

remove dbg!

a7b8665

Signed-off-by: Luka Peschke <[email protected]>

Merge branch 'main' into strings-by-ref-calamine

cc9588b

lukapeschke marked this pull request as ready for review June 18, 2024 14:37

lukapeschke requested a review from PrettyWood June 18, 2024 14:37

lukapeschke added 2 commits June 18, 2024 16:41

fix typing

6ed47ce

Signed-off-by: Luka Peschke <[email protected]>

chore: clippy rust 1.79

01dfb76

Signed-off-by: Luka Peschke <[email protected]>

lukapeschke added ✋ need review ✋ and removed 🔨 WIP 🔧 labels Jun 18, 2024

lukapeschke mentioned this pull request Jun 23, 2024

chore(deps): bump pyo3 0.20.3 -> 0.21.2 #241

Merged

PrettyWood approved these changes Jun 30, 2024

View reviewed changes

lukapeschke added 2 commits July 1, 2024 10:40

docs: improve docstrings

371b2ca

Signed-off-by: Luka Peschke <[email protected]>

Merge branch 'main' into strings-by-ref-calamine

9e92efd

lukapeschke merged commit 2147bb5 into main Jul 1, 2024
22 checks passed

lukapeschke deleted the strings-by-ref-calamine branch July 1, 2024 08:47

PrettyWood mentioned this pull request Jul 17, 2024

When coercing columns to strings, boolean cells turn into null #250

Closed

lukapeschke mentioned this pull request Jul 19, 2024

feat(python): Optimise read_excel when using "calamine" engine with the latest fastexcel pola-rs/polars#17735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce eager loading functions #147

feat: introduce eager loading functions #147

lukapeschke commented Dec 22, 2023 •

edited

Loading

lukapeschke commented Feb 20, 2024

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 27, 2024

lukapeschke commented Feb 28, 2024

lukapeschke commented Mar 4, 2024

lukapeschke commented May 22, 2024

lukapeschke commented Jun 18, 2024 •

edited

Loading

PrettyWood left a comment

PrettyWood Jun 30, 2024

lukapeschke Jul 1, 2024

lukapeschke Jul 1, 2024

PrettyWood Jun 30, 2024

lukapeschke Jul 1, 2024

lukapeschke Jul 1, 2024

feat: introduce eager loading functions #147

feat: introduce eager loading functions #147

Conversation

lukapeschke commented Dec 22, 2023 • edited Loading

What

Caveats

Gains

Before

After

Pandas

lukapeschke commented Feb 20, 2024

lukapeschke commented Feb 26, 2024

PrettyWood commented Feb 27, 2024

lukapeschke commented Feb 28, 2024

new data

main

this branch

lukapeschke commented Mar 4, 2024

lukapeschke commented May 22, 2024

lukapeschke commented Jun 18, 2024 • edited Loading

latest measurements with this branch

PrettyWood left a comment

Choose a reason for hiding this comment

PrettyWood Jun 30, 2024

Choose a reason for hiding this comment

lukapeschke Jul 1, 2024

Choose a reason for hiding this comment

lukapeschke Jul 1, 2024

Choose a reason for hiding this comment

PrettyWood Jun 30, 2024

Choose a reason for hiding this comment

lukapeschke Jul 1, 2024

Choose a reason for hiding this comment

lukapeschke Jul 1, 2024

Choose a reason for hiding this comment

lukapeschke commented Dec 22, 2023 •

edited

Loading

`main`

lukapeschke commented Jun 18, 2024 •

edited

Loading