-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worksheet Reading 3-4x Slower Than Baseline Of Underlying quick_xml
Library
#372
Comments
Thanks! This is indeed much more than what I had in mind. Out of curiosity, could you measure
And thanks for opening this issue! |
Ok, sorry for the long delay, life has been hectic. I started to work on a near-from-scratch branch to get a better feel for the full code base. So far I have been able to get a base working implimentation when the program saving the spreadsheet actually makes an effort to follow the specification (looking at you OnlyOffice). Starting from scratch allowed me to more easily test performance changes. Using the dataset we have been working against, a The work is all bundled up in a I am not testing memory usage right now as the style information for cells and text takes up a lot more than just the value in This is the profile of the This is the most recent profile I took of The The current implimentation fills the Spreadsheet with empty Cells, as would be represented in the actual spreadsheet. This takes time but brings other benefits, and given that it dropped to 13 seconds from 22, I think it can be worked with for now. The values of the cells are all stores in a small sized optimized string. This cleans up the api for the cell value to be just The rewrite also restructures the project a bit, or at least starts it. With Other api changes include cleaning up naming and iterators use calamine::wip::Workbook;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let path = r#"NYC_311_SR_2010-2020-sample-1M.xlsx"#;
// an enum that filters which workbook implimentation to use based off the extension.
// This mimics current `open_workbook` functionality.
let mut workbook = Workbook::open(path)?;
// This lets you iterate over the worksheet names without parsing it.
// This is used in nushell to get the names fast in the current api.
for worksheet in workbook.worksheets() {
println!("{}", worksheet.name());
}
// Option<Result<Worksheet<'_>>>
// parses full worksheet up front
let worksheet = workbook.worksheet("Sheet1")?.unwrap();
// Row<'_> Rows<'_>
for row in worksheet.rows() {
// If the column is out of range it returns `None`
let a = row.column(0).unwrap();
let b = row.column(1);
// value can be `None` if its an empty cell
let Some(name) = a.value() else {
break;
};
let location = b.unwrap();
println!(
"| {}:`{}` | {} |",
name,
// font has `name`, `size`, `argb` and `rgb` getters. Returns as `&str`
a.font().rgb(),
location.value().unwrap(),
);
}
Ok(())
} This is still very incomplete and NOT meant to be merged, only here for a reference an experimentation. But would love to see what progress can be made going forward. |
Thanks for the feedback!
|
Hard to say for sure to be honest. I think it could be a combination of a lot of different things as the current profile of the I did try
Currently there is a
This is something I was also grappling with how to approach. If the end goal also supports writes, then there need to be an underlying cell which to write to. So having the "canvas" in which to "draw" seems like a must for an implimentation. At least when using a My currently settled on idea is to just default to a full canvas representation, which will be also support cell mutation and random position insert, using As for the api, it could either be a Workbook::open_lazy There could be a
One issue I haven't been able to think how to handle is merged cells. In the I'm sure there will be a lot of small things that come up, but the real test for the abstraction will be when I do another format, like |
When looking into the worksheet read time, based on early conversations here, and seeing that it took up 22 seconds out of a ~25 second operation, I began some testing of the underlying library,
quick_xml
, and saw that the raw worksheet xml file, a 1.3GB on-disk file, took ~6 seconds to read through when doing "nothing".With this time as the baseline time that we build off of with the work of doing things with the parsed events, the jump to 22 seconds seems like there could be some big win/s to be had.
I'm starting this issue as a thread specifically for optimizing the worksheet reading time. As seen in the issue I created for
quick_xml
, we could benefit from that getting an#[inline]
coating there, but I want to look more into what is done in this crate for improvements in design and implementation. Given the percentage of time this takes currently, 85%+, any improvement will have a noticeable effect on the overall operation.The text was updated successfully, but these errors were encountered: