-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Big-Data File" or "Database" based backend for very large spreadsheets #368
Comments
Maybe implement lazy iteration? And applications can use any of engines to save to disk. Two problems with it:
|
Yeah, true lazy iteration is going to be a challenge. I don't think its even possible to do a truly lazy implementation. I've been looking into other libs to see what they are doing, but its taking me a while to navigate them and try to piece together what they are doing about it. Even if we only allowed lazy iteration over only cells, not rows or columns, the problem of having to scrape the shared strings and styles file doesn't go away. Of the workload problems, I split them between wall clock and memory constraints. For the wall clock issue I have made some subsets that we could add in to help with some workloads. In the first subset, we load all the sheets and information into memory, parsing them in parallel. We represent the range fully, empty value cells included. What you index into is what you would expect. This would be an Another version of this could be to only parse specified sheets, so not all sheets go into memory, only the given. This could be like we parse only the metadata of the sheets, like the name and index, and using this to initiate a full parse for the sheets passed in by the user. So a For some potential memory savings, an optimization can be made when handling sparse data. The shared strings and sheets are all loaded into memory, but the backing container of the cells discards the empty value cells, compacting all the data together. There would be like a In both cases its expected that the use case requires multiple sheets of data and random access to cells directly at ones immediate disposal. For pure memory optimizations, the options are much more constrained. Even only opening the sheet that is needed has some very easily reachable limits when you take into account that the shared string file still needs to be read. But beyond that, the next solution I can think of would go straight to the filesystem. Employing some of The two extremes of open everything into memory all at once and open everything onto disk all at once cover the full range of the constraints I laid out. But none of these solutions make accessing a random location, like the first row, any faster, a constraint that sits in the middle of the two I specified, but also kind of its own thing. Its improves speed, but really its more like first row speed. And memory remains low, but really its just scoped to the iteration scope. Its not that you have an issue where you can't fit the data into memory scope and thus turn to it. Its almost more of a biproduct that the memory is lower, in my opinion, rather than being the goal of lazy iteration. The real benefit is that you could access earlier info without needing to parse all of it which you wont use. I am still working on an write-up of working towards a 1.0 API. I am adding to it everyday when I have some time to work on it. I'm trying to flesh out some of the main challenges and try to at least provide a direction to go in. Lazy iteration, unfortunately, is something I'm not really closer on figuring out. I've only been programing for about a year(This project was actually my first OS contribution), so I don't have any experience to lean on for these things. I'll try to get it up sometime before the end of this month though. |
Thanks for writing this up. While I am clearly not against it, (I find it interesting) I'm afraid it'll be very complex. I want to explore streaming (by rows or cells) iterators first, even if it seems complex. I also have the feeling that very large excel files, while a reality is pretty niche, as this type of workflow should have since then moved into other formats/database (excel struggles opening these files).
For reference, the file used in the dataset has
In an ideal world (for xlsx at least, other file format behave differently), there should be a "mode" where we can pull the sharedString directly from the zip, on demand. For instance we could scan it first and only store string indexes etc ... It'd then be much smaller. |
Closing this now for #370 which will bring support for lazy cell reading. It does not work for xls and ods files though, feel free to reopen, |
Building off an idea that
excelize
uses, we could save really large datasets onto disk as a file.Big Data
Some options:
pandas
has documentation on working with large datasets and one of the solutions is to do something similar.It could be behind a feature flag like
storage
, so as to keep the implementation out of the public API. This would then change from building a structure in memory, and start to write to a file in the OS'stemp
directory. Parsing the xml file probably takes longer than writing the file to disk, the performance would decrease compared to a in-memory only solution, but if data would never fit in-memory in the first place, then having a file format optimized for this kind of task.Some formats need to be saved all at once, meaning the data would need to be filled in memory to then save it. A workaround would be to save smaller chunks to file and then aggregate the data together then needed. This idea of 'chunking' is also mentioned often as a solution in online forms when working with large amounts of data.
If/when calamine offers writing and editing, for the read-only formats, we could update these smaller chunks as a new file with the new values and invalidate the old one. We could just also not offer a writing mechanism when this is the backing storage.
The files would most likely only store the values themselves, so other metadata for the cells would have to still be kept in memory, if wanted, and this could still fill up a lot of memory with things like cell fill color, font, border, format, etc. So in this case, a
styles
flag would be incompatible with astorage
flag.SQLite
Outside of the "Big Data" formats, something like
SQLite
could be something to look into. A single.db
file would store the data, updates can be done in-place, and cell metadata could be stored in column with the value.And index can be created over the column row pair for increased performance on iteration.
Questions
Something to be aware of is that most, if not all, the formats have datatypes. Matching the cells values with the proper data type could be more challenging than it initially seems.
With SQLite as an example, the field would have to be
TEXT
to store all the values in the same field, and another field would need to be added to keep track of the type.Another is that in a "worst case scenario" the
sharedStrings
file could be huge, and with the wrong kind of dataset, this alone could fill up memory. For theSQLite
solution, there could just be a shared strings table with the ID as the index. The spreadsheet table would then have a foreign key pointing to the ID of the shared string table. Though this would mean that we can't store all the values in the value column, as there would be values that don't reference thesharedStrings
data. But with this figured out, it would allow asynchronous reads/writes. keeping the memory used to whatever the buffers are sized at.Another database option could be in the form of surrealdb, which has more flexibility in how data is represented and "joined". It also comes with a file back solution.
Of all these suggestions, I must say I gravitate towards the database options the most. It seems the simplest to implement and offers the most straight forward way to achieve the most flexibility while keeping up performance.
The last main point to keep in mind is what the most likely use case is for this big data. If its processing, analytics, saving to new format, editing and resaving, they all can require different workloads that would lend to certain implementations. Trying to pin down the use case to design against would probably be a good first step, if its possible.
The text was updated successfully, but these errors were encountered: