-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some thoughts about data tree design #9
Comments
That's a good suggestion! Yes, it's only called file trees because it uses paths. There's a DataTrees.jl package already haha 😅
You mean the files are already pretty equally distributed in size (and work)? Yeah that's true in edge cases like one big CSV file. Dagger can handle irregularity in workload and avoid starving workers, so it's not so bad. But I see your iterators idea! That's cool! I'm excited to see what you make. Okay yes, DataSets.jl idea is clearer to me now! Thanks for that. I really feel it would be nice to have generically typed indices. |
Oh 😬. But then again, it seems to have only four commits and not be registered which I guess means it's essentially abandoned. So the name may be available after all :) |
Looks like I've been inadvertently doing a lot of the same things as FileTrees.jl, just completely outside the context of filesystems. In AxisSets.jl, I'm storing an associative of paths (
I think if the |
That sounds good. What would be the |
The way I've been thinking about this in DataSets.jl:
These rules mean that such trees are not exactly like AbstractDict because that iterates key-value pairs. However I believe value-iteration is just a lot better for data-driven work than iterating key-value pairs by default and Dictionaries.jl gets this right. In addition, you can add paths to the mix
|
Hi Shashi it was nice to chat about this!
I had some thoughts about the design and how it relates to what I've been thinking about
DataTree
would be a more descriptive name. I almost called one of my own types this in my prototype! But in the end I settled onFileTree
because the thing I've written so far is a lazy view which is explicitly backed by the filesystem rather than being reflected in memory.open
them, yielding a Julia type which can read them lazily. Then we would have a whole familyS3Tree
,ZipTree
etc etc with the same basic interface.stat
info could be represented)In general, I think we're building something related but largely complimentary: in DataSets.jl I'm focusing on how one lazily reads the data index and data "from disk" — or other static location. I want to declaratively define such data locations and systematically turn that config into Julia objects the user can work with in their program. Have a system to move such data between storage backends etc etc. (Of course, DataSets.jl isn't restricted to trees. In principle the same ideas apply to the many tabular data formats, and data we'd often consider as a "single file"; eg large images or other multidimensional arrays.)
The text was updated successfully, but these errors were encountered: