-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
poor performance relative to FlexiJoins in integer case #24
Comments
I've cleaned up FlexiJoins dependencies last month, and now came to registering the updated version in |
Just as an update as to the status of this issue: @ericphanson has a PR in |
This is addressed by JuliaRegistries/General#85622 |
Just tried the same benchmark with the updated Intervals version - it did become somewhat faster. Also note that FlexiJoins is used suboptimally here:
Using For larger N_INTERVALS = 3000, N_POINTS = 10^6:
|
I've realized FlexiJoins doesn't have CI setup (it has github actions script on gitlab which doesn't do anything), so it doesn't fit Beacon's dependency requirements. It also doesn't support the Tables.jl interface which is kind of a non-starter for a tabular join, and its DataFrames support doesn't participate in semver. So I don't think it's a viable alternative to DataFrameIntervals, however it does show these operations could be a lot faster still. |
It does: the primary repo is (private) on github, gitlab is just the public mirror (motivation 1 2).
DF support is pretty "naive" for now, definitely suboptimal as seen even from the benchmarks above. Also see https://gitlab.com/aplavin/FlexiJoins.jl/-/issues/2. These are the major reasons why this functionality is considered experimental. I'm not really familiar with DFs myself, so any help improving the integration is welcome. Then this part can also become semver-stable functionality.
Indeed, the priority of FlexiJoins is to support the Base Julia interface - collections. Many (most) in-memory tables also follow this interface. |
Thanks @aplavin for running the benchmark. Super helpful to know! I perhaps should have clarified when I closed the issue, that I simply meant the more optimized |
I have the task of matching individual sample points (integers) into AlignedSpan sample regions. I have found that if I map to raw
Int
s andUnitRange{Int}
s, I get much better performance with FlexiJoins than usingInterval{Int, Closed, Closed}
with DataFrameIntervals. This problem might be out-of-scope for DataFrameIntervals, but maybe it also indicates a place where perf could be improved.For my actual task, I am having trouble using FlexiJoins due to compat issues (Flux v0.12 needs ArrayInterface 5, while FlexiJoins 0.1.23 is the latest release, and needs Static 0.7-something, which needs ArrayInterface v6... it's a whole mess). So if DataFrameIntervals could solve this more performantly without needing extra dependencies that introduce compat issues, that would be wonderful. For now I am using DataFrameIntervals despite the perf shortfall to avoid the compat problem.
Here is my MWE.
Versions:
on Julia 1.8.2 (ubuntu).
Problem:
The FlexiJoin join takes 0.14 seconds, while the
interval_join
takes 22s, 157x slower.I originally had
N_INTERVALS = 3000
andN_POINTS = 10^6
, for which FlexiJoins took 4-5s, and DataFrameIntervals was still running after 15 mins when I cancelled and reduced those values to make the MWE more minimal.The text was updated successfully, but these errors were encountered: