Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poor performance relative to FlexiJoins in integer case #24

Open
ericphanson opened this issue Oct 12, 2022 · 7 comments
Open

poor performance relative to FlexiJoins in integer case #24

ericphanson opened this issue Oct 12, 2022 · 7 comments

Comments

@ericphanson
Copy link
Member

ericphanson commented Oct 12, 2022

I have the task of matching individual sample points (integers) into AlignedSpan sample regions. I have found that if I map to raw Ints and UnitRange{Int}s, I get much better performance with FlexiJoins than using Interval{Int, Closed, Closed} with DataFrameIntervals. This problem might be out-of-scope for DataFrameIntervals, but maybe it also indicates a place where perf could be improved.

For my actual task, I am having trouble using FlexiJoins due to compat issues (Flux v0.12 needs ArrayInterface 5, while FlexiJoins 0.1.23 is the latest release, and needs Static 0.7-something, which needs ArrayInterface v6... it's a whole mess). So if DataFrameIntervals could solve this more performantly without needing extra dependencies that introduce compat issues, that would be wonderful. For now I am using DataFrameIntervals despite the perf shortfall to avoid the compat problem.

Here is my MWE.

Versions:

FlexiJoins v0.1.23
Intervals v1.8.0
DataFrameIntervals v0.1.0

on Julia 1.8.2 (ubuntu).

Problem:

using DataFrames, Intervals, StableRNGs
using DataFrameIntervals, FlexiJoins

N_INTERVALS = 1000
N_POINTS = 10^5

function make_example(n_intervals, n_points)
    rng = StableRNG(123)
    points = [rand(rng, 1:10_000) for _ in 1:n_points]
    L = DataFrame(:point => points)
    transform!(L, :point => ByRow(p -> Interval{Int,Closed,Closed}(p, p)) => :interval)

    indices = map(1:n_intervals) do _
        i = rand(rng, 1:10_000)
        return i:i+100
    end
    R = DataFrame(:indices => indices)
    transform!(R, :indices => ByRow(I -> Interval{Int,Closed,Closed}(first(I), last(I))) => :interval)
    return L, R
end

L, R = make_example(N_INTERVALS, N_POINTS)
@time FlexiJoins.innerjoin((L, R), by_pred(:point, , :indices))
@time interval_join(L, R; on=:interval)

The FlexiJoin join takes 0.14 seconds, while the interval_join takes 22s, 157x slower.

I originally had N_INTERVALS = 3000 and N_POINTS = 10^6, for which FlexiJoins took 4-5s, and DataFrameIntervals was still running after 15 mins when I cancelled and reduced those values to make the MWE more minimal.

@aplavin
Copy link

aplavin commented Oct 18, 2022

I've cleaned up FlexiJoins dependencies last month, and now came to registering the updated version in General. In particular, it no longer depends on Static - should help with enviroments like yours that require an older version of it.

@haberdashPI
Copy link
Member

haberdashPI commented Oct 19, 2022

Just as an update as to the status of this issue: @ericphanson has a PR in Intervals.jl that improves the performance of find_intersections (which determines performance here) that will support the AlignedSpans case (or an Interval{Closed,Closed} type). I'll then port that to support all interval types, which should improve performance, in general, for DataFrameIntervals.

@haberdashPI
Copy link
Member

This is addressed by JuliaRegistries/General#85622

@aplavin
Copy link

aplavin commented Jun 15, 2023

Just tried the same benchmark with the updated Intervals version - it did become somewhat faster.
Results as-is:
FlexiJoins 0.140287 seconds (23.20 k allocations: 229.607 MiB, 7.97% gc time)
DataFrameIntervals 1.339070 seconds (4.07 M allocations: 7.914 GiB, 4.81% gc time)

Also note that FlexiJoins is used suboptimally here:

  • The interval is represented as a Julia range, that falls back to the generic iterable implementation
  • Non-native input and output formats - DataFrames require conversion internally

Using Interval (from IntervalSets) instead of range, and StructArray instead of DataFrame, I get
FlexiJoins (native) 0.056713 seconds (80 allocations: 21.091 MiB, 47.42% gc time)

For larger N_INTERVALS = 3000, N_POINTS = 10^6:
FlexiJoins (asis) 10.779393 seconds (93.14 k allocations: 6.174 GiB, 1.67% gc time)
FlexiJoins (native) 1.068480 seconds (92 allocations: 658.505 MiB, 0.42% gc time)
DataFrameIntervals 79.301766 seconds (120.87 M allocations: 180.618 GiB, 8.43% gc time)

Status `/tmp/jl_WWEqxB/Project.toml`
  [33b79e07] DataFrameIntervals v0.2.0 `https://github.com/beacon-biosignals/DataFrameIntervals.jl#main`
  [a93c6f00] DataFrames v1.5.0
  [e37f2e79] FlexiJoins v0.1.30
  [8197267c] IntervalSets v0.7.4
  [d8418881] Intervals v1.10.0
  [860ef19b] StableRNGs v1.0.0
  [44cfe95a] Pkg v1.9.0

@ericphanson
Copy link
Member Author

I've realized FlexiJoins doesn't have CI setup (it has github actions script on gitlab which doesn't do anything), so it doesn't fit Beacon's dependency requirements. It also doesn't support the Tables.jl interface which is kind of a non-starter for a tabular join, and its DataFrames support doesn't participate in semver. So I don't think it's a viable alternative to DataFrameIntervals, however it does show these operations could be a lot faster still.

@ericphanson ericphanson reopened this Jun 15, 2023
@aplavin
Copy link

aplavin commented Jun 15, 2023

FlexiJoins doesn't have CI setup

It does: the primary repo is (private) on github, gitlab is just the public mirror (motivation 1 2).
It's no coincidence that there's a github action script in the repo :)

its DataFrames support doesn't participate in semver

DF support is pretty "naive" for now, definitely suboptimal as seen even from the benchmarks above. Also see https://gitlab.com/aplavin/FlexiJoins.jl/-/issues/2. These are the major reasons why this functionality is considered experimental.

I'm not really familiar with DFs myself, so any help improving the integration is welcome. Then this part can also become semver-stable functionality.

It also doesn't support the Tables.jl interface

Indeed, the priority of FlexiJoins is to support the Base Julia interface - collections. Many (most) in-memory tables also follow this interface.
Any help with Tables.jl support would also be appreciated, but I don't have such plans myself.

@haberdashPI
Copy link
Member

haberdashPI commented Jun 15, 2023

Thanks @aplavin for running the benchmark. Super helpful to know!

I perhaps should have clarified when I closed the issue, that I simply meant the more optimized find_intersections method had been implemented. Perhaps I shouldn't have closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants