Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stream methods to IterableRealInterval #336

Merged
merged 32 commits into from
Oct 23, 2023
Merged

Add stream methods to IterableRealInterval #336

merged 32 commits into from
Oct 23, 2023

Conversation

tpietzsch
Copy link
Member

@tpietzsch tpietzsch commented Sep 12, 2023

This PR adds IterableRealInterval.stream() and .parallelStream() default methods to access the pixel values in an image as a Stream<T>.

The stream methods rely on a default implementation of IterableRealInterval.spliterator() backed by RealCursor.

Encounter order of the streams matches that of cursors, i.e. Views.flatIterable(img).stream() yields elements in flat iteration order.

Usage examples:

  • count the number of true pixels in a binary image
    Img<BitType> img = ...;
    long numberOfOnes = img.stream().filter(BitType::get).count();
  • find the maximum pixel value in a DoubleType image
    Img<DoubleType > img = ...;
    OptionalDouble maxValue = img.parallelStream().mapToDouble( DoubleType::get ).max();

Pitfalls

Note that the T elements of the stream are proxies and reused (as usual).
The RealCursorSpliterator implementation takes care that a new proxy is used for each split-off prefix, so parallelStream() works as expected.
However, explicit copying operations must be added, if stream elements are supposed to be retained (by stateful intermediate or terminal operations).

For example, to collect all DoubleType values between 0 and 1 into a list:

List< DoubleType > values = img.stream()
    .filter( t -> t.get() >= 0.0 && t.get() <= 1.0 )
    .map( DoubleType::copy ) // <-- this is important!
    .collect( Collectors.toList() );

The .map(DoubleType::copy) operation is necessary, otherwise the values list will contain many duplicates of the same DoubleType object (which may not even have to a value between 0 and 1). The copy could also be done before the .filter(...) operation, but it's better to do it as late as possible to avoid unnecessary creation of objects.

Performance

Initial benchmarks show that using streams (even without copying) is a lot slower than explicit for loops, for example.

Running the following benchmark

// Img<IntType> img = ArrayImgs.ints(2000, 2000);

@Benchmark
public long benchmarkForLoop() {
    long count = 0;
    for (IntType t : img) {
        if (t.get() > 127)
            ++count;
    }
    return count;
}

@Benchmark
public long benchmarkStream() {
    return img.stream().filter(t -> t.get() > 127).count();
}

results in

Benchmark                                     Mode  Cnt   Score   Error  Units
StreamBenchmark.benchmarkForLoop              avgt   20   4,537 ± 0,095  ms/op
StreamBenchmark.benchmarkStream               avgt   20  33,331 ± 0,193  ms/op

Not ideal... It may be possible to improve performance, but so far I didn't find anything that works.

However, I think this is anyway more a quality-of-life feature. (Like the RandomAccessible.getAt(...) convenience methods (#246) which I find myself using more often then I expected, despite the performance overhead.)

Ideas

There is more to explore in this direction.

  • Most importantly, it is of limiting to have only access to pixel values. It is more useful to also know pixel positions. For this, we should explore (additionally) providing Stream<LocalizableSampler<T>> instead of Stream<T>.
    The implementation would be simple, basically just use Cursor<T> instead of Cursor<T>.get() for the stream elements. However, the generics are a bit hairy.
  • Does it make sense to add ImgLibStream wrapper around java.util.stream.Stream?
    With this we could add additional operations, for example the previously mentioned .map(DoubleType::copy) could be .materialize(). We could decorate operations like .distinct() and .sort() to make sure that copies have been made before, etc.
  • Can specialized Spliterators improve performance over iterators, for example for CellImg?
  • Could there be any interesting interaction with the Views / Converters framework?

@tpietzsch tpietzsch marked this pull request as draft September 12, 2023 12:27
@tpietzsch tpietzsch changed the title Streams Add stream methods to IterableRealInterval Sep 12, 2023
@ctrueden
Copy link
Member

ctrueden commented Sep 12, 2023

@tpietzsch This is awesome! Too bad about the performance, but still good to have. 😀

About the ImgLibStream: I think this idea is almost necessary to do, because otherwise people will definitely bump into proxy-type-object-reuse-related bugs. I'm less convinced that you need a public class wrapper, though—it could instead be only an internal Stream subclass that overrides methods as appropriate while adding no new API. If we take care to override most/all of the potential pain points, the need for a method like materialize() becomes less. Are there other new API methods that occurred to you besides those you mentioned above?

For the localizable stream elements: I like this idea. The method could just be .localizingStream() for symmetry with localizingCursor(), eh? Although I guess we probably also want .localizingParallelStream() 🙄 ... But then as you say, the generics get tough. Maybe instead of baking it into the IterableRealInterval interface, some static utility methods would be easier? Like:

public static < T > Stream< RealCursor< T >> localizingStream( IterableRealInterval< T > iri ) { ... }
public static < T > Stream< Cursor< T >> localizingStream( IterableInterval< T > ii ) { ... }

This avoids the hairiness of incompatible return type of an overridden method in IterableInterval due to non-covariance.

And the code could read almost as nicely:

Img< DoubleType > myImg = ...;
List< Double > valuesPast123 = Streams.localizing( myImg )
    .filter( c -> c.getDoublePosition( 0 ) >  123.0 )
    .map( c -> c.get().getRealDouble() )
    .collect( Collectors.toList() );

@tpietzsch
Copy link
Member Author

tpietzsch commented Sep 12, 2023

@ctrueden I made separate issues for the wrapper classes #339, and the localizing streams #338, and replied there

RealPositionable is Positionable, so just forward to localize(Positionable)
instead of going dimension by dimension.
These are Spliterator<T>, that can additionally provide location of the
last T that was visited in tryAdvance().
…rator

The methods have default implementations that are backed by (Real)Cursor.
The default implementation goes dimension-by-dimension or forwards to
Positionable.setPosition(Localizable), which is also often implemented
to go dimension-by-dimension
Default implementation builds on localizingCursor()
@tpietzsch tpietzsch marked this pull request as ready for review October 23, 2023 09:22
@tpietzsch tpietzsch merged commit 1ff3d56 into master Oct 23, 2023
1 check passed
@tpietzsch tpietzsch deleted the stream branch September 7, 2024 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants