Implement `HOST_UDF` aggregation for reduction and segmented reduction #17645

ttnghia · 2024-12-20T19:25:48Z

Following #17592, this enables HOST_UDF aggregation in reduction and segmented reduction, allowing to execute a host-side user-defined function (UDF) through libcudf aggregation framework.

Closes #16633.

Signed-off-by: Nghia Truong <[email protected]>

PointKernel · 2024-12-20T22:05:50Z

cpp/include/cudf/aggregation/host_udf.hpp

+    template <typename T,
+              CUDF_ENABLE_IF(std::is_same_v<T, reduction_data_attribute> ||
+                             std::is_same_v<T, segmented_reduction_data_attribute> ||
+                             std::is_same_v<T, groupby_data_attribute>)>
    data_attribute(T value_) : value{value_}
    {


Using a static_assert instead of SFINAE might offer clearer information in cases of misuse of this constructor, for example:

template<typename T> data_attribute (T value_) : value {value_} { static_assert(std::is_same_v<T, reduction_data_attribute> || std::is_same_v<T, segmented_reduction_data_attribute> || std::is_same_v<T, groupby_data_attribute>, "Unsupported aggregation data attribute"); }

PointKernel · 2024-12-20T22:47:38Z

cpp/include/cudf/aggregation/host_udf.hpp

+  using input_data_t = std::variant<column_view,
+                                    data_type,
+                                    std::optional<std::reference_wrapper<scalar const>>,
+                                    null_policy,
+                                    size_type,
+                                    device_span<size_type const>>;


When used exclusively for groupby, the input data types are still manageable. However, they have now become somewhat challenging to control. This was also the primary difficulty I encountered when trying to understand the use of host UDF in the initial version of your design. For example, the implicit connection between the optional input scalar and the initialization value of a reduction is ambiguous to follow.

Now looking back, I think the core issue lies in using an enum to represent the data attributes required for aggregation. For example, an enum with values like GROUPED_VALUES and GROUP_OFFSETS should signify that only one of these options is valid in a given context. However, the fact that your documentation example requires both for a custom UDF violates the fundamental purpose of an enum, which is to define mutually exclusive states. Also, data_attribute is too vague, as it could refer to a (partial) groupby, a (partial) reduction, a (partial) segmented reduction, or intermediate aggregations.

How about making two tiers of base classes?

// For type erasure: storing `std::unique_ptr<host_udf_base>` by the same way in the // internal implementation class struct host_udf_base {}; // define data attributes, required functions etc. struct reduction_host_udf_base: host_udf_base { }; struct segmented_reduction_host_udf_base: host_udf_base { }; struct groupby_host_udf_base: host_udf_base { };

The users will subclass from each of these 3 base classes separately. These base class will not mix their data at all.

ttnghia added 3 commits December 20, 2024 10:48

Implement HOST_UDF aggregation for reduction and segmented reduction

2da9137

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-25.02' into host_udf_reduction

fca442a

Fix example

4b5aa95

Signed-off-by: Nghia Truong <[email protected]>

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 20, 2024

ttnghia requested review from PointKernel and davidwendt December 20, 2024 19:25

ttnghia self-assigned this Dec 20, 2024

ttnghia requested review from a team as code owners December 20, 2024 19:25

ttnghia mentioned this pull request Dec 20, 2024

Hyper log log plus plus(HLL++) NVIDIA/spark-rapids-jni#2522

Open

Update docs

5982f3e

Signed-off-by: Nghia Truong <[email protected]>

PointKernel reviewed Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `HOST_UDF` aggregation for reduction and segmented reduction #17645

Implement `HOST_UDF` aggregation for reduction and segmented reduction #17645

ttnghia commented Dec 20, 2024

PointKernel Dec 20, 2024 •

edited

Loading

PointKernel Dec 20, 2024

ttnghia Dec 20, 2024 •

edited

Loading

Implement HOST_UDF aggregation for reduction and segmented reduction #17645

Are you sure you want to change the base?

Implement HOST_UDF aggregation for reduction and segmented reduction #17645

Conversation

ttnghia commented Dec 20, 2024

PointKernel Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

PointKernel Dec 20, 2024

Choose a reason for hiding this comment

ttnghia Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Implement `HOST_UDF` aggregation for reduction and segmented reduction #17645

Implement `HOST_UDF` aggregation for reduction and segmented reduction #17645

PointKernel Dec 20, 2024 •

edited

Loading

ttnghia Dec 20, 2024 •

edited

Loading