Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HOST_UDF aggregation for reduction and segmented reduction #17645

Open
wants to merge 4 commits into
base: branch-25.02
Choose a base branch
from

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Dec 20, 2024

Following #17592, this enables HOST_UDF aggregation in reduction and segmented reduction, allowing to execute a host-side user-defined function (UDF) through libcudf aggregation framework.

Closes #16633.

@ttnghia ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Dec 20, 2024
@ttnghia ttnghia self-assigned this Dec 20, 2024
@ttnghia ttnghia requested review from a team as code owners December 20, 2024 19:25
Signed-off-by: Nghia Truong <[email protected]>
Comment on lines +168 to 173
template <typename T,
CUDF_ENABLE_IF(std::is_same_v<T, reduction_data_attribute> ||
std::is_same_v<T, segmented_reduction_data_attribute> ||
std::is_same_v<T, groupby_data_attribute>)>
data_attribute(T value_) : value{value_}
{
Copy link
Member

@PointKernel PointKernel Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a static_assert instead of SFINAE might offer clearer information in cases of misuse of this constructor, for example:

template<typename T>
data_attribute (T value_) : value {value_} {
  static_assert(std::is_same_v<T, reduction_data_attribute> ||
                std::is_same_v<T, segmented_reduction_data_attribute> ||
                std::is_same_v<T, groupby_data_attribute>,
                "Unsupported aggregation data attribute");
}

Comment on lines +249 to +254
using input_data_t = std::variant<column_view,
data_type,
std::optional<std::reference_wrapper<scalar const>>,
null_policy,
size_type,
device_span<size_type const>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When used exclusively for groupby, the input data types are still manageable. However, they have now become somewhat challenging to control. This was also the primary difficulty I encountered when trying to understand the use of host UDF in the initial version of your design. For example, the implicit connection between the optional input scalar and the initialization value of a reduction is ambiguous to follow.

Now looking back, I think the core issue lies in using an enum to represent the data attributes required for aggregation. For example, an enum with values like GROUPED_VALUES and GROUP_OFFSETS should signify that only one of these options is valid in a given context. However, the fact that your documentation example requires both for a custom UDF violates the fundamental purpose of an enum, which is to define mutually exclusive states. Also, data_attribute is too vague, as it could refer to a (partial) groupby, a (partial) reduction, a (partial) segmented reduction, or intermediate aggregations.

Copy link
Contributor Author

@ttnghia ttnghia Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making two tiers of base classes?

// For type erasure: storing `std::unique_ptr<host_udf_base>` by the same way in the
// internal implementation class
struct host_udf_base {};

// define data attributes, required functions etc.
struct reduction_host_udf_base: host_udf_base { };
struct segmented_reduction_host_udf_base: host_udf_base { };
struct groupby_host_udf_base: host_udf_base { };

The users will subclass from each of these 3 base classes separately. These base class will not mix their data at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[FEA] Allow to run groupby/reduction with externally derived aggregations
2 participants