-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement HOST_UDF
aggregation for reduction and segmented reduction
#17645
base: branch-25.02
Are you sure you want to change the base?
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
template <typename T, | ||
CUDF_ENABLE_IF(std::is_same_v<T, reduction_data_attribute> || | ||
std::is_same_v<T, segmented_reduction_data_attribute> || | ||
std::is_same_v<T, groupby_data_attribute>)> | ||
data_attribute(T value_) : value{value_} | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a static_assert
instead of SFINAE might offer clearer information in cases of misuse of this constructor, for example:
template<typename T>
data_attribute (T value_) : value {value_} {
static_assert(std::is_same_v<T, reduction_data_attribute> ||
std::is_same_v<T, segmented_reduction_data_attribute> ||
std::is_same_v<T, groupby_data_attribute>,
"Unsupported aggregation data attribute");
}
using input_data_t = std::variant<column_view, | ||
data_type, | ||
std::optional<std::reference_wrapper<scalar const>>, | ||
null_policy, | ||
size_type, | ||
device_span<size_type const>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When used exclusively for groupby, the input data types are still manageable. However, they have now become somewhat challenging to control. This was also the primary difficulty I encountered when trying to understand the use of host UDF in the initial version of your design. For example, the implicit connection between the optional input scalar and the initialization value of a reduction is ambiguous to follow.
Now looking back, I think the core issue lies in using an enum to represent the data attributes required for aggregation. For example, an enum with values like GROUPED_VALUES
and GROUP_OFFSETS
should signify that only one of these options is valid in a given context. However, the fact that your documentation example requires both for a custom UDF violates the fundamental purpose of an enum, which is to define mutually exclusive states. Also, data_attribute
is too vague, as it could refer to a (partial) groupby, a (partial) reduction, a (partial) segmented reduction, or intermediate aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making two tiers of base classes?
// For type erasure: storing `std::unique_ptr<host_udf_base>` by the same way in the
// internal implementation class
struct host_udf_base {};
// define data attributes, required functions etc.
struct reduction_host_udf_base: host_udf_base { };
struct segmented_reduction_host_udf_base: host_udf_base { };
struct groupby_host_udf_base: host_udf_base { };
The users will subclass from each of these 3 base classes separately. These base class will not mix their data at all.
Following #17592, this enables
HOST_UDF
aggregation in reduction and segmented reduction, allowing to execute a host-side user-defined function (UDF) through libcudf aggregation framework.Closes #16633.