-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137
base: main
Are you sure you want to change the base?
Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137
Conversation
- Added functionality for row-wise mode calculation (axis=1) to support the Dask DataFrame API. - The new implementation dynamically handles row-wise mode and ensures consistent metadata handling across partitions. - Added validation for the axis parameter, with appropriate error handling for unsupported values. - Ensured compatibility with existing column-wise (axis=0) mode functionality, preserving the original behavior for that case. Resolves dask-expr issue dask#1136.
Unit tests for the added functionality have been submitted in the Dask repository’s PR due to the initial issue location and the entry point for the functionality being within dask/dataframe/core.py. Please note that this PR in Dask-expr must be approved before the Dask PR tests can successfully run, as the tests depend on the changes made here. |
Formatted _collection.py with Black
dask_expr/_collection.py
Outdated
# Implement axis=1 (row-wise mode) | ||
num_columns = len(self.columns) # Maximum possible number of modes per row | ||
|
||
def row_wise_mode(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't have to do this, we generally expect that the columns in every partition are matching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify which part isn't necessary? Are you referring to determining the number of columns for row-wise mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything that is in def row_wise_mode basically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I'll make the necessary revisions this afternoon and push them to this PR once tested.
dask_expr/_collection.py
Outdated
|
||
# Create metadata with the correct number of columns and float64 dtype | ||
meta = pd.DataFrame( | ||
{i: pd.Series(dtype="float64") for i in range(num_columns)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you forcing float in all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for highlighting that forcing columns to float64 can lead to unintended type conversions. I initially used float64 to handle NaN values but agree it's better to preserve original data types. I'll update the implementation to dynamically determine data types based on the input, using pandas' nullable types like Int64 for integers and string for text data. This will handle missing values without unnecessary type changes, ensuring consistent data types across partitions while maintaining data integrity. Please let me know if you have further suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't do any explicit type casting, just use the nonempty meta of the input and call mode on it
Simplified the logic for row-wise mode computation (axis=1) to dynamically handle multiple modes per row. Refactored metadata handling to ensure the number of columns is consistent across partitions, avoiding mismatches in column count. This addresses issues with inconsistent column numbers between computed data and metadata in Dask, and addresses dev team feedback.
Apologies for the delay — I’ve had some commitments with my day job this past week. The current version has been simplified, with no explicit typecasting or nested functions. It has passed both my local bug script and the pending unit test in the Dask repo. Note: I accidentally pushed my venv with GitHub Desktop earlier, but I’ve since re-pushed with the venv directory removed. |
Modified row-wise mode implementation to rely entirely on self._meta_nonempty for metadata generation, as per developer feedback. Ensured complete removal of explicit typecasting and ensured consistent column handling between computed data and metadata.
Made linting changes, specifically for black.
Resolves Dask Expressions issue #1136 and Dask issue #11389