Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

thyripian · 2024-09-19T23:53:39Z

Added functionality for row-wise mode calculation (axis=1) to support the Dask DataFrame API.
The new implementation dynamically handles row-wise mode and ensures consistent metadata handling across partitions.
Added validation for the axis parameter, with appropriate error handling for unsupported values.
Ensured compatibility with existing column-wise (axis=0) mode functionality, preserving the original behavior for that case.

Resolves Dask Expressions issue #1136 and Dask issue #11389

- Added functionality for row-wise mode calculation (axis=1) to support the Dask DataFrame API. - The new implementation dynamically handles row-wise mode and ensures consistent metadata handling across partitions. - Added validation for the axis parameter, with appropriate error handling for unsupported values. - Ensured compatibility with existing column-wise (axis=0) mode functionality, preserving the original behavior for that case. Resolves dask-expr issue dask#1136.

thyripian · 2024-09-20T00:51:58Z

Unit tests for the added functionality have been submitted in the Dask repository’s PR due to the initial issue location and the entry point for the functionality being within dask/dataframe/core.py. Please note that this PR in Dask-expr must be approved before the Dask PR tests can successfully run, as the tests depend on the changes made here.

Formatted _collection.py with Black

phofl · 2024-09-26T11:53:36Z

dask_expr/_collection.py

+            # Implement axis=1 (row-wise mode)
+            num_columns = len(self.columns)  # Maximum possible number of modes per row
+
+            def row_wise_mode(df):


you don't have to do this, we generally expect that the columns in every partition are matching

Could you clarify which part isn't necessary? Are you referring to determining the number of columns for row-wise mode?

Everything that is in def row_wise_mode basically

Understood. I'll make the necessary revisions this afternoon and push them to this PR once tested.

phofl · 2024-09-26T11:54:11Z

dask_expr/_collection.py

+
+            # Create metadata with the correct number of columns and float64 dtype
+            meta = pd.DataFrame(
+                {i: pd.Series(dtype="float64") for i in range(num_columns)}


why are you forcing float in all cases?

Thank you for highlighting that forcing columns to float64 can lead to unintended type conversions. I initially used float64 to handle NaN values but agree it's better to preserve original data types. I'll update the implementation to dynamically determine data types based on the input, using pandas' nullable types like Int64 for integers and string for text data. This will handle missing values without unnecessary type changes, ensuring consistent data types across partitions while maintaining data integrity. Please let me know if you have further suggestions.

Please don't do any explicit type casting, just use the nonempty meta of the input and call mode on it

Simplified the logic for row-wise mode computation (axis=1) to dynamically handle multiple modes per row. Refactored metadata handling to ensure the number of columns is consistent across partitions, avoiding mismatches in column count. This addresses issues with inconsistent column numbers between computed data and metadata in Dask, and addresses dev team feedback.

My bad. Git desktop added my venv to the last push but I didn't see it.

thyripian · 2024-10-06T19:52:26Z

Apologies for the delay — I’ve had some commitments with my day job this past week. The current version has been simplified, with no explicit typecasting or nested functions. It has passed both my local bug script and the pending unit test in the Dask repo.

Note: I accidentally pushed my venv with GitHub Desktop earlier, but I’ve since re-pushed with the venv directory removed.

Modified row-wise mode implementation to rely entirely on self._meta_nonempty for metadata generation, as per developer feedback. Ensured complete removal of explicit typecasting and ensured consistent column handling between computed data and metadata.

Made linting changes, specifically for black.

Reformat _collection.py

e976f38

Formatted _collection.py with Black

phofl reviewed Sep 26, 2024

View reviewed changes

thyripian added 3 commits October 6, 2024 15:38

drop venv

9f918ac

My bad. Git desktop added my venv to the last push but I didn't see it.

Add venv to .gitignore to avoid tracking the virtual environment

ddaef07

thyripian and others added 3 commits October 6, 2024 16:01

Run pre-commit linting

894bd9c

Made linting changes, specifically for black.

Merge branch 'main' into feature/rowwise-mode-support-gh1136

1dbdd2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

thyripian commented Sep 19, 2024

thyripian commented Sep 20, 2024

phofl Sep 26, 2024

thyripian Sep 26, 2024

phofl Sep 26, 2024

thyripian Sep 26, 2024

phofl Sep 26, 2024

thyripian Sep 26, 2024

phofl Sep 26, 2024

thyripian commented Oct 6, 2024

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

Are you sure you want to change the base?

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

Conversation

thyripian commented Sep 19, 2024

thyripian commented Sep 20, 2024

phofl Sep 26, 2024

Choose a reason for hiding this comment

thyripian Sep 26, 2024

Choose a reason for hiding this comment

phofl Sep 26, 2024

Choose a reason for hiding this comment

thyripian Sep 26, 2024

Choose a reason for hiding this comment

phofl Sep 26, 2024

Choose a reason for hiding this comment

thyripian Sep 26, 2024

Choose a reason for hiding this comment

phofl Sep 26, 2024

Choose a reason for hiding this comment

thyripian commented Oct 6, 2024