Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Row-Wise Mode Functionality (axis=1) and Improve Metadata Handling in _collection.py #1137

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

thyripian
Copy link

  • Added functionality for row-wise mode calculation (axis=1) to support the Dask DataFrame API.
  • The new implementation dynamically handles row-wise mode and ensures consistent metadata handling across partitions.
  • Added validation for the axis parameter, with appropriate error handling for unsupported values.
  • Ensured compatibility with existing column-wise (axis=0) mode functionality, preserving the original behavior for that case.

Resolves Dask Expressions issue #1136 and Dask issue #11389

- Added functionality for row-wise mode calculation (axis=1) to support the Dask DataFrame API.
- The new implementation dynamically handles row-wise mode and ensures consistent metadata handling across partitions.
- Added validation for the axis parameter, with appropriate error handling for unsupported values.
- Ensured compatibility with existing column-wise (axis=0) mode functionality, preserving the original behavior for that case.

Resolves dask-expr issue dask#1136.
@thyripian
Copy link
Author

Unit tests for the added functionality have been submitted in the Dask repository’s PR due to the initial issue location and the entry point for the functionality being within dask/dataframe/core.py. Please note that this PR in Dask-expr must be approved before the Dask PR tests can successfully run, as the tests depend on the changes made here.

Formatted _collection.py with Black
# Implement axis=1 (row-wise mode)
num_columns = len(self.columns) # Maximum possible number of modes per row

def row_wise_mode(df):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't have to do this, we generally expect that the columns in every partition are matching

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which part isn't necessary? Are you referring to determining the number of columns for row-wise mode?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything that is in def row_wise_mode basically

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I'll make the necessary revisions this afternoon and push them to this PR once tested.


# Create metadata with the correct number of columns and float64 dtype
meta = pd.DataFrame(
{i: pd.Series(dtype="float64") for i in range(num_columns)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you forcing float in all cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for highlighting that forcing columns to float64 can lead to unintended type conversions. I initially used float64 to handle NaN values but agree it's better to preserve original data types. I'll update the implementation to dynamically determine data types based on the input, using pandas' nullable types like Int64 for integers and string for text data. This will handle missing values without unnecessary type changes, ensuring consistent data types across partitions while maintaining data integrity. Please let me know if you have further suggestions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do any explicit type casting, just use the nonempty meta of the input and call mode on it

Simplified the logic for row-wise mode computation (axis=1) to dynamically handle multiple modes per row. Refactored metadata handling to ensure the number of columns is consistent across partitions, avoiding mismatches in column count. This addresses issues with inconsistent column numbers between computed data and metadata in Dask, and addresses dev team feedback.
My bad. Git desktop added my venv to the last push but I didn't see it.
@thyripian
Copy link
Author

Apologies for the delay — I’ve had some commitments with my day job this past week. The current version has been simplified, with no explicit typecasting or nested functions. It has passed both my local bug script and the pending unit test in the Dask repo.

Note: I accidentally pushed my venv with GitHub Desktop earlier, but I’ve since re-pushed with the venv directory removed.

thyripian and others added 3 commits October 6, 2024 16:01
Modified row-wise mode implementation to rely entirely on self._meta_nonempty for metadata generation, as per developer feedback. Ensured complete removal of explicit typecasting and ensured consistent column handling between computed data and metadata.
Made linting changes, specifically for black.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants