Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

TEARFEAR
Copy link

@TEARFEAR TEARFEAR commented Nov 22, 2024

Changes :

  • Updated comparison_op function in pandas/core/ops/array_ops.py:
    • Added handling for comparisons between string and object dtypes.
    • Ensured string and object types are casted to a common type (string) before comparison.
  • Added a test case to validate the fix:
    • Verified comparison works for string vs object with homogeneous and mixed types.
    • Verified behavior with PyArrow-based strings enabled (pd.options.future.infer_string = True).

--

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 23, 2024
@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Nov 23, 2024
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
Added a few suggestions

Comment on lines +328 to +329
if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or (
is_object_dtype(lvalues) and is_string_dtype(rvalues)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for string dtype for an array can be expensive in case the array is object dtype (at that point it will scan all values to check if they are strings). So we might want to try avoid that at this level.
I think we could handle the issue specifically for the ArrowExtensionArray itself (see the code I referenced in #60228 (comment))

):
if lvalues.dtype.name == "string" and rvalues.dtype == object:
lvalues = lvalues.astype("string")
rvalues = pd_array(rvalues, dtype="string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to do the casting the other way around. Instead of casting the object to string and then compare both as strings, I think we have to cast the string to object and compare both as object dtype.

The reason for this is that casting to string might actually convert values to strings, and then we are no longer doing the comparison for the original values.

>>> ser_string = pd.Series(["1", "b"])
>>> ser_mixed = pd.Series([1, "b"])
>>> ser_string == ser_mixed
0    False
1     True
dtype: bool

>>> ser_string == ser_mixed.astype("string")
0    True
1    True
dtype: bool

So if we would do that casting under the hood, the result would change in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should add this case to the tests!


def test_comparison_string_mixed_object():
# Issue https://github.com/pandas-dev/pandas/issues/60228
pd.options.future.infer_string = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to add this for CI, because we have a separate CI build that enables this option for the full test suite.

Now, this can still be useful to test locally, but the way you can do this is with setting an environment variable (on linux I can do PANDAS_FUTURE_INFER_STRING=1 pytest ... to run the test with the option enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG (string dtype): comparison of string column to mixed object column fails
2 participants