BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

TEARFEAR · 2024-11-22T00:47:29Z

Changes :

Updated comparison_op function in pandas/core/ops/array_ops.py:
- Added handling for comparisons between string and object dtypes.
- Ensured string and object types are casted to a common type (string) before comparison.
Added a test case to validate the fix:
- Verified comparison works for string vs object with homogeneous and mixed types.
- Verified behavior with PyArrow-based strings enabled (pd.options.future.infer_string = True).

--

closes BUG (string dtype): comparison of string column to mixed object column fails #60228 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…s-dev#60228)

…n fails pandas-dev#60228

jorisvandenbossche

Thanks for working on this!
Added a few suggestions

jorisvandenbossche · 2024-11-23T09:17:19Z

pandas/core/ops/array_ops.py

+    if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or (
+        is_object_dtype(lvalues) and is_string_dtype(rvalues)


Checking for string dtype for an array can be expensive in case the array is object dtype (at that point it will scan all values to check if they are strings). So we might want to try avoid that at this level.
I think we could handle the issue specifically for the ArrowExtensionArray itself (see the code I referenced in #60228 (comment))

jorisvandenbossche · 2024-11-23T09:20:50Z

pandas/core/ops/array_ops.py

+    ):
+        if lvalues.dtype.name == "string" and rvalues.dtype == object:
+            lvalues = lvalues.astype("string")
+            rvalues = pd_array(rvalues, dtype="string")


We might need to do the casting the other way around. Instead of casting the object to string and then compare both as strings, I think we have to cast the string to object and compare both as object dtype.

The reason for this is that casting to string might actually convert values to strings, and then we are no longer doing the comparison for the original values.

>>> ser_string = pd.Series(["1", "b"]) >>> ser_mixed = pd.Series([1, "b"]) >>> ser_string == ser_mixed 0 False 1 True dtype: bool >>> ser_string == ser_mixed.astype("string") 0 True 1 True dtype: bool

So if we would do that casting under the hood, the result would change in this case.

And we should add this case to the tests!

jorisvandenbossche · 2024-11-23T09:23:07Z

pandas/tests/series/methods/test_compare.py

+
+def test_comparison_string_mixed_object():
+    # Issue https://github.com/pandas-dev/pandas/issues/60228
+    pd.options.future.infer_string = True


You don't need to add this for CI, because we have a separate CI build that enables this option for the full test suite.

Now, this can still be useful to test locally, but the way you can do this is with setting an environment variable (on linux I can do PANDAS_FUTURE_INFER_STRING=1 pytest ... to run the test with the option enabled.

TEARFEAR and others added 6 commits November 21, 2024 14:35

fixed comparison of string column to mixed object column (issue panda…

4bc49ed

…s-dev#60228)

BUG (string dtype): comparison of string column to mixed object colum…

0def761

…n fails pandas-dev#60228

BUG (string dtype): comparison of string column to mixed object colum…

c4da919

…n fails pandas-dev#60228

BUG (string dtype): comparison of string column to mixed object colum…

900f3b1

…n fails pandas-dev#60228

BUG (string dtype): comparison of string column to mixed object colum…

8db4edc

…n fails pandas-dev#60228

Merge branch 'main' into bug-update-60228

d4ea527

jorisvandenbossche added this to the 2.3 milestone Nov 23, 2024

jorisvandenbossche added the Strings String extension data type and string data label Nov 23, 2024

jorisvandenbossche reviewed Nov 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

TEARFEAR commented Nov 22, 2024 •

edited by jorisvandenbossche

Loading

jorisvandenbossche left a comment

jorisvandenbossche Nov 23, 2024

jorisvandenbossche Nov 23, 2024

jorisvandenbossche Nov 23, 2024

jorisvandenbossche Nov 23, 2024

		if (is_string_dtype(lvalues) and is_object_dtype(rvalues)) or (
		is_object_dtype(lvalues) and is_string_dtype(rvalues)

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

Are you sure you want to change the base?

BUG (string dtype): comparison of string column to mixed object column fails #60228 (fixed) #60392

Conversation

TEARFEAR commented Nov 22, 2024 • edited by jorisvandenbossche Loading

Changes :

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

TEARFEAR commented Nov 22, 2024 •

edited by jorisvandenbossche

Loading