String dtype: use ObjectEngine for indexing for now correctness over performance #60329

jorisvandenbossche · 2024-11-15T17:21:55Z

A new StringEngine for indexing was added in #56997, showing some performance improvements compared to the ObjectEngine.
However, there are some issues with handling of missing values, see for example #59879

The change in this PR switches back to object based engine, to for now have correct/desired behaviour, and we can see later if we can optimize this (but short term for 2.3/3.0 I would prioritize correct behaviour)

xref #54792

…performance

… values

WillAyd · 2024-11-15T23:32:56Z

FWIW I noticed the xfails in test_pivot.py are going to require this, as there are tests that working with missing values as column labels

…engine

WillAyd

What's the reason for adding a new engine versus changing the existing StringEngine?

WillAyd · 2024-11-16T23:10:21Z

pandas/_libs/index.pyi

@@ -54,6 +54,7 @@ class UInt16Engine(IndexEngine): ...
 class UInt8Engine(IndexEngine): ...
 class ObjectEngine(IndexEngine): ...
 class StringEngine(IndexEngine): ...
+class StringObjectEngine(ObjectEngine): ...


Hmm would it be better to call this StrEngine? Or where does the term StringObject come from?

It was meant to be read as "string-objectengine", i.e. essentially just the object engine, but we know that we only use it for strings (and so the _check_type can be specialized).

But I don't mind the name exactly (although StrEngine might also be confusing, because we currently use this for both str and string dtypes)

…engine

jorisvandenbossche · 2024-11-17T09:09:54Z

What's the reason for adding a new engine versus changing the existing StringEngine?

I was initially thinking to modify the StringEngine to be like the masked engine to properly handle missing values, but that turned out to be a bit more complicated and so to have something that works correctly I thought to (for now) just fall back to the ObjectEngine (as we were using before for the string dtype as well).

You can see in the first commit that's what I did, but then I realized that the ObjectEngine itself it not yet enough if we want to have compatibility to allow looking up missing values with None vs np.nan (the object engine is strict about that, but for back compat I would prefer that the None can still be used for the string dtype, because the constructor will now coerce None to NaN), and so that is why I then added a very small ObjectEngine subclass to handle the missing value in the scalar lookup case.

WillAyd · 2024-11-17T16:03:15Z

pandas/tests/indexes/string/test_indexing.py

+        with pytest.raises(KeyError):
+            index.get_loc(nulls_fixture)
+
+    def test_get_loc_missing(self, any_string_dtype, nulls_fixture):


So this test now means that you can use np.nan and pd.NA interchangeably when indexing? If that's correct, I'm not sure I agree that we should be going that far

The problem is that we are coercing any missing value indicator to NaN upon construction, and so to preserve back compat, I think I prefer we do the same for input to indexing operations.

To express it in terms of get_loc, this works now:

>>> pd.options.future.infer_string = False >>> pd.Index(["a", "b", None]).get_loc(None) 2

but the same on main with enabling the string dtype:

>>> pd.options.future.infer_string = True >>> pd.Index(["a", "b", None]).get_loc(None) ... KeyError: None

That is because now the None is no longer in the object dtype index, but has been coerced to NaN.
(on main, trying the above with np.nan also fails (see the issue #59879), but that's because the StringEngine simply wasn't set up to work with missing values, so that is the initial reason I replaced it now with the StringObjectEngine)

The above is with None, but essentially happens with any other missing value indicator, like pd.NA. Maybe None and np.nan are the most important ones though, but I would at least prefer that indexing with None keeps working for now (we can always start deprecating it, but I wouldn't do that it as a breaking change for 3.0)

FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)

FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)

Nice - that's a great issue. Thanks for opening it.

To express it in terms of get_loc, this works now:

Hmm I'm a bit confused by how this relates to all of the missing indicators becoming essentially equal though. On main, this does not work (?):

>>> pd.options.future.infer_string = False >>> pd.Index(["a", "b", None]).get_loc(np.nan) KeyError: nan

Definitely understand that there is not an ideal solution here given the inconsistent history, but I don't want to go too far and just start making all of the missing value indicators interchangeable. I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this

On main, this does not work (?):

Yes, that's the first bug that this PR is solving: right now no missing value lookup works, not even NaN itself (which is what is stored in the array). This is because the StringEngine simply doesn't handle missing values correctly (when building the hash table, it actually converts it to a sentinel string, but then for any of the lookup methods it doesn't take that into account; it's a bit an incomplete implementation)

So by using the ObjectEngine (subclass), that fixes that first issue: ensuring NaN can be found

I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this

Missing values don't compare equal (well, Nonedoes, but we specifically didn't choose that long term as the sentinel moving forward; np.nan and pd.NA don't compare equal), so containment is already a bit of a special case anyway compared to equality, when it comes to missing values.

Fair point on the equality. I guess I'm still hung up on the indexing behavior being the same though.

I've lost track of the nuance a bit, but haven't np.nan and pd.NA always had different indexing behavior? I'm just wary of glossing over that as part of this.

Maybe worth some input from @pandas-dev/pandas-core if anyone else has thoughts

I updated the PR to for now just enable exact matching missing values in get_loc, so this PR can already be merged (and fix the most glaring bug), and then we can have the discussion around backwards compatibility in #59879 (I don't think the above thread is very easy to follow for other people to chime in, will do a write up on the issue -> see #59879 (comment))

…engine

jorisvandenbossche added 2 commits November 15, 2024 17:19

String dtype: use ObjectEngine for indexing for now correctness over …

091baa8

…performance

add string-specific ObjectEngine subclass for pre-processing of input…

cfb73f5

… values

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data labels Nov 15, 2024

jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024

jorisvandenbossche added 3 commits November 15, 2024 19:55

remove xfails

6892f83

add tests for get_loc + fix for NA variant of string dtype

e007299

support get_indexer

bb148ba

jorisvandenbossche added 2 commits November 16, 2024 15:26

update tests

a669d75

Merge remote-tracking branch 'upstream/main' into string-dtype-index-…

2a4aed2

…engine

jorisvandenbossche marked this pull request as ready for review November 16, 2024 20:05

jorisvandenbossche requested a review from WillAyd as a code owner November 16, 2024 20:05

WillAyd reviewed Nov 16, 2024

View reviewed changes

jorisvandenbossche added 3 commits November 17, 2024 09:55

Merge remote-tracking branch 'upstream/main' into string-dtype-index-…

13fa689

…engine

update xfail for parser test

fccd220

try fix typing

8142300

WillAyd reviewed Nov 17, 2024

View reviewed changes

WillAyd mentioned this pull request Nov 18, 2024

TST (string dtype): clean-up assorted xfails #60354

Draft

jorisvandenbossche added 2 commits November 23, 2024 15:24

Merge remote-tracking branch 'upstream/main' into string-dtype-index-…

3c62a8d

…engine

limit get_loc to exact match for now

43a3edf

jorisvandenbossche mentioned this pull request Nov 23, 2024

BUG (string dtype): looking up missing value in the Index fails #59879

Open

fix for non-infer_string mode

c546a51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

jorisvandenbossche commented Nov 15, 2024

WillAyd commented Nov 15, 2024

WillAyd left a comment

WillAyd Nov 16, 2024

jorisvandenbossche Nov 17, 2024

jorisvandenbossche commented Nov 17, 2024

WillAyd Nov 17, 2024

jorisvandenbossche Nov 17, 2024

jorisvandenbossche Nov 18, 2024

WillAyd Nov 18, 2024

jorisvandenbossche Nov 18, 2024

WillAyd Nov 19, 2024

jorisvandenbossche Nov 23, 2024

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

Are you sure you want to change the base?

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

Conversation

jorisvandenbossche commented Nov 15, 2024

WillAyd commented Nov 15, 2024

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment