Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: use ObjectEngine for indexing for now correctness over performance #60329

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

A new StringEngine for indexing was added in #56997, showing some performance improvements compared to the ObjectEngine.
However, there are some issues with handling of missing values, see for example #59879

The change in this PR switches back to object based engine, to for now have correct/desired behaviour, and we can see later if we can optimize this (but short term for 2.3/3.0 I would prioritize correct behaviour)

xref #54792

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data labels Nov 15, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Nov 15, 2024
@WillAyd
Copy link
Member

WillAyd commented Nov 15, 2024

FWIW I noticed the xfails in test_pivot.py are going to require this, as there are tests that working with missing values as column labels

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review November 16, 2024 20:05
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for adding a new engine versus changing the existing StringEngine?

@@ -54,6 +54,7 @@ class UInt16Engine(IndexEngine): ...
class UInt8Engine(IndexEngine): ...
class ObjectEngine(IndexEngine): ...
class StringEngine(IndexEngine): ...
class StringObjectEngine(ObjectEngine): ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm would it be better to call this StrEngine? Or where does the term StringObject come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was meant to be read as "string-objectengine", i.e. essentially just the object engine, but we know that we only use it for strings (and so the _check_type can be specialized).

But I don't mind the name exactly (although StrEngine might also be confusing, because we currently use this for both str and string dtypes)

@jorisvandenbossche
Copy link
Member Author

What's the reason for adding a new engine versus changing the existing StringEngine?

I was initially thinking to modify the StringEngine to be like the masked engine to properly handle missing values, but that turned out to be a bit more complicated and so to have something that works correctly I thought to (for now) just fall back to the ObjectEngine (as we were using before for the string dtype as well).

You can see in the first commit that's what I did, but then I realized that the ObjectEngine itself it not yet enough if we want to have compatibility to allow looking up missing values with None vs np.nan (the object engine is strict about that, but for back compat I would prefer that the None can still be used for the string dtype, because the constructor will now coerce None to NaN), and so that is why I then added a very small ObjectEngine subclass to handle the missing value in the scalar lookup case.

with pytest.raises(KeyError):
index.get_loc(nulls_fixture)

def test_get_loc_missing(self, any_string_dtype, nulls_fixture):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test now means that you can use np.nan and pd.NA interchangeably when indexing? If that's correct, I'm not sure I agree that we should be going that far

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we are coercing any missing value indicator to NaN upon construction, and so to preserve back compat, I think I prefer we do the same for input to indexing operations.

To express it in terms of get_loc, this works now:

>>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(None)
2

but the same on main with enabling the string dtype:

>>> pd.options.future.infer_string = True
>>> pd.Index(["a", "b", None]).get_loc(None)
...
KeyError: None

That is because now the None is no longer in the object dtype index, but has been coerced to NaN.
(on main, trying the above with np.nan also fails (see the issue #59879), but that's because the StringEngine simply wasn't set up to work with missing values, so that is the initial reason I replaced it now with the StringObjectEngine)

The above is with None, but essentially happens with any other missing value indicator, like pd.NA. Maybe None and np.nan are the most important ones though, but I would at least prefer that indexing with None keeps working for now (we can always start deprecating it, but I wouldn't do that it as a breaking change for 3.0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is also already quite inconsistent depending on the data type .. See #59765 for an overview (e.g. also for datetimelike and categorical, we treat all NA-likes as the same in indexing lookups)

Nice - that's a great issue. Thanks for opening it.

To express it in terms of get_loc, this works now:

Hmm I'm a bit confused by how this relates to all of the missing indicators becoming essentially equal though. On main, this does not work (?):

>>> pd.options.future.infer_string = False
>>> pd.Index(["a", "b", None]).get_loc(np.nan)
KeyError: nan

Definitely understand that there is not an ideal solution here given the inconsistent history, but I don't want to go too far and just start making all of the missing value indicators interchangeable. I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main, this does not work (?):

Yes, that's the first bug that this PR is solving: right now no missing value lookup works, not even NaN itself (which is what is stored in the array). This is because the StringEngine simply doesn't handle missing values correctly (when building the hash table, it actually converts it to a sentinel string, but then for any of the lookup methods it doesn't take that into account; it's a bit an incomplete implementation)

So by using the ObjectEngine (subclass), that fixes that first issue: ensuring NaN can be found

I think containment logic should land a little closer to equality logic, and in the latter we obviously don't allow this

Missing values don't compare equal (well, Nonedoes, but we specifically didn't choose that long term as the sentinel moving forward; np.nan and pd.NA don't compare equal), so containment is already a bit of a special case anyway compared to equality, when it comes to missing values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on the equality. I guess I'm still hung up on the indexing behavior being the same though.

I've lost track of the nuance a bit, but haven't np.nan and pd.NA always had different indexing behavior? I'm just wary of glossing over that as part of this.

Maybe worth some input from @pandas-dev/pandas-core if anyone else has thoughts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to for now just enable exact matching missing values in get_loc, so this PR can already be merged (and fix the most glaring bug), and then we can have the discussion around backwards compatibility in #59879 (I don't think the above thread is very easy to follow for other people to chime in, will do a write up on the issue -> see #59879 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants