API: how to check for "logical" equality of dtypes? #60305

jorisvandenbossche · 2024-11-13T18:00:22Z

Assume you have a series, which has a certain dtype. In the case that this dtype is an instance of potentially multiple variants of a logical dtype (for example, string backed by python or backed by pyarrow), how do you check for the "logical" equality of such dtypes?

For checking the logical equality for one series, you have the option to compare it with the generic string alias (which will return True for any variant of it) or checking the dtype with isinstance or some is_..._dtype (although we have deprecated some of those). Using string dtype as the example:

ser.dtype == "string"
# or
isinstance(ser.dtype, pd.StringDtype)
pd.api.types.is_string_dtype(ser.dtype)

When you want to check if two serieses have the same dtype, the == will check for exact equality (in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage):

ser1.dtype == ser2.dtype

But so how to check this logical equality for two dtypes? In the example, how to know that both dtypes are representing the same logical dtype (i.e. both a StringDtype instance), without necessarily wanting to check the exact type (i.e. the user doesn't necessarily know it are string dtypes, just want to check if they are logically the same)

# this might work?
type(ser1.dtype) == type(ser2.dtype)

Do we want some other API here that is a bit more user friendly? (just brainstorming, something like dtype1.is_same_type(dtype2), or a function, ..)

This is important in the discussion around logical dtypes (#58455), but so it is already an issue for the new string dtype as well in pandas 3.0

cc @WillAyd @Dr-Irv @jbrockmendel (tagging some people that were most active in the logical dtypes discussion)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-11-13T18:01:54Z

And maybe an additional question is how to propagate that notion of "exact equality" vs "logical equality" into methods like .equals() or assert_frame_equal() (those could get a keyword about it?)

Dr-Irv · 2024-11-13T18:47:00Z

One idea is to make .equals() mean "is it the same logical type", but that won't work because sometimes the dtype of a Series is a numpy dtype.

IMHO, let == mean "exactly the same dtype", and introduce a function to mean "is same logical dtype"

jorisvandenbossche · 2024-11-13T19:11:03Z

Something like pd.api.types.is_same_dtype(dtype1, dtype2) ?

WillAyd · 2024-11-13T19:23:24Z

When you want to check if two serieses have the same dtype, the == will check for exact equality (in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage):

I think these should compare as equal if they are logically equivalent; otherwise we are back to the issue of exposing the implementation detail to end users.

So I think the reverse of being proposed is what we should have. By default, equality comparisons use the logical semantics, and if you wanted a more granular physical comparison you should use a dedicated function. I think we have a prior art for that already when considering .equals and testing.assert_frame_equal

Dr-Irv · 2024-11-13T19:27:50Z

I think these should compare as equal if they are logically equivalent; otherwise we are back to the issue of exposing the implementation detail to end users.

@WillAyd that would create a change of behavior. See this example:

>>> s = pd.Series([1,2,3], dtype="Int64")
>>> s2 = pd.Series([1,2,3])
>>> s
0    1
1    2
2    3
dtype: Int64
>>> s2
0    1
1    2
2    3
dtype: int64
>>> s.dtype == s2.dtype
False

So you're proposing that == reports True here, and that's a possible big change for users.

WillAyd · 2024-11-13T19:30:42Z

In the long run yes...although we definitely need to be careful about the steps that we take to get there.

I'm also coming from the perspective that we've discussed in PDEP-13, where dtype="Int64" and dtype==np.int64 lose their physical nature and only expose their logical behavior to end users

simonjayhawkins · 2024-11-14T10:11:33Z

There seems to be a concern about changing the behavior of equality checks, as it could affect users who rely on the current exact equality checks.

in the string dtype example, the below can evaluate to False even if both are a StringDtype, but have a different storage

as @WillAyd mentions this is exposing the implementation detail to end users. So could it be considered reasonable to argue that this is a bug and a change would be a bugfix not a change in behavior for just this dtype in isolation which is still currently considered experimental?

WillAyd · 2024-11-14T15:40:48Z

Yea that's an interesting point that @simonjayhawkins brings up. Do we think there's a huge risk to changing that behavior for strings today?

jorisvandenbossche added the API Design label Nov 13, 2024

simonjayhawkins mentioned this issue Nov 14, 2024

ENH: Need API support and __repr__ to discover the storage used for strings #59342

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: how to check for "logical" equality of dtypes? #60305

API: how to check for "logical" equality of dtypes? #60305

jorisvandenbossche commented Nov 13, 2024

jorisvandenbossche commented Nov 13, 2024

Dr-Irv commented Nov 13, 2024

jorisvandenbossche commented Nov 13, 2024

WillAyd commented Nov 13, 2024 •

edited

Loading

Dr-Irv commented Nov 13, 2024

WillAyd commented Nov 13, 2024

simonjayhawkins commented Nov 14, 2024

WillAyd commented Nov 14, 2024

API: how to check for "logical" equality of dtypes? #60305

API: how to check for "logical" equality of dtypes? #60305

Comments

jorisvandenbossche commented Nov 13, 2024

jorisvandenbossche commented Nov 13, 2024

Dr-Irv commented Nov 13, 2024

jorisvandenbossche commented Nov 13, 2024

WillAyd commented Nov 13, 2024 • edited Loading

Dr-Irv commented Nov 13, 2024

WillAyd commented Nov 13, 2024

simonjayhawkins commented Nov 14, 2024

WillAyd commented Nov 14, 2024

WillAyd commented Nov 13, 2024 •

edited

Loading