API: creating DataFrame with no columns: object vs string dtype columns? #60338

jorisvandenbossche · 2024-11-16T17:19:29Z

A typical case we encounter in the tests is starting from an empty DataFrame, and then adding some columns.

Simplied example of this pattern:

df = pd.DataFrame()
df["a"] = values
...

The dataframe starts with an empty Index columns, and the default dtype for an empty Index is object dtype. And then inserting string labels for the actual columns into that Index object, preserves the object dtype.

As long as we used object dtype for string column names, this was perfectly fine. But now that we will infer str dtype for actual string column names, it gets a bit annoying that the pattern above does not result in str but object colums.

This is not the best pattern, so maybe it's OK this does not give the ideal result. But at the same since we even use it quite regularly in our own tests, I suppose this is not that uncommon.

The text was updated successfully, but these errors were encountered:

WillAyd · 2024-11-16T17:35:05Z

I wonder if it would be less disruptive to have the empty Index default to a string data type and coerce to object as needed (at least when used in columns).

jorisvandenbossche · 2024-11-16T19:13:01Z

I was actually wrong about the default empty index being object dtype. While that is the case for directly creating an empty index, for DataFrame/Series we already deviate from that and create an empty range index:

>>> pd.DataFrame().index
RangeIndex(start=0, stop=0, step=1)
>>> pd.DataFrame().columns
RangeIndex(start=0, stop=0, step=1)
>>> pd.Index([])
Index([], dtype='object')

Now, the result is the same because inserting a string label in the integer-like range index also upcasts to object dtype.

But yeah, I think it could make sense for the columns to be string by default.
This would be a backwards incompatible change for the case where you start with an empty dataframe and then insert columns with integer labels (that would then cast to object dtype, instead of preserving the integer dtype)

WillAyd · 2024-11-16T19:25:32Z

Good point, although its hard to make any guarantees about what the data type of an empty dataframe is with our current data model.

Might be another good motivating factor for PDEP-13 #58455 to implement the Null type and use that as the default. That's of course a ways off; in the meantime I think we just have to make a best effort at this, which I think would be assuming string column labels

jorisvandenbossche added API Design Strings String extension data type and string data Index Related to the Index class or subclasses labels Nov 16, 2024

jorisvandenbossche mentioned this issue Nov 16, 2024

TST (string dtype): resolve xfails for frame methods #60336

Draft

jorisvandenbossche added this to the 2.3 milestone Nov 16, 2024

simonjayhawkins mentioned this issue Nov 18, 2024

TST (string dtype): clean-up assorted xfails #60345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: creating DataFrame with no columns: object vs string dtype columns? #60338

API: creating DataFrame with no columns: object vs string dtype columns? #60338

jorisvandenbossche commented Nov 16, 2024

WillAyd commented Nov 16, 2024

jorisvandenbossche commented Nov 16, 2024

WillAyd commented Nov 16, 2024

API: creating DataFrame with no columns: object vs string dtype columns? #60338

API: creating DataFrame with no columns: object vs string dtype columns? #60338

Comments

jorisvandenbossche commented Nov 16, 2024

WillAyd commented Nov 16, 2024

jorisvandenbossche commented Nov 16, 2024

WillAyd commented Nov 16, 2024