Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: creating DataFrame with no columns: object vs string dtype columns? #60338

Open
jorisvandenbossche opened this issue Nov 16, 2024 · 3 comments
Labels
API Design Index Related to the Index class or subclasses Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

A typical case we encounter in the tests is starting from an empty DataFrame, and then adding some columns.

Simplied example of this pattern:

df = pd.DataFrame()
df["a"] = values
...

The dataframe starts with an empty Index columns, and the default dtype for an empty Index is object dtype. And then inserting string labels for the actual columns into that Index object, preserves the object dtype.

As long as we used object dtype for string column names, this was perfectly fine. But now that we will infer str dtype for actual string column names, it gets a bit annoying that the pattern above does not result in str but object colums.

This is not the best pattern, so maybe it's OK this does not give the ideal result. But at the same since we even use it quite regularly in our own tests, I suppose this is not that uncommon.

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data Index Related to the Index class or subclasses labels Nov 16, 2024
@WillAyd
Copy link
Member

WillAyd commented Nov 16, 2024

I wonder if it would be less disruptive to have the empty Index default to a string data type and coerce to object as needed (at least when used in columns).

@jorisvandenbossche
Copy link
Member Author

I was actually wrong about the default empty index being object dtype. While that is the case for directly creating an empty index, for DataFrame/Series we already deviate from that and create an empty range index:

>>> pd.DataFrame().index
RangeIndex(start=0, stop=0, step=1)
>>> pd.DataFrame().columns
RangeIndex(start=0, stop=0, step=1)
>>> pd.Index([])
Index([], dtype='object')

Now, the result is the same because inserting a string label in the integer-like range index also upcasts to object dtype.

But yeah, I think it could make sense for the columns to be string by default.
This would be a backwards incompatible change for the case where you start with an empty dataframe and then insert columns with integer labels (that would then cast to object dtype, instead of preserving the integer dtype)

@WillAyd
Copy link
Member

WillAyd commented Nov 16, 2024

Good point, although its hard to make any guarantees about what the data type of an empty dataframe is with our current data model.

Might be another good motivating factor for PDEP-13 #58455 to implement the Null type and use that as the default. That's of course a ways off; in the meantime I think we just have to make a best effort at this, which I think would be assuming string column labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Index Related to the Index class or subclasses Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

2 participants