schema_sample_rows=0 results in a table filled with null values #236

niekrongen · 2024-05-22T17:40:05Z

When reading a excel file and setting the schema_sample_rows to 0 results in a table of the correct height, but all values are set to null.
If the schema_sample_rows is set to n and the first n values in a column are empty, then the remainder of the values in this column are also filled with null values.
This means that the values that are present in this column, are replaced with null values.
This is causing a loss of data, while this can be fixed by defaulting to a column dtype of type string.

For example, xlsx2csv has this also as default.

PrettyWood · 2024-05-22T18:55:25Z

I don't really understand.
schema_sample_rows sets the number of rows used to infer the type of the columns.
If you set it to 0, we can't infer anything because no data is read hence the null type being generated

niekrongen · 2024-05-22T19:04:49Z

I understand that the type cannot be inferred, but why would you replace all the values with null then?

I have a table filled with values, but I do not want to have any types inferred as I want to assign them by myself.
However, the columns are all filled with null values instead of the values that are present in the column.
I would be fine with the fact that the column dtype is null, but then all the values are kept as strings for example.
In that way I would be able to cast the columns by myself, to the values that I want them to be and prevent the extra overhead of inferring the dtype.

niekrongen · 2024-05-22T19:14:18Z

For example the following excel on the left with schema_sample_rows = 0 results in the table on the right after reading it with all dtypes being null.

Another example is the following excel on the left with schema_sample_rows = 1 results in the table on the right after reading it with the dtypes of the first three columns being string and the last column being of dtype null.

I do not see why this is wanted behavior to remove all data from the column if the dtype cannot be inferred.
Could you please elaborate why this is the default behavior and how to prevent the replacement in way that is not casting all 16384 columns with the dtypes parameter to type string.

Tim-Kracht · 2024-05-22T21:48:09Z

I ran into this as well. I have very inconsistent excel files and the requirement is to load all data as strings. Since I don't know how many columns there are, I am currently doing a 3-step process: load the first row, build a dtype map based on the width of the row, then load the sheet using the map. The time hit for the double load is noticeable for larger worksheets. I am certainly open to better solutions. I have tried some combinations of Polars and fastexcel and always run into trouble... dropped columns/null data at one end and unsupported column type combination: {DateTime, String} at the other.

I was also going to open a feature request to pass a single dtype to apply to all columns.

ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, n_rows=1, schema_sample_rows=0)
type_map = {i: "string" for i in range(ws.width + 1)}
ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, dtypes=type_map)

niekrongen · 2024-05-23T05:29:39Z

The solution that I have used with polars is by creating the type_map for all 16384 columns as this is the limit of an excel file. Of course I am not creating it for the all columns but for my purpose with polars I have set it for the first 100 columns. This returns the correct dataframe. The extra columns in the type_map are not created at all, so therefore you are able to create a type_map without knowing the width of the worksheet in the first place.

PrettyWood · 2024-07-20T23:29:00Z

@Tim-Kracht @niekrongen we made some improvements by supporting most mix types.
Does it solve your problems?
I don't want to add another option but if a way to have all columns in string is really needed, we'll work on a solution (like #257)

PrettyWood · 2024-10-13T15:05:10Z

@lukapeschke I reckon we should forbid 0 as value. It makes no sense. And then we can close this issue. WDYT?

lukapeschke · 2024-10-14T09:59:46Z

@PrettyWood I agree, having schema_sample_rows=0 should result in an InvalidParameters error

closes #236 Signed-off-by: Luka Peschke <[email protected]>

lukapeschke · 2024-10-15T15:15:05Z

@Tim-Kracht @niekrongen #304 will forbid setting schema_sample_rows=0 .

Since v0.12.0, it is possible to specify dtypes='string' to load_sheet, which will have the same effect as specifying string for all columns

niekrongen mentioned this issue May 22, 2024

read_excel with engine="calamine" infer_schema_length=0 returns an empty DataFrame pola-rs/polars#16408

Closed

2 tasks

lukapeschke added a commit that referenced this issue Oct 15, 2024

fix: forbid schema_sample_rows=0

97b4e5b

closes #236 Signed-off-by: Luka Peschke <[email protected]>

lukapeschke mentioned this issue Oct 15, 2024

feat: forbid schema_sample_rows=0 #304

Merged

PrettyWood closed this as completed in #304 Oct 15, 2024

PrettyWood closed this as completed in 17f0186 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema_sample_rows=0 results in a table filled with null values #236

schema_sample_rows=0 results in a table filled with null values #236

niekrongen commented May 22, 2024

PrettyWood commented May 22, 2024

niekrongen commented May 22, 2024

niekrongen commented May 22, 2024

Tim-Kracht commented May 22, 2024

niekrongen commented May 23, 2024

PrettyWood commented Jul 20, 2024 •

edited

Loading

PrettyWood commented Oct 13, 2024 •

edited

Loading

lukapeschke commented Oct 14, 2024

lukapeschke commented Oct 15, 2024

schema_sample_rows=0 results in a table filled with null values #236

schema_sample_rows=0 results in a table filled with null values #236

Comments

niekrongen commented May 22, 2024

PrettyWood commented May 22, 2024

niekrongen commented May 22, 2024

niekrongen commented May 22, 2024

Tim-Kracht commented May 22, 2024

niekrongen commented May 23, 2024

PrettyWood commented Jul 20, 2024 • edited Loading

PrettyWood commented Oct 13, 2024 • edited Loading

lukapeschke commented Oct 14, 2024

lukapeschke commented Oct 15, 2024

PrettyWood commented Jul 20, 2024 •

edited

Loading

PrettyWood commented Oct 13, 2024 •

edited

Loading