-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema_sample_rows=0 results in a table filled with null values #236
Comments
I don't really understand. |
I understand that the type cannot be inferred, but why would you replace all the values with null then? I have a table filled with values, but I do not want to have any types inferred as I want to assign them by myself. |
I ran into this as well. I have very inconsistent excel files and the requirement is to load all data as strings. Since I don't know how many columns there are, I am currently doing a 3-step process: load the first row, build a dtype map based on the width of the row, then load the sheet using the map. The time hit for the double load is noticeable for larger worksheets. I am certainly open to better solutions. I have tried some combinations of I was also going to open a feature request to pass a single ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, n_rows=1, schema_sample_rows=0)
type_map = {i: "string" for i in range(ws.width + 1)}
ws = wb.load_sheet_by_name(name=sheet_name, header_row=None, dtypes=type_map) |
The solution that I have used with polars is by creating the |
@Tim-Kracht @niekrongen we made some improvements by supporting most mix types. |
@lukapeschke I reckon we should forbid |
@PrettyWood I agree, having |
closes #236 Signed-off-by: Luka Peschke <[email protected]>
@Tim-Kracht @niekrongen #304 will forbid setting Since v0.12.0, it is possible to specify |
When reading a excel file and setting the schema_sample_rows to 0 results in a table of the correct height, but all values are set to null.
If the schema_sample_rows is set to n and the first n values in a column are empty, then the remainder of the values in this column are also filled with null values.
This means that the values that are present in this column, are replaced with null values.
This is causing a loss of data, while this can be fixed by defaulting to a column dtype of type string.
For example, xlsx2csv has this also as default.
The text was updated successfully, but these errors were encountered: