Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read xlsx file using 1st row as header even using "open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])" #323

Open
michaelpopes opened this issue May 26, 2024 · 2 comments

Comments

@michaelpopes
Copy link

michaelpopes commented May 26, 2024

DUCKDB_READ_XLSX_BUG.xlsx

As you can see, there are two sheets in this Excel file. I wish to read all the cells as data, so I use this option "open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])".

import duckdb

duckdb.install_extension("spatial")
duckdb.load_extension("spatial")

duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='2', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='3', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)

The results are different for the two sheets. The result of sheet "2" diables the header, which is correct. But the result of sheet "3" still uses 1st row as head, which is wrong. I don't see any difference between sheet "2" and sheet "3" to cause the different results.

Python 3.12.3 (tags/v3.12.3:f6650f9, Apr  9 2024, 14:05:25) [MSC v.1938 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import duckdb
>>> duckdb.sql(f"SELECT VERSION()").show(max_width=198)
┌───────────┐
│ version() │
│  varchar  │
├───────────┤
│ v0.10.3   │
└───────────┘

>>>
>>> duckdb.install_extension("spatial")
>>> duckdb.load_extension("spatial")
>>>
>>> duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='2', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Field1  │ Field2  │ Field3  │ Field4  │ Field5  │ Field6  │ Field7  │ Field8  │ Field9  │ Field10 │ Field11 │ Field12 │ Field13 │ Field14 │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ AC      │ BC      │ CC      │ DC      │ EC      │ FC      │ GC      │ HC      │ IC      │ JC      │ KC      │ LC      │ MC      │ NC      │
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

>>> duckdb.sql(f"FROM st_read('DUCKDB_READ_XLSX_BUG.xlsx', layer='3', open_options=['HEADERS=DISABLE', 'FIELD_TYPES=STRING'])").show(max_width=198)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│   AC    │   BC    │   CC    │   DC    │   EC    │   FC    │   GC    │   HC    │   IC    │   JC    │   KC    │   LC    │   MC    │   NC    │   OC    │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
│ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

As you can see from the results, I am using python 3.12.3 and Duckdb v0.10.3.

The correct results should be that both sheet "2" and sheet "3" have the same head, like "Field1", "Field2", etc.

Thank you for your consideration!

@iku000888
Copy link

I ran into a very similar issue attempting to read a file with multiple sheets, where the first sheet respects HEADERS=DISABLE but not for the second sheet.

D .version
SQLite v1.1.1 af39bd0dcf
clang-15.0.0

@iku000888
Copy link

I seem to be able to work around it by exporting the sheet of interest to a single sheeted xlsx file and then reading the exported xlsx. Not great but I can possibly live with it.

D COPY (select * from st_read('file.xlsx', layer = 'sheet')) TO 'tempoutput.xlsx' WITH (FORMAT GDAL, DRIVER 'xlsx');

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants