Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Importing Excel Spreadsheets w/Crash #328

Open
2 tasks done
Deanozk opened this issue Jun 4, 2024 · 7 comments
Open
2 tasks done

Error Importing Excel Spreadsheets w/Crash #328

Deanozk opened this issue Jun 4, 2024 · 7 comments

Comments

@Deanozk
Copy link

Deanozk commented Jun 4, 2024

What happens?

I'm a new user. I installed duckdb version 0.10.0 and followed instructions to import excel spreadsheet. My files contain multiple sheets inside. Simple files load ok however for my private file that has no values in first three rows of first three columns (but does have values after that) I get fatal error and need to restart in some cases. I am also noticing its not clear how to specify which sheet or sheets of a multisheet document to load or how it determines the number of rows. See attached sheets.

To Reproduce

download a complex excel spreadsheet with multiple sheets. I downloaded
SaleData.xlsx # Note I did this after first trying and failing to load my own private xlsx sheet. Steps I took 1. install spatial and load spatial. Its possible on my first attempt I did not load spatial but I am not sure. I received an error trying to load: SELECT * FROM st_read('sludge.xlsx'); This produced an error: SELECT * FROM st_read('sludge.xlsx');
Error: INTERNAL Error: Failed to bind "ST_Read": Table function must return at least one column

  1. Then I tried with SaleData.xlsx and got identical area.
  2. Open another window and started duckdb and ran a very simple spreadsheet names.xlsx that was one row and one column. It worked fine. I split SaleData into separate files one per sheet and it loaded it.
Screenshot 2024-06-04 at 22 51 03 This worked. So I tried the singled the single sheet file from SalesData. That also worked ok. but it loaded over a thousand rows. Perhaps when i downloaded it from google it created extra rows? not sure. Here is the actual file [namelist.xlsx](https://github.com/user-attachments/files/15567144/namelist.xlsx) Maybe it has over 1k rows but I dont think so. So I then tried loading SaleData file and it loaded ok in a way but it loaded 45 rows. In fact there are 8 rows in the first sheet and 44 rows in the second sheet. Why does it appear to have selected the second sheet? After this sort of worked (except for the number of rows) I tried the original sheet that it had trouble loading and got the following errors: Internal Error failed to bind st_reader. Table must return at least one column. (Which table?) can it please say which table it wants to try to get this from? Because spreadsheets can have many right? When I try to load another spreadsheet the SaleData again it gives a fatal error saying it was invalidated by previous error. Screenshot 2024-06-04 at 23 03 19

I will take a look at my private xlsx that seemed to cause the original issue but in any event I think the behavior is odd and not expected. Plus the documentation is unclear on how to load a document with multiple sheets inside. Its simply not clear enough. most spreadsheets DO have multiple sheets inside. So the documentation is off. The logic of which sheets and how many rows get loaded is at best inadequately documented and more likely simply does not work for certain types of spreedsheets. As for error messages I think it should what table its trying to load(name please?) and also if needs values for the column data just say that. Keep in mind thats not a requirement for a valid spreedsheet for data frame as afaik.

Update: the file causing the original error does not have values for the first three columns in the first three rows. It does have values after that. It did load ok into sheets.google.com

Update: Adding values for
a1,b1,c1
a2,b2,c2
a2,b3,c3 which were not present in the original sludge.xlsx does allow the document to load. It seems that having values in the initial rows of the columns are a requirement? Just to load it? Even before creating a table? It loads fine in Spreadsheet. Also with runtime crash more information should be given. And the error message provided did not indicate missing column or row values but something

OS:

M1 Mac OS Silicon Ventura 13.3.1

DuckDB Version:

v0.10.0

DuckDB Client:

Python

Full Name:

Dean Oz

Affiliation:

None

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue? Yes I have. However, I do need to check the original file as well that caused the original error.

  • Yes, I have
@duckdblabs-bot
Copy link

Thanks for opening this issue! Based on our automated check, it seems that your post contains some code but it does not use code blocks to format it.

Please double-check your post and revise it if necessary. To employ syntax highlighting, it's recommended to use code blocks with triple backticks, e.g.:

```sql
SELECT ...
```

If this is a false positive, feel free to disregard this comment.

@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Jun 4, 2024
@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

I will load up my original spreetsheet sludge.xlsx that caused the original error and check it and report back whats in it or not in it.

@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

I can't share the whole sheet but 1) it loaded ok into google sheets so its a valid spreedsheet. 2) its first three columns do not have values for first three rows ,but columns do have values after that. Here is a picture of the columns
Screenshot 2024-06-04 at 23 19 18

@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

ABC have do not have values for the first three rows but they do after that. D-AE have values in all rows. There is only one sheet in the document.

@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

Only other thing besides not having values for the first three columns in the first three rows is that there is some vertically aligned text in some columns as opposed to horizontally aligned.

@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

I added a1, b1, b3
a2, b2, b3
a3, b3, c3
and the sheet loaded after that. It does appear to report up to 1000k plus rows? Not sure why it loads that many. here is the sheet as downloaded. Please see that sd.xlsx also loads 1k plus rows.

@Deanozk
Copy link
Author

Deanozk commented Jun 4, 2024

In summary it seems that first colunn rows must be filled in or it will not even import as distinct from create a table. The error message related to there not being at least one column is not descriptive as the table has many columns and all have values at some point. It was not decided that the first column for example is the primary etc. So I think this error message is inaccurate. Also there needs to be more reported for a full crash like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants