Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

29.02 cell cast to 29.020000000000003 string #289

Closed
severinh opened this issue Sep 27, 2024 · 7 comments · Fixed by #292
Closed

29.02 cell cast to 29.020000000000003 string #289

severinh opened this issue Sep 27, 2024 · 7 comments · Fixed by #292
Labels
🦀 rust 🦀 Pull requests that edit Rust code
Milestone

Comments

@severinh
Copy link

severinh commented Sep 27, 2024

Steps to reproduce

See the Excel file decimal-numbers.xlsx. It contains a column with two decimals:

Decimals
28.14
29.02

Read the Excel file using fastexcel:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1").to_polars()
shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ f64      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Looks fine.

Then read the Excel file while casting to strings:

read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1", dtypes={0: 'string'}).to_polars()
shape: (2, 1)
┌────────────────────┐
│ Decimals           │
│ ---                │
│ str                │
╞════════════════════╡
│ 28.14              │
│ 29.020000000000003 │
└────────────────────┘

The expected result was that the strings would be the same as shown to the user in the Excel file:

shape: (2, 1)
┌──────────┐
│ Decimals │
│ ---      │
│ str      │
╞══════════╡
│ 28.14    │
│ 29.02    │
└──────────┘

Wrap-up

I understand this looks like an issue due to floating point precision, and I'm not sure if this:

  1. could be fixed in fastexcel
  2. could be fixed in the underlying calamine.
  3. cannot be fixed at all, since it's a fundamental property of the Excel file format or parsing process.

What's the motivation for filing this bug: In our system, we have highly heterogeneous data, so we have to read all Excel values as strings. However, if users see 29.02 in their Excel files, but 29.020000000000003 in our system, that's highly confusing and surprising to users.

What do you think?

@PrettyWood
Copy link
Member

PrettyWood commented Sep 27, 2024

Hello @severinh
Always great to have a clear bug issue with file and clear explanation thank you!
I checked quickly and indeed it feels hard to fix on fastexcel side

In src/data.rs, in create_string_array function we simply do

cell.as_string()

Even with

cell.get_float().map(|v| v.to_string())

or

cell.as_f64()

we get the same behavior with this floating precision issue and AFAIK there is no way for us to get more info on the original input.
I'll try to dig on calamine side tonight

EDIT: quick check on calamine

    inner: [
        SharedString(
            "Decimals",
        ),
        Float(
            28.14,
        ),
        Float(
            29.020000000000003,
        ),
    ],

when reading the xml content we get

BytesText { content: Borrowed("29.020000000000003") }

so I guess it goes even further directly in the xml content 😞

@jmcnamara
Copy link

The floating point number in the Excel file is 29.020000000000003:

    <row r="2" spans="1:1" x14ac:dyDescent="0.2">
      <c r="A2" s="1">
        <v>28.14</v>
      </c>
    </row>
    <row r="3" spans="1:1" x14ac:dyDescent="0.2">
      <c r="A3" s="1">
        <v>29.020000000000003</v>
      </c>
    </row>

For the display (and storage) of floating point numbers Excel uses something very similar to the printf() format option %.16G:

$ python
>>> print("%.16G" % 29.020000000000003)
29.02

That format option isn't supported by Rust in std::fmt however.

@PrettyWood
Copy link
Member

PrettyWood commented Oct 2, 2024

Thanks @jmcnamara indeed it's really in the xml content itself.

Maybe we could use rust_decimal crate that implements already a "smart" Display

With

[dependencies]
rust_decimal = { version = "1.36.0", default-features = false }
use rust_decimal::prelude::*;

fn main() {
    let x = 29.020000000000003_f64;
    println!("x: {}", Decimal::from_f64(x).unwrap());
}
▶ cargo r -q
x: 29.02

WDYT?

@lukapeschke lukapeschke added the 🦀 rust 🦀 Pull requests that edit Rust code label Oct 2, 2024
@lukapeschke lukapeschke added this to the v0.12.0 milestone Oct 2, 2024
@jmcnamara
Copy link

jmcnamara commented Oct 2, 2024

For the display (and storage) of floating point numbers Excel uses something very similar to the printf() format option %.16G:

I was wrong about this for the display part. Excel displays decimal numbers with 8 digits in a standard cell width and 10 digits in a wide cell. Like this:

Format = 0.000000000 Unformatted, wide cell Unformatted, standard width
1.123456789 1.123456789 1.123457
12.123456789 12.12345679 12.12346
123.123456789 123.1234568 123.1235
1234.123456789 1234.123457 1234.123
12345.123456789 12345.12346 12345.12
123456.123456789 123456.1235 123456.1

It also trims trailing zeros and the decimal point if there are no fractional part.

@PrettyWood
Copy link
Member

Again thank you for the info. Much appreciated.
I reckon we should make a PR on calamine side. I can maybe do that this weekend

@severinh
Copy link
Author

severinh commented Oct 4, 2024

Huge thanks for the deep investigation into this issue, @PrettyWood and @jmcnamara! Learned a lot from reading this discussion.

I would trust you to make the right decisions if and where this should be changed. I also hope that this won't have any negative impact on other users of calamine and fastexcel (such as performance).

@PrettyWood
Copy link
Member

We chose to support it directly in fastexcel for now. It should have very little impact because it's only applied for float to string conversion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🦀 rust 🦀 Pull requests that edit Rust code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants