Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datacompy: Object/string misinterpreted as float -> false equal result #121

Open
petrafakler opened this issue Nov 2, 2021 · 8 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@petrafakler
Copy link

While using datacompy.compare a string/object was misinterpreted as float (because string has only digits). After all the strings have got length 35 and are only different in the last digit. The misinterpreted float was cutted and compare says that the values are equal.

Example:

import pandas as pd
import datacompy as datacompy

df1 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923014']})

df2 = pd.DataFrame({'ID':[1], 'REFER_NR': ['9998700990704001708177961516923015']})

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='ID',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='TEST', #Optional, defaults to 'df1'
    df2_name='INTE' #Optional, defaults to 'df2'
    )

print(compare.report())

result:

Column Summary

Number of columns in common: 2
Number of columns in TEST but not in INTE: 0
Number of columns in INTE but not in TEST: 0

Row Summary


Matched on: id

Any duplicates on match values: No

Absolute Tolerance: 0

Relative Tolerance: 0

Number of rows in common: 1

Number of rows in TEST but not in INTE: 0

Number of rows in INTE but not in TEST: 0

Number of rows with some compared columns unequal: 0

Number of rows with all compared columns equal: 1

Column Comparison


Number of columns compared with some values unequal: 0

Number of columns compared with all values equal: 2

Total number of values which compare unequal: 0

Maybe number of digits can help to interpret float and object.

@fdosani
Copy link
Member

fdosani commented Nov 2, 2021 via email

@fdosani
Copy link
Member

fdosani commented Nov 3, 2021

So looking into this issue, it seem to be happening here due to the following code.

Using np.isclose throws an exception which then leads to logic to cast using a float. This actually works, but might have the unintended consequence you are facing. I think one option is to add a flag to not cast and treat any string as just that vs what is happening now.

@jborchma @elzzhu @ak-gupta @theianrobertson any thoughts/opinions on this?

@petrafakler
Copy link
Author

Thanks for analysis so far :-)

@elzzhu
Copy link

elzzhu commented Nov 3, 2021

I think it makes sense to add a flag to not cast since there are instances where IDs are numerical but you don't necessarily want to treat them as such. If the flag was added, would the default behaviour be the current behaviour?

@fdosani
Copy link
Member

fdosani commented Nov 3, 2021

@elzzhu I think it would default to the current behaviour:

  • Current behaviour is to use isclose and then cast to float if there is an exception.
  • New behaviour would be if the flag (ex: dont_cast_string) is True it would check with isclose, if an exception is thrown it will check if the flag is set and if the column is a "string" if so it will not cast and continue.

This should solve for the issue and keep existing behaviour. Only thing is for now it might need to be on all columns vs picking and choosing.

@fdosani fdosani self-assigned this Dec 12, 2021
@fdosani fdosani added bug Something isn't working enhancement New feature or request labels Dec 12, 2021
@fdosani
Copy link
Member

fdosani commented Dec 12, 2021

I’m going to take a stab at a fix this week for this. Sorry fell off my radar.

@james-stead
Copy link

Any word on this one. We are experiencing this issue as well.

@fdosani
Copy link
Member

fdosani commented May 29, 2024

Any word on this one. We are experiencing this issue as well.

@james-stead sorry about that. This sort of fell off the radar a bit. I'm assuming you have some numbers (as strings) which are being cast into a float type correct? If you are ok with the above proposal we can add a new optional flag to not cast certain columns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants