Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float conversion issue screwing with numeric encoders. #27

Open
germanjoey opened this issue Apr 25, 2019 · 0 comments
Open

Float conversion issue screwing with numeric encoders. #27

germanjoey opened this issue Apr 25, 2019 · 0 comments

Comments

@germanjoey
Copy link

germanjoey commented Apr 25, 2019

I almost feel bad for reporting this one.

Using the yacht hydrodynamics UIC dataset, I got this error:

(env) (base) C:\Users\josep\Jeenee\AutoML\automl_train>python model.py -d ..\automl-testbench\yacht-hydrodynamics\data.csv -m train
Traceback (most recent call last):
  File "model.py", line 46, in <module>
    model_train(df, encoders, args, model)
  File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 347, in model_train
    X, y = process_data(df, encoders)
  File "C:\Users\josep\Jeenee\AutoML\automl_train\pipeline.py", line 296, in process_data
    df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')
  File "C:\Users\josep\Jeenee\AutoML\venv\lib\site-packages\pandas\core\reshape\tile.py", line 235, in cut
    raise ValueError('bins must increase monotonically.')
ValueError: bins must increase monotonically.

Hmmm, odd. Let's take a look at pipeline.py...

    # Length-beam ratio
    length_beam_ratio_enc = df['Length-beam ratio']
    length_beam_ratio_bins = length_beam_ratio_enc.quantile(
        np.linspace(0, 1, 10+1))
    encoders['length_beam_ratio_bins'] = length_beam_ratio_bins
    
    # ....

    # Length-beam ratio
    length_beam_ratio_enc = pd.cut(
        df['Length-beam ratio'].values, encoders['length_beam_ratio_bins'], labels=False, include_lowest=True, duplicates='drop')

The error is referring to the .cut line, which I had previously patched to include the duplicates='drop' bit. But the current error isn't related to that, but complaining about the encoder. Hmmm, nothing looks odd in the data about that column. Let's open up pdb and take a look...

>>> encoders['length_beam_ratio_bins']
[2.73, 2.76, 3.15, 3.15, 3.1499999999999995, 3.15, 3.17, 3.32, 3.51, 3.51, 3.64]

facepalm

Well now! I suppose I'll concede that's technically not monotonically increasing!

I appended a .round(4) to the two .quantile lines of encoders/numeric (lines 12 and 15), which worked for this test case. This is certainly not an adequate general solution, however, as e.g. that'll break data on data that needs precision at the 5th decimal place...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant