Improve numeric inference capabilities Double/Integer/WholeNumber #140

thehomebrewnerd · 2020-09-24T18:34:19Z

In the current approach to logical type inference the main differentiator between a Double and an Integer or WholeNumber logical type is the underlying pandas dtype. So, in the following examples, these series would be inferred to have a logical type of Double when a WholeNumber or Integer type would be more appropriate.

from woodwork.data_column import infer_logical_type

# inferred as Double because dtype is float64, but could be WholeNumber with Int64 dtype
pd.Series([1.0, 2.0, -3.0])  
>>> infer_logical_type(pd.Series([1.0, np.nan, 3.0]))
Double

# inferred as Double because dtype is float64, but could be Integer with Int64 dtype
>>> infer_logical_type(pd.Series([1.0, 2.0, -3.0])) 
Double

We should improve the inference for series that contain NaN values or for series that can be represented as Integer or WholeNumber types without loss of information. One way to check for this loss of information would be to cast columns inferred as Double to integer and then determine if the values are equal to the original float values or not. Would need to drop NaN values first to perform this comparison.

>>> all(pd.Series([1.0, 2.0]).astype('int') == pd.Series([1.0, 2.0]))
True
>>> all(pd.Series([1.0, 2.0]).astype('int') == pd.Series([1.0, 2.0002]))
False
>>> all(pd.Series([1.0, np.nan]).dropna().astype('int') == pd.Series([1.0, np.nan]).dropna())
True

The text was updated successfully, but these errors were encountered:

gsheni · 2020-09-24T18:58:36Z

One thing I would be curious about is if there was a significant slowdown in speed with large number of rows and/or lots of precision in the numbers.

thehomebrewnerd · 2020-09-24T19:07:14Z

@gsheni The precision question is one that should definitely be investigated before we implement just to make sure we are confident that we won't drop important information inadvertently.

As for performance, no doubt adding this check will add time to the inference process. If we find a significant performance degradation, we could potentially add some type of inference level setting to the global config to control some of these types of things. That way the user would have some control over the speed/accuracy tradeoff during inference.

import woodwork as ww
ww.config.set_option('inference_level', 'simple')  # Fast, but might miss some more nuanced differences
ww.config.set_option('inference_level', 'advanced')  # More accurate, but slower due to increased data checks

tyler3991 · 2021-02-02T19:38:26Z

Let's wait on this until we can track performance pre/post implementing this.

thehomebrewnerd changed the title ~~Improve numeric inference capabilities~~ Improve numeric inference capabilities Double/Integer/WholeNumber Sep 24, 2020

thehomebrewnerd added the enhancement Improvement to an existing feature label Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve numeric inference capabilities Double/Integer/WholeNumber #140

Improve numeric inference capabilities Double/Integer/WholeNumber #140

thehomebrewnerd commented Sep 24, 2020 •

edited

Loading

gsheni commented Sep 24, 2020 •

edited

Loading

thehomebrewnerd commented Sep 24, 2020

tyler3991 commented Feb 2, 2021

Improve numeric inference capabilities Double/Integer/WholeNumber #140

Improve numeric inference capabilities Double/Integer/WholeNumber #140

Comments

thehomebrewnerd commented Sep 24, 2020 • edited Loading

gsheni commented Sep 24, 2020 • edited Loading

thehomebrewnerd commented Sep 24, 2020

tyler3991 commented Feb 2, 2021

thehomebrewnerd commented Sep 24, 2020 •

edited

Loading

gsheni commented Sep 24, 2020 •

edited

Loading