Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve numeric inference capabilities Double/Integer/WholeNumber #140

Open
thehomebrewnerd opened this issue Sep 24, 2020 · 3 comments
Open
Labels
enhancement Improvement to an existing feature

Comments

@thehomebrewnerd
Copy link
Contributor

thehomebrewnerd commented Sep 24, 2020

In the current approach to logical type inference the main differentiator between a Double and an Integer or WholeNumber logical type is the underlying pandas dtype. So, in the following examples, these series would be inferred to have a logical type of Double when a WholeNumber or Integer type would be more appropriate.

from woodwork.data_column import infer_logical_type

# inferred as Double because dtype is float64, but could be WholeNumber with Int64 dtype
pd.Series([1.0, 2.0, -3.0])  
>>> infer_logical_type(pd.Series([1.0, np.nan, 3.0]))
Double

# inferred as Double because dtype is float64, but could be Integer with Int64 dtype
>>> infer_logical_type(pd.Series([1.0, 2.0, -3.0])) 
Double

We should improve the inference for series that contain NaN values or for series that can be represented as Integer or WholeNumber types without loss of information. One way to check for this loss of information would be to cast columns inferred as Double to integer and then determine if the values are equal to the original float values or not. Would need to drop NaN values first to perform this comparison.

>>> all(pd.Series([1.0, 2.0]).astype('int') == pd.Series([1.0, 2.0]))
True
>>> all(pd.Series([1.0, 2.0]).astype('int') == pd.Series([1.0, 2.0002]))
False
>>> all(pd.Series([1.0, np.nan]).dropna().astype('int') == pd.Series([1.0, np.nan]).dropna())
True
@thehomebrewnerd thehomebrewnerd changed the title Improve numeric inference capabilities Improve numeric inference capabilities Double/Integer/WholeNumber Sep 24, 2020
@thehomebrewnerd thehomebrewnerd added the enhancement Improvement to an existing feature label Sep 24, 2020
@gsheni
Copy link
Contributor

gsheni commented Sep 24, 2020

One thing I would be curious about is if there was a significant slowdown in speed with large number of rows and/or lots of precision in the numbers.

@thehomebrewnerd
Copy link
Contributor Author

@gsheni The precision question is one that should definitely be investigated before we implement just to make sure we are confident that we won't drop important information inadvertently.

As for performance, no doubt adding this check will add time to the inference process. If we find a significant performance degradation, we could potentially add some type of inference level setting to the global config to control some of these types of things. That way the user would have some control over the speed/accuracy tradeoff during inference.

import woodwork as ww
ww.config.set_option('inference_level', 'simple')  # Fast, but might miss some more nuanced differences
ww.config.set_option('inference_level', 'advanced')  # More accurate, but slower due to increased data checks

@tyler3991
Copy link

Let's wait on this until we can track performance pre/post implementing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants