Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTable init does not not replace NaNs with pd.NA with float data type #128

Open
gsheni opened this issue Sep 22, 2020 · 11 comments
Open
Labels
bug Something isn't working

Comments

@gsheni
Copy link
Contributor

gsheni commented Sep 22, 2020

  • We want DataTable to use 1 representation of NaN (pd.NA). This is a forwarding looking feature of pandas.

The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

  • In our init of DataTable, we have a replace_none, which defaults to True. However, this is not working for some data types inputted into the DataTable
import numpy as np
import pandas as pd
import woodwork as ww
d = {'col1': [1, 2, np.nan], 'col2': [3, 4, None],
     'col3': pd.Series([1, 2, np.nan], dtype='Int64'),
     'col4': pd.Series([1, 2, None], dtype='string')}
df = pd.DataFrame(data=d)

df.dtypes

dt = ww.DataTable(df, name="retail", replace_none=True, copy_dataframe=True)
dt.dataframe

Screen Shot 2020-09-22 at 12 54 53 PM

- The expected behavior is that all NaN-like values in the DataFrame would be pd.NA
@gsheni gsheni added the bug Something isn't working label Sep 22, 2020
@thehomebrewnerd
Copy link
Contributor

Looked into this issue a little bit. It seems like the problem is happening when the columns get cast to type category. The original fillna seems to be working fine.

 df.fillna(pd.NA)
   col1  col2  col3  col4
0     1     3     1     1
1     2     4     2     2
2  <NA>  <NA>  <NA>  <NA>

@thehomebrewnerd
Copy link
Contributor

And it looks like this is getting inferred as a Categorical because when the NaN values in the dataframe are filled with pd.NA values, the datatypes for the float columns are getting converted to object, and our type inference recognizes this as a string dtype.

@tamargrey
Copy link
Contributor

Any time you put pd.NA in a series and don't specify the dtype it gets inferred as object. So any column passed to DataTable that already contains pd.NA but didn't specify the dtype won't be able to have pd.NA in the DataTable.

And then there's the interesting behavior with fillna where normally if you manually try to add pd.NA to a dtype that doesn't take them (like float64), it would result in a TypeError: float() argument must be a string or a number, not 'NAType' but fillna will override that and change the dtype to object so that we can replace with pd.NA which is how we get float columns changed to object.

@thehomebrewnerd
Copy link
Contributor

@tamargrey So, I think this means we cannot have a column that has a LogicalType of Double and a dataframe dtype of float64, if that column contains pd.NA values, correct? If we tried to do physical type conversion on that column it will fail with the error you mention above.

@tamargrey
Copy link
Contributor

@thehomebrewnerd there's a difference between these two cases below that means that the physical type conversion would, I think, be possible:
This is not allowed:

pd.Series([1,2,pd.NA,4], dtype='float')

But this works and converts pd.NA to np.nan:

from_int = pd.Series([1,2,pd.NA,4], dtype='Int64')
from_int.astype('float')

@gsheni
Copy link
Contributor Author

gsheni commented Sep 22, 2020

I wonder if pandas's convert_dtypes function can help us for the process of converting NaN to pd.NA

@thehomebrewnerd
Copy link
Contributor

@tamargrey Hmmm. The conversion succeeds, but the pd.NA is replaced with np.nan, so we are again left with the original issue of not having all our missing values represented with pd.NA...

@gsheni
Copy link
Contributor Author

gsheni commented Sep 22, 2020

As @thehomebrewnerd pointed out, pd.NA is not support for the categoircal dtype.

All values of categorical data are either in categories or np.nan.

import numpy as np
import pandas as pd

d = {'col4': pd.Series([1, 2, pd.NA], dtype='string')}
df = pd.DataFrame(data=d)

df['col4'].astype(pd.CategoricalDtype())

@gsheni
Copy link
Contributor Author

gsheni commented Sep 28, 2020

Waiting on pandas to support pd.NA with new FloatDtype

@tyler3991
Copy link

This issue should:

  1. Have a single NaN representation in a table. pd.NA
  2. Min pandas requirement should be 1.2.0

@gsheni
Copy link
Contributor Author

gsheni commented Feb 2, 2021

EvalML is currently adding support for pandas 1.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants