DataTable init does not not replace NaNs with pd.NA with float data type #128

gsheni · 2020-09-22T16:55:18Z

We want DataTable to use 1 representation of NaN (pd.NA). This is a forwarding looking feature of pandas.

The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).

In our init of DataTable, we have a replace_none, which defaults to True. However, this is not working for some data types inputted into the DataTable

import numpy as np
import pandas as pd
import woodwork as ww
d = {'col1': [1, 2, np.nan], 'col2': [3, 4, None],
     'col3': pd.Series([1, 2, np.nan], dtype='Int64'),
     'col4': pd.Series([1, 2, None], dtype='string')}
df = pd.DataFrame(data=d)

df.dtypes

dt = ww.DataTable(df, name="retail", replace_none=True, copy_dataframe=True)
dt.dataframe

- The expected behavior is that all NaN-like values in the DataFrame would be pd.NA

The text was updated successfully, but these errors were encountered:

thehomebrewnerd · 2020-09-22T20:02:51Z

Looked into this issue a little bit. It seems like the problem is happening when the columns get cast to type category. The original fillna seems to be working fine.

 df.fillna(pd.NA)
   col1  col2  col3  col4
0     1     3     1     1
1     2     4     2     2
2  <NA>  <NA>  <NA>  <NA>

thehomebrewnerd · 2020-09-22T20:11:43Z

And it looks like this is getting inferred as a Categorical because when the NaN values in the dataframe are filled with pd.NA values, the datatypes for the float columns are getting converted to object, and our type inference recognizes this as a string dtype.

tamargrey · 2020-09-22T20:46:50Z

Any time you put pd.NA in a series and don't specify the dtype it gets inferred as object. So any column passed to DataTable that already contains pd.NA but didn't specify the dtype won't be able to have pd.NA in the DataTable.

And then there's the interesting behavior with fillna where normally if you manually try to add pd.NA to a dtype that doesn't take them (like float64), it would result in a TypeError: float() argument must be a string or a number, not 'NAType' but fillna will override that and change the dtype to object so that we can replace with pd.NA which is how we get float columns changed to object.

thehomebrewnerd · 2020-09-22T20:55:30Z

@tamargrey So, I think this means we cannot have a column that has a LogicalType of Double and a dataframe dtype of float64, if that column contains pd.NA values, correct? If we tried to do physical type conversion on that column it will fail with the error you mention above.

tamargrey · 2020-09-22T21:22:07Z

@thehomebrewnerd there's a difference between these two cases below that means that the physical type conversion would, I think, be possible:
This is not allowed:

pd.Series([1,2,pd.NA,4], dtype='float')

But this works and converts pd.NA to np.nan:

from_int = pd.Series([1,2,pd.NA,4], dtype='Int64')
from_int.astype('float')

gsheni · 2020-09-22T21:24:13Z

I wonder if pandas's convert_dtypes function can help us for the process of converting NaN to pd.NA

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

thehomebrewnerd · 2020-09-22T21:25:52Z

@tamargrey Hmmm. The conversion succeeds, but the pd.NA is replaced with np.nan, so we are again left with the original issue of not having all our missing values represented with pd.NA...

gsheni · 2020-09-22T21:26:27Z

As @thehomebrewnerd pointed out, pd.NA is not support for the categoircal dtype.

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-data

All values of categorical data are either in categories or np.nan.

import numpy as np
import pandas as pd

d = {'col4': pd.Series([1, 2, pd.NA], dtype='string')}
df = pd.DataFrame(data=d)

df['col4'].astype(pd.CategoricalDtype())

gsheni · 2020-09-28T19:57:05Z

Waiting on pandas to support pd.NA with new FloatDtype

ENH: nullable Float32/64 ExtensionArray pandas-dev/pandas#34307

tyler3991 · 2021-02-01T20:56:03Z

This issue should:

Have a single NaN representation in a table. pd.NA
Min pandas requirement should be 1.2.0

gsheni · 2021-02-02T16:44:21Z

EvalML is currently adding support for pandas 1.2.0

gsheni added the bug Something isn't working label Sep 22, 2020

gsheni assigned thehomebrewnerd Sep 23, 2020

gsheni unassigned thehomebrewnerd Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataTable init does not not replace NaNs with pd.NA with float data type #128

DataTable init does not not replace NaNs with pd.NA with float data type #128

gsheni commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

tamargrey commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

tamargrey commented Sep 22, 2020

gsheni commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

gsheni commented Sep 22, 2020

gsheni commented Sep 28, 2020

tyler3991 commented Feb 1, 2021

gsheni commented Feb 2, 2021

DataTable init does not not replace NaNs with pd.NA with float data type #128

DataTable init does not not replace NaNs with pd.NA with float data type #128

Comments

gsheni commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

tamargrey commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

tamargrey commented Sep 22, 2020

gsheni commented Sep 22, 2020

thehomebrewnerd commented Sep 22, 2020

gsheni commented Sep 22, 2020

gsheni commented Sep 28, 2020

tyler3991 commented Feb 1, 2021

gsheni commented Feb 2, 2021