You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current approach to logical type inference the main differentiator between a Double and an Integer or WholeNumber logical type is the underlying pandas dtype. So, in the following examples, these series would be inferred to have a logical type of Double when a WholeNumber or Integer type would be more appropriate.
fromwoodwork.data_columnimportinfer_logical_type# inferred as Double because dtype is float64, but could be WholeNumber with Int64 dtypepd.Series([1.0, 2.0, -3.0])
>>>infer_logical_type(pd.Series([1.0, np.nan, 3.0]))
Double# inferred as Double because dtype is float64, but could be Integer with Int64 dtype>>>infer_logical_type(pd.Series([1.0, 2.0, -3.0]))
Double
We should improve the inference for series that contain NaN values or for series that can be represented as Integer or WholeNumber types without loss of information. One way to check for this loss of information would be to cast columns inferred as Double to integer and then determine if the values are equal to the original float values or not. Would need to drop NaN values first to perform this comparison.
@gsheni The precision question is one that should definitely be investigated before we implement just to make sure we are confident that we won't drop important information inadvertently.
As for performance, no doubt adding this check will add time to the inference process. If we find a significant performance degradation, we could potentially add some type of inference level setting to the global config to control some of these types of things. That way the user would have some control over the speed/accuracy tradeoff during inference.
importwoodworkaswwww.config.set_option('inference_level', 'simple') # Fast, but might miss some more nuanced differencesww.config.set_option('inference_level', 'advanced') # More accurate, but slower due to increased data checks
In the current approach to logical type inference the main differentiator between a
Double
and anInteger
orWholeNumber
logical type is the underlying pandasdtype
. So, in the following examples, these series would be inferred to have a logical type ofDouble
when aWholeNumber
orInteger
type would be more appropriate.We should improve the inference for series that contain
NaN
values or for series that can be represented asInteger
orWholeNumber
types without loss of information. One way to check for this loss of information would be to cast columns inferred asDouble
to integer and then determine if the values are equal to the original float values or not. Would need to dropNaN
values first to perform this comparison.The text was updated successfully, but these errors were encountered: