Description
When importing data from athena my data has nulls which is a change from file to file. Csv's nullable int columns get converted to float64 when returned as a dataframe from pandas.read_sql_athena
, which is unexpected behavior.
I suggest that aws-data-wrangler adds an option to Session.pandas.read_csv
that can convert all columns to string columns such as df = pd.read_csv('/path/to/file.csv', dtype=str)
.
Oddly pandas does not have an easy way to convert an already existing dataframe float64 to str AND format. The suggested way is at load format the dataframe for float64 (suggested workaround and for printing, the inability to format when convert to a string )
What is happening:
Data
col1,col2,col3
19,3,1
20,,1
,5,4
Becomes:
col1,col2,col3
19.0,3.0,1
20.0,,1
,5.0,4
Using aws data wrangler like:
def read_csv(sql, bucket, files, sep, wr_session):
""" Read csv table using AWS Wrangler"""
return wr_session.pandas.read_csv(
f's3://{bucket}/{file}',
sep=sep
)
Forcing me to write a following function after the above read_csv
def _convert_float_columns_to_str(df):
""" Pandas pulls in columns as floats where there is a nan
but that is not correct for our logic, so we want to find float
datatypes and round to int 'like' datatype and convert to str.
"""
df.columns = map(str.lower, df.columns)
float_columns = list(df.select_dtypes(include=['float64']).columns)
# round does not, not sure why
# df[float_columns] = df[float_columns].round(0)
# df = df.round()
# format all the columns to
df[float_columns] = df[float_columns].applymap('{:,.0f}'.format)
df = df.applymap(str)
return df
Why can I just specify the column(s) I want? Because this is data I am receiving externally and I don't know when a column will have null and thus the int column will implicitly be converted to float64 and when I run code against it if row['col'] == 19
the data source data keeps changing from csv to application.