Nullable int columns being pulled into float columns: Allow format all columns on read_csv

When importing data from athena my data has nulls which is a change from file to file.  Csv's nullable int columns get converted to float64 when returned as a dataframe from `pandas.read_sql_athena`, which is unexpected behavior.

I suggest that aws-data-wrangler adds an option to `Session.pandas.read_csv` that can convert all columns to string columns such as ```df = pd.read_csv('/path/to/file.csv', dtype=str)```.

Oddly pandas does not have an easy way to convert an already existing dataframe float64 to str AND format. The suggested way is at load format the dataframe for float64 ([suggested workaround](https://stackoverflow.com/a/27112985/3870664) and [for printing](https://stackoverflow.com/questions/20937538/how-to-display-pandas-dataframe-of-floats-using-a-format-string-for-columns), the [inability to format when convert to a string](https://github.com/pandas-dev/pandas/issues/17211) )

What is happening:

Data 
col1,col2,col3
19,3,1
20,,1
,5,4

Becomes:

col1,col2,col3
19.0,3.0,1
20.0,,1
,5.0,4

Using aws data wrangler like: 
```python
def read_csv(sql, bucket, files, sep, wr_session):
    """ Read csv table using AWS Wrangler"""
    return wr_session.pandas.read_csv(
            f's3://{bucket}/{file}',
            sep=sep
        )
```


Forcing me to write a following function after the above `read_csv`
```python
def _convert_float_columns_to_str(df):
	""" Pandas pulls in columns as floats where there is a nan
	but that is not correct for our logic, so we want to find float 
	datatypes and round to int 'like' datatype and convert to str.
	"""
	df.columns = map(str.lower, df.columns)
	float_columns = list(df.select_dtypes(include=['float64']).columns)
	# round does not, not sure why
	# df[float_columns] = df[float_columns].round(0)
	# df = df.round()
	# format all the columns to 
	df[float_columns] = df[float_columns].applymap('{:,.0f}'.format)
	df = df.applymap(str)
	return df
```

Why can I just specify the column(s) I want? Because this is data I am receiving externally and I don't know when a column will have null and thus the int column will implicitly be converted to float64 and when I run code against it `if row['col'] == 19` the data source data keeps changing from csv to application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions