Skip to content

Nullable int columns being pulled into float columns: Allow format all columns on read_csv #158

Closed
@vfrank66

Description

@vfrank66

When importing data from athena my data has nulls which is a change from file to file. Csv's nullable int columns get converted to float64 when returned as a dataframe from pandas.read_sql_athena, which is unexpected behavior.

I suggest that aws-data-wrangler adds an option to Session.pandas.read_csv that can convert all columns to string columns such as df = pd.read_csv('/path/to/file.csv', dtype=str).

Oddly pandas does not have an easy way to convert an already existing dataframe float64 to str AND format. The suggested way is at load format the dataframe for float64 (suggested workaround and for printing, the inability to format when convert to a string )

What is happening:

Data
col1,col2,col3
19,3,1
20,,1
,5,4

Becomes:

col1,col2,col3
19.0,3.0,1
20.0,,1
,5.0,4

Using aws data wrangler like:

def read_csv(sql, bucket, files, sep, wr_session):
    """ Read csv table using AWS Wrangler"""
    return wr_session.pandas.read_csv(
            f's3://{bucket}/{file}',
            sep=sep
        )

Forcing me to write a following function after the above read_csv

def _convert_float_columns_to_str(df):
	""" Pandas pulls in columns as floats where there is a nan
	but that is not correct for our logic, so we want to find float 
	datatypes and round to int 'like' datatype and convert to str.
	"""
	df.columns = map(str.lower, df.columns)
	float_columns = list(df.select_dtypes(include=['float64']).columns)
	# round does not, not sure why
	# df[float_columns] = df[float_columns].round(0)
	# df = df.round()
	# format all the columns to 
	df[float_columns] = df[float_columns].applymap('{:,.0f}'.format)
	df = df.applymap(str)
	return df

Why can I just specify the column(s) I want? Because this is data I am receiving externally and I don't know when a column will have null and thus the int column will implicitly be converted to float64 and when I run code against it if row['col'] == 19 the data source data keeps changing from csv to application.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestmajor releaseWill be addressed in the next major release

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions