Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions pydax/loaders/_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def load(self, path: Union[_typing.PathLike, Dict[str, str]], options: SchemaDic
- ``columns`` key specifies the data type of each column. Each data type corresponds to a Pandas'
supported dtype. If unspecified, then it is default.
- ``delimiter`` key specifies the delimiter of the input CSV file.
- ``header`` key specifies if the first row of the CSV file contains the headers. Defaults to True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like as long as header is not False, it is treated as True (even for empty strings, empty lists, which are usually evaluated to False in Python). Could you make this point clear in this document?

- ``encoding`` key specifies the encoding of the CSV file. Defaults to UTF-8.
:raises TypeError: ``path`` is not a path object.
"""
Expand All @@ -55,9 +56,15 @@ def load(self, path: Union[_typing.PathLike, Dict[str, str]], options: SchemaDic
else:
dtypes[column] = type_

names = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does names default to None in read_csv? I don't see this in the document. Perhaps it's better if we simply do not provide names if header is False?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names is None by default. I thought about doing that but I wanted to avoid having 2 versions of read_csv

if options.get('header', True) is False:
# If no header use the columns provided in schema
names = [*options.get('columns', {})]

return pd.read_csv(path, dtype=dtypes,
# The following line after "if" is for circumventing
# https://github.com/pandas-dev/pandas/issues/38489
parse_dates=parse_dates if len(parse_dates) > 0 else False,
names=names,
encoding=options.get('encoding', 'utf-8'),
delimiter=options.get('delimiter', ','))
10 changes: 10 additions & 0 deletions tests/test_loaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,3 +243,13 @@ def test_csv_pandas_loader_no_encoding(self, tmp_path, noaa_jfk_schema):

del noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['encoding']
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)

def test_csv_pandas_header(self, tmp_path, noaa_jfk_schema):
"Test CSVPandasLoader header options"

noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = True
self.test_csv_pandas_loader(tmp_path, noaa_jfk_schema)

with pytest.raises(ValueError): # Pandas should error from trying to read string as another dtype
noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = False
Comment on lines +253 to +254
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError): # Pandas should error from trying to read string as another dtype
noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = False
noaa_jfk_schema['subdatasets']['jfk_weather_cleaned']['format']['options']['header'] = False
with pytest.raises(ValueError): # Pandas should error from trying to read string as another dtype

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also assert a couple of keyword in the exception message?

Dataset(noaa_jfk_schema, tmp_path, mode=Dataset.InitializationMode.DOWNLOAD_AND_LOAD)