Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I could use pandas to detect and repair issues in a CSV file, but raise an informative warning when an unrepairable issue is encountered.
I have written a function which identifies common issues (e.g. the field delimiter being improperly used within a field) and checks surrounding fields to estimate the original intent of the data, but when the issue cannot be identified with this logic, the function would return the original line and the user should be directed to the problematic line.
Feature Description
Given a CSV with bad lines (e.g. line 3 having an extra "E"):
id,field_1,field_2
101,A,B
102,C,D,E
103,F,G
read_csv() will, with all defaults (on_bad_lines='error'
), raise a ParserError:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
With on_bad_lines='warn'
, it will raise a ParserWarning, with the same helpful information:
<stdin>:1: ParserWarning: Skipping line 3: expected 3 fields, saw 4
However, when a using a callable (e.g. on_bad_lines=line_fixer
), the ParserWarning message is very generic, not indicating the line number, expected fields, nor seen fields:
>>> import pandas as pd
>>> def line_fixer(line):
... return [1, 2, 3, 4, 5]
...
>>> df = pd.read_csv('test.csv', engine='python', on_bad_lines=line_fixer)
<stdin>:1: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
Including these details would allow the user to find and fix the input CSV manually.
Alternative Solutions
- Pre-process the CSV file separately from the read_csv() function.
- Pass line number and expected field count to the callable function, which can raise its own descriptive warning.
Additional Context
No response