You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/real_world_csv.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,7 +82,7 @@ Headers in production files are rarely as clean as you'd expect. They carry unit
82
82
| Headers with spaces and special characters (`Revenue (USD)`) | ✅ | Spaces and dashes normalized to underscores → `:revenue_(usd)`. Parentheses, slashes, etc. are preserved. |
83
83
| Extra data columns beyond the header row | ✅ | Auto-generates `column_N` names for extra fields. Controlled by `missing_headers:` option. |
84
84
| No header row at all | 🔘 | Use `headers_in_file: false, user_provided_headers: [:col1, :col2, ...]`. Common in raw database dumps and fixed-format legacy exports. |
85
-
| Repeated header row mid-file | ❌ | Happens when files are assembled with `cat chunk_1.csv chunk_2.csv > full.csv`. The repeated header line is silently treated as a data row, producing a hash like `{name: "name", age: "age"}`. Pre-process to strip repeated headers before parsing. |
85
+
| Repeated header row mid-file | ❌ | Happens when files are assembled with `cat chunk_1.csv chunk_2.csv > full.csv`. The repeated header line is silently treated as a data row, producing a hash like `{name: "name", age: "age"}`. Pre-process to strip repeated headers before parsing, or post-process filtering out the data hashes containing header information. |
86
86
87
87
---
88
88
@@ -93,9 +93,9 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
| Currency symbols in values (`$1,234.56`) | ✅ | Won't match the numeric pattern — safely left as a string. |
97
-
| Percentage values (`12.5%`) | ✅ | Won't match the numeric pattern — safely left as a string. |
98
-
| Leading zeros (ZIP codes, phone numbers, SKUs, account numbers) | 🔘 |`convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing. |
96
+
| Currency symbols in values (`$1,234.56`, `€1.234,56`) | ✅ / 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
97
+
| Percentage values (`12.5%`) | ✅ / 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
98
+
| Leading zeros (ZIP codes, phone numbers, SKUs, account numbers) | 🔘 |`convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing! US ZIP codes have leading zeroes. |
99
99
| NULL / empty value variants (`NULL`, `\N`, `N/A`, `(null)`, `#N/A`) | 🔘 | Use `nil_values_matching: /\A(NULL\\|\\N\|N\/A\|#N\/A\|\\(null\\))\z/i`. Without configuration these are left as literal strings. |
100
100
| Date values (`2023-01-15`, `01/02/2023`, `Jan 2, 2023`) | 🔘 | Use `value_converters` with a date parsing lambda. SmarterCSV does not auto-convert dates — format ambiguity (`01/02/2023` = Jan 2 or Feb 1?) makes auto-conversion unsafe. |
101
101
| Boolean variants (`Y/N`, `Yes/No`, `TRUE/FALSE`, `1/0`, `X/` in SAP) | 🔘 | Use `value_converters` for the relevant columns. |
@@ -126,11 +126,11 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
126
126
| MySQL `SELECT INTO OUTFILE`| Backslash quote escaping | ✅ |`quote_escaping: :auto` default. |
| Excel `Save As CSV`| UTF-8 BOM, RFC 4180 quoting, 1,048,576 row limit | ✅ | BOM stripped, quoting handled. Row limit is an Excel constraint — SmarterCSV will parse whatever Excel wrote. |
129
+
| Government open data portals | Semicolons as separator, Latin-1, inconsistent quoting | ✅ / 🔘 |`col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8. |
130
+
| Bioinformatics (VCF-derived) | Thousands of columns (one sample per column) | ✅ | No column count limit in the parsing hot path. |
| Shopify / WooCommerce | Pipe-delimited values within a field (`tag1\|tag2\|tag3`) | 🔘 | Use `value_converters` to split on `\|` for the relevant column. |
131
133
| Qualtrics / SurveyMonkey | 200–800 columns, multi-row headers, HTML in values | 🔘 | Multi-row headers require pre-processing; HTML in values left as-is (use value_converters to strip). |
132
-
| Government open data portals | Semicolons as separator, Latin-1, inconsistent quoting | ✅ / 🔘 |`col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8. |
133
-
| Bioinformatics (VCF-derived) | Thousands of columns (one sample per column) | ✅ | No column count limit in the parsing hot path. |
134
134
| Gzipped CSV (`.csv.gz`) | Compressed file | 🔘 | Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`. |
135
135
| HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
0 commit comments