Skip to content

Commit 2ce9c03

Browse files
committed
update
1 parent db27a79 commit 2ce9c03

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

docs/real_world_csv.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ Headers in production files are rarely as clean as you'd expect. They carry unit
8282
| Headers with spaces and special characters (`Revenue (USD)`) || Spaces and dashes normalized to underscores → `:revenue_(usd)`. Parentheses, slashes, etc. are preserved. |
8383
| Extra data columns beyond the header row || Auto-generates `column_N` names for extra fields. Controlled by `missing_headers:` option. |
8484
| No header row at all | 🔘 | Use `headers_in_file: false, user_provided_headers: [:col1, :col2, ...]`. Common in raw database dumps and fixed-format legacy exports. |
85-
| Repeated header row mid-file || Happens when files are assembled with `cat chunk_1.csv chunk_2.csv > full.csv`. The repeated header line is silently treated as a data row, producing a hash like `{name: "name", age: "age"}`. Pre-process to strip repeated headers before parsing. |
85+
| Repeated header row mid-file || Happens when files are assembled with `cat chunk_1.csv chunk_2.csv > full.csv`. The repeated header line is silently treated as a data row, producing a hash like `{name: "name", age: "age"}`. Pre-process to strip repeated headers before parsing, or post-process filtering out the data hashes containing header information. |
8686

8787
---
8888

@@ -93,9 +93,9 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
9393
| Issue | Status | Notes |
9494
|-------|--------|-------|
9595
| Integer and float conversion || `convert_values_to_numeric: true` (default). `"42"``42`, `"3.14"``3.14`. |
96-
| Currency symbols in values (`$1,234.56`) || Won't match the numeric pattern — safely left as a string. |
97-
| Percentage values (`12.5%`) || Won't match the numeric pattern — safely left as a string. |
98-
| Leading zeros (ZIP codes, phone numbers, SKUs, account numbers) | 🔘 | `convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing. |
96+
| Currency symbols in values (`$1,234.56`, `€1.234,56`) |/ 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
97+
| Percentage values (`12.5%`) |/ 🔘| Won't match the numeric pattern — safely left as a string. Use `value_converters` if numeric value is needed.|
98+
| Leading zeros (ZIP codes, phone numbers, SKUs, account numbers) | 🔘 | `convert_values_to_numeric: { except: [:zip, :phone, :sku] }`. Without this, `"01234"` becomes `1234`. One of the most common silent data loss bugs in CSV processing! US ZIP codes have leading zeroes. |
9999
| NULL / empty value variants (`NULL`, `\N`, `N/A`, `(null)`, `#N/A`) | 🔘 | Use `nil_values_matching: /\A(NULL\\|\\N\|N\/A\|#N\/A\|\\(null\\))\z/i`. Without configuration these are left as literal strings. |
100100
| Date values (`2023-01-15`, `01/02/2023`, `Jan 2, 2023`) | 🔘 | Use `value_converters` with a date parsing lambda. SmarterCSV does not auto-convert dates — format ambiguity (`01/02/2023` = Jan 2 or Feb 1?) makes auto-conversion unsafe. |
101101
| Boolean variants (`Y/N`, `Yes/No`, `TRUE/FALSE`, `1/0`, `X/` in SAP) | 🔘 | Use `value_converters` for the relevant columns. |
@@ -126,11 +126,11 @@ Numeric conversion is one of the most common sources of data loss. SmarterCSV co
126126
| MySQL `SELECT INTO OUTFILE` | Backslash quote escaping || `quote_escaping: :auto` default. |
127127
| PostgreSQL `COPY TO` | Backslash quote escaping, `\N` for NULL | ✅ / 🔘 | Escaping handled automatically; `\N` as nil requires `nil_values_matching`. |
128128
| Excel `Save As CSV` | UTF-8 BOM, RFC 4180 quoting, 1,048,576 row limit || BOM stripped, quoting handled. Row limit is an Excel constraint — SmarterCSV will parse whatever Excel wrote. |
129+
| Government open data portals | Semicolons as separator, Latin-1, inconsistent quoting | ✅ / 🔘 | `col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8. |
130+
| Bioinformatics (VCF-derived) | Thousands of columns (one sample per column) || No column count limit in the parsing hot path. |
129131
| QuickBooks exports | Windows-1252 encoding, currency-formatted values | 🔘 | Specify `file_encoding: 'windows-1252'`. Currency values like `"$1,234.56"` stay as strings. |
130132
| Shopify / WooCommerce | Pipe-delimited values within a field (`tag1\|tag2\|tag3`) | 🔘 | Use `value_converters` to split on `\|` for the relevant column. |
131133
| Qualtrics / SurveyMonkey | 200–800 columns, multi-row headers, HTML in values | 🔘 | Multi-row headers require pre-processing; HTML in values left as-is (use value_converters to strip). |
132-
| Government open data portals | Semicolons as separator, Latin-1, inconsistent quoting | ✅ / 🔘 | `col_sep: :auto` handles semicolons; specify `file_encoding:` if non-UTF-8. |
133-
| Bioinformatics (VCF-derived) | Thousands of columns (one sample per column) || No column count limit in the parsing hot path. |
134134
| Gzipped CSV (`.csv.gz`) | Compressed file | 🔘 | Decompress and pass the resulting IO object: `SmarterCSV.process(Zlib::GzipReader.open(path))`. |
135135
| HTTP streaming | Parsing from a live HTTP response | 🔘 | Pass any IO-compatible object that responds to `#gets`. |
136136

0 commit comments

Comments
 (0)