clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data)

Hello,

This is a very neat project! I was thinking "_I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them_" and then I found your paper and Python package! Congrats on this very nice contribution.

I am trying to see how `clevercsv` performs on FEC data. For instance, let's consider this file:

https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip

```shell
$ head -5 fec-indiv-1979–1980.csv
C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914
```

When I try to open the file with `clevercsv`, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.

```python
# downloaded, unzipped and renamed to a CSV file from:
# https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
content = open("fec-indiv-1979–1980.csv").read()
clevercsv.Sniffer().sniff(content, verbose=True)
```

It prints out this:

```text
Running normal form detection ...
Not normal, has potential escapechar.
Running data consistency measure ...
```

and then a while later (a few minutes later) starts printing:

```text
Considering 92 dialects.
SimpleDialect(',', '', ''):	P =    22104.867952	T =        0.003101	Q =       68.546737
SimpleDialect(',', '"', ''):	P =    13927.762095	T =        0.003668	Q =       51.090510
SimpleDialect(',', '"', '/'):	P =    13839.682333	T =        0.002461	Q =       34.060128
SimpleDialect(',', "'", ''):	P =    12072.093333	T =        0.003278	Q =       39.571560
SimpleDialect(';', '', ''):	P =      106.613556	T =        0.000003	Q =        0.000345
SimpleDialect(';', '"', ''):	P =       99.261000	T =        0.000000	Q =        0.000000
SimpleDialect(';', '"', '/'):	P =       50.238917	skip.
SimpleDialect(';', "'", ''):	P =       49.981222	skip.
SimpleDialect('', '', ''):	P =      308.696000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', ''):	P =      194.530000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', '/'):	P =       96.652000	T =        0.000000	Q =        0.000000
SimpleDialect('', "'", ''):	P =      144.787000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =    17818.683137	T =        0.346978	Q =     6182.686103
SimpleDialect(' ', '', '/'):	P =    17818.565863	T =        0.346984	Q =     6182.762051
SimpleDialect(' ', '"', ''):	P =    11300.749933	T =        0.353544	Q =     3995.309179
SimpleDialect(' ', '"', '/'):	P =    10372.973520	T =        0.355343	Q =     3685.960429
SimpleDialect(' ', "'", ''):	P =     7231.699311	T =        0.343354	Q =     2483.032090
SimpleDialect(' ', "'", '/'):	P =     7231.658120	T =        0.343362	Q =     2483.075319
SimpleDialect('#', '', ''):	P =      163.330000	skip.
SimpleDialect('#', '"', ''):	P =      103.253000	skip.
SimpleDialect('#', '"', '/'):	P =       67.761333	skip.
SimpleDialect('#', "'", ''):	P =       78.132000	skip.
SimpleDialect('$', '', ''):	P =      155.096500	skip.
SimpleDialect('$', '"', ''):	P =       97.764000	skip.
SimpleDialect('$', '"', '/'):	P =       64.601000	skip.
SimpleDialect('$', "'", ''):	P =       72.892500	skip.
SimpleDialect('%', '', ''):	P =      104.950222	skip.
SimpleDialect('%', '', '\\'):	P =      104.783889	skip.
SimpleDialect('%', '"', ''):	P =       65.896889	skip.
SimpleDialect('%', '"', '/'):	P =       65.765333	skip.
SimpleDialect('%', '"', '\\'):	P =       65.730556	skip.
SimpleDialect('%', "'", ''):	P =       49.648556	skip.
SimpleDialect('%', "'", '\\'):	P =       49.482222	skip.
SimpleDialect('&', '', ''):	P =     2940.570750	skip.
SimpleDialect('&', '', '/'):	P =     2940.446000	skip.
SimpleDialect('&', '"', ''):	P =     1936.209667	skip.
SimpleDialect('&', '"', '/'):	P =     1441.305200	skip.
SimpleDialect('&', "'", ''):	P =     1340.900250	skip.
SimpleDialect('&', "'", '/'):	P =     1340.775500	skip.
SimpleDialect('*', '', ''):	P =      156.344000	skip.
SimpleDialect('*', '"', ''):	P =       97.514500	skip.
SimpleDialect('*', '"', '/'):	P =       65.599000	skip.
SimpleDialect('*', "'", ''):	P =       73.142000	skip.
SimpleDialect('+', '', ''):	P =      156.344000	skip.
SimpleDialect('+', '', '\\'):	P =      156.094500	skip.
SimpleDialect('+', '"', ''):	P =       99.011500	skip.
SimpleDialect('+', '"', '/'):	P =       65.266333	skip.
SimpleDialect('+', '"', '\\'):	P =       99.011500	skip.
SimpleDialect('+', "'", ''):	P =       73.890500	skip.
SimpleDialect('+', "'", '\\'):	P =       73.890500	skip.
SimpleDialect('-', '', ''):	P =     1456.570500	skip.
SimpleDialect('-', '"', ''):	P =      914.921750	skip.
SimpleDialect('-', '"', '/'):	P =      700.916267	skip.
SimpleDialect('-', "'", ''):	P =      687.513667	skip.
SimpleDialect(':', '', ''):	P =      155.096500	skip.
SimpleDialect(':', '"', ''):	P =       97.514500	skip.
SimpleDialect(':', '"', '/'):	P =       64.933667	skip.
SimpleDialect('<', '', ''):	P =      155.845000	skip.
SimpleDialect('<', '"', ''):	P =       97.764000	skip.
SimpleDialect('<', '"', '/'):	P =       65.100000	skip.
SimpleDialect('<', "'", ''):	P =       73.391500	skip.
SimpleDialect('?', '', ''):	P =      155.595500	skip.
SimpleDialect('?', '"', ''):	P =       98.512500	skip.
SimpleDialect('?', '"', '/'):	P =       96.652000	skip.
SimpleDialect('@', '', ''):	P =      156.344000	skip.
SimpleDialect('@', '"', ''):	P =       98.762000	skip.
SimpleDialect('@', '"', '/'):	P =       64.767333	skip.
SimpleDialect('@', "'", ''):	P =       73.391500	skip.
SimpleDialect('\\', '', ''):	P =      105.282889	skip.
SimpleDialect('\\', '"', ''):	P =       66.063222	skip.
SimpleDialect('\\', '"', '/'):	P =       66.098000	skip.
SimpleDialect('\\', "'", ''):	P =       74.140000	skip.
SimpleDialect('^', '', ''):	P =      154.597500	skip.
SimpleDialect('^', '"', ''):	P =       97.514500	skip.
SimpleDialect('^', '"', '/'):	P =       64.601000	skip.
SimpleDialect('^', "'", ''):	P =       72.643000	skip.
SimpleDialect('_', '', ''):	P =      156.094500	skip.
SimpleDialect('_', '"', ''):	P =       98.013500	skip.
SimpleDialect('_', '"', '/'):	P =       65.100000	skip.
SimpleDialect('_', "'", ''):	P =       73.391500	skip.
SimpleDialect('|', '', ''):	P =   293996.190476	T =        0.946519	Q =   278273.106576
SimpleDialect('|', '', '/'):	P =   146998.094048	skip.
SimpleDialect('|', '', '@'):	P =   146998.094048	skip.
SimpleDialect('|', '', '\\'):	P =   146998.092857	skip.
SimpleDialect('|', '"', ''):	P =   185266.666667	skip.
SimpleDialect('|', '"', '/'):	P =    46024.763214	skip.
SimpleDialect('|', '"', '@'):	P =    92633.332143	skip.
SimpleDialect('|', '"', '\\'):	P =    92633.330952	skip.
SimpleDialect('|', "'", ''):	P =    12535.572981	skip.
SimpleDialect('|', "'", '/'):	P =    12535.572981	skip.
SimpleDialect('|', "'", '@'):	P =    12535.572765	skip.
SimpleDialect('|', "'", '\\'):	P =    12535.572981	skip.
```
I would say this takes about 30 minutes, and finally it concludes:
```text
SimpleDialect('|', '', '')
```
I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.

I would be tempted to think this is because I feed the data as a variable `content`, following your example, rather than provide the filename directly. However when I tried to read the `read_csv` method directly with the filename, it also was really, very, very slow. So I think in all situations currently, `clevercsv` trips on this file, and more generally this type of file.

When I take the initiative to truncate the data arbitrarily, `clevercsv` works beautifully. But shouldn't the truncating be something the library, as opposed to the user does?

```python
clevercsv.Sniffer().sniff(content[0:1000], verbose=True)
```

provides in a few seconds:
```text
Running normal form detection ...
Didn't match any normal forms.
Running data consistency measure ...
Considering 4 dialects.
SimpleDialect(',', '', ''):	P =        4.500000	T =        0.000000	Q =        0.000000
SimpleDialect('', '', ''):	P =        0.009000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =        1.562500	T =        0.312500	Q =        0.488281
SimpleDialect('|', '', ''):	P =        8.571429	T =        0.952381	Q =        8.163265
SimpleDialect('|', '', '')
```

@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?
- We start with a small size, and we truncate the file to that small size.
- The detected dialect may be wrong because we are only working on a small subset of the file, so we double the amount of content that we provide the sniffer, and see if it answers the same thing.
- We repeat a predetermined number of times (for instance 4 times), until we've asserted the sniffer has detected the same dialect for larger portions of the file.

I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e

When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:

```python
In[3]: probe_sniff(content)
Out[3]: SimpleDialect('|', '', '')
```

And on the off-chance you would like me to add this algorithm to the codebase, where would it go?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data) #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data) #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions