Considering 92 dialects.
SimpleDialect(',', '', ''): P = 22104.867952 T = 0.003101 Q = 68.546737
SimpleDialect(',', '"', ''): P = 13927.762095 T = 0.003668 Q = 51.090510
SimpleDialect(',', '"', '/'): P = 13839.682333 T = 0.002461 Q = 34.060128
SimpleDialect(',', "'", ''): P = 12072.093333 T = 0.003278 Q = 39.571560
SimpleDialect(';', '', ''): P = 106.613556 T = 0.000003 Q = 0.000345
SimpleDialect(';', '"', ''): P = 99.261000 T = 0.000000 Q = 0.000000
SimpleDialect(';', '"', '/'): P = 50.238917 skip.
SimpleDialect(';', "'", ''): P = 49.981222 skip.
SimpleDialect('', '', ''): P = 308.696000 T = 0.000000 Q = 0.000000
SimpleDialect('', '"', ''): P = 194.530000 T = 0.000000 Q = 0.000000
SimpleDialect('', '"', '/'): P = 96.652000 T = 0.000000 Q = 0.000000
SimpleDialect('', "'", ''): P = 144.787000 T = 0.000000 Q = 0.000000
SimpleDialect(' ', '', ''): P = 17818.683137 T = 0.346978 Q = 6182.686103
SimpleDialect(' ', '', '/'): P = 17818.565863 T = 0.346984 Q = 6182.762051
SimpleDialect(' ', '"', ''): P = 11300.749933 T = 0.353544 Q = 3995.309179
SimpleDialect(' ', '"', '/'): P = 10372.973520 T = 0.355343 Q = 3685.960429
SimpleDialect(' ', "'", ''): P = 7231.699311 T = 0.343354 Q = 2483.032090
SimpleDialect(' ', "'", '/'): P = 7231.658120 T = 0.343362 Q = 2483.075319
SimpleDialect('#', '', ''): P = 163.330000 skip.
SimpleDialect('#', '"', ''): P = 103.253000 skip.
SimpleDialect('#', '"', '/'): P = 67.761333 skip.
SimpleDialect('#', "'", ''): P = 78.132000 skip.
SimpleDialect('$', '', ''): P = 155.096500 skip.
SimpleDialect('$', '"', ''): P = 97.764000 skip.
SimpleDialect('$', '"', '/'): P = 64.601000 skip.
SimpleDialect('$', "'", ''): P = 72.892500 skip.
SimpleDialect('%', '', ''): P = 104.950222 skip.
SimpleDialect('%', '', '\\'): P = 104.783889 skip.
SimpleDialect('%', '"', ''): P = 65.896889 skip.
SimpleDialect('%', '"', '/'): P = 65.765333 skip.
SimpleDialect('%', '"', '\\'): P = 65.730556 skip.
SimpleDialect('%', "'", ''): P = 49.648556 skip.
SimpleDialect('%', "'", '\\'): P = 49.482222 skip.
SimpleDialect('&', '', ''): P = 2940.570750 skip.
SimpleDialect('&', '', '/'): P = 2940.446000 skip.
SimpleDialect('&', '"', ''): P = 1936.209667 skip.
SimpleDialect('&', '"', '/'): P = 1441.305200 skip.
SimpleDialect('&', "'", ''): P = 1340.900250 skip.
SimpleDialect('&', "'", '/'): P = 1340.775500 skip.
SimpleDialect('*', '', ''): P = 156.344000 skip.
SimpleDialect('*', '"', ''): P = 97.514500 skip.
SimpleDialect('*', '"', '/'): P = 65.599000 skip.
SimpleDialect('*', "'", ''): P = 73.142000 skip.
SimpleDialect('+', '', ''): P = 156.344000 skip.
SimpleDialect('+', '', '\\'): P = 156.094500 skip.
SimpleDialect('+', '"', ''): P = 99.011500 skip.
SimpleDialect('+', '"', '/'): P = 65.266333 skip.
SimpleDialect('+', '"', '\\'): P = 99.011500 skip.
SimpleDialect('+', "'", ''): P = 73.890500 skip.
SimpleDialect('+', "'", '\\'): P = 73.890500 skip.
SimpleDialect('-', '', ''): P = 1456.570500 skip.
SimpleDialect('-', '"', ''): P = 914.921750 skip.
SimpleDialect('-', '"', '/'): P = 700.916267 skip.
SimpleDialect('-', "'", ''): P = 687.513667 skip.
SimpleDialect(':', '', ''): P = 155.096500 skip.
SimpleDialect(':', '"', ''): P = 97.514500 skip.
SimpleDialect(':', '"', '/'): P = 64.933667 skip.
SimpleDialect('<', '', ''): P = 155.845000 skip.
SimpleDialect('<', '"', ''): P = 97.764000 skip.
SimpleDialect('<', '"', '/'): P = 65.100000 skip.
SimpleDialect('<', "'", ''): P = 73.391500 skip.
SimpleDialect('?', '', ''): P = 155.595500 skip.
SimpleDialect('?', '"', ''): P = 98.512500 skip.
SimpleDialect('?', '"', '/'): P = 96.652000 skip.
SimpleDialect('@', '', ''): P = 156.344000 skip.
SimpleDialect('@', '"', ''): P = 98.762000 skip.
SimpleDialect('@', '"', '/'): P = 64.767333 skip.
SimpleDialect('@', "'", ''): P = 73.391500 skip.
SimpleDialect('\\', '', ''): P = 105.282889 skip.
SimpleDialect('\\', '"', ''): P = 66.063222 skip.
SimpleDialect('\\', '"', '/'): P = 66.098000 skip.
SimpleDialect('\\', "'", ''): P = 74.140000 skip.
SimpleDialect('^', '', ''): P = 154.597500 skip.
SimpleDialect('^', '"', ''): P = 97.514500 skip.
SimpleDialect('^', '"', '/'): P = 64.601000 skip.
SimpleDialect('^', "'", ''): P = 72.643000 skip.
SimpleDialect('_', '', ''): P = 156.094500 skip.
SimpleDialect('_', '"', ''): P = 98.013500 skip.
SimpleDialect('_', '"', '/'): P = 65.100000 skip.
SimpleDialect('_', "'", ''): P = 73.391500 skip.
SimpleDialect('|', '', ''): P = 293996.190476 T = 0.946519 Q = 278273.106576
SimpleDialect('|', '', '/'): P = 146998.094048 skip.
SimpleDialect('|', '', '@'): P = 146998.094048 skip.
SimpleDialect('|', '', '\\'): P = 146998.092857 skip.
SimpleDialect('|', '"', ''): P = 185266.666667 skip.
SimpleDialect('|', '"', '/'): P = 46024.763214 skip.
SimpleDialect('|', '"', '@'): P = 92633.332143 skip.
SimpleDialect('|', '"', '\\'): P = 92633.330952 skip.
SimpleDialect('|', "'", ''): P = 12535.572981 skip.
SimpleDialect('|', "'", '/'): P = 12535.572981 skip.
SimpleDialect('|', "'", '@'): P = 12535.572765 skip.
SimpleDialect('|', "'", '\\'): P = 12535.572981 skip.
I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.
Running normal form detection ...
Didn't match any normal forms.
Running data consistency measure ...
Considering 4 dialects.
SimpleDialect(',', '', ''): P = 4.500000 T = 0.000000 Q = 0.000000
SimpleDialect('', '', ''): P = 0.009000 T = 0.000000 Q = 0.000000
SimpleDialect(' ', '', ''): P = 1.562500 T = 0.312500 Q = 0.488281
SimpleDialect('|', '', ''): P = 8.571429 T = 0.952381 Q = 8.163265
SimpleDialect('|', '', '')
When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:
And on the off-chance you would like me to add this algorithm to the codebase, where would it go?
Hello,
This is a very neat project! I was thinking "I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them" and then I found your paper and Python package! Congrats on this very nice contribution.
I am trying to see how
clevercsvperforms on FEC data. For instance, let's consider this file:https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
When I try to open the file with
clevercsv, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.It prints out this:
and then a while later (a few minutes later) starts printing:
I would say this takes about 30 minutes, and finally it concludes:
I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.
I would be tempted to think this is because I feed the data as a variable
content, following your example, rather than provide the filename directly. However when I tried to read theread_csvmethod directly with the filename, it also was really, very, very slow. So I think in all situations currently,clevercsvtrips on this file, and more generally this type of file.When I take the initiative to truncate the data arbitrarily,
clevercsvworks beautifully. But shouldn't the truncating be something the library, as opposed to the user does?provides in a few seconds:
@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?
I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e
When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:
And on the off-chance you would like me to add this algorithm to the codebase, where would it go?