Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: fix poor performance #225

Open
andrewliebenow opened this issue Sep 20, 2024 · 1 comment
Open

diff: fix poor performance #225

andrewliebenow opened this issue Sep 20, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@andrewliebenow
Copy link
Contributor

Is your feature request related to a problem? Please describe.
diff appears to process files multiple orders of magnitude slower than other implementations:

❯ time -- diff -- ./file-a ./file-b
Elapsed time: 3.381272 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds

❯ time -- diffutils -- ./file-a ./file-b
Elapsed time: 0.005884 seconds
User time: 0.000000 seconds
System time: 0.000000 seconds

❯ du -k -- ./file-a ./file-b
1384    ./file-a
1384    ./file-b

In this case, file-a and file-b are identical.

Describe the solution you'd like
I think we should aim for diff to be able to process files at 10 mebibytes per second, at the very least (on my system, the current implementation is getting less than 0.5 mebibytes per second). On my system, diffutils from uutils does over 200 mebibytes per second.

Describe alternatives you've considered
N/A

Additional context
N/A

@andrewliebenow andrewliebenow added the enhancement New feature or request label Sep 20, 2024
@gcanat
Copy link
Contributor

gcanat commented Sep 30, 2024

Indeed, the current implementation seems sub-optimal. For every line in file1 it is iterating over every line of file2. Comparing a 90Mb ascii txt file makes it crash :)

I implemented the histogram diff algorithm, probably a bit naively at this point, but it is giving decent results so far (4.5s for diffing two 90Mb files). I was just a bit baffled by the Hunk struct and why the need for the changes attribute. In my opinion, once we have the kind and line start/end for both files we have everything we need. But maybe I am missing something ?

Note: out of the 4.5s, 2s are spent just loading the files (11M lines...)

Anyway, I should be able to open a draft PR soon and maybe get some more guidance.

@gcanat gcanat mentioned this issue Sep 30, 2024
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants