Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interactive navigation for tree with millions of files is really slow #227

Closed
inga-lovinde opened this issue Jan 23, 2024 · 5 comments · Fixed by #228
Closed

Interactive navigation for tree with millions of files is really slow #227

inga-lovinde opened this issue Jan 23, 2024 · 5 comments · Fixed by #228
Assignees

Comments

@inga-lovinde
Copy link

inga-lovinde commented Jan 23, 2024

I observe some weird behavior with dua on large trees.

The scanned directory is 13M files total (across all subdirectories). It took dua ~7-8 hours to load it at 200 threads (it's a network mount, as described in #224); memory usage when in interactive mode is ~1.5GB.

I'm in interactive mode, in a directory with 1000 entries (all are subdirectories), with subtree containing 2M entries total.
I try to enter a subdirectory, it has under 193 entries (all are subdirectories), with its subtree containing just 2k entries. It takes ~10 seconds.
I try to enter its subdirectory, it has 8 entries (all are subdirectories), subtree containing 48 entries. It takes around a second.
I try to enter its subdirectory, it has 1 entry (subdirectory), subtree containing 5 entries, 3 levels deep inside of it there are 2 files. Navigating inside of it only takes a (noticeable) fraction of second for every action.
Navigating (using the u key) from the subdirectory with subtree containing 5 entries to subtree containing 48 entries takes around a second.
Pressing u key there (to get to the subdirectory with subtree containing 2k entries) takes ~10 seconds.
Pressing u key again (to get to the subdirectory with subtree containing 2M entries) takes around a minute.
Pressing u key again to get to the subdirectory with 8 entries and subtree containing 9M entries takes under a second, so this seems to be related to the number of entries in a directory, not to the size of its subtree.

By "takes XX seconds" I mean that the UI is stuck in the previous state and is completely unresponsive all this time; yet, oddly enough, dua CPU usage hovers around zero all this time.

(For reference, doing ls on the same directories takes under a second.)

That's on dua v2.26.0 from Alpine package repository (Alpine uses musl instead of libc, not sure if that's relevant here).

@Byron
Copy link
Owner

Byron commented Jan 23, 2024

Thanks for reporting.

What matters here is that lstat calls are very slow, which is run nearly unconditionally whenever an entry is entered. This is merely to detect presence, as it will highlight entries that aren't present anymore in red. This also means it won't detect type changes, i.e. dir to flie or vice-versa.

Maybe there is a fee to be paid when the lstat call goes though longer paths, which is probably why there is a difference between ls and the way this works here.

To fix it, one could probably just add a flag that prevents these lstat checks entirely. Independently of that, one could try to to speed up the procedure by doing a directory listing instead through which metadata might come back faster. But that's to be validated.

I will implement the command-line flag to disable this behaviour, but also encourage you to see if read_dir + DirEntry::metadata() is faster than a bunch of symlink_metadata() calls on the known entries in a directory. If that's the case, a contribution of that algorithm would make it faster for everyone even if that flag isn't used.

@Byron Byron self-assigned this Jan 23, 2024
Byron added a commit that referenced this issue Jan 23, 2024
With it, in interactive mode, entries will not be checked for presence.

This can avoid laggy behaviour when switching between directories
as `lstat` calls will not run, which can be slow on some filesystems.
@Byron
Copy link
Owner

Byron commented Jan 23, 2024

When thinking about this use-case I can also think of another improvement: How useful would it be to be able to dump the graph in some format, say JSON, so it could be made accessible by other tooling? Waiting 8 hours for something that is lost when hitting q seems like waste to be avoided.

From there I could absolutely imagine to add additional sub-commands that can handle these snapshot files, to compute changes between them for example. The simplest case would probably be to be able to use the UI to visualise such a snapshot, instead of re-scanning the underlying tree.

Maybe @piotrwach is interested in this as well :).

@Byron
Copy link
Owner

Byron commented Jan 23, 2024

@gosuwachu
Copy link
Contributor

gosuwachu commented Jan 24, 2024

@Byron Yes, I may be interested :) Depending on how it is implemented I think dua could then also be used for analysing directory trees from other sources, like for example S3,. E.g. all that would be required is to dump the object listing in the right format:

<name> <size> <mtime>

Such file then could be loaded in dua in interactive mode. What this would potentially also allow is interactively analysing file systems on other machines without actually running dua there: ssh remote-server find /directory/to/scan -t f | dua i --from-traversal

@Byron
Copy link
Owner

Byron commented Jan 25, 2024

I absolutely love your thoughts! With this feature, dua i would become a front-end for viewing and organizing a graph data structure, which of course can also be created by other tools then.

ssh find… | dua i --from-traversal seems like a separate feature though as that new data structure wouldn't be involved directly. I also don't see how a simple find would produce all the data that's needed per entry, which I find critical to make this 'remote crawler that isn't dua' scenario work (after all, if dua was on the remote one could just use that). However, if that could be made to work, that would definitely be a nice feature to have as well.

For the terminal application input now is just a stream of events, and it doesn't have to care at all about their source anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants