Interactive navigation for tree with millions of files is really slow #227

inga-lovinde · 2024-01-23T11:39:14Z

I observe some weird behavior with dua on large trees.

The scanned directory is 13M files total (across all subdirectories). It took dua ~7-8 hours to load it at 200 threads (it's a network mount, as described in #224); memory usage when in interactive mode is ~1.5GB.

I'm in interactive mode, in a directory with 1000 entries (all are subdirectories), with subtree containing 2M entries total.
I try to enter a subdirectory, it has under 193 entries (all are subdirectories), with its subtree containing just 2k entries. It takes ~10 seconds.
I try to enter its subdirectory, it has 8 entries (all are subdirectories), subtree containing 48 entries. It takes around a second.
I try to enter its subdirectory, it has 1 entry (subdirectory), subtree containing 5 entries, 3 levels deep inside of it there are 2 files. Navigating inside of it only takes a (noticeable) fraction of second for every action.
Navigating (using the u key) from the subdirectory with subtree containing 5 entries to subtree containing 48 entries takes around a second.
Pressing u key there (to get to the subdirectory with subtree containing 2k entries) takes ~10 seconds.
Pressing u key again (to get to the subdirectory with subtree containing 2M entries) takes around a minute.
Pressing u key again to get to the subdirectory with 8 entries and subtree containing 9M entries takes under a second, so this seems to be related to the number of entries in a directory, not to the size of its subtree.

By "takes XX seconds" I mean that the UI is stuck in the previous state and is completely unresponsive all this time; yet, oddly enough, dua CPU usage hovers around zero all this time.

(For reference, doing ls on the same directories takes under a second.)

That's on dua v2.26.0 from Alpine package repository (Alpine uses musl instead of libc, not sure if that's relevant here).

The text was updated successfully, but these errors were encountered:

Byron · 2024-01-23T14:24:06Z

Thanks for reporting.

What matters here is that lstat calls are very slow, which is run nearly unconditionally whenever an entry is entered. This is merely to detect presence, as it will highlight entries that aren't present anymore in red. This also means it won't detect type changes, i.e. dir to flie or vice-versa.

Maybe there is a fee to be paid when the lstat call goes though longer paths, which is probably why there is a difference between ls and the way this works here.

To fix it, one could probably just add a flag that prevents these lstat checks entirely. Independently of that, one could try to to speed up the procedure by doing a directory listing instead through which metadata might come back faster. But that's to be validated.

I will implement the command-line flag to disable this behaviour, but also encourage you to see if read_dir + DirEntry::metadata() is faster than a bunch of symlink_metadata() calls on the known entries in a directory. If that's the case, a contribution of that algorithm would make it faster for everyone even if that flag isn't used.

With it, in interactive mode, entries will not be checked for presence. This can avoid laggy behaviour when switching between directories as `lstat` calls will not run, which can be slow on some filesystems.

Byron · 2024-01-23T14:48:48Z

When thinking about this use-case I can also think of another improvement: How useful would it be to be able to dump the graph in some format, say JSON, so it could be made accessible by other tooling? Waiting 8 hours for something that is lost when hitting q seems like waste to be avoided.

From there I could absolutely imagine to add additional sub-commands that can handle these snapshot files, to compute changes between them for example. The simplest case would probably be to be able to use the UI to visualise such a snapshot, instead of re-scanning the underlying tree.

Maybe @piotrwach is interested in this as well :).

Byron · 2024-01-23T14:50:44Z

Mitigated in https://github.com/Byron/dua-cli/releases/tag/v2.28.0 .

gosuwachu · 2024-01-24T20:35:13Z

@Byron Yes, I may be interested :) Depending on how it is implemented I think dua could then also be used for analysing directory trees from other sources, like for example S3,. E.g. all that would be required is to dump the object listing in the right format:

<name> <size> <mtime>

Such file then could be loaded in dua in interactive mode. What this would potentially also allow is interactively analysing file systems on other machines without actually running dua there: ssh remote-server find /directory/to/scan -t f | dua i --from-traversal

Byron · 2024-01-25T06:22:05Z

I absolutely love your thoughts! With this feature, dua i would become a front-end for viewing and organizing a graph data structure, which of course can also be created by other tools then.

ssh find… | dua i --from-traversal seems like a separate feature though as that new data structure wouldn't be involved directly. I also don't see how a simple find would produce all the data that's needed per entry, which I find critical to make this 'remote crawler that isn't dua' scenario work (after all, if dua was on the remote one could just use that). However, if that could be made to work, that would definitely be a nice feature to have as well.

For the terminal application input now is just a stream of events, and it doesn't have to care at all about their source anymore.

Byron self-assigned this Jan 23, 2024

Byron mentioned this issue Jan 23, 2024

add --no-entry-check flag #228

Merged

Byron closed this as completed in #228 Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interactive navigation for tree with millions of files is really slow #227

Interactive navigation for tree with millions of files is really slow #227

inga-lovinde commented Jan 23, 2024 •

edited

Loading

Byron commented Jan 23, 2024

Byron commented Jan 23, 2024

Byron commented Jan 23, 2024

gosuwachu commented Jan 24, 2024 •

edited

Loading

Byron commented Jan 25, 2024

Interactive navigation for tree with millions of files is really slow #227

Interactive navigation for tree with millions of files is really slow #227

Comments

inga-lovinde commented Jan 23, 2024 • edited Loading

Byron commented Jan 23, 2024

Byron commented Jan 23, 2024

Byron commented Jan 23, 2024

gosuwachu commented Jan 24, 2024 • edited Loading

Byron commented Jan 25, 2024

inga-lovinde commented Jan 23, 2024 •

edited

Loading

gosuwachu commented Jan 24, 2024 •

edited

Loading