Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to ignore files matching filename pattern #303

Open
Atrate opened this issue Feb 22, 2025 · 11 comments
Open

Add option to ignore files matching filename pattern #303

Atrate opened this issue Feb 22, 2025 · 11 comments
Labels
roadmap issue will be addressed in a future major code iteration

Comments

@Atrate
Copy link

Atrate commented Feb 22, 2025

I know that bees is extent-based and not file-based, but considering the log messages, it seems to be able to identify which files it's taking pieces out of.

As such, I'd like to make a feature request for ignoring certain files based on a glob or regex pattern.

Use case: some workflows, especially ones on Qubes OS, make use of significant numbers of large temporary files that are almost-exact clones of existing VM images. The problem here is that they are temporary (in the case of DispVMs) and get discarded after VM shutdown - running bees over those images (when there's changes to the image) cause unnecessary write cycles.

As such, it'd be nice if bees could exclude files, e.g. by glob pattern (for qubes it would be *dirty.img).

@Zygo
Copy link
Owner

Zygo commented Feb 28, 2025

considering the log messages, it seems to be able to identify which files it's taking pieces out of.

It asks the kernel to identify the file through /proc/self/fd/ after opening the file.

There are some problems with using that interface for filename filtering. The kernel can't identify files that don't have file names (e.g. because they're deleted, or because there's no valid path from / to the file in the current mount namespace). Most of those are non-issues for logging, because logging has no impact on program behavior, but these issues would break a filter that is actively making decisions based on the filename.

There are some performance issues with making queries through that interface at high rates. In the log messages, the query cost is tied to an error event or some other operation that is even more expensive, but for filtering we would need to hit that interface on every open.

Different parts of the filename have different filtering costs. If we know that an entire subvol is excluded, we don't have to look up names on anything in the subvol, and if we cache filtering results, we can skip name resolution by ID. So we should consider the subvol path separately from the path within the subvol.

bees does have parts of the filename in memory at various points in its code, but they aren't accessible from a single place where they could be passed to regex or fnmatch. Once part of a path is resolved, it's replaced with an open file descriptor and the name of that component is dropped. It's fairly straightforward to save those components in a cache and reassemble them for filtering, but there's no reason to do so before the filtering infrastructure exists.

All that said, this particular use case could be served by calling fnmatch on the last component of the filename (or all path components starting from the root of the subvol), and that call can be easily inserted into the existing code. This doesn't require finding a way to save all the path components, and doesn't conflict with any more specialized filtering opportunities when they become available. Presumably we'll always want to be able to filter by the part of the path inside the subvol, so we can have --exclude-glob-basename today, and add the full set of filters later on, e.g. --exclude-regex-subvol, --exclude-glob-path, --exclude-compressed, etc.

@Zygo Zygo added the roadmap issue will be addressed in a future major code iteration label Feb 28, 2025
@kakra
Copy link
Contributor

kakra commented Feb 28, 2025

Wouldn't such a filter potentially hit a lot of false positives (and vice verse) because a matching hashed block could appear in files both you want to filter and you don't want to filter?

Also, in the OP context, it looks more like a filter at the subvol level would be needed. That would require Qubes OS to put clones into separate sub volumes, tho.

@Zygo
Copy link
Owner

Zygo commented Mar 1, 2025

Wouldn't such a filter potentially hit a lot of false positives (and vice verse) because a matching hashed block could appear in files both you want to filter and you don't want to filter?

Filename filters are necessarily filters on references, not extents or blocks. The dedupe algorithm might still find matching blocks, but it would be prevented from modifying some references to those blocks.

For #249, all the filter has to do is match and exclude whatever references lead to /boot, while still permitting dedupe of other copies and reflinks. Note that if the blocks are reflinked from files that don't match the filter, the blocks will be scanned, and dedupe may happen between any non-excluded reflinks. This will prevent the /boot files from being fully deduped, but there are few of them, so the total lost space is still better than grub boot failure.

For the OP use case, we want to exclude scanning of the files as well due to their size and ephemeral contents. For the moment, a reference filter will prevent such scanning, since scanning requires opening the files to read them, and the filenames can be checked against the filter.

In the future, with csum-based scan, it's not necessary to read the files during scanning. The file csums will always be scanned and matched for dedupe, but that process is 2-4 orders of magnitude faster than the current scanning process (and 1-2 orders of magnitude of that is the cost of enumerating references to an extent at all). The pathname filter kicks in later in the process, when it's time to identify reflinks in order to make open calls just before deduping known duplicate extents, and we can no longer avoid determining the file names. At that point, we discover the filename matches an exclude rule, and we can abandon the dedupe. Note that this is still a net gain over the current state, because we would only do this in cases where there's a known duplicate block, i.e. we won't do it any more when there are only unique blocks.

There is a subtle difference in that change: in the current scanner, the reflink name filter matches or doesn't match individually for each reflink, which results in incomplete dedupe of blocks when some references match the filter and some do not. Extent scan batches all the reflinks to an extent together, but the reflinks are still processed independently.

In the future, the scanner could check all the reflink names before doing anything to the extent, and abandon the extent if any reference matches an exclude filter (include filters would require all references to match), to avoid having extents that are partially deduplicated. This requires two passes through all the reflink names--one to filter their names, and one for each reference during dedupe--which might slow things down when the filter is used, but it might still not slow down more than deduping all the ephemeral blocks.

My extended plan for filtering has a schema that allows users to specify whether a path rule matches any reference, all references, or applies to individual references, so the various semantics and cost tradeoffs will be available.

@Zygo
Copy link
Owner

Zygo commented Mar 1, 2025

Side note: if the goal is to avoid deduping large amounts of ephemeral data, the fastest way to do that is to add a gap between the current filesystem transid and the highest scanned transid. If bees stays ~100 transactions behind the filesystem, it will only dedupe data that has been untouched for at least an hour. That filtering is done by the crawlers using tree search, before any of the later processing stages. The main drawback is that this scheme can't dedupe anything until it gets old enough.

@kakra
Copy link
Contributor

kakra commented Mar 1, 2025

Okay, I wasn't aware that you can get all the file names that have that extent reference. Because in /proc/self/fd/ each FD only points to a single file...

@Atrate
Copy link
Author

Atrate commented Mar 1, 2025

Thank you for the exhaustive responses!

Side note: if the goal is to avoid deduping large amounts of ephemeral data, the fastest way to do that is to add a gap between the current filesystem transid and the highest scanned transid. If bees stays ~100 transactions behind the filesystem, it will only dedupe data that has been untouched for at least an hour. That filtering is done by the crawlers using tree search, before any of the later processing stages. The main drawback is that this scheme can't dedupe anything until it gets old enough.

Is this a feature that can be enabled right now in bees or is it to-be-implemented? Regardless of the filename filtering feature, maybe such a transaction delay could be quite useful for all workflows, as it would help with large amounts of I/O for short-term cache data on any OS.

@kakra
Copy link
Contributor

kakra commented Mar 2, 2025

I think the general goal of bees is to follow new transactions as fast as possible to make use of a still warm cache. So it's probably not useful for all workflows but it could be useful option.

@Zygo
Copy link
Owner

Zygo commented Mar 4, 2025

Is this a feature that can be enabled right now in bees or is it to-be-implemented?

To be implemented, but it's a config option and a "+ config_value" in the code.

maybe such a transaction delay could be quite useful for all workflows, as it would help with large amounts of I/O for short-term cache data on any OS.

I see two common cases in the field:

  1. There's not a lot of writing, so bees keeps up easily, even if it dedupes all the unnecessary things.
  2. There's much more writing than bees can handle, so it's always hundreds (or thousands, or tens of thousands) of transactions behind. A variation on this is that bees is intentionally stopped or throttled for some time, which causes a similar effect (writing_speed > processing_speed).

The transaction delay is useful in cases where there's enough writing to get out of case 1 (i.e. still keeping up, but not "easily"), but not enough writing to get into case 2 (i.e. where bees no longer has a choice, and has the transaction delay forced on it). It's also useful for non-performance use cases, like reducing SSD wear.

the general goal of bees is to follow new transactions as fast as possible to make use of a still warm cache

As usual for btrfs, it's tradeoffs all the way down. The further behind transid_max is, the more metadata the crawlers have to filter out (the btrfs search range filter only filters out old transactions, not new ones), so the slower scanning gets when it's not completely up to date. On the other hand, the crawlers are very fast at scanning, so keeping a few thousand transactions behind only burns a CPU second or two per cycle.

Reads are cheap (only a single-digit percentage of dedupe time), and you need a lot of RAM to be able to hold hundreds of transactions of write data and have that data size be large enough that you can't simply read it back from disk in a few seconds.

Most of the places where a warm cache is useful are in the dedupe of a single extent--after that, the cache is not needed and one day bees could explicitly drop it. Dedupes are the most expensive thing, so skipping dedupes we don't need helps more than anything else.

@Zygo
Copy link
Owner

Zygo commented Mar 4, 2025

I wasn't aware that you can get all the file names that have that extent reference. Because in /proc/self/fd/ each FD only points to a single file...

Oof...while implementing this, I recalled inodes can have multiple file names--a fact that the entire rest of bees can gleefully ignore.

So we can hit the any/all question (and either pay the cost to answer it, or settle for a leaky filter) even with a single reference to a single extent, if the reference happens to be in a file with hardlinks.

@kakra
Copy link
Contributor

kakra commented Mar 4, 2025

Oof...while implementing this, I recalled inodes can have multiple file names

That's probably what I meant above because we previously had that discussion that you cannot easily pinpoint an operation to a single filename. ;-)

@Zygo
Copy link
Owner

Zygo commented Mar 4, 2025

we previously had that discussion that you cannot easily pinpoint an operation to a single filename.

It's neither easy nor cheap, but it also seems to be necessary. It seems likely that filtering will remain necessary and useful even after all other known issues are resolved, so I'll have to find a way to do it that doesn't suck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap issue will be addressed in a future major code iteration
Projects
None yet
Development

No branches or pull requests

3 participants