-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to ignore files matching filename pattern #303
Comments
It asks the kernel to identify the file through There are some problems with using that interface for filename filtering. The kernel can't identify files that don't have file names (e.g. because they're deleted, or because there's no valid path from There are some performance issues with making queries through that interface at high rates. In the log messages, the query cost is tied to an error event or some other operation that is even more expensive, but for filtering we would need to hit that interface on every open. Different parts of the filename have different filtering costs. If we know that an entire subvol is excluded, we don't have to look up names on anything in the subvol, and if we cache filtering results, we can skip name resolution by ID. So we should consider the subvol path separately from the path within the subvol. bees does have parts of the filename in memory at various points in its code, but they aren't accessible from a single place where they could be passed to All that said, this particular use case could be served by calling |
Wouldn't such a filter potentially hit a lot of false positives (and vice verse) because a matching hashed block could appear in files both you want to filter and you don't want to filter? Also, in the OP context, it looks more like a filter at the subvol level would be needed. That would require Qubes OS to put clones into separate sub volumes, tho. |
Filename filters are necessarily filters on references, not extents or blocks. The dedupe algorithm might still find matching blocks, but it would be prevented from modifying some references to those blocks. For #249, all the filter has to do is match and exclude whatever references lead to For the OP use case, we want to exclude scanning of the files as well due to their size and ephemeral contents. For the moment, a reference filter will prevent such scanning, since scanning requires opening the files to read them, and the filenames can be checked against the filter. In the future, with csum-based scan, it's not necessary to read the files during scanning. The file csums will always be scanned and matched for dedupe, but that process is 2-4 orders of magnitude faster than the current scanning process (and 1-2 orders of magnitude of that is the cost of enumerating references to an extent at all). The pathname filter kicks in later in the process, when it's time to identify reflinks in order to make open calls just before deduping known duplicate extents, and we can no longer avoid determining the file names. At that point, we discover the filename matches an exclude rule, and we can abandon the dedupe. Note that this is still a net gain over the current state, because we would only do this in cases where there's a known duplicate block, i.e. we won't do it any more when there are only unique blocks. There is a subtle difference in that change: in the current scanner, the reflink name filter matches or doesn't match individually for each reflink, which results in incomplete dedupe of blocks when some references match the filter and some do not. Extent scan batches all the reflinks to an extent together, but the reflinks are still processed independently. In the future, the scanner could check all the reflink names before doing anything to the extent, and abandon the extent if any reference matches an exclude filter (include filters would require all references to match), to avoid having extents that are partially deduplicated. This requires two passes through all the reflink names--one to filter their names, and one for each reference during dedupe--which might slow things down when the filter is used, but it might still not slow down more than deduping all the ephemeral blocks. My extended plan for filtering has a schema that allows users to specify whether a path rule matches any reference, all references, or applies to individual references, so the various semantics and cost tradeoffs will be available. |
Side note: if the goal is to avoid deduping large amounts of ephemeral data, the fastest way to do that is to add a gap between the current filesystem transid and the highest scanned transid. If bees stays ~100 transactions behind the filesystem, it will only dedupe data that has been untouched for at least an hour. That filtering is done by the crawlers using tree search, before any of the later processing stages. The main drawback is that this scheme can't dedupe anything until it gets old enough. |
Okay, I wasn't aware that you can get all the file names that have that extent reference. Because in |
Thank you for the exhaustive responses!
Is this a feature that can be enabled right now in |
I think the general goal of bees is to follow new transactions as fast as possible to make use of a still warm cache. So it's probably not useful for all workflows but it could be useful option. |
To be implemented, but it's a config option and a "+ config_value" in the code.
I see two common cases in the field:
The transaction delay is useful in cases where there's enough writing to get out of case 1 (i.e. still keeping up, but not "easily"), but not enough writing to get into case 2 (i.e. where bees no longer has a choice, and has the transaction delay forced on it). It's also useful for non-performance use cases, like reducing SSD wear.
As usual for btrfs, it's tradeoffs all the way down. The further behind transid_max is, the more metadata the crawlers have to filter out (the btrfs search range filter only filters out old transactions, not new ones), so the slower scanning gets when it's not completely up to date. On the other hand, the crawlers are very fast at scanning, so keeping a few thousand transactions behind only burns a CPU second or two per cycle. Reads are cheap (only a single-digit percentage of dedupe time), and you need a lot of RAM to be able to hold hundreds of transactions of write data and have that data size be large enough that you can't simply read it back from disk in a few seconds. Most of the places where a warm cache is useful are in the dedupe of a single extent--after that, the cache is not needed and one day bees could explicitly drop it. Dedupes are the most expensive thing, so skipping dedupes we don't need helps more than anything else. |
Oof...while implementing this, I recalled inodes can have multiple file names--a fact that the entire rest of bees can gleefully ignore. So we can hit the any/all question (and either pay the cost to answer it, or settle for a leaky filter) even with a single reference to a single extent, if the reference happens to be in a file with hardlinks. |
That's probably what I meant above because we previously had that discussion that you cannot easily pinpoint an operation to a single filename. ;-) |
It's neither easy nor cheap, but it also seems to be necessary. It seems likely that filtering will remain necessary and useful even after all other known issues are resolved, so I'll have to find a way to do it that doesn't suck. |
I know that bees is extent-based and not file-based, but considering the log messages, it seems to be able to identify which files it's taking pieces out of.
As such, I'd like to make a feature request for ignoring certain files based on a glob or regex pattern.
Use case: some workflows, especially ones on Qubes OS, make use of significant numbers of large temporary files that are almost-exact clones of existing VM images. The problem here is that they are temporary (in the case of DispVMs) and get discarded after VM shutdown - running bees over those images (when there's changes to the image) cause unnecessary write cycles.
As such, it'd be nice if bees could exclude files, e.g. by glob pattern (for qubes it would be
*dirty.img
).The text was updated successfully, but these errors were encountered: