Concretization Cache: Refactor + manifest testing #49589

johnwparent · 2025-03-19T11:39:13Z

#48198 Introduces the concretization cache, and associated manifest. However it only added testing for the cache itself, not the manifest or cleanup procedures.

This PR:

Adds tests for the manifest and cleanup
Refactors the manifest to update correctly
Refactors the cache to store base32 hashes like the rest of Spack's hashes
gzips the cache entries for a smaller memory footprint

johnwparent · 2025-03-19T11:43:35Z

@spackbot fix style

spackbot-app · 2025-03-19T11:43:38Z

Let me see if I can fix that for you!

spackbot-app · 2025-03-19T11:45:12Z

I was able to run spack style --fix for you!

spack style --fix

==> Running style checks on spack
  selected: import, isort, black, flake8, mypy
==> Modified files
  lib/spack/llnl/util/filesystem.py
  lib/spack/spack/solver/asp.py
  lib/spack/spack/test/concretization/core.py
  lib/spack/spack/test/conftest.py
==> Running import checks
  import checks were clean
==> Running isort checks
  isort checks were clean
==> Running black checks
reformatted lib/spack/spack/test/concretization/core.py
reformatted lib/spack/spack/solver/asp.py
All done! ✨ 🍰 ✨
2 files reformatted, 2 files left unchanged.
  black checks were clean
==> Running flake8 checks
lib/spack/spack/test/concretization/core.py:5: [F401] 'io' imported but unused
  flake8 found errors
==> Running mypy checks
Success: no issues found in 640 source files
  mypy checks were clean

Keep in mind that I cannot fix your flake8 or mypy errors, so if you have any you'll need to fix them and update the pull request. If I was able to push to your branch, if you make further changes you will need to pull from your updated branch before pushing again.

I've updated the branch with style fixes.

haampie · 2025-03-19T14:07:38Z

lib/spack/spack/solver/asp.py

-            f.write(new_stats)
+                new.write(f"{entry_path.name} {entry_bytes}\n")
+            new.seek(0, io.SEEK_SET)
+            new_stats = f"{count} {cache_bytes}\n"


This in-place mutation does not work because line length is variable. If it increases, you overwrite the newline character, and the file gets corrupted. I think you should either work with a binary file (struct.pack/struct.unpack*), or rewrite the entire file (probably what you wanted to avoid).

Note that if the idea is to do a constant time write operation (update header & append single entry), then FileCache is the wrong choice, since it copies the file to a temporary which you then mutate in-place. If you want to do things in-place, you could work with the lock file directly.

Also what's missing is some version field in the header to describe the metadata file format, which is always helpful in case we make a forward incompatible change.

what's missing is some version field in the header to describe the metadata file format

Good idea, adding now.

This in-place mutation does not work because line length is variable.

🤦

Note that if the idea is to do a constant time write operation (update header & append single entry), then FileCache is the wrong choice, since it copies the file to a temporary which you then mutate in-place.

Good point, I didn't consider this when I refactored to using the FileCache, I may as well just avoid the in place write. That said, this outcome of this response conditions on our overall decision on whether or not to actually use the FileCache.

Either way, I agree, it should all be binary.

haampie · 2025-03-19T14:14:59Z

lib/spack/spack/solver/asp.py

    def flush_manifest(self):
        """Updates the concretization cache manifest file after a cache write operation
        Updates the current byte count and entry counts and writes to the head of the
        manifest file"""
        manifest_file = self.root / self._cache_manifest
        manifest_file.touch(exist_ok=True)
-        with open(manifest_file, "r+", encoding="utf-8") as f:
+        with self._fc.write_transaction(self._cache_manifest) as (old, new):


write_transaction opens a file in text mode, which makes seek calls questionable. It does not accept byte nor character offsets, but rather whatever tell() returns, which is documented as an "opaque number" (i.e. you shouldn't pass numbers yourself)

So, if you continue to use FileCache, you should probably extend the API to accept a file mode and read/write in binary. But given the other review comment, maybe FileCache isn't the best choice here given that it does not allow constant time updates to the file.

So, if you continue to use FileCache, you should probably extend the API to accept a file mode and read/write in binary. But given the other review comment, maybe FileCache isn't the best choice here given that it does not allow constant time updates to the file.

I think, at least you and I, we are in the binary only camp, the question just remains whether or not to use FileCache.

haampie · 2025-03-19T14:16:55Z

lib/spack/spack/solver/asp.py

            setup, specs, reuse=reusable_specs, output=output, allow_deprecated=allow_deprecated
        )
+        CONC_CACHE.flush_manifest()


Shouldn't flush be done on every write? Then you don't need to manually sync and deal with exceptions that are raised between the writing the concretizer output to cache and updating the stats.

Syncing slack thread here for context for other reviewers:

This was done as an effort to future proof the concretization cache against a large number of concurrent writes causing serialized cache result returns in the case where a large number of concurrent writes occur when we add the capability to have a remote concretization cache. The concern is present locally to a smaller degree when running multiple spack processes concretizing large environments with unify: false, particularly with many small packages, but the primary concern is for remote caches. Particularly for smaller specs, having to wait long periods for write access to the manifest could limit the conc speedup the cache provides.

Follow up comments from @haampie suggest the use of a ccache style model where we lock a "bucket" (or cache entry subdir) and track the stats at that scope, significantly limiting the occurrence of concurrent writes causing a bottleneck. The same overarching cache statistics we currently track could be computed from each bucket via a method as part of the conc cache API.

I think this is the best path forward, currently implementing and testing.

haampie · 2025-03-19T14:26:21Z

lib/spack/spack/solver/asp.py

@@ -824,28 +834,56 @@ def _lock_prefix_from_cache_path(self, cache_path: str):
            spack.util.hash.b32_hash(cache_path), spack.util.crypto.bit_length(sys.maxsize)
        )

+    def _zip_cache_entry(self, cache_data: str):
+        """Compress cache entries via gzip"""
+        return gzip.compress(cache_data.encode("utf-8"))


Probably a good idea to specify the compresslevel=; the default is gzip._COMPRESS_LEVEL_BEST == 9 which is very slow for little added value, level 6 is better, see the docstrings of spack.util.archive.gzip_compressed_tarfile

Sounds good, thanks!

Going off my inbox I see this comment

You should use mtime=0, otherwise you get a different gzip header on every call and 0 cache hits.

I don't think that has an implication for cache hits since we go off the filename, but it is still best practice?

Initially I did not see you checksummed the contents.

Note that if you're planning to distribute these files, you might need to do sha256sum(gzip(contents)) instead of sha256sum(contents), otherwise checksum validation requires decompression, which we try to avoid everywhere else.

Ah ok, that makes sense, thanks for the follow up.

I like sha256sum(gzip(contents)), its a good idea, will refactor, was previously unaware of the convention (but also it's common sense)

haampie · 2025-03-19T14:28:41Z

lib/spack/spack/solver/asp.py

+        """Compress cache entries via gzip"""
+        return gzip.compress(cache_data.encode("utf-8"))
+
+    def _unzip_cache_entry(self, cache_data: bytes):


it's more common to wrap the filestream (e.g. gzip.open or GzipFile(...)) so decompress happens on read(...)

haampie · 2025-03-19T14:33:52Z

lib/spack/spack/solver/asp.py

@@ -889,7 +927,8 @@ def store(self, problem: str, result: Result, statistics: List, test: bool = Fal
                tty.debug(f"Cache entry {cache_path} exists, will not be overwritten")
                return
            cache_dict = {"results": result.to_dict(test=test), "statistics": statistics}
-            bytes_written = new.write(json.dumps(cache_dict))
+            new_bytes = open(new.name, "wb")
+            bytes_written = new_bytes.write(self._zip_cache_entry(json.dumps(cache_dict)))


I think it's incorrect to open the same file a second time.

Probably easier to avoid the entire FileCache and instead create a temporary file in the dir safely (e.g. tempfile.mkstemp) followed by an atomic rename. I think we have utilities for that.

we have llnl.util.filesystem.write_tmp_and_move() and a few others

The idea here is to use the FileCache so I can drop in a Storage object with a similar API without having to re-write too much.

I think it's incorrect to open the same file a second time.

I think its definitely a smell to do so, no argument there, but as we're writing binary encoded data to the entries, we need a binary writer. I for one, if we're locked into the FileCache approach, would like a binary file cache option.
Another alternative is adding a new API, similar to the FileCache but perhaps better suited to this type of operation, that way we avoid some of the cludgey-ness of this implementation approach and Ryan can still drop something in easily.

Shouldn't be hard to add read/write_transaction(..., mode).

I don't really see the advantage of FileCache since it always copies the file that you're planning to update in-place. What's the reason you want to use FileCache?

I think I touched on this in our slack conversation, and Ryan stated it above, but pretty much the only reason to use it is so that we can have a generic API that allows us to refer to either a local (managed by FileCache) or remote (managed by some remote file cache class using a compatible) cache location. It's also why the config entry is confusingly (for now) url, which is something I believed you made a point about in the other PR.

read/write_transaction(..., mode)

Yes, it's very very easy. ~~That said, constant time file updates with atomic writes would be nice, which isn't currently possible in the FileCache. Maybe a refactor is in order.~~

Nevermind, I'm abandoning the manifest altogether in favor of a cleanup-time computation of the cache contents, that will make things easier for Ryan and remove a good deal of the issues with this cache.

hm okay, filecache is mostly for locking, that's not very relevant for remotes

I am working on a follow up that is a drop in replacement for the FileCache API. It is meant primarily to make adding remote caching easier later.

The web_util APIs are rather more limited and relies on global variables and is heavily tied to mirrors. Rather than a more significant refactor of that, I am working on a more generic interface for dealing with storage devices.

FileCache is just a placeholder for now. Writing something else the is more configurable/better suited for this is probably not a bad idea.

haampie · 2025-03-19T14:42:16Z

lib/spack/spack/solver/asp.py

+
+    def _unzip_cache_entry(self, cache_data: bytes):
+        """Decompresses cache entries, which are gzipped by default"""
+        return gzip.decompress(cache_data).decode("utf-8")


This errors for existing uncompressed cache entries.

Yes, I need to write a handler/migration for that.

alalazo · 2025-03-19T16:08:48Z

While we are fixing this, can we also add a default for:

            "concretization_cache": {
                "type": "object",
                "properties": {
                    "enable": {"type": "boolean"},
                    "url": {"type": "string"},
                    "entry_limit": {"type": "integer", "minimum": 0},
                    "size_limit": {"type": "integer", "minimum": 0},
                },
            },

in etc/defaults? Currently we have none, and so it's unclear which value is used by doing:

$ spack config blame config

Also, was it discussed whether this belongs to config or concretizer?

kwryankrattiger · 2025-03-19T16:58:53Z

Also, was it discussed whether this belongs to config or concretizer?

It was, but there had since been disagreement on that. I think config makes sense as that is where we configure the other misc caches, but I don't have an extremely strong opinion being that.

tgamblin · 2025-03-20T11:11:02Z

Also, was it discussed whether this belongs to config or concretizer?

I think it would be nice if it were in concretizer, but you could refer to the misc_cache as part of the default URL.

johnwparent · 2025-03-21T15:36:46Z

Also, was it discussed whether this belongs to config or concretizer?

I think it would be nice if it were in concretizer, but you could refer to the misc_cache as part of the default URL.

I'll take this as consensus.

johnwparent · 2025-03-21T18:53:48Z

@spackbot fix style

spackbot-app · 2025-03-21T18:53:51Z

Let me see if I can fix that for you!

spackbot-app · 2025-03-21T18:55:18Z

I was able to run spack style --fix for you!

spack style --fix

==> Running style checks on spack
  selected: import, isort, black, flake8, mypy
==> Modified files
  lib/spack/llnl/util/filesystem.py
  lib/spack/spack/solver/asp.py
  lib/spack/spack/test/concretization/core.py
  lib/spack/spack/test/conftest.py
==> Running import checks
  import checks were clean
==> Running isort checks
  isort checks were clean
==> Running black checks
All done! ✨ 🍰 ✨
4 files left unchanged.
  black checks were clean
==> Running flake8 checks
  flake8 checks were clean
==> Running mypy checks
Success: no issues found in 640 source files
  mypy checks were clean
==> spack style checks were clean

Keep in mind that I cannot fix your flake8 or mypy errors, so if you have any you'll need to fix them and update the pull request. If I was able to push to your branch, if you make further changes you will need to pull from your updated branch before pushing again.

I wasn't able to make any further changes, but please see the message above for remaining issues you can fix locally!

wip

spackbot-app bot added core PR affects Spack core functionality tests General test capability(ies) utilities labels Mar 19, 2025

johnwparent requested review from haampie, kwryankrattiger and tgamblin March 19, 2025 11:39

johnwparent mentioned this pull request Mar 19, 2025

Revert "Solver: Cache Concretization Results (#48198)" #49588

Closed

johnwparent mentioned this pull request Mar 19, 2025

Create a generic storage interface #49380

Draft

haampie reviewed Mar 19, 2025

View reviewed changes

johnwparent force-pushed the concretization-cache/cleanup-and-test branch from 2e7ca64 to f0ac02b Compare April 11, 2025 00:25

spackbot-app bot added the defaults label Apr 11, 2025

johnwparent added 3 commits April 14, 2025 14:20

General refactors and improvements to the concretization cache

4129cad

Add test for cache cleanup

0493926

Style

56b22b8

johnwparent and others added 12 commits April 14, 2025 14:20

refactor tests

bc819cb

[@spackbot] updating style on behalf of johnwparent

fb30079

Clear first line of new manifest

f166801

Better handling of intermediate binary file approach

d89fb4b

Dont read to unreadable fd

e13598d

use correct cache limit config name

341a4dd

wip

cache overhaul

22db529

Final refactor

a89ad7f

refactor tests post rebase

cf1269b

rm unneeded test

8f25070

rm lambda cruft

6b657ed

cleanup

18a247f

johnwparent force-pushed the concretization-cache/cleanup-and-test branch from 9a8dbae to 18a247f Compare April 14, 2025 18:29

johnwparent added 4 commits April 14, 2025 15:14

compat with python 3.7

ef1640a

Cleanup tests + confgi

cf8dd40

account for differences in conc size

f07cc0c

rm bad docstring

eb04942

Concretization Cache: Refactor + manifest testing #49589

Are you sure you want to change the base?

Concretization Cache: Refactor + manifest testing #49589

Conversation

johnwparent commented Mar 19, 2025

johnwparent commented Mar 19, 2025

spackbot-app bot commented Mar 19, 2025

spackbot-app bot commented Mar 19, 2025

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haampie Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

johnwparent Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnwparent Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haampie Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alalazo commented Mar 19, 2025

kwryankrattiger commented Mar 19, 2025

tgamblin commented Mar 20, 2025

johnwparent commented Mar 21, 2025

johnwparent commented Mar 21, 2025

spackbot-app bot commented Mar 21, 2025

spackbot-app bot commented Mar 21, 2025

haampie Mar 19, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading

haampie Mar 25, 2025 •

edited

Loading

johnwparent Mar 25, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading

johnwparent Mar 25, 2025 •

edited

Loading

haampie Mar 19, 2025 •

edited

Loading