-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Concretization Cache: Refactor + manifest testing #49589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Concretization Cache: Refactor + manifest testing #49589
Conversation
@spackbot fix style |
Let me see if I can fix that for you! |
I was able to run spack style --fix==> Running style checks on spack
selected: import, isort, black, flake8, mypy
==> Modified files
lib/spack/llnl/util/filesystem.py
lib/spack/spack/solver/asp.py
lib/spack/spack/test/concretization/core.py
lib/spack/spack/test/conftest.py
==> Running import checks
import checks were clean
==> Running isort checks
isort checks were clean
==> Running black checks
reformatted lib/spack/spack/test/concretization/core.py
reformatted lib/spack/spack/solver/asp.py
All done! ✨ 🍰 ✨
2 files reformatted, 2 files left unchanged.
black checks were clean
==> Running flake8 checks
lib/spack/spack/test/concretization/core.py:5: [F401] 'io' imported but unused
flake8 found errors
==> Running mypy checks
Success: no issues found in 640 source files
mypy checks were clean
I've updated the branch with style fixes. |
lib/spack/spack/solver/asp.py
Outdated
f.write(new_stats) | ||
new.write(f"{entry_path.name} {entry_bytes}\n") | ||
new.seek(0, io.SEEK_SET) | ||
new_stats = f"{count} {cache_bytes}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in-place mutation does not work because line length is variable. If it increases, you overwrite the newline character, and the file gets corrupted. I think you should either work with a binary file (struct.pack/struct.unpack*
), or rewrite the entire file (probably what you wanted to avoid).
Note that if the idea is to do a constant time write operation (update header & append single entry), then FileCache
is the wrong choice, since it copies the file to a temporary which you then mutate in-place. If you want to do things in-place, you could work with the lock file directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what's missing is some version field in the header to describe the metadata file format, which is always helpful in case we make a forward incompatible change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's missing is some version field in the header to describe the metadata file format
Good idea, adding now.
This in-place mutation does not work because line length is variable.
🤦
Note that if the idea is to do a constant time write operation (update header & append single entry), then FileCache is the wrong choice, since it copies the file to a temporary which you then mutate in-place.
Good point, I didn't consider this when I refactored to using the FileCache, I may as well just avoid the in place write. That said, this outcome of this response conditions on our overall decision on whether or not to actually use the FileCache
.
Either way, I agree, it should all be binary.
lib/spack/spack/solver/asp.py
Outdated
def flush_manifest(self): | ||
"""Updates the concretization cache manifest file after a cache write operation | ||
Updates the current byte count and entry counts and writes to the head of the | ||
manifest file""" | ||
manifest_file = self.root / self._cache_manifest | ||
manifest_file.touch(exist_ok=True) | ||
with open(manifest_file, "r+", encoding="utf-8") as f: | ||
with self._fc.write_transaction(self._cache_manifest) as (old, new): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_transaction
opens a file in text mode, which makes seek
calls questionable. It does not accept byte nor character offsets, but rather whatever tell() returns, which is documented as an "opaque number" (i.e. you shouldn't pass numbers yourself)
So, if you continue to use FileCache
, you should probably extend the API to accept a file mode and read/write in binary. But given the other review comment, maybe FileCache isn't the best choice here given that it does not allow constant time updates to the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if you continue to use FileCache, you should probably extend the API to accept a file mode and read/write in binary. But given the other review comment, maybe FileCache isn't the best choice here given that it does not allow constant time updates to the file.
I think, at least you and I, we are in the binary only camp, the question just remains whether or not to use FileCache
.
lib/spack/spack/solver/asp.py
Outdated
setup, specs, reuse=reusable_specs, output=output, allow_deprecated=allow_deprecated | ||
) | ||
CONC_CACHE.flush_manifest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't flush
be done on every write? Then you don't need to manually sync and deal with exceptions that are raised between the writing the concretizer output to cache and updating the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Syncing slack thread here for context for other reviewers:
This was done as an effort to future proof the concretization cache against a large number of concurrent writes causing serialized cache result returns in the case where a large number of concurrent writes occur when we add the capability to have a remote concretization cache. The concern is present locally to a smaller degree when running multiple spack processes concretizing large environments with unify: false
, particularly with many small packages, but the primary concern is for remote caches. Particularly for smaller specs, having to wait long periods for write access to the manifest could limit the conc speedup the cache provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up comments from @haampie suggest the use of a ccache
style model where we lock a "bucket" (or cache entry subdir) and track the stats at that scope, significantly limiting the occurrence of concurrent writes causing a bottleneck. The same overarching cache statistics we currently track could be computed from each bucket via a method as part of the conc cache API.
I think this is the best path forward, currently implementing and testing.
lib/spack/spack/solver/asp.py
Outdated
@@ -824,28 +834,56 @@ def _lock_prefix_from_cache_path(self, cache_path: str): | |||
spack.util.hash.b32_hash(cache_path), spack.util.crypto.bit_length(sys.maxsize) | |||
) | |||
|
|||
def _zip_cache_entry(self, cache_data: str): | |||
"""Compress cache entries via gzip""" | |||
return gzip.compress(cache_data.encode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea to specify the compresslevel=
; the default is gzip._COMPRESS_LEVEL_BEST == 9
which is very slow for little added value, level 6 is better, see the docstrings of spack.util.archive.gzip_compressed_tarfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going off my inbox I see this comment
You should use mtime=0, otherwise you get a different gzip header on every call and 0 cache hits.
I don't think that has an implication for cache hits since we go off the filename, but it is still best practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I did not see you checksummed the contents.
Note that if you're planning to distribute these files, you might need to do sha256sum(gzip(contents)) instead of sha256sum(contents), otherwise checksum validation requires decompression, which we try to avoid everywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that makes sense, thanks for the follow up.
I like sha256sum(gzip(contents))
, its a good idea, will refactor, was previously unaware of the convention (but also it's common sense)
lib/spack/spack/solver/asp.py
Outdated
"""Compress cache entries via gzip""" | ||
return gzip.compress(cache_data.encode("utf-8")) | ||
|
||
def _unzip_cache_entry(self, cache_data: bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's more common to wrap the filestream (e.g. gzip.open or GzipFile(...)) so decompress happens on read(...)
lib/spack/spack/solver/asp.py
Outdated
@@ -889,7 +927,8 @@ def store(self, problem: str, result: Result, statistics: List, test: bool = Fal | |||
tty.debug(f"Cache entry {cache_path} exists, will not be overwritten") | |||
return | |||
cache_dict = {"results": result.to_dict(test=test), "statistics": statistics} | |||
bytes_written = new.write(json.dumps(cache_dict)) | |||
new_bytes = open(new.name, "wb") | |||
bytes_written = new_bytes.write(self._zip_cache_entry(json.dumps(cache_dict))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's incorrect to open the same file a second time.
Probably easier to avoid the entire FileCache and instead create a temporary file in the dir safely (e.g. tempfile.mkstemp) followed by an atomic rename. I think we have utilities for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have llnl.util.filesystem.write_tmp_and_move()
and a few others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to use the FileCache
so I can drop in a Storage object with a similar API without having to re-write too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's incorrect to open the same file a second time.
I think its definitely a smell to do so, no argument there, but as we're writing binary encoded data to the entries, we need a binary writer. I for one, if we're locked into the FileCache
approach, would like a binary file cache option.
Another alternative is adding a new API, similar to the FileCache
but perhaps better suited to this type of operation, that way we avoid some of the cludgey-ness of this implementation approach and Ryan can still drop something in easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be hard to add read/write_transaction(..., mode).
I don't really see the advantage of FileCache since it always copies the file that you're planning to update in-place. What's the reason you want to use FileCache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I touched on this in our slack conversation, and Ryan stated it above, but pretty much the only reason to use it is so that we can have a generic API that allows us to refer to either a local (managed by FileCache) or remote (managed by some remote file cache class using a compatible) cache location. It's also why the config entry is confusingly (for now) url, which is something I believed you made a point about in the other PR.
read/write_transaction(..., mode)
Yes, it's very very easy. That said, constant time file updates with atomic writes would be nice, which isn't currently possible in the FileCache. Maybe a refactor is in order.
Nevermind, I'm abandoning the manifest altogether in favor of a cleanup-time computation of the cache contents, that will make things easier for Ryan and remove a good deal of the issues with this cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm okay, filecache is mostly for locking, that's not very relevant for remotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on a follow up that is a drop in replacement for the FileCache
API. It is meant primarily to make adding remote caching easier later.
The web_util APIs are rather more limited and relies on global variables and is heavily tied to mirrors. Rather than a more significant refactor of that, I am working on a more generic interface for dealing with storage devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileCache
is just a placeholder for now. Writing something else the is more configurable/better suited for this is probably not a bad idea.
lib/spack/spack/solver/asp.py
Outdated
|
||
def _unzip_cache_entry(self, cache_data: bytes): | ||
"""Decompresses cache entries, which are gzipped by default""" | ||
return gzip.decompress(cache_data).decode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This errors for existing uncompressed cache entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I need to write a handler/migration for that.
While we are fixing this, can we also add a default for:
in $ spack config blame config Also, was it discussed whether this belongs to |
It was, but there had since been disagreement on that. I think config makes sense as that is where we configure the other misc caches, but I don't have an extremely strong opinion being that. |
I think it would be nice if it were in |
I'll take this as consensus. |
@spackbot fix style |
Let me see if I can fix that for you! |
I was able to run spack style --fix==> Running style checks on spack
selected: import, isort, black, flake8, mypy
==> Modified files
lib/spack/llnl/util/filesystem.py
lib/spack/spack/solver/asp.py
lib/spack/spack/test/concretization/core.py
lib/spack/spack/test/conftest.py
==> Running import checks
import checks were clean
==> Running isort checks
isort checks were clean
==> Running black checks
All done! ✨ 🍰 ✨
4 files left unchanged.
black checks were clean
==> Running flake8 checks
flake8 checks were clean
==> Running mypy checks
Success: no issues found in 640 source files
mypy checks were clean
==> spack style checks were clean
I wasn't able to make any further changes, but please see the message above for remaining issues you can fix locally! |
2e7ca64
to
f0ac02b
Compare
9a8dbae
to
18a247f
Compare
#48198 Introduces the concretization cache, and associated manifest. However it only added testing for the cache itself, not the manifest or cleanup procedures.
This PR: