-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: vis-encode all path characters except visible ASCII #985
base: main
Are you sure you want to change the base?
Conversation
This pair of utilities from Python's core are helpful when encoding or escaping strings. Unlike the common alternative — repeated application of `str.replace` — a `str.translate` implementation performs its work in a single pass. This isn't principally about efficiency — although a single-pass implementation may be more efficient — but rather about correctness. Doing all the translation in a single pass sidesteps the issue of double-encoding errors which are possible under repeated-processing schemes when when substitution input/output aliasing is present (i.e. some substitutions produce output that other substitutions recognize as to-be-replaced input). See chainguard-dev/rules_apko#30 for an concrete example of a double-encoding issue resulting from a repeated-processing translation implementation.
Use a code-generated table to avoid having inscrutable magic constants in the codebase.
Bazel's Starlark does not provide access to a string's bytes, only its codepoints, so we are unable to do this escaping in Starlark. So a second pass is needed, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. Fixes bazel-contrib#794
The `mtree` passed to the `tar` rule is user-provided. Unlike the other streams participating in this comparison, we cannot completely trust that the encoding it uses is the same as the form this ruleset would produce. There's a bit of play to how libarchive interprets these path specs. To be correct, the comparison we're making requires that all equivalent forms be massaged to a single representation before performing the checks.
Bazel expects file paths to be written verbatim in the unused inputs file; it does not understand any escaping or encoding scheme. Because we're not post-filtering depsets written out from Bazel, we're no longer able to produce a 2-column format as easily: sed performs replacements in all columns. That's fine for the intended-to-be-vis-encoded column 1, but incorrect for the intended-to-be-verbatim-content column 2. Instead, we abandon the 2-column approach and instead work in a single column, that is vis-encoded when performing comparisons, and then decoded to write back out to Bazel.
This demonstrates support for non-ASCII characters in archive filenames. Generating an archive with non-ASCII characters reports a diagnostic, however there does not appear to be any issue — the file is placed into the archive with the expected path and contents. ``` INFO: From Tar lib/tests/tar/7.tar: tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8 ```
Now that alternate and redundant escape sequences are canonicalized-away, we no longer need to be concerned about externally-defined mtree files using these and falsely mismatching with our chosen encoding scheme.
FYI, this patch is fairly substantive, but I believe there are separable elements. I've filed it as one PR because this is my desired end-state. The |
Otherwise we may observe platform-dependent behaviour.
4b7b5e2
to
0a4bba6
Compare
Listing creation on Ubuntu is failing with ``` tar: Pathname can't be converted from UTF-8 to current locale. ``` Using a locale with UTF-8 support should help.
e7ea9a3
to
41b8d94
Compare
The POSIX and C.UTF-8 locales seem to work different between my Mac OS dev machine and the Ubuntu CI machines. Rather than continuing to fight Ubuntu to convince it to emit escapes like Mac OS is doing, attempt the opposite and try to coerce Mac OS to behave like Ubuntu and not perform the escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you for taking this on. At a high level, the approach seems sound to me.
@@ -0,0 +1,171 @@ | |||
// Code generator for vis-encoding support scripts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.
https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the vis(1)
BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.
That's fair. I recognize that this is a fairly big change set I am proposing. I'm happy to search for ways to reduce the burden. Maybe you could provide some insight into the areas you find most concerning and where efforts would do the most good? As the author, I'm a bit blinded by the curse of knowledge, so I'm not entirely sure where my energies would best be spent.
Is this more about the 171 SLOC in the gen_vis_scripts
tool? Or about the 1069 SLOC of the entire PR? Assuming the latter, I have some ideas on how to get that down, although I'm not so sure they are improvements.
For instance, the processing of 7-bit ASCII within Bazel/Starlark requires separate support code — the 103 SLOC addition to strings.bzl
— that could be dropped by moving all escaping work out to the sed
pass. Although I think a str.maketrans
/str.translate
facility in Starlark would be independently nice to have. Maybe that should be separated though?
Another thing to get the PR line count down would be the checked-in generated .sed
scripts, totalling 588 SLOC (or ~630 SLOC if they take on the work of ASCII escaping too). Those could be codegenned in dev and published as artifacts in release, just like binary artifacts. That seemed like more work to set up and makes the scripts a bit less directly accessible, but its something we could do if the raw SLOC number is the concern.
Would additional testing make you more comfortable with this change? I considered adding some Bats-based testing of the individual sed
scripts but with PR size already a concern I wasn't sure whether that might be considered over-engineering.
https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the
vis(1)
BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?
If there is a way to achieve meaningful code size reduction through code sharing with govis
, I am not seeing it. It is an interesting project, but it seems to me to be just different enough that it doesn't quite suit what is called for here. FWICT, there are two root-cause differences that drive this: a difference in implementation approach, and a difference in reverse-engineering target.
govis
does its conversion as a single pass — it visits each character only once — which is a nice approach that I would recommend whenever it is available. Unfortunately, to my knowledge, no such facility is available in sed
(or almost1 any other POSIX utility); we only have repeated substitution available to us. IMO, that limitation is what drives the more complex behaviours in the generated sed
scripts — the careful deconfliction of backslash and NUL handling — which has no analogue in govis
.
govis
seems to have been reverse-engineered from vis
. The escaping behaviour implemented here has been reverse-engineered from libarchive, which has its own interpreter for escape sequences. These differ from one another ever so slightly, and also a bit from the mtree file format spec. Since the intent is to use mtree
files with bsdtar
, which uses libarchive, I felt that fidelity to the stable quirks of that implementation would be the least surprising and disruptive for callers.
govis
' choice to replicate vis
also up-scopes what it does from what we actually need. mtree
's encoding for paths isn't the fully general vis
encoding — only the octal escape sequences are part of the documented file format and libarchive tacks on a couple of additional, C-inspired escape sequences. Most govis
modes, faithfully reproduced from vis
, are not interesting to us in this context.
Footnotes
-
You might be able to do it in
awk
but (1) you'd be writing pretty much an entire parser rather than just a translation table, opportunities for code sharing would be even less, and (2)awk
has some quirks in its handling of unusual characters that make it not a great tool when we just want to process bytes. ↩
Only direct references to rules_go need to be isolated, not rules that merely transitively depend on that ruleset through references to other rules.
This allows archive input files to include Unicode characters.
It should also support the more esoteric corners of the ASCII set
(e.g. backslashes, newlines, control codes), if the user can find a way
to convince Bazel to use such a file. This latter feature probably isn't
a good idea to actually use; but it is nice to have confidence that this
ruleset won't choke if it comes up.
Generating an archive with non-ASCII characters reports a diagnostic,
however this does not appear to correspond to a real issue — the file
is still placed into the archive with the expected path and contents.
Encoding of 7-bit ASCII is performed in Bazel, however Starlark does not provide
access to a string's bytes, only its codepoints, so we are unable to do full 8-bit escaping
in Starlark. A second pass is needed, at least until the spec and implementation work
to get a
bytes
type lands.We use
sed
for this second pass. A prior attemptat this feature had done a similar external pass with the
vis
utility, but hasnot been accepted because the
vis
tool is not expected to be available everywhere.By contrast,
sed
is part of POSIX and is widely available.This solution relies on dev-time codegen of Starlark and Sed scripts;
to avoid forcing our
dev_dependency
rulesets (i.e.rules_go
) onusers, the code generator is maintained in a separate package.
We also take this opportunity to canonicalize the paths derived
from user-provided mtree specs when computing the unused
inputs set. This should ensure that the comparison is exact,
which substantially reduces the risks associated with this feature
and should enable it to become a default.
Fixes #794