Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: vis-encode all path characters except visible ASCII #985

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

plobsing
Copy link
Contributor

@plobsing plobsing commented Nov 22, 2024

This allows archive input files to include Unicode characters.
It should also support the more esoteric corners of the ASCII set
(e.g. backslashes, newlines, control codes), if the user can find a way
to convince Bazel to use such a file. This latter feature probably isn't
a good idea to actually use; but it is nice to have confidence that this
ruleset won't choke if it comes up.

Generating an archive with non-ASCII characters reports a diagnostic,
however this does not appear to correspond to a real issue — the file
is still placed into the archive with the expected path and contents.

INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8

Encoding of 7-bit ASCII is performed in Bazel, however Starlark does not provide
access to a string's bytes, only its codepoints, so we are unable to do full 8-bit escaping
in Starlark. A second pass is needed, at least until the spec and implementation work
to get a bytes type lands.

We use sed for this second pass. A prior attempt
at this feature had done a similar external pass with the vis utility, but has
not been accepted because the vis tool is not expected to be available everywhere.
By contrast, sed is part of POSIX and is widely available.

This solution relies on dev-time codegen of Starlark and Sed scripts;
to avoid forcing our dev_dependency rulesets (i.e. rules_go) on
users, the code generator is maintained in a separate package.

We also take this opportunity to canonicalize the paths derived
from user-provided mtree specs when computing the unused
inputs set. This should ensure that the comparison is exact,
which substantially reduces the risks associated with this feature
and should enable it to become a default.

Fixes #794

This pair of utilities from Python's core are helpful when encoding or escaping strings.
Unlike the common alternative — repeated application of `str.replace` —
a `str.translate` implementation performs its work in a single pass.

This isn't principally about efficiency — although a single-pass
implementation may be more efficient — but rather about correctness.
Doing all the translation in a single pass
sidesteps the issue of double-encoding errors which are possible under
repeated-processing schemes when when substitution input/output aliasing
is present (i.e. some substitutions produce output that other
substitutions recognize as to-be-replaced input).
See chainguard-dev/rules_apko#30 for an concrete
example of a double-encoding issue resulting from a repeated-processing
translation implementation.
Use a code-generated table to avoid having inscrutable magic constants
in the codebase.
Bazel's Starlark does not provide access to a string's bytes, only its
codepoints, so we are unable to do this escaping in Starlark. So a
second pass is needed, at least until the spec and implementation work
to get a [`bytes` type](bazelbuild/starlark#112) lands.

Fixes bazel-contrib#794
The `mtree` passed to the `tar` rule is user-provided. Unlike the other
streams participating in this comparison, we cannot completely trust
that the encoding it uses is the same as the form this ruleset would
produce. There's a bit of play to how libarchive interprets these path
specs. To be correct, the comparison we're making requires that all
equivalent forms be massaged to a single representation before
performing the checks.
Bazel expects file paths to be written verbatim in the unused inputs
file; it does not understand any escaping or encoding scheme.

Because we're not post-filtering depsets written out from Bazel, we're
no longer able to produce a 2-column format as easily: sed performs
replacements in all columns. That's fine for the
intended-to-be-vis-encoded column 1, but incorrect for the
intended-to-be-verbatim-content column 2.

Instead, we abandon the 2-column approach and instead work in a single
column, that is vis-encoded when performing comparisons, and then
decoded to write back out to Bazel.
This demonstrates support for non-ASCII characters in archive filenames.

Generating an archive with non-ASCII characters reports a diagnostic,
however there does not appear to be any issue — the file is placed into
the archive with the expected path and contents.

```
INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8
```
Now that alternate and redundant escape sequences are
canonicalized-away, we no longer need to be concerned about
externally-defined mtree files using these and falsely mismatching with
our chosen encoding scheme.
@plobsing
Copy link
Contributor Author

FYI, this patch is fairly substantive, but I believe there are separable elements. I've filed it as one PR because this is my desired end-state. The str.maketrans/str.translate feature could be its own PR if we wanted to expose it as a public API. And just the 7-bit encoding could be a separate, intermediate improvement. Please let me know if you want to land this in phases; either the ones I've outlined or some other partitioning scheme.

Otherwise we may observe platform-dependent behaviour.
@plobsing plobsing force-pushed the vis-encode-improvements branch from 4b7b5e2 to 0a4bba6 Compare November 22, 2024 18:31
Listing creation on Ubuntu is failing with

```
tar: Pathname can't be converted from UTF-8 to current locale.
```

Using a locale with UTF-8 support should help.
@plobsing plobsing force-pushed the vis-encode-improvements branch from e7ea9a3 to 41b8d94 Compare November 22, 2024 19:55
The POSIX and C.UTF-8 locales seem to work different between my Mac OS
dev machine and the Ubuntu CI machines. Rather than continuing to fight
Ubuntu to convince it to emit escapes like Mac OS is doing, attempt the
opposite and try to coerce Mac OS to behave like Ubuntu and not perform
the escaping.
Copy link
Collaborator

@alexeagle alexeagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for taking this on. At a high level, the approach seems sound to me.

lib/private/gen_vis_scripts/BUILD.bazel Outdated Show resolved Hide resolved
@@ -0,0 +1,171 @@
// Code generator for vis-encoding support scripts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.

https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the vis(1) BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.

That's fair. I recognize that this is a fairly big change set I am proposing. I'm happy to search for ways to reduce the burden. Maybe you could provide some insight into the areas you find most concerning and where efforts would do the most good? As the author, I'm a bit blinded by the curse of knowledge, so I'm not entirely sure where my energies would best be spent.

Is this more about the 171 SLOC in the gen_vis_scripts tool? Or about the 1069 SLOC of the entire PR? Assuming the latter, I have some ideas on how to get that down, although I'm not so sure they are improvements.

For instance, the processing of 7-bit ASCII within Bazel/Starlark requires separate support code — the 103 SLOC addition to strings.bzl — that could be dropped by moving all escaping work out to the sed pass. Although I think a str.maketrans/str.translate facility in Starlark would be independently nice to have. Maybe that should be separated though?

Another thing to get the PR line count down would be the checked-in generated .sed scripts, totalling 588 SLOC (or ~630 SLOC if they take on the work of ASCII escaping too). Those could be codegenned in dev and published as artifacts in release, just like binary artifacts. That seemed like more work to set up and makes the scripts a bit less directly accessible, but its something we could do if the raw SLOC number is the concern.

Would additional testing make you more comfortable with this change? I considered adding some Bats-based testing of the individual sed scripts but with PR size already a concern I wasn't sure whether that might be considered over-engineering.

https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the vis(1) BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?

If there is a way to achieve meaningful code size reduction through code sharing with govis, I am not seeing it. It is an interesting project, but it seems to me to be just different enough that it doesn't quite suit what is called for here. FWICT, there are two root-cause differences that drive this: a difference in implementation approach, and a difference in reverse-engineering target.

govis does its conversion as a single pass — it visits each character only once — which is a nice approach that I would recommend whenever it is available. Unfortunately, to my knowledge, no such facility is available in sed (or almost1 any other POSIX utility); we only have repeated substitution available to us. IMO, that limitation is what drives the more complex behaviours in the generated sed scripts — the careful deconfliction of backslash and NUL handling — which has no analogue in govis.

govis seems to have been reverse-engineered from vis. The escaping behaviour implemented here has been reverse-engineered from libarchive, which has its own interpreter for escape sequences. These differ from one another ever so slightly, and also a bit from the mtree file format spec. Since the intent is to use mtree files with bsdtar, which uses libarchive, I felt that fidelity to the stable quirks of that implementation would be the least surprising and disruptive for callers.

govis' choice to replicate vis also up-scopes what it does from what we actually need. mtree's encoding for paths isn't the fully general vis encoding — only the octal escape sequences are part of the documented file format and libarchive tacks on a couple of additional, C-inspired escape sequences. Most govis modes, faithfully reproduced from vis, are not interesting to us in this context.

Footnotes

  1. You might be able to do it in awk but (1) you'd be writing pretty much an entire parser rather than just a translation table, opportunities for code sharing would be even less, and (2) awk has some quirks in its handling of unusual characters that make it not a great tool when we just want to process bytes.

@plobsing plobsing requested a review from alexeagle December 7, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: mtree_spec rule does not encode filenames with special characters
2 participants