feat: vis-encode all path characters except visible ASCII #985

plobsing · 2024-11-22T15:13:01Z

This allows archive input files to include Unicode characters.
It should also support the more esoteric corners of the ASCII set
(e.g. backslashes, newlines, control codes), if the user can find a way
to convince Bazel to use such a file. This latter feature probably isn't
a good idea to actually use; but it is nice to have confidence that this
ruleset won't choke if it comes up.

Generating an archive with non-ASCII characters reports a diagnostic,
however this does not appear to correspond to a real issue — the file
is still placed into the archive with the expected path and contents.

INFO: From Tar lib/tests/tar/7.tar:
tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8

Encoding of 7-bit ASCII is performed in Bazel, however Starlark does not provide
access to a string's bytes, only its codepoints, so we are unable to do full 8-bit escaping
in Starlark. A second pass is needed, at least until the spec and implementation work
to get a bytes type lands.

We use sed for this second pass. A prior attempt
at this feature had done a similar external pass with the vis utility, but has
not been accepted because the vis tool is not expected to be available everywhere.
By contrast, sed is part of POSIX and is widely available.

This solution relies on dev-time codegen of Starlark and Sed scripts;
to avoid forcing our dev_dependency rulesets (i.e. rules_go) on
users, the code generator is maintained in a separate package.

We also take this opportunity to canonicalize the paths derived
from user-provided mtree specs when computing the unused
inputs set. This should ensure that the comparison is exact,
which substantially reduces the risks associated with this feature
and should enable it to become a default.

Fixes #794

This pair of utilities from Python's core are helpful when encoding or escaping strings. Unlike the common alternative — repeated application of `str.replace` — a `str.translate` implementation performs its work in a single pass. This isn't principally about efficiency — although a single-pass implementation may be more efficient — but rather about correctness. Doing all the translation in a single pass sidesteps the issue of double-encoding errors which are possible under repeated-processing schemes when when substitution input/output aliasing is present (i.e. some substitutions produce output that other substitutions recognize as to-be-replaced input). See chainguard-dev/rules_apko#30 for an concrete example of a double-encoding issue resulting from a repeated-processing translation implementation.

Use a code-generated table to avoid having inscrutable magic constants in the codebase.

Bazel's Starlark does not provide access to a string's bytes, only its codepoints, so we are unable to do this escaping in Starlark. So a second pass is needed, at least until the spec and implementation work to get a [`bytes` type](bazelbuild/starlark#112) lands. Fixes bazel-contrib#794

The `mtree` passed to the `tar` rule is user-provided. Unlike the other streams participating in this comparison, we cannot completely trust that the encoding it uses is the same as the form this ruleset would produce. There's a bit of play to how libarchive interprets these path specs. To be correct, the comparison we're making requires that all equivalent forms be massaged to a single representation before performing the checks.

Bazel expects file paths to be written verbatim in the unused inputs file; it does not understand any escaping or encoding scheme. Because we're not post-filtering depsets written out from Bazel, we're no longer able to produce a 2-column format as easily: sed performs replacements in all columns. That's fine for the intended-to-be-vis-encoded column 1, but incorrect for the intended-to-be-verbatim-content column 2. Instead, we abandon the 2-column approach and instead work in a single column, that is vis-encoded when performing comparisons, and then decoded to write back out to Bazel.

This demonstrates support for non-ASCII characters in archive filenames. Generating an archive with non-ASCII characters reports a diagnostic, however there does not appear to be any issue — the file is placed into the archive with the expected path and contents. ``` INFO: From Tar lib/tests/tar/7.tar: tar: lib/tests/tar/srcdir/Unicode® support?🤞: Can't translate pathname 'lib/tests/tar/srcdir/Unicode® support?🤞' to UTF-8 ```

Now that alternate and redundant escape sequences are canonicalized-away, we no longer need to be concerned about externally-defined mtree files using these and falsely mismatching with our chosen encoding scheme.

plobsing · 2024-11-22T15:15:57Z

FYI, this patch is fairly substantive, but I believe there are separable elements. I've filed it as one PR because this is my desired end-state. The str.maketrans/str.translate feature could be its own PR if we wanted to expose it as a public API. And just the 7-bit encoding could be a separate, intermediate improvement. Please let me know if you want to land this in phases; either the ones I've outlined or some other partitioning scheme.

Otherwise we may observe platform-dependent behaviour.

Listing creation on Ubuntu is failing with ``` tar: Pathname can't be converted from UTF-8 to current locale. ``` Using a locale with UTF-8 support should help.

The POSIX and C.UTF-8 locales seem to work different between my Mac OS dev machine and the Ubuntu CI machines. Rather than continuing to fight Ubuntu to convince it to emit escapes like Mac OS is doing, attempt the opposite and try to coerce Mac OS to behave like Ubuntu and not perform the escaping.

alexeagle

Awesome, thank you for taking this on. At a high level, the approach seems sound to me.

lib/private/gen_vis_scripts/BUILD.bazel

alexeagle · 2024-12-06T17:14:51Z

lib/private/gen_vis_scripts/gen_vis_scripts.go

@@ -0,0 +1,171 @@
+// Code generator for vis-encoding support scripts.


I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.

https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the vis(1) BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?

I'm a bit scared of having this volume of new code for the existing maintainers to understand and bugfix.

That's fair. I recognize that this is a fairly big change set I am proposing. I'm happy to search for ways to reduce the burden. Maybe you could provide some insight into the areas you find most concerning and where efforts would do the most good? As the author, I'm a bit blinded by the curse of knowledge, so I'm not entirely sure where my energies would best be spent.

Is this more about the 171 SLOC in the gen_vis_scripts tool? Or about the 1069 SLOC of the entire PR? Assuming the latter, I have some ideas on how to get that down, although I'm not so sure they are improvements.

For instance, the processing of 7-bit ASCII within Bazel/Starlark requires separate support code — the 103 SLOC addition to strings.bzl — that could be dropped by moving all escaping work out to the sed pass. Although I think a str.maketrans/str.translate facility in Starlark would be independently nice to have. Maybe that should be separated though?

Another thing to get the PR line count down would be the checked-in generated .sed scripts, totalling 588 SLOC (or ~630 SLOC if they take on the work of ASCII escaping too). Those could be codegenned in dev and published as artifacts in release, just like binary artifacts. That seemed like more work to set up and makes the scripts a bit less directly accessible, but its something we could do if the raw SLOC number is the concern.

Would additional testing make you more comfortable with this change? I considered adding some Bats-based testing of the individual sed scripts but with PR size already a concern I wasn't sure whether that might be considered over-engineering.

https://github.com/vbatts/go-mtree/tree/main/pkg/govis also has a Go implementation of the vis(1) BSD built-in. And @vbatts has indicated willingness to help out with our Bazel tar implementation. Do you think there's any code-sharing possible?

If there is a way to achieve meaningful code size reduction through code sharing with govis, I am not seeing it. It is an interesting project, but it seems to me to be just different enough that it doesn't quite suit what is called for here. FWICT, there are two root-cause differences that drive this: a difference in implementation approach, and a difference in reverse-engineering target.

govis does its conversion as a single pass — it visits each character only once — which is a nice approach that I would recommend whenever it is available. Unfortunately, to my knowledge, no such facility is available in sed (or almost¹ any other POSIX utility); we only have repeated substitution available to us. IMO, that limitation is what drives the more complex behaviours in the generated sed scripts — the careful deconfliction of backslash and NUL handling — which has no analogue in govis.

govis seems to have been reverse-engineered from vis. The escaping behaviour implemented here has been reverse-engineered from libarchive, which has its own interpreter for escape sequences. These differ from one another ever so slightly, and also a bit from the mtree file format spec. Since the intent is to use mtree files with bsdtar, which uses libarchive, I felt that fidelity to the stable quirks of that implementation would be the least surprising and disruptive for callers.

govis' choice to replicate vis also up-scopes what it does from what we actually need. mtree's encoding for paths isn't the fully general vis encoding — only the octal escape sequences are part of the documented file format and libarchive tacks on a couple of additional, C-inspired escape sequences. Most govis modes, faithfully reproduced from vis, are not interesting to us in this context.

Footnotes

You might be able to do it in awk but (1) you'd be writing pretty much an entire parser rather than just a translation table, opportunities for code sharing would be even less, and (2) awk has some quirks in its handling of unusual characters that make it not a great tool when we just want to process bytes. ↩

Only direct references to rules_go need to be isolated, not rules that merely transitively depend on that ruleset through references to other rules.

plobsing added 8 commits November 22, 2024 09:44

Correctly vis-encode all 7-bit ASCII for mtree

4d77429

Use a code-generated table to avoid having inscrutable magic constants in the codebase.

Update docstring to reflect greater confidence in mtree filtering logic

4c2d148

Now that alternate and redundant escape sequences are canonicalized-away, we no longer need to be concerned about externally-defined mtree files using these and falsely mismatching with our chosen encoding scheme.

Add a test covering canonicalization of external mtree specs

419cd8a

Always generate archive listing in the POSIX locale

0a4bba6

Otherwise we may observe platform-dependent behaviour.

plobsing force-pushed the vis-encode-improvements branch from 4b7b5e2 to 0a4bba6 Compare November 22, 2024 18:31

Use C.UTF-8 locale instead

41b8d94

Listing creation on Ubuntu is failing with ``` tar: Pathname can't be converted from UTF-8 to current locale. ``` Using a locale with UTF-8 support should help.

plobsing force-pushed the vis-encode-improvements branch from e7ea9a3 to 41b8d94 Compare November 22, 2024 19:55

plobsing added 2 commits November 22, 2024 15:05

Provide explanation for hacky use of en_US locale

522d76b

alexeagle reviewed Dec 6, 2024

View reviewed changes

plobsing added 4 commits December 7, 2024 12:08

Move non-devtool rules out of quarantine package

c0c49f0

Only direct references to rules_go need to be isolated, not rules that merely transitively depend on that ruleset through references to other rules.

Move script generator tool to live alongside other tools

7fe54b8

Remove unnecessary export to non-existed package

54694ce

Associate gencode comment with the generation block it describes

1aded37

plobsing requested a review from alexeagle December 7, 2024 19:30

plobsing mentioned this pull request Dec 9, 2024

Locale support for bsdtar #966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: vis-encode all path characters except visible ASCII #985

feat: vis-encode all path characters except visible ASCII #985

plobsing commented Nov 22, 2024 •

edited

Loading

plobsing commented Nov 22, 2024

alexeagle left a comment

alexeagle Dec 6, 2024

plobsing Dec 7, 2024

		@@ -0,0 +1,171 @@
		// Code generator for vis-encoding support scripts.

feat: vis-encode all path characters except visible ASCII #985

Are you sure you want to change the base?

feat: vis-encode all path characters except visible ASCII #985

Conversation

plobsing commented Nov 22, 2024 • edited Loading

plobsing commented Nov 22, 2024

alexeagle left a comment

Choose a reason for hiding this comment

alexeagle Dec 6, 2024

Choose a reason for hiding this comment

plobsing Dec 7, 2024

Choose a reason for hiding this comment

Footnotes

plobsing commented Nov 22, 2024 •

edited

Loading