Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locale support for bsdtar #966

Closed
alexnovak opened this issue Oct 14, 2024 · 4 comments
Closed

Locale support for bsdtar #966

alexnovak opened this issue Oct 14, 2024 · 4 comments

Comments

@alexnovak
Copy link

The bsdtar supplied with bazel-lib doesn't support unicode in mtree files.

Bsdtar does seem to support locales, but only when compiled with HAVE_SETLOCALE https://github.com/libarchive/libarchive/blob/40ff837717b89e9a5d2c735758f503d124d17b72/tar/bsdtar.c#L190-L192

In the libarchive compilation that underpins this, it seems like we're configuring this for libarchive, but perhaps it isn't making its way to the bsdtar compilation? bazelbuild/bazel-central-registry@6050102#diff-7362d45a2c906ff9c0922ff1f104e88aa28197d9345dabbc18a7a5740f3959e6R1352

Does it also have to depend on //:config? I'm not particularly familiar with bazel cc rules.

@alexnovak
Copy link
Author

Looking into this further, I've confirmed that we are correctly compiling with HAVE_SETLOCALE. This looks like it actually might be an issue with how libarchive interprets mtree
https://github.com/libarchive/libarchive/blob/40ff837717b89e9a5d2c735758f503d124d17b72/libarchive/archive_read_support_format_mtree.c#L1074-L1079

I think isprint here will fail on multibyte sequences, such as characters with a modifier like "ő".

@plobsing
Copy link
Contributor

plobsing commented Dec 8, 2024

The mtree file format does not support multi-byte sequences; the contents of an mtree file are expected to be 7-bit ASCII. Content not conforming to that expectation must be represented using escape sequences:

 When encoding file or pathnames,	any backslash character	 or  character
 outside of the 95 printable ASCII characters must be encoded as a back-
 slash  followed	by  three octal	digits.	 When reading mtree files, any
 appearance of a backslash followed by three octal digits	should be con-
 verted into the corresponding character.

Assuming you want UTF-81, "ő" can be represented as \305\201 in mtree pathnames.

Footnotes

  1. Encoding is not prescribed by mtree or tar, you could use some other encoding if you really wanted to.

@alexnovak
Copy link
Author

You're absolutely correct. The folks over on libarchive were polite enough to point this out to me: libarchive/libarchive#2384 (comment)

They've mentioned this might be a nice feature for them in the future, but agree that it isn't a bug. I'll close this for now and use an awk or sed hack to clean up these paths in my own builds..

@plobsing
Copy link
Contributor

plobsing commented Dec 9, 2024

FWIW, I am trying to land a change to make that cleanup automatic if you use the mtree_spec rule to create your mtree. That change works using a big pile of sed, so if you're managing your own mtree, you could probably re-use / modify some of those scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants