Skip to content

Conversation

@pgavlin
Copy link
Member

@pgavlin pgavlin commented Mar 11, 2025

ESC currently handles binary data by coercing binary values to a (potentially-invalid) strings. This works okay internally, since Go does not validate the wellformedness of strings when coercing from binary data. However, the standard Go JSON encoder ensures that all strings are coerced to valid UTF-8 during marshaling, which replaces any invalid bytes with the replacement character. This corrupts the data.

For example, the following environment should write the bytes dcc5 43bc 1b4f f4fb 8263 d630 a199 2688 166b c084 5b71 1240 24c3 a944 d937 e2fc to a temporary file:

values:
  files:
    MY_FILE:
      fn::fromBase64: 3MVDvBtP9PuCY9YwoZkmiBZrwIRbcRJAJMOpRNk34vw=

Instead, it writes the bytes efbf bdef bfbd 43ef bfbd 1b4f efbf bdef bfbd efbf bd63 efbf bd30 efbf bdef bfbd 26ef bfbd 166b efbf bdef bfbd 5b71 1240 24c3 a944 efbf bd37 efbf bdef bfbd, because the decoded value of MY_FILE is marshaled as "��C�\u001bO���c�0��\u0026�\u0016k��[q\u0012@$éD�7��" due to the presence of invalid UTF-8 bytes in the decoded value.

This commit addresses these challenges through three related changes:

  1. []byte values are now allowed inside of eval.value. This allows binary data to be faithfully represented within the evaluator.
  2. []byte values are now allowed inside of esc.Value. If an esc.Value contains a []byte, its contents are marshaled as a base64-encoded string and a new binary property is set to true. This allows binary data to be safely roundtripped without complicated encoding schemes.
  3. The files stanza now supports binary data. If the value of an entry in files is binary data, it is written to the filesystem as-is.

Strictly speaking this is a breaking change: prior to these changes, binary data--which can only be manufactured by a provider, rotator, or fn::fromBase64--would be marshaled as a plain old JSON string. With these changes, binary data is still marshaled as a JSON string, but its contents are now base64-encoded bytes. If users had binary data that happened to be valid UTF-8, that data will now be encoded using base64 during marshaling instead of being represented verbatim. If this is a major concern, we could change our marshaling behavior to only base64-encode if the binary data is not valid UTF-8.

Fixes #480.

ESC currently handles binary data by coercing binary values to a
(potentially-invalid) strings. This works okay internally, since Go does
not validate the wellformedness of strings when coercing from binary
data. However, the standard Go JSON encoder ensures that all strings are
[coerced to valid UTF-8 during marshaling](https://pkg.go.dev/encoding/json#Marshal),
which replaces any invalid bytes with the replacement character. This
corrupts the data.

For example, the following environment should write the bytes `dcc5 43bc
1b4f f4fb 8263 d630 a199 2688 166b c084 5b71 1240 24c3 a944 d937 e2fc`
to a temporary file:

```yaml
values:
  files:
    MY_FILE:
      fn::fromBase64: 3MVDvBtP9PuCY9YwoZkmiBZrwIRbcRJAJMOpRNk34vw=
```

Instead, it writes the bytes `efbf bdef bfbd 43ef bfbd 1b4f efbf bdef
bfbd efbf bd63 efbf bd30 efbf bdef bfbd 26ef bfbd 166b efbf bdef bfbd
5b71 1240 24c3 a944 efbf bd37 efbf bdef bfbd`, because the decoded value
of `MY_FILE` is marshaled as `"��C�\u001bO���c�0��\u0026�\u0016k��[q\u0012@$éD�7��"`
due to the presence of invalid UTF-8 bytes in the decoded value.

This commit addresses these challenges through three related changes:

1. `[]byte` values are now allowed inside of `eval.value`. This allows
   binary data to be faithfully represented within the evaluator.
2. `[]byte` values are now allowed inside of `esc.Value`. If an
   `esc.Value` contains a `[]byte`, its contents are marshaled as a
   base64-encoded string and a new `binary` property is set to `true`.
   This allows binary data to be safely roundtripped without complicated
   encoding schemes.
3. The `files` stanza now supports binary data. If the value of an entry
   in `files` is binary data, it is written to the filesystem as-is.

Strictly speaking this is a breaking change: prior to these changes,
binary data--which can only be manufactured by a provider, rotator, or
`fn::fromBase64`--would be marshaled as a plain old JSON string. With
these changes, binary data is still marshaled as a JSON string, but its
contents are now base64-encoded bytes. If users had binary data that
happened to be valid UTF-8, that data will now be encoded using base64
during marshaling instead of being represented verbatim. If this is a
major concern, we could change our marshaling behavior to only
base64-encode if the binary data is not valid UTF-8.

Fixes #480.
@pgavlin pgavlin requested review from komalali, nyobe and seanyeh March 11, 2025 19:50
@pgavlin
Copy link
Member Author

pgavlin commented Mar 11, 2025

9ab531a is a bit questionable--might be that the better option is to change the test validation code s.t. it marshals the result of ToJSON before comparing. The benefit of the original changes to ToJSON/FromJSON is that users of the esc package can roughly round-trip values (modulo metadata) between representations.

@nyobe
Copy link
Contributor

nyobe commented Mar 12, 2025

I'm uneasy about having ToJSON do that base64 encoding. If it returns the underlying []byte directly then it will end up encoding as base64 already during json.Marshal, so I think having it happen early is surprising, and I think it will break roundtripping since FromJSON won't convert it back to []byte.

I think it should be unnecessary, because Value is able to roundtrip a []byte correctly thanks to the custom MarshalJSON/UnmarshalJSON implementation. I'd much prefer changing the test validation code to leverage this instead

@pgavlin
Copy link
Member Author

pgavlin commented Jul 17, 2025

little note on the !!binary tag: https://yaml.org/type/binary.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary data corrupted during marshaling

2 participants