Skip to content

Fix panic when marshaling build info with invalid UTF-8#2634

Open
scarf005 wants to merge 1 commit intomicrosoft:mainfrom
scarf005:fix/1531/invalid-utf-8
Open

Fix panic when marshaling build info with invalid UTF-8#2634
scarf005 wants to merge 1 commit intomicrosoft:mainfrom
scarf005:fix/1531/invalid-utf-8

Conversation

@scarf005
Copy link

@scarf005 scarf005 commented Feb 2, 2026

Sanitize diagnostic message args containing invalid UTF-8 sequences into Unicode replacement character (U+FFFD) before JSON marshaling to prevent panics when source files contain invalid UTF-8 sequences.

fixes #1531

this PR was generated by opencode + claude 4.5 opus.

Copilot AI review requested due to automatic review settings February 2, 2026 11:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical panic that occurs when marshaling build info containing diagnostic messages with invalid UTF-8 sequences. The fix sanitizes message arguments by replacing invalid UTF-8 with the Unicode replacement character (U+FFFD) before JSON marshaling.

Changes:

  • Added toValidUTF8MessageArgs function to sanitize diagnostic message arguments
  • Applied sanitization to both diagnostic conversion paths in snapshottobuildinfo.go
  • Added comprehensive unit tests for the sanitization function

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
internal/execute/incremental/snapshottobuildinfo.go Added UTF-8 sanitization function and integrated it into both diagnostic conversion paths
internal/execute/incremental/snapshottobuildinfo_test.go Added comprehensive unit tests for the sanitization function

Comment on lines 36 to 38
name: "multiple invalid bytes coalesced",
input: []string{"\x80\x81\x82"},
expected: []string{"\uFFFD"},
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expectation may be incorrect. The Go standard library function strings.ToValidUTF8 replaces each individual invalid UTF-8 byte with the replacement character, not consecutive invalid bytes as a single replacement. For the input "\x80\x81\x82" (three invalid continuation bytes), the function will produce "\uFFFD\uFFFD\uFFFD" (three replacement characters), not "\uFFFD" (one replacement character) as expected in this test.

This test should be updated to match the actual behavior of strings.ToValidUTF8. You can verify this by running a simple test:

result := strings.ToValidUTF8("\x80\x81\x82", "\uFFFD")
// result will be "\uFFFD\uFFFD\uFFFD", not "\uFFFD"
Suggested change
name: "multiple invalid bytes coalesced",
input: []string{"\x80\x81\x82"},
expected: []string{"\uFFFD"},
name: "multiple invalid bytes individually replaced",
input: []string{"\x80\x81\x82"},
expected: []string{"\uFFFD\uFFFD\uFFFD"},

Copilot uses AI. Check for mistakes.
Comment on lines 56 to 58
name: "overlong encodings coalesced",
input: []string{"\xc0\x80"},
expected: []string{"\uFFFD"},
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expectation may be incorrect. The sequence "\xc0\x80" is an overlong encoding for the null character. The Go standard library function strings.ToValidUTF8 will replace each invalid byte separately, resulting in "\uFFFD\uFFFD" (two replacement characters), not "\uFFFD" (one replacement character) as expected in this test.

This test should be updated to match the actual behavior of strings.ToValidUTF8.

Suggested change
name: "overlong encodings coalesced",
input: []string{"\xc0\x80"},
expected: []string{"\uFFFD"},
name: "overlong encodings replaced per byte",
input: []string{"\xc0\x80"},
expected: []string{"\uFFFD\uFFFD"},

Copilot uses AI. Check for mistakes.
// toValidUTF8MessageArgs ensures all message args are valid UTF-8 strings.
// Invalid UTF-8 sequences are replaced with the Unicode replacement character (U+FFFD).
// This prevents JSON marshaling failures when diagnostic messages contain invalid UTF-8.
func toValidUTF8MessageArgs(args []string) []string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough because we need to resurface this information correctly when showing error from buildInfo
This is also comes up in signature help so this problem needs to be tackled differently.

Fixes microsoft#1531

When source files contain invalid UTF-8 sequences (e.g., binary data
or incorrectly encoded text), the compiler panics during JSON marshaling
of build info or LSP responses.

This fix adds UTF-8 sanitization at two layers:
1. VFS file reading layer (internal/vfs/internal/internal.go) - sanitizes
   invalid UTF-8 when decoding file contents to catch issues at the source
2. Diagnostic message args (internal/diagnostics/diagnostics.go) - sanitizes
   strings in StringifyArgs as defense-in-depth

Invalid UTF-8 sequences are replaced with the Unicode replacement
character (U+FFFD).
@scarf005 scarf005 force-pushed the fix/1531/invalid-utf-8 branch from 73d5f90 to 3183fa1 Compare February 3, 2026 01:54
Comment on lines +181 to +186
// Ensure the string is valid UTF-8 to prevent panics during JSON marshaling
// (e.g., in buildInfo serialization or LSP responses).
// Invalid UTF-8 sequences are replaced with the Unicode replacement character (U+FFFD).
if !utf8.ValidString(s) {
s = strings.ToValidUTF8(s, string(utf8.RuneError))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make it impossible to do anything in files with bad encoding, though, and won't stop the client from sending it to us either...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tsgo panics with invalid UTF-8 when marshalling build info

3 participants