Fix panic when marshaling build info with invalid UTF-8#2634
Fix panic when marshaling build info with invalid UTF-8#2634scarf005 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical panic that occurs when marshaling build info containing diagnostic messages with invalid UTF-8 sequences. The fix sanitizes message arguments by replacing invalid UTF-8 with the Unicode replacement character (U+FFFD) before JSON marshaling.
Changes:
- Added
toValidUTF8MessageArgsfunction to sanitize diagnostic message arguments - Applied sanitization to both diagnostic conversion paths in
snapshottobuildinfo.go - Added comprehensive unit tests for the sanitization function
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/execute/incremental/snapshottobuildinfo.go | Added UTF-8 sanitization function and integrated it into both diagnostic conversion paths |
| internal/execute/incremental/snapshottobuildinfo_test.go | Added comprehensive unit tests for the sanitization function |
| name: "multiple invalid bytes coalesced", | ||
| input: []string{"\x80\x81\x82"}, | ||
| expected: []string{"\uFFFD"}, |
There was a problem hiding this comment.
The test expectation may be incorrect. The Go standard library function strings.ToValidUTF8 replaces each individual invalid UTF-8 byte with the replacement character, not consecutive invalid bytes as a single replacement. For the input "\x80\x81\x82" (three invalid continuation bytes), the function will produce "\uFFFD\uFFFD\uFFFD" (three replacement characters), not "\uFFFD" (one replacement character) as expected in this test.
This test should be updated to match the actual behavior of strings.ToValidUTF8. You can verify this by running a simple test:
result := strings.ToValidUTF8("\x80\x81\x82", "\uFFFD")
// result will be "\uFFFD\uFFFD\uFFFD", not "\uFFFD"| name: "multiple invalid bytes coalesced", | |
| input: []string{"\x80\x81\x82"}, | |
| expected: []string{"\uFFFD"}, | |
| name: "multiple invalid bytes individually replaced", | |
| input: []string{"\x80\x81\x82"}, | |
| expected: []string{"\uFFFD\uFFFD\uFFFD"}, |
| name: "overlong encodings coalesced", | ||
| input: []string{"\xc0\x80"}, | ||
| expected: []string{"\uFFFD"}, |
There was a problem hiding this comment.
The test expectation may be incorrect. The sequence "\xc0\x80" is an overlong encoding for the null character. The Go standard library function strings.ToValidUTF8 will replace each invalid byte separately, resulting in "\uFFFD\uFFFD" (two replacement characters), not "\uFFFD" (one replacement character) as expected in this test.
This test should be updated to match the actual behavior of strings.ToValidUTF8.
| name: "overlong encodings coalesced", | |
| input: []string{"\xc0\x80"}, | |
| expected: []string{"\uFFFD"}, | |
| name: "overlong encodings replaced per byte", | |
| input: []string{"\xc0\x80"}, | |
| expected: []string{"\uFFFD\uFFFD"}, |
| // toValidUTF8MessageArgs ensures all message args are valid UTF-8 strings. | ||
| // Invalid UTF-8 sequences are replaced with the Unicode replacement character (U+FFFD). | ||
| // This prevents JSON marshaling failures when diagnostic messages contain invalid UTF-8. | ||
| func toValidUTF8MessageArgs(args []string) []string { |
There was a problem hiding this comment.
This is not enough because we need to resurface this information correctly when showing error from buildInfo
This is also comes up in signature help so this problem needs to be tackled differently.
Fixes microsoft#1531 When source files contain invalid UTF-8 sequences (e.g., binary data or incorrectly encoded text), the compiler panics during JSON marshaling of build info or LSP responses. This fix adds UTF-8 sanitization at two layers: 1. VFS file reading layer (internal/vfs/internal/internal.go) - sanitizes invalid UTF-8 when decoding file contents to catch issues at the source 2. Diagnostic message args (internal/diagnostics/diagnostics.go) - sanitizes strings in StringifyArgs as defense-in-depth Invalid UTF-8 sequences are replaced with the Unicode replacement character (U+FFFD).
73d5f90 to
3183fa1
Compare
| // Ensure the string is valid UTF-8 to prevent panics during JSON marshaling | ||
| // (e.g., in buildInfo serialization or LSP responses). | ||
| // Invalid UTF-8 sequences are replaced with the Unicode replacement character (U+FFFD). | ||
| if !utf8.ValidString(s) { | ||
| s = strings.ToValidUTF8(s, string(utf8.RuneError)) | ||
| } |
There was a problem hiding this comment.
This will make it impossible to do anything in files with bad encoding, though, and won't stop the client from sending it to us either...
Sanitize diagnostic message args containing invalid UTF-8 sequences into Unicode replacement character (U+FFFD) before JSON marshaling to prevent panics when source files contain invalid UTF-8 sequences.
fixes #1531
this PR was generated by opencode + claude 4.5 opus.