Report binary file changes in sem diff#237
Conversation
There was a problem hiding this comment.
inspect review
Triage: 57 entities analyzed | 0 critical, 0 high, 37 medium, 20 low
Verdict: standard_review
Findings (2)
- [low] In
crates/sem-cli/src/formatters/terminal.rs, theformat_terminalfunction skips files where all changes are orphans in non-verbose mode, but the condition only checksindices(semantic changes) and ignoresbinary_indices. If a file has only binary changes and all semantic changes are orphans, it will be incorrectly skipped even though binary changes should be shown. - [low] In
crates/sem-cli/src/commands/diff.rs, theread_file_compare_contentfunction returns(Option<String>, bool)where the bool indicates if the file is binary. Whenis_binaryis true, it returns(None, true). However, in the file comparison logic, bothcontent_aandcontent_bareOption<String>, and the code later doesbefore_content: if is_binary { None } else { content_a }. Ifcontent_ais alreadyNonewhenis_binaryis true, this works, but if one file is binary and the other is text, the logiclet is_binary = a_is_binary || b_is_binary;will mark both as binary and discard the text content from the non-binary file.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
aa60aeb to
8909c8d
Compare
There was a problem hiding this comment.
inspect review
Triage: 41 entities analyzed | 0 critical, 0 high, 24 medium, 17 low
Verdict: standard_review
Findings (7)
- [low] In
format_markdown, binary changes are iterated but the loop variableidxis used to index intobinary_changes, yetidxcomes from enumeratingbinary_changes.iter(). This meansidxis the enumeration index (0, 1, 2...), which correctly indexesbinary_changes[idx]. However, the pattern is inconsistent with how semantic changes are handled (where indices are stored in a Vec and dereferenced). This is actually correct, but the inconsistency could lead to bugs if copy-pasted. - [low] In
diff_command, when comparing two files directly, the code checksis_binary = a_is_binary || b_is_binaryand then setsbefore_contentandafter_contenttoNoneif binary. However, if only one file is binary (e.g., comparing a text file to a binary file), both contents are set to None, which may not correctly represent the change. The logic should handle the case where only one side is binary. - [low] In
parse_unified_diff, the code addsis_binary: falseto newPatchEntryinstances, but only sets it totruewhen encountering specific binary markers. However, if a file is binary but the markers are not detected (e.g., malformed diff), the entry will incorrectly haveis_binary: false, leading to attempts to read content that doesn't exist. - [low] In
lacks_diffable_content, forFileStatus::Renamed, the function returnstrueif eitherbefore_contentorafter_contentisNone. However, a renamed text file should have both contents present. If only one is missing, this might indicate a bug elsewhere, but treating it as binary could hide the real issue. The logic assumes that missing content always means binary, which may not be correct for renames. - [low] In
read_file_compare_content, ifString::from_utf8(bytes)fails butis_binaryis false (becausebytes_look_binaryreturned false), the function returnsOk((None, false)). This means a non-binary file that fails UTF-8 conversion will haveNonecontent butis_binary=false, which is inconsistent and could cause the file to be skipped incorrectly. - [low] In
format_markdown, binary changes are iterated but never filtered by verbose mode, unlike semantic changes. This means binary changes in files with only orphan semantic changes will still be shown in non-verbose mode, creating inconsistent behavior. - [low] In
format_terminal, the verbose mode filtering logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. A file with only binary changes and orphan semantic changes will be skipped entirely in non-verbose mode, hiding the binary changes.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
8909c8d to
32089bf
Compare
There was a problem hiding this comment.
inspect review
Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review
Findings (5)
- [low] In
parse_unified_diff, the function attempts to read working files for unstaged changes but doesn't handle binary files correctly. Whene.is_binaryis true, it still tries to read the file as text withstd::fs::read_to_string, which will fail for binary files. The binary check should be applied before the fallback read. - [low] In
diff_command, when comparing two files with--file-compare, the code callsread_file_compare_contentwhich returnsResult<Option<String>, std::io::Error>. However, the error handling withunwrap_or_elsedoesn't distinguish between I/O errors and binary files (which returnOk(None)). If a file is binary, it will be treated as havingNonecontent, but the subsequent code doesn't check if both sides are None before creating a FileChange, potentially creating invalid comparisons. - [low] In
parse_unified_diff, the function attempts to read working files withstd::fs::read_to_stringfor non-binary files, but this will fail for binary files that weren't detected by theis_binaryflag. The code should useread_file_compare_contentinstead to handle binary detection consistently. - [low] In
diff_commandfor file comparison mode, the code now returnsOption<String>fromread_file_compare_contentbut assigns it directly tobefore_contentandafter_contentfields which expectOption<String>. However, if a binary file is detected (returnsNone), the FileChange will have both contents asNone, but the code doesn't handle this case - it should either report an error or create a binary change entry. - [low] In
format_terminal, the verbose mode skip logic now checksbinary_indices.is_empty()but this creates asymmetric behavior: files with only binary changes will be shown even in non-verbose mode, while files with only orphan changes are hidden. This inconsistency could confuse users.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
32089bf to
cffb93c
Compare
There was a problem hiding this comment.
inspect review
Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review
Findings (7)
- [low] In
parse_unified_diff, the function attempts to read file content for binary files whene.is_binaryis true, but then checks!e.is_binarybefore reading. This logic is inverted - it should skip reading for binary files but the condition allows reading whene.is_binaryis false, which is correct. However, the warning message at line 349 will trigger for binary files even though that's expected behavior, causing spurious warnings. - [low] In
diff_commandat line 465-467, when comparing two files directly (not from git), if either file is binary (returns None fromread_file_compare_content), the code creates a FileChange withbefore_content: content_aandafter_content: content_bwhere these could be None. The comment says 'None means this side is binary', but the old code expected Some(String). This could break downstream code that expects both sides to be Some for Modified status. - [low] In
lacks_diffable_contentfunction in differ.rs, for FileStatus::Renamed, it checks if either before_content OR after_content is None. However, a renamed file should have both sides present (just with different paths). If only one side is None, it's likely a bug in content population, not a binary file. This could incorrectly classify partially-populated renamed files as binary. - [low] In
bytes_look_binaryfunction in diff.rs (line 134-143), the function checks for UTF-8 validity but has inconsistent behavior: whencompleteis false, it returns false for partial UTF-8 sequences (error_len().is_none()), but whencompleteis true, it returns true for the same partial sequences. This means a file scanned in chunks could be classified differently than when read completely, leading to inconsistent binary detection. - [low] In
parse_unified_diff, the function reads file content from disk for unstaged changes but doesn't handle the case whereread_file_compare_contentreturnsOk(None)for binary files. This will causeafter_contentto beNoneeven whene.is_binaryis false, triggering the warning message incorrectly. - [low] In
diff_commandfor file comparison mode, whenread_file_compare_contentreturnsOk(None)for a binary file, the code creates aFileChangewithbefore_content: Noneandafter_content: None. However, the comment says 'None means this side is binary', but there's no way to distinguish between a binary file and a read error, which could lead to incorrect binary change reporting. - [low] In
format_terminal, the verbose mode skip logic now checksbinary_indices.is_empty()but this creates inconsistent behavior: a file with only binary changes will be shown even in non-verbose mode, while a file with only orphan changes will be skipped. This breaks the intended filtering.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
cffb93c to
37892d1
Compare
There was a problem hiding this comment.
inspect review
Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review
Findings (6)
- [low] In
parse_unified_diff, theis_binaryfield is initialized tofalsebut never set totruewhen creating a newPatchEntry. The binary detection logic only updates existing entries, so if a binary file is the first file in the diff, it won't be marked as binary. - [low] In
parse_unified_diff, the binary detection happens insideelse if let Some(ref mut entry) = current, butcurrentis only set after processing the diff header. If binary markers appear before the entry is added toentries, they won't be detected. - [low] In
parse_unified_diff, after pushing a newPatchEntrytoentries, the code never updatescurrentto point to the newly added entry. This meanscurrentremainsNoneand binary detection in subsequent lines won't work. - [low] In
parse_unified_diff, the binary detection checksline.starts_with("Binary files ")but this will only match if there's a space after 'files'. Git's actual output is 'Binary files a/... and b/... differ' which would match, but the alternative checkline == "GIT binary patch"is too strict - it should checkstarts_withinstead of exact equality since there might be trailing content. - [low] In
diff_commandfor file comparison mode, whenread_file_compare_contentreturnsOk(None)(binary file), the code creates aFileChangewithbefore_content: content_aandafter_content: content_bwhere both areOption<String>. If either side is binary (None), this will create a FileChange with one side None, but the comment says 'None means this side is binary' - however, the actual assignment doesn't handle the case where only one side is binary correctly for the semantic diff pipeline. - [low] In
lacks_diffable_content, the logic forFileStatus::Renamedchecks if eitherbefore_contentorafter_contentis None, but a renamed file should have both sides present (just with different paths). A rename with missing content on one side likely indicates a binary file, but the function returns true if EITHER is None, when it should probably require BOTH to be None to be considered truly lacking diffable content, or the rename case should be handled separately.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
37892d1 to
e92a24e
Compare
There was a problem hiding this comment.
inspect review
Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review
Findings (4)
- [low] In
parse_unified_diff, theis_binaryfield is initialized tofalsebut never set totruewhen creating a newPatchEntry. The binary detection logic checksline.starts_with("Binary files ")orline.starts_with("GIT binary patch")and setsentry.is_binary = true, but this only works ifcurrentisSome. When a new entry is created,is_binarydefaults tofalse, so if the binary marker appears before any other recognized line for that entry, it won't be detected. - [low] In
diff_command, when comparing two files directly (theelse if a.is_some() && b.is_some()branch), the code creates aFileChangewithbefore_content: content_aandafter_content: content_bwhere these areOption<String>. However,read_file_compare_contentreturnsResult<Option<String>, std::io::Error>, socontent_aandcontent_bare alreadyOption<String>. The comment says 'None means this side is binary', but the code doesn't handle the case where one side is binary and the other is text - it will just passNonefor the binary side, which is correct. However, the error handling withunwrap_or_elsewill exit on IO errors, but won't distinguish between IO errors and binary files, potentially causing confusing error messages. - [low] In parse_unified_diff, the is_binary field is never set to true for entries that don't have a current entry. When a new diff block starts with 'diff --git', the previous entry is flushed and a new entry is created with is_binary: false. If the previous entry had binary markers but no current entry existed yet, those markers would be ignored.
- [low] In format_terminal, the verbose mode skip logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. If a file has only binary changes and no semantic changes, it will be skipped even in verbose mode because indices.is_empty() would be true and the all() check would vacuously succeed.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
e92a24e to
2b42a63
Compare
There was a problem hiding this comment.
inspect review
Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review
Findings (5)
- [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true for entries that don't match the binary detection patterns. When a diff entry has no 'Binary files' or 'GIT binary patch' line, is_binary remains false even if the file is actually binary, causing binary files to be incorrectly processed as text files.
- [low] In parse_unified_diff, when has_binary_content is set to true after reading a file from disk, the before_content is not checked or cleared. If before_content was successfully read from git but after_content is binary, the FileChange will have mismatched content (text before, None after) which violates the binary detection logic in lacks_diffable_content.
- [low] In lacks_diffable_content, a FileStatus::Renamed file is considered binary only if BOTH before_content and after_content are None. However, a renamed binary file should be detected as binary if EITHER side is None, since renames can involve content changes and one side being binary makes the whole change binary.
- [low] In parse_unified_diff, when has_binary_content is set to true from e.is_binary, the code still attempts to read file content with read_file_compare_content when after_content is None and new_sha is Some. This can overwrite has_binary_content back to false if the file read succeeds as text, incorrectly treating a binary file as text.
- [low] In run_diff_pipeline, when opts.patch is true and two files are compared, the code creates a FileChange with before_content and after_content set to the result of read_file_compare_content (which returns Option). However, the comment says 'None means this side is binary', but the code doesn't handle the case where both sides are None (both binary), which would incorrectly report 'No semantic changes detected' instead of reporting a binary change.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
2b42a63 to
eab4c20
Compare
There was a problem hiding this comment.
inspect review
Triage: 43 entities analyzed | 0 critical, 0 high, 27 medium, 16 low
Verdict: standard_review
Findings (5)
- [low] In parse_unified_diff, binary files are not properly handled when git_show fails. The code sets has_binary_content=true only when read_file_compare_content returns Ok(None), but if read_file_compare_content returns Err(_), has_binary_content remains false. This causes the subsequent check 'if !has_binary_content && before_content.is_none() && after_content.is_none()' to incorrectly warn about missing content for binary files that failed to read.
- [low] In parse_unified_diff, the is_binary flag is only set when parsing 'Binary files' or 'GIT binary patch' lines, but binary detection also happens in read_file_compare_content. If a file is binary but the patch doesn't contain these markers (e.g., for unstaged changes), the file won't be marked as binary in the PatchEntry, leading to inconsistent binary detection between before_content and after_content.
- [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true for entries that don't match the binary detection patterns. The code checks 'Binary files ' and 'GIT binary patch' but git diff can also output 'diff --git' followed by 'index' lines with mode changes for binary files without these markers, causing binary files to be treated as text.
- [low] In parse_unified_diff, when reading working file content as fallback, the code checks has_binary_content flag but this flag is only set AFTER attempting to read the file. If read_file_compare_content returns Ok(None) indicating binary, has_binary_content is set to true, but the before_content check at the end uses the OLD value of has_binary_content (false), causing the warning to be printed for binary files.
- [low] In diff_command for file comparison mode, content_a and content_b are Option from read_file_compare_content, but the code assigns them directly to before_content and after_content fields. However, the comment says 'None means this side is binary' but there's no actual handling to report these as binary changes - they'll just be filtered out silently by lacks_diffable_content without being added to binary_changes.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
|
Hey @Iron-Ham the other PRs have been merged into main. This one has too many conflicts to rebase cleanly (13 conflict regions across 4 files). Could you rebase this onto main when you get a chance? Main has changed significantly with the scope resolution tightening, patch parsing improvements, and the other PRs landing. Thanks! |
eab4c20 to
74d2d4c
Compare
There was a problem hiding this comment.
inspect review
Triage: 48 entities analyzed | 0 critical, 0 high, 32 medium, 16 low
Verdict: standard_review
Findings (6)
- [low] In
diff_command, when comparing two files with--file-compare, if either file is binary (content_a or content_b is None), a FileChange is created with statusFileStatus::Modified. However, this is incorrect for new files or deleted files - it should useFileStatus::Addedif content_a is None, orFileStatus::Deletedif content_b is None. - [low] In
lacks_diffable_content, the logic forFileStatus::Renamedis incorrect. A renamed file can have both before_content and after_content present (if it's a text file that was renamed), so checkingfile.before_content.is_none() || file.after_content.is_none()will incorrectly classify renamed text files as binary. - [low] In
format_terminal, the verbose mode skip logic now checksbinary_indices.is_empty()but doesn't check if all binary changes are orphans. This creates an inconsistency where files with only binary changes are never skipped in non-verbose mode, unlike files with only orphan semantic changes. - [low] In
diff_command, when comparing two files directly (non-patch mode), if either file is binary (content_a or content_b is None), a FileChange is created with status FileStatus::Modified. However, this is incorrect for the case where we're comparing two completely different files (not a git diff). The status should reflect whether it's an add, delete, or modify based on the actual file states, not hardcoded as Modified. - [low] In
format_terminal, the verbose mode skip logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. If a file has only binary changes and no semantic changes, it will be skipped even in verbose mode because the conditionbinary_indices.is_empty()is checked with AND, meaning files with ONLY binary changes will still be skipped. - [low] In
read_file_compare_content, the function returnsResult<Option<String>, std::io::Error>where None indicates binary content. However, indiff_commandwhen calling this function for file comparison, if the read fails with an IO error, it's silently ignored withErr(_) => {}, which means a legitimate IO error (like permission denied) is treated the same as a binary file, potentially causing incorrect behavior.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
74d2d4c to
01519ba
Compare
There was a problem hiding this comment.
inspect review
Triage: 51 entities analyzed | 0 critical, 0 high, 32 medium, 19 low
Verdict: standard_review
Findings (5)
- [low] In
diff_command, file comparison for binary files creates a FileChange with both content fields as None, but the status is hardcoded toFileStatus::Modified. This is incorrect for added/deleted binary files - if content_a is None but content_b exists, it should be Added; if content_b is None but content_a exists, it should be Deleted. - [low] In
lacks_diffable_content, the logic for FileStatus::Renamed is wrong. A renamed file can have both before_content and after_content present (text file renamed), so checkingfile.before_content.is_none() || file.after_content.is_none()will incorrectly classify text file renames as binary changes. - [low] In
format_terminal, the verbose mode skip logic now checksbinary_indices.is_empty()but this creates inconsistent behavior: a file with only binary changes will be shown even in non-verbose mode, while a file with only orphan changes will be skipped. This breaks the original intent where orphan-only files are hidden in non-verbose mode. - [low] In
format_terminal, the verbose mode skip logic checks ifbinary_indices.is_empty()but doesn't check if all binary changes are orphans. This creates an asymmetry where files with only binary changes are always shown even in non-verbose mode, while files with only orphan semantic changes are hidden. The logic should be consistent. - [low] In
lacks_diffable_content, the function returns true for Modified/Renamed files when either before_content OR after_content is None. However, this is incorrect for the case where a text file is modified to become binary (or vice versa) - one side would have content and the other wouldn't, but this should still be treated as a binary change. The logic should use AND instead of OR for Modified/Renamed status.
Reviewed by inspect | Entity-level triage found 0 high-risk changes
Summary
Closes #181
Test plan
cargo test --manifest-path crates/Cargo.toml --workspacesem diff --format jsonreportssummary.binary: 1,summary.fileCount: 1, and abinaryChangesentrygit diff --no-ext-diff --check