Skip to content

Report binary file changes in sem diff#237

Open
Iron-Ham wants to merge 1 commit into
Ataraxy-Labs:mainfrom
Iron-Ham:Iron-Ham/fix-diff-binary-files
Open

Report binary file changes in sem diff#237
Iron-Ham wants to merge 1 commit into
Ataraxy-Labs:mainfrom
Iron-Ham:Iron-Ham/fix-diff-binary-files

Conversation

@Iron-Ham
Copy link
Copy Markdown
Contributor

Summary

  • detect binary files in git-backed, patch, and direct file-compare diff inputs
  • carry binary file changes through diff summaries and CLI/MCP output
  • add regression coverage for working, staged, file-compare, and JSON binary reporting

Closes #181

Test plan

  • cargo test --manifest-path crates/Cargo.toml --workspace
  • Reproduced issue binary files silently skipped #181 in a dummy git repository and verified sem diff --format json reports summary.binary: 1, summary.fileCount: 1, and a binaryChanges entry
  • Verified direct file comparison of invalid UTF-8 binary files reports a binary change instead of failing to read text
  • git diff --no-ext-diff --check

Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 57 entities analyzed | 0 critical, 0 high, 37 medium, 20 low
Verdict: standard_review

Findings (2)

  1. [low] In crates/sem-cli/src/formatters/terminal.rs, the format_terminal function skips files where all changes are orphans in non-verbose mode, but the condition only checks indices (semantic changes) and ignores binary_indices. If a file has only binary changes and all semantic changes are orphans, it will be incorrectly skipped even though binary changes should be shown.
  2. [low] In crates/sem-cli/src/commands/diff.rs, the read_file_compare_content function returns (Option<String>, bool) where the bool indicates if the file is binary. When is_binary is true, it returns (None, true). However, in the file comparison logic, both content_a and content_b are Option<String>, and the code later does before_content: if is_binary { None } else { content_a }. If content_a is already None when is_binary is true, this works, but if one file is binary and the other is text, the logic let is_binary = a_is_binary || b_is_binary; will mark both as binary and discard the text content from the non-binary file.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from aa60aeb to 8909c8d Compare May 26, 2026 20:09
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 41 entities analyzed | 0 critical, 0 high, 24 medium, 17 low
Verdict: standard_review

Findings (7)

  1. [low] In format_markdown, binary changes are iterated but the loop variable idx is used to index into binary_changes, yet idx comes from enumerating binary_changes.iter(). This means idx is the enumeration index (0, 1, 2...), which correctly indexes binary_changes[idx]. However, the pattern is inconsistent with how semantic changes are handled (where indices are stored in a Vec and dereferenced). This is actually correct, but the inconsistency could lead to bugs if copy-pasted.
  2. [low] In diff_command, when comparing two files directly, the code checks is_binary = a_is_binary || b_is_binary and then sets before_content and after_content to None if binary. However, if only one file is binary (e.g., comparing a text file to a binary file), both contents are set to None, which may not correctly represent the change. The logic should handle the case where only one side is binary.
  3. [low] In parse_unified_diff, the code adds is_binary: false to new PatchEntry instances, but only sets it to true when encountering specific binary markers. However, if a file is binary but the markers are not detected (e.g., malformed diff), the entry will incorrectly have is_binary: false, leading to attempts to read content that doesn't exist.
  4. [low] In lacks_diffable_content, for FileStatus::Renamed, the function returns true if either before_content or after_content is None. However, a renamed text file should have both contents present. If only one is missing, this might indicate a bug elsewhere, but treating it as binary could hide the real issue. The logic assumes that missing content always means binary, which may not be correct for renames.
  5. [low] In read_file_compare_content, if String::from_utf8(bytes) fails but is_binary is false (because bytes_look_binary returned false), the function returns Ok((None, false)). This means a non-binary file that fails UTF-8 conversion will have None content but is_binary=false, which is inconsistent and could cause the file to be skipped incorrectly.
  6. [low] In format_markdown, binary changes are iterated but never filtered by verbose mode, unlike semantic changes. This means binary changes in files with only orphan semantic changes will still be shown in non-verbose mode, creating inconsistent behavior.
  7. [low] In format_terminal, the verbose mode filtering logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. A file with only binary changes and orphan semantic changes will be skipped entirely in non-verbose mode, hiding the binary changes.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham marked this pull request as ready for review May 26, 2026 20:13
@Iron-Ham Iron-Ham marked this pull request as draft May 26, 2026 20:17
@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from 8909c8d to 32089bf Compare May 26, 2026 20:22
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review

Findings (5)

  1. [low] In parse_unified_diff, the function attempts to read working files for unstaged changes but doesn't handle binary files correctly. When e.is_binary is true, it still tries to read the file as text with std::fs::read_to_string, which will fail for binary files. The binary check should be applied before the fallback read.
  2. [low] In diff_command, when comparing two files with --file-compare, the code calls read_file_compare_content which returns Result<Option<String>, std::io::Error>. However, the error handling with unwrap_or_else doesn't distinguish between I/O errors and binary files (which return Ok(None)). If a file is binary, it will be treated as having None content, but the subsequent code doesn't check if both sides are None before creating a FileChange, potentially creating invalid comparisons.
  3. [low] In parse_unified_diff, the function attempts to read working files with std::fs::read_to_string for non-binary files, but this will fail for binary files that weren't detected by the is_binary flag. The code should use read_file_compare_content instead to handle binary detection consistently.
  4. [low] In diff_command for file comparison mode, the code now returns Option<String> from read_file_compare_content but assigns it directly to before_content and after_content fields which expect Option<String>. However, if a binary file is detected (returns None), the FileChange will have both contents as None, but the code doesn't handle this case - it should either report an error or create a binary change entry.
  5. [low] In format_terminal, the verbose mode skip logic now checks binary_indices.is_empty() but this creates asymmetric behavior: files with only binary changes will be shown even in non-verbose mode, while files with only orphan changes are hidden. This inconsistency could confuse users.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from 32089bf to cffb93c Compare May 26, 2026 20:27
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review

Findings (7)

  1. [low] In parse_unified_diff, the function attempts to read file content for binary files when e.is_binary is true, but then checks !e.is_binary before reading. This logic is inverted - it should skip reading for binary files but the condition allows reading when e.is_binary is false, which is correct. However, the warning message at line 349 will trigger for binary files even though that's expected behavior, causing spurious warnings.
  2. [low] In diff_command at line 465-467, when comparing two files directly (not from git), if either file is binary (returns None from read_file_compare_content), the code creates a FileChange with before_content: content_a and after_content: content_b where these could be None. The comment says 'None means this side is binary', but the old code expected Some(String). This could break downstream code that expects both sides to be Some for Modified status.
  3. [low] In lacks_diffable_content function in differ.rs, for FileStatus::Renamed, it checks if either before_content OR after_content is None. However, a renamed file should have both sides present (just with different paths). If only one side is None, it's likely a bug in content population, not a binary file. This could incorrectly classify partially-populated renamed files as binary.
  4. [low] In bytes_look_binary function in diff.rs (line 134-143), the function checks for UTF-8 validity but has inconsistent behavior: when complete is false, it returns false for partial UTF-8 sequences (error_len().is_none()), but when complete is true, it returns true for the same partial sequences. This means a file scanned in chunks could be classified differently than when read completely, leading to inconsistent binary detection.
  5. [low] In parse_unified_diff, the function reads file content from disk for unstaged changes but doesn't handle the case where read_file_compare_content returns Ok(None) for binary files. This will cause after_content to be None even when e.is_binary is false, triggering the warning message incorrectly.
  6. [low] In diff_command for file comparison mode, when read_file_compare_content returns Ok(None) for a binary file, the code creates a FileChange with before_content: None and after_content: None. However, the comment says 'None means this side is binary', but there's no way to distinguish between a binary file and a read error, which could lead to incorrect binary change reporting.
  7. [low] In format_terminal, the verbose mode skip logic now checks binary_indices.is_empty() but this creates inconsistent behavior: a file with only binary changes will be shown even in non-verbose mode, while a file with only orphan changes will be skipped. This breaks the intended filtering.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from cffb93c to 37892d1 Compare May 26, 2026 20:28
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review

Findings (6)

  1. [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true when creating a new PatchEntry. The binary detection logic only updates existing entries, so if a binary file is the first file in the diff, it won't be marked as binary.
  2. [low] In parse_unified_diff, the binary detection happens inside else if let Some(ref mut entry) = current, but current is only set after processing the diff header. If binary markers appear before the entry is added to entries, they won't be detected.
  3. [low] In parse_unified_diff, after pushing a new PatchEntry to entries, the code never updates current to point to the newly added entry. This means current remains None and binary detection in subsequent lines won't work.
  4. [low] In parse_unified_diff, the binary detection checks line.starts_with("Binary files ") but this will only match if there's a space after 'files'. Git's actual output is 'Binary files a/... and b/... differ' which would match, but the alternative check line == "GIT binary patch" is too strict - it should check starts_with instead of exact equality since there might be trailing content.
  5. [low] In diff_command for file comparison mode, when read_file_compare_content returns Ok(None) (binary file), the code creates a FileChange with before_content: content_a and after_content: content_b where both are Option<String>. If either side is binary (None), this will create a FileChange with one side None, but the comment says 'None means this side is binary' - however, the actual assignment doesn't handle the case where only one side is binary correctly for the semantic diff pipeline.
  6. [low] In lacks_diffable_content, the logic for FileStatus::Renamed checks if either before_content or after_content is None, but a renamed file should have both sides present (just with different paths). A rename with missing content on one side likely indicates a binary file, but the function returns true if EITHER is None, when it should probably require BOTH to be None to be considered truly lacking diffable content, or the rename case should be handled separately.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from 37892d1 to e92a24e Compare May 26, 2026 20:34
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review

Findings (4)

  1. [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true when creating a new PatchEntry. The binary detection logic checks line.starts_with("Binary files ") or line.starts_with("GIT binary patch") and sets entry.is_binary = true, but this only works if current is Some. When a new entry is created, is_binary defaults to false, so if the binary marker appears before any other recognized line for that entry, it won't be detected.
  2. [low] In diff_command, when comparing two files directly (the else if a.is_some() && b.is_some() branch), the code creates a FileChange with before_content: content_a and after_content: content_b where these are Option<String>. However, read_file_compare_content returns Result<Option<String>, std::io::Error>, so content_a and content_b are already Option<String>. The comment says 'None means this side is binary', but the code doesn't handle the case where one side is binary and the other is text - it will just pass None for the binary side, which is correct. However, the error handling with unwrap_or_else will exit on IO errors, but won't distinguish between IO errors and binary files, potentially causing confusing error messages.
  3. [low] In parse_unified_diff, the is_binary field is never set to true for entries that don't have a current entry. When a new diff block starts with 'diff --git', the previous entry is flushed and a new entry is created with is_binary: false. If the previous entry had binary markers but no current entry existed yet, those markers would be ignored.
  4. [low] In format_terminal, the verbose mode skip logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. If a file has only binary changes and no semantic changes, it will be skipped even in verbose mode because indices.is_empty() would be true and the all() check would vacuously succeed.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham marked this pull request as ready for review May 26, 2026 21:18
@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from e92a24e to 2b42a63 Compare May 28, 2026 16:59
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 42 entities analyzed | 0 critical, 0 high, 24 medium, 18 low
Verdict: standard_review

Findings (5)

  1. [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true for entries that don't match the binary detection patterns. When a diff entry has no 'Binary files' or 'GIT binary patch' line, is_binary remains false even if the file is actually binary, causing binary files to be incorrectly processed as text files.
  2. [low] In parse_unified_diff, when has_binary_content is set to true after reading a file from disk, the before_content is not checked or cleared. If before_content was successfully read from git but after_content is binary, the FileChange will have mismatched content (text before, None after) which violates the binary detection logic in lacks_diffable_content.
  3. [low] In lacks_diffable_content, a FileStatus::Renamed file is considered binary only if BOTH before_content and after_content are None. However, a renamed binary file should be detected as binary if EITHER side is None, since renames can involve content changes and one side being binary makes the whole change binary.
  4. [low] In parse_unified_diff, when has_binary_content is set to true from e.is_binary, the code still attempts to read file content with read_file_compare_content when after_content is None and new_sha is Some. This can overwrite has_binary_content back to false if the file read succeeds as text, incorrectly treating a binary file as text.
  5. [low] In run_diff_pipeline, when opts.patch is true and two files are compared, the code creates a FileChange with before_content and after_content set to the result of read_file_compare_content (which returns Option). However, the comment says 'None means this side is binary', but the code doesn't handle the case where both sides are None (both binary), which would incorrectly report 'No semantic changes detected' instead of reporting a binary change.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from 2b42a63 to eab4c20 Compare May 29, 2026 22:38
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 43 entities analyzed | 0 critical, 0 high, 27 medium, 16 low
Verdict: standard_review

Findings (5)

  1. [low] In parse_unified_diff, binary files are not properly handled when git_show fails. The code sets has_binary_content=true only when read_file_compare_content returns Ok(None), but if read_file_compare_content returns Err(_), has_binary_content remains false. This causes the subsequent check 'if !has_binary_content && before_content.is_none() && after_content.is_none()' to incorrectly warn about missing content for binary files that failed to read.
  2. [low] In parse_unified_diff, the is_binary flag is only set when parsing 'Binary files' or 'GIT binary patch' lines, but binary detection also happens in read_file_compare_content. If a file is binary but the patch doesn't contain these markers (e.g., for unstaged changes), the file won't be marked as binary in the PatchEntry, leading to inconsistent binary detection between before_content and after_content.
  3. [low] In parse_unified_diff, the is_binary field is initialized to false but never set to true for entries that don't match the binary detection patterns. The code checks 'Binary files ' and 'GIT binary patch' but git diff can also output 'diff --git' followed by 'index' lines with mode changes for binary files without these markers, causing binary files to be treated as text.
  4. [low] In parse_unified_diff, when reading working file content as fallback, the code checks has_binary_content flag but this flag is only set AFTER attempting to read the file. If read_file_compare_content returns Ok(None) indicating binary, has_binary_content is set to true, but the before_content check at the end uses the OLD value of has_binary_content (false), causing the warning to be printed for binary files.
  5. [low] In diff_command for file comparison mode, content_a and content_b are Option from read_file_compare_content, but the code assigns them directly to before_content and after_content fields. However, the comment says 'None means this side is binary' but there's no actual handling to report these as binary changes - they'll just be filtered out silently by lacks_diffable_content without being added to binary_changes.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@rs545837
Copy link
Copy Markdown
Member

rs545837 commented May 31, 2026

Hey @Iron-Ham the other PRs have been merged into main. This one has too many conflicts to rebase cleanly (13 conflict regions across 4 files). Could you rebase this onto main when you get a chance? Main has changed significantly with the scope resolution tightening, patch parsing improvements, and the other PRs landing. Thanks!

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from eab4c20 to 74d2d4c Compare June 1, 2026 14:23
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 48 entities analyzed | 0 critical, 0 high, 32 medium, 16 low
Verdict: standard_review

Findings (6)

  1. [low] In diff_command, when comparing two files with --file-compare, if either file is binary (content_a or content_b is None), a FileChange is created with status FileStatus::Modified. However, this is incorrect for new files or deleted files - it should use FileStatus::Added if content_a is None, or FileStatus::Deleted if content_b is None.
  2. [low] In lacks_diffable_content, the logic for FileStatus::Renamed is incorrect. A renamed file can have both before_content and after_content present (if it's a text file that was renamed), so checking file.before_content.is_none() || file.after_content.is_none() will incorrectly classify renamed text files as binary.
  3. [low] In format_terminal, the verbose mode skip logic now checks binary_indices.is_empty() but doesn't check if all binary changes are orphans. This creates an inconsistency where files with only binary changes are never skipped in non-verbose mode, unlike files with only orphan semantic changes.
  4. [low] In diff_command, when comparing two files directly (non-patch mode), if either file is binary (content_a or content_b is None), a FileChange is created with status FileStatus::Modified. However, this is incorrect for the case where we're comparing two completely different files (not a git diff). The status should reflect whether it's an add, delete, or modify based on the actual file states, not hardcoded as Modified.
  5. [low] In format_terminal, the verbose mode skip logic only checks if all semantic changes are orphans, but doesn't account for binary_indices. If a file has only binary changes and no semantic changes, it will be skipped even in verbose mode because the condition binary_indices.is_empty() is checked with AND, meaning files with ONLY binary changes will still be skipped.
  6. [low] In read_file_compare_content, the function returns Result<Option<String>, std::io::Error> where None indicates binary content. However, in diff_command when calling this function for file comparison, if the read fails with an IO error, it's silently ignored with Err(_) => {}, which means a legitimate IO error (like permission denied) is treated the same as a binary file, potentially causing incorrect behavior.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

@Iron-Ham Iron-Ham force-pushed the Iron-Ham/fix-diff-binary-files branch from 74d2d4c to 01519ba Compare June 1, 2026 20:53
Copy link
Copy Markdown

@inspect-review inspect-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspect review

Triage: 51 entities analyzed | 0 critical, 0 high, 32 medium, 19 low
Verdict: standard_review

Findings (5)

  1. [low] In diff_command, file comparison for binary files creates a FileChange with both content fields as None, but the status is hardcoded to FileStatus::Modified. This is incorrect for added/deleted binary files - if content_a is None but content_b exists, it should be Added; if content_b is None but content_a exists, it should be Deleted.
  2. [low] In lacks_diffable_content, the logic for FileStatus::Renamed is wrong. A renamed file can have both before_content and after_content present (text file renamed), so checking file.before_content.is_none() || file.after_content.is_none() will incorrectly classify text file renames as binary changes.
  3. [low] In format_terminal, the verbose mode skip logic now checks binary_indices.is_empty() but this creates inconsistent behavior: a file with only binary changes will be shown even in non-verbose mode, while a file with only orphan changes will be skipped. This breaks the original intent where orphan-only files are hidden in non-verbose mode.
  4. [low] In format_terminal, the verbose mode skip logic checks if binary_indices.is_empty() but doesn't check if all binary changes are orphans. This creates an asymmetry where files with only binary changes are always shown even in non-verbose mode, while files with only orphan semantic changes are hidden. The logic should be consistent.
  5. [low] In lacks_diffable_content, the function returns true for Modified/Renamed files when either before_content OR after_content is None. However, this is incorrect for the case where a text file is modified to become binary (or vice versa) - one side would have content and the other wouldn't, but this should still be treated as a binary change. The logic should use AND instead of OR for Modified/Renamed status.

Reviewed by inspect | Entity-level triage found 0 high-risk changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

binary files silently skipped

2 participants