|
| 1 | +--- |
| 2 | +id: FILE-006 |
| 3 | +title: Path Intersection & Smart Diff |
| 4 | +status: To Do |
| 5 | +assignee: jamiepine |
| 6 | +parent: FILE-000 |
| 7 | +priority: High |
| 8 | +tags: [files, operations, diff, copy, index, deduplication] |
| 9 | +last_updated: 2026-02-07 |
| 10 | +related_tasks: [INDEX-010, INDEX-011, FILE-001, FSYNC-003] |
| 11 | +--- |
| 12 | + |
| 13 | +## Description |
| 14 | + |
| 15 | +Implement a path intersection operation that compares two paths (potentially on different volumes or devices) and returns the set difference: files present at path A that don't exist at path B. This enables "smart copy" — transferring only what's missing without re-copying files that already exist at the destination. |
| 16 | + |
| 17 | +Primary use case: an external drive with files, a NAS with a partial backup. Select both, see what's missing, copy the diff. |
| 18 | + |
| 19 | +## Problem |
| 20 | + |
| 21 | +- No way to compare two directories and find what's missing |
| 22 | +- Copying an entire folder to a NAS re-transfers files that are already there |
| 23 | +- Users resort to rsync or manual comparison for this workflow |
| 24 | +- The index has all the data needed for this comparison but no operation exposes it |
| 25 | + |
| 26 | +## Architecture |
| 27 | + |
| 28 | +### Three Matching Strategies |
| 29 | + |
| 30 | +**1. Heuristic Match (fast, default)** |
| 31 | +Compare by relative path + file size + modification time. No hashing required. Suitable for most backup/copy scenarios where files haven't been modified in place. |
| 32 | + |
| 33 | +**2. Content Match (exact)** |
| 34 | +Compare by BLAKE3 content hash (`content_id`). Catches files that have been renamed or moved. Requires Phase 4 content identification on both sides — either from the persistent index or computed on demand. |
| 35 | + |
| 36 | +**3. Hybrid Match** |
| 37 | +Heuristic first, then content-verify ambiguous cases (same name, different size or mtime). Best balance of speed and accuracy. |
| 38 | + |
| 39 | +### Operation Flow |
| 40 | + |
| 41 | +``` |
| 42 | +Input: source_path (SdPath), target_path (SdPath), strategy |
| 43 | + ↓ |
| 44 | + Ensure both paths are indexed (ephemeral or persistent) |
| 45 | + Use complete scan if needed (INDEX-011) |
| 46 | + ↓ |
| 47 | + Build relative path maps for both directories |
| 48 | + ↓ |
| 49 | + Apply matching strategy: |
| 50 | + Heuristic → compare (rel_path, size, mtime) tuples |
| 51 | + Content → compare content_id hashes |
| 52 | + Hybrid → heuristic first, hash verify edge cases |
| 53 | + ↓ |
| 54 | + Output: PathDiffResult { |
| 55 | + only_in_source: Vec<DiffEntry>, // Missing from target |
| 56 | + only_in_target: Vec<DiffEntry>, // Extra at target |
| 57 | + modified: Vec<DiffEntry>, // Same path, different content |
| 58 | + matched: usize, // Count of identical files |
| 59 | + total_copy_size: u64, // Bytes needed to sync |
| 60 | + } |
| 61 | +``` |
| 62 | + |
| 63 | +### Integration with FileCopyJob |
| 64 | + |
| 65 | +The diff result feeds directly into a copy operation: |
| 66 | + |
| 67 | +```rust |
| 68 | +let diff = path_intersection(source, target, Strategy::Heuristic).await?; |
| 69 | + |
| 70 | +// Copy only what's missing |
| 71 | +let copy_input = FileCopyInput { |
| 72 | + sources: SdPathBatch::from_paths( |
| 73 | + diff.only_in_source.iter().map(|e| e.sd_path.clone()) |
| 74 | + ), |
| 75 | + destination: target_path, |
| 76 | + overwrite: false, |
| 77 | + ..Default::default() |
| 78 | +}; |
| 79 | +``` |
| 80 | + |
| 81 | +## Implementation Steps |
| 82 | + |
| 83 | +### 1. Define Core Types |
| 84 | + |
| 85 | +```rust |
| 86 | +// core/src/ops/files/diff/mod.rs |
| 87 | + |
| 88 | +pub struct PathDiffInput { |
| 89 | + pub source: SdPath, |
| 90 | + pub target: SdPath, |
| 91 | + pub strategy: DiffStrategy, |
| 92 | + pub use_index_rules: bool, // false = complete scan for full coverage |
| 93 | +} |
| 94 | + |
| 95 | +pub enum DiffStrategy { |
| 96 | + /// Compare by relative path + size + mtime. Fast, no hashing. |
| 97 | + Heuristic, |
| 98 | + /// Compare by BLAKE3 content hash. Catches renames/moves. |
| 99 | + Content, |
| 100 | + /// Heuristic first, content-verify ambiguous matches. |
| 101 | + Hybrid, |
| 102 | +} |
| 103 | + |
| 104 | +pub struct PathDiffResult { |
| 105 | + /// Files at source that don't exist at target. |
| 106 | + pub only_in_source: Vec<DiffEntry>, |
| 107 | + /// Files at target that don't exist at source. |
| 108 | + pub only_in_target: Vec<DiffEntry>, |
| 109 | + /// Files at both paths but with different content. |
| 110 | + pub modified: Vec<DiffEntry>, |
| 111 | + /// Number of files that matched exactly. |
| 112 | + pub matched_count: usize, |
| 113 | + /// Total bytes that would need to be copied (only_in_source + modified). |
| 114 | + pub copy_size: u64, |
| 115 | + /// Total number of files scanned. |
| 116 | + pub total_scanned: usize, |
| 117 | +} |
| 118 | + |
| 119 | +pub struct DiffEntry { |
| 120 | + pub relative_path: PathBuf, |
| 121 | + pub sd_path: SdPath, |
| 122 | + pub uuid: Option<Uuid>, |
| 123 | + pub size: u64, |
| 124 | + pub modified_at: DateTime<Utc>, |
| 125 | + pub content_id: Option<String>, // BLAKE3 hash if available |
| 126 | + pub kind: EntryKind, |
| 127 | +} |
| 128 | +``` |
| 129 | + |
| 130 | +### 2. Implement Index Acquisition |
| 131 | + |
| 132 | +Before diffing, ensure both paths are indexed. Check ephemeral cache first, then request a scan if needed. |
| 133 | + |
| 134 | +```rust |
| 135 | +// core/src/ops/files/diff/resolver.rs |
| 136 | + |
| 137 | +async fn ensure_indexed( |
| 138 | + path: &SdPath, |
| 139 | + use_rules: bool, |
| 140 | + ctx: &ActionContext, |
| 141 | +) -> Result<()> { |
| 142 | + let cache = ctx.core_context().ephemeral_cache(); |
| 143 | + |
| 144 | + let abs_path = path.resolve(ctx)?; |
| 145 | + if cache.is_indexed(&abs_path).await { |
| 146 | + return Ok(()); |
| 147 | + } |
| 148 | + |
| 149 | + // Request index scan |
| 150 | + let config = if use_rules { |
| 151 | + IndexerJobConfig::ephemeral_browse(path.clone(), IndexScope::Recursive, false) |
| 152 | + } else { |
| 153 | + IndexerJobConfig::complete_scan(path.clone(), IndexScope::Recursive) |
| 154 | + }; |
| 155 | + |
| 156 | + let job_id = ctx.job_manager().submit(IndexerJob::new(config)).await?; |
| 157 | + ctx.job_manager().wait_for(job_id).await?; |
| 158 | + |
| 159 | + Ok(()) |
| 160 | +} |
| 161 | +``` |
| 162 | + |
| 163 | +### 3. Build Relative Path Maps |
| 164 | + |
| 165 | +Extract entries from the ephemeral index and build maps keyed by relative path for comparison. |
| 166 | + |
| 167 | +```rust |
| 168 | +async fn build_path_map( |
| 169 | + root: &Path, |
| 170 | + cache: &EphemeralIndexCache, |
| 171 | +) -> Result<HashMap<PathBuf, DiffEntry>> { |
| 172 | + let index = cache.get_for_path(root) |
| 173 | + .ok_or_else(|| anyhow!("Path not indexed: {}", root.display()))?; |
| 174 | + |
| 175 | + let index_read = index.read().await; |
| 176 | + let mut map = HashMap::new(); |
| 177 | + |
| 178 | + for (abs_path, _entry_id) in index_read.entries() { |
| 179 | + if let Ok(relative) = abs_path.strip_prefix(root) { |
| 180 | + let entry = build_diff_entry(&index_read, abs_path, relative)?; |
| 181 | + map.insert(relative.to_path_buf(), entry); |
| 182 | + } |
| 183 | + } |
| 184 | + |
| 185 | + Ok(map) |
| 186 | +} |
| 187 | +``` |
| 188 | + |
| 189 | +### 4. Implement Matching Strategies |
| 190 | + |
| 191 | +**Heuristic:** |
| 192 | +```rust |
| 193 | +fn diff_heuristic( |
| 194 | + source_map: &HashMap<PathBuf, DiffEntry>, |
| 195 | + target_map: &HashMap<PathBuf, DiffEntry>, |
| 196 | +) -> PathDiffResult { |
| 197 | + let mut result = PathDiffResult::default(); |
| 198 | + |
| 199 | + for (rel_path, source_entry) in source_map { |
| 200 | + match target_map.get(rel_path) { |
| 201 | + None => result.only_in_source.push(source_entry.clone()), |
| 202 | + Some(target_entry) => { |
| 203 | + if source_entry.size != target_entry.size |
| 204 | + || source_entry.modified_at != target_entry.modified_at |
| 205 | + { |
| 206 | + result.modified.push(source_entry.clone()); |
| 207 | + } else { |
| 208 | + result.matched_count += 1; |
| 209 | + } |
| 210 | + } |
| 211 | + } |
| 212 | + } |
| 213 | + |
| 214 | + for (rel_path, target_entry) in target_map { |
| 215 | + if !source_map.contains_key(rel_path) { |
| 216 | + result.only_in_target.push(target_entry.clone()); |
| 217 | + } |
| 218 | + } |
| 219 | + |
| 220 | + result.copy_size = result.only_in_source.iter().map(|e| e.size).sum::<u64>() |
| 221 | + + result.modified.iter().map(|e| e.size).sum::<u64>(); |
| 222 | + |
| 223 | + result |
| 224 | +} |
| 225 | +``` |
| 226 | + |
| 227 | +**Content:** Same structure but matches on `content_id` instead of (path, size, mtime). Can detect files that were renamed or moved. Falls back to heuristic for entries without content IDs. |
| 228 | + |
| 229 | +**Hybrid:** Runs heuristic first. For entries where path matches but size/mtime differ, checks content_id to determine if the file actually changed or just had its timestamp updated. |
| 230 | + |
| 231 | +### 5. Register as Action |
| 232 | + |
| 233 | +```rust |
| 234 | +// core/src/ops/files/diff/action.rs |
| 235 | + |
| 236 | +pub struct PathDiffAction; |
| 237 | +crate::register_library_action!(PathDiffAction, "files.diff"); |
| 238 | + |
| 239 | +impl Action for PathDiffAction { |
| 240 | + type Input = PathDiffInput; |
| 241 | + type Output = PathDiffResult; |
| 242 | + |
| 243 | + async fn run(input: Self::Input, ctx: &ActionContext) -> Result<Self::Output> { |
| 244 | + // 1. Ensure both paths indexed |
| 245 | + ensure_indexed(&input.source, input.use_index_rules, ctx).await?; |
| 246 | + ensure_indexed(&input.target, input.use_index_rules, ctx).await?; |
| 247 | + |
| 248 | + // 2. Build path maps |
| 249 | + let source_map = build_path_map(&input.source.resolve(ctx)?, cache).await?; |
| 250 | + let target_map = build_path_map(&input.target.resolve(ctx)?, cache).await?; |
| 251 | + |
| 252 | + // 3. Run strategy |
| 253 | + match input.strategy { |
| 254 | + DiffStrategy::Heuristic => Ok(diff_heuristic(&source_map, &target_map)), |
| 255 | + DiffStrategy::Content => Ok(diff_content(&source_map, &target_map)), |
| 256 | + DiffStrategy::Hybrid => Ok(diff_hybrid(&source_map, &target_map)), |
| 257 | + } |
| 258 | + } |
| 259 | +} |
| 260 | +``` |
| 261 | + |
| 262 | +### 6. CLI Integration |
| 263 | + |
| 264 | +```bash |
| 265 | +# Show what's missing on the NAS |
| 266 | +sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos |
| 267 | + |
| 268 | +# Copy only what's missing |
| 269 | +sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos --copy |
| 270 | + |
| 271 | +# Content-based matching (catches renames) |
| 272 | +sd-cli files diff /Volumes/ExtDrive /Volumes/NAS --strategy content |
| 273 | + |
| 274 | +# Include normally-filtered files |
| 275 | +sd-cli files diff /path/a /path/b --no-rules |
| 276 | +``` |
| 277 | + |
| 278 | +## Files to Create |
| 279 | + |
| 280 | +- `core/src/ops/files/diff/mod.rs` - Module definition, types |
| 281 | +- `core/src/ops/files/diff/action.rs` - PathDiffAction registration |
| 282 | +- `core/src/ops/files/diff/resolver.rs` - Index acquisition and path map building |
| 283 | +- `core/src/ops/files/diff/strategies.rs` - Heuristic, Content, and Hybrid matching |
| 284 | + |
| 285 | +**Modified Files:** |
| 286 | +- `core/src/ops/files/mod.rs` - Add `diff` module |
| 287 | +- CLI command registration for `files diff` |
| 288 | + |
| 289 | +## Acceptance Criteria |
| 290 | + |
| 291 | +- [ ] PathDiffAction registered and callable via API |
| 292 | +- [ ] Heuristic strategy correctly identifies files missing from target |
| 293 | +- [ ] Heuristic strategy correctly identifies modified files (same path, different size/mtime) |
| 294 | +- [ ] Content strategy matches by BLAKE3 hash, detects renames |
| 295 | +- [ ] Hybrid strategy uses heuristic first, falls back to content for ambiguous cases |
| 296 | +- [ ] Auto-indexes paths that aren't in the ephemeral cache before diffing |
| 297 | +- [ ] `use_index_rules: false` triggers complete scan via INDEX-011 |
| 298 | +- [ ] Diff result feeds directly into FileCopyJob input |
| 299 | +- [ ] CLI `files diff` command shows human-readable summary |
| 300 | +- [ ] CLI `files diff --copy` triggers copy of missing files |
| 301 | +- [ ] Integration test: diff two directories, copy diff, re-diff shows zero missing |
| 302 | +- [ ] Handles cross-volume paths (local drive vs NAS) |
| 303 | +- [ ] Handles directories with 100K+ files without excessive memory usage |
| 304 | + |
| 305 | +## Performance Notes |
| 306 | + |
| 307 | +- Path map building is O(n) where n = entries under the root |
| 308 | +- Heuristic comparison is O(n + m) — one pass over each map |
| 309 | +- Content comparison is O(n + m) for indexed entries, O(n * hash_time) if hashing on demand |
| 310 | +- For the NAS use case (100K files, heuristic), expect sub-second comparison on indexed data |
| 311 | +- The expensive part is the initial indexing, not the diff itself |
| 312 | + |
| 313 | +## Related Tasks |
| 314 | + |
| 315 | +- INDEX-010 - Bidirectional UUID Reconciliation (stable UUIDs across index layers) |
| 316 | +- INDEX-011 - Rules-Free Ephemeral Scan Mode (complete coverage for diff accuracy) |
| 317 | +- FILE-001 - File Copy Job (executes the copy after diff) |
| 318 | +- FSYNC-003 - FileSyncService Core (uses diff logic internally for sync resolution) |
0 commit comments