Skip to content

Commit e44261d

Browse files
committed
Update file operation statuses and introduce path intersection task
- Changed the status of the "Epic: File Operations" from Done to In Progress to reflect ongoing work. - Added a new task "Path Intersection & Smart Diff" to facilitate efficient file synchronization by identifying differences between two paths. - Updated related tasks and dependencies in the file sync system to ensure proper integration with the new path intersection functionality.
1 parent f1843bf commit e44261d

8 files changed

+1248
-7
lines changed

.tasks/core/FILE-000-file-operations.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
id: FILE-000
33
title: "Epic: File Operations"
4-
status: Done
4+
status: In Progress
55
assignee: jamiepine
66
priority: High
77
tags: [epic, core, file-ops]
@@ -11,3 +11,12 @@ whitepaper: Section 4
1111
## Description
1212

1313
Encompasses all core file manipulation tasks, including copying, moving, deleting, and validation. These operations will be implemented as durable jobs orchestrated by the Action System.
14+
15+
## Child Tasks
16+
17+
- **FILE-001**: File Copy Job - Done
18+
- **FILE-002**: File Deletion Job - Done
19+
- **FILE-003**: Cloud Volume File Operations - Done
20+
- **FILE-004**: Rename and Folders - Done
21+
- **FILE-005**: Bidirectional Remote Copy - Done
22+
- **FILE-006**: Path Intersection & Smart Diff - To Do
Lines changed: 318 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,318 @@
1+
---
2+
id: FILE-006
3+
title: Path Intersection & Smart Diff
4+
status: To Do
5+
assignee: jamiepine
6+
parent: FILE-000
7+
priority: High
8+
tags: [files, operations, diff, copy, index, deduplication]
9+
last_updated: 2026-02-07
10+
related_tasks: [INDEX-010, INDEX-011, FILE-001, FSYNC-003]
11+
---
12+
13+
## Description
14+
15+
Implement a path intersection operation that compares two paths (potentially on different volumes or devices) and returns the set difference: files present at path A that don't exist at path B. This enables "smart copy" — transferring only what's missing without re-copying files that already exist at the destination.
16+
17+
Primary use case: an external drive with files, a NAS with a partial backup. Select both, see what's missing, copy the diff.
18+
19+
## Problem
20+
21+
- No way to compare two directories and find what's missing
22+
- Copying an entire folder to a NAS re-transfers files that are already there
23+
- Users resort to rsync or manual comparison for this workflow
24+
- The index has all the data needed for this comparison but no operation exposes it
25+
26+
## Architecture
27+
28+
### Three Matching Strategies
29+
30+
**1. Heuristic Match (fast, default)**
31+
Compare by relative path + file size + modification time. No hashing required. Suitable for most backup/copy scenarios where files haven't been modified in place.
32+
33+
**2. Content Match (exact)**
34+
Compare by BLAKE3 content hash (`content_id`). Catches files that have been renamed or moved. Requires Phase 4 content identification on both sides — either from the persistent index or computed on demand.
35+
36+
**3. Hybrid Match**
37+
Heuristic first, then content-verify ambiguous cases (same name, different size or mtime). Best balance of speed and accuracy.
38+
39+
### Operation Flow
40+
41+
```
42+
Input: source_path (SdPath), target_path (SdPath), strategy
43+
44+
Ensure both paths are indexed (ephemeral or persistent)
45+
Use complete scan if needed (INDEX-011)
46+
47+
Build relative path maps for both directories
48+
49+
Apply matching strategy:
50+
Heuristic → compare (rel_path, size, mtime) tuples
51+
Content → compare content_id hashes
52+
Hybrid → heuristic first, hash verify edge cases
53+
54+
Output: PathDiffResult {
55+
only_in_source: Vec<DiffEntry>, // Missing from target
56+
only_in_target: Vec<DiffEntry>, // Extra at target
57+
modified: Vec<DiffEntry>, // Same path, different content
58+
matched: usize, // Count of identical files
59+
total_copy_size: u64, // Bytes needed to sync
60+
}
61+
```
62+
63+
### Integration with FileCopyJob
64+
65+
The diff result feeds directly into a copy operation:
66+
67+
```rust
68+
let diff = path_intersection(source, target, Strategy::Heuristic).await?;
69+
70+
// Copy only what's missing
71+
let copy_input = FileCopyInput {
72+
sources: SdPathBatch::from_paths(
73+
diff.only_in_source.iter().map(|e| e.sd_path.clone())
74+
),
75+
destination: target_path,
76+
overwrite: false,
77+
..Default::default()
78+
};
79+
```
80+
81+
## Implementation Steps
82+
83+
### 1. Define Core Types
84+
85+
```rust
86+
// core/src/ops/files/diff/mod.rs
87+
88+
pub struct PathDiffInput {
89+
pub source: SdPath,
90+
pub target: SdPath,
91+
pub strategy: DiffStrategy,
92+
pub use_index_rules: bool, // false = complete scan for full coverage
93+
}
94+
95+
pub enum DiffStrategy {
96+
/// Compare by relative path + size + mtime. Fast, no hashing.
97+
Heuristic,
98+
/// Compare by BLAKE3 content hash. Catches renames/moves.
99+
Content,
100+
/// Heuristic first, content-verify ambiguous matches.
101+
Hybrid,
102+
}
103+
104+
pub struct PathDiffResult {
105+
/// Files at source that don't exist at target.
106+
pub only_in_source: Vec<DiffEntry>,
107+
/// Files at target that don't exist at source.
108+
pub only_in_target: Vec<DiffEntry>,
109+
/// Files at both paths but with different content.
110+
pub modified: Vec<DiffEntry>,
111+
/// Number of files that matched exactly.
112+
pub matched_count: usize,
113+
/// Total bytes that would need to be copied (only_in_source + modified).
114+
pub copy_size: u64,
115+
/// Total number of files scanned.
116+
pub total_scanned: usize,
117+
}
118+
119+
pub struct DiffEntry {
120+
pub relative_path: PathBuf,
121+
pub sd_path: SdPath,
122+
pub uuid: Option<Uuid>,
123+
pub size: u64,
124+
pub modified_at: DateTime<Utc>,
125+
pub content_id: Option<String>, // BLAKE3 hash if available
126+
pub kind: EntryKind,
127+
}
128+
```
129+
130+
### 2. Implement Index Acquisition
131+
132+
Before diffing, ensure both paths are indexed. Check ephemeral cache first, then request a scan if needed.
133+
134+
```rust
135+
// core/src/ops/files/diff/resolver.rs
136+
137+
async fn ensure_indexed(
138+
path: &SdPath,
139+
use_rules: bool,
140+
ctx: &ActionContext,
141+
) -> Result<()> {
142+
let cache = ctx.core_context().ephemeral_cache();
143+
144+
let abs_path = path.resolve(ctx)?;
145+
if cache.is_indexed(&abs_path).await {
146+
return Ok(());
147+
}
148+
149+
// Request index scan
150+
let config = if use_rules {
151+
IndexerJobConfig::ephemeral_browse(path.clone(), IndexScope::Recursive, false)
152+
} else {
153+
IndexerJobConfig::complete_scan(path.clone(), IndexScope::Recursive)
154+
};
155+
156+
let job_id = ctx.job_manager().submit(IndexerJob::new(config)).await?;
157+
ctx.job_manager().wait_for(job_id).await?;
158+
159+
Ok(())
160+
}
161+
```
162+
163+
### 3. Build Relative Path Maps
164+
165+
Extract entries from the ephemeral index and build maps keyed by relative path for comparison.
166+
167+
```rust
168+
async fn build_path_map(
169+
root: &Path,
170+
cache: &EphemeralIndexCache,
171+
) -> Result<HashMap<PathBuf, DiffEntry>> {
172+
let index = cache.get_for_path(root)
173+
.ok_or_else(|| anyhow!("Path not indexed: {}", root.display()))?;
174+
175+
let index_read = index.read().await;
176+
let mut map = HashMap::new();
177+
178+
for (abs_path, _entry_id) in index_read.entries() {
179+
if let Ok(relative) = abs_path.strip_prefix(root) {
180+
let entry = build_diff_entry(&index_read, abs_path, relative)?;
181+
map.insert(relative.to_path_buf(), entry);
182+
}
183+
}
184+
185+
Ok(map)
186+
}
187+
```
188+
189+
### 4. Implement Matching Strategies
190+
191+
**Heuristic:**
192+
```rust
193+
fn diff_heuristic(
194+
source_map: &HashMap<PathBuf, DiffEntry>,
195+
target_map: &HashMap<PathBuf, DiffEntry>,
196+
) -> PathDiffResult {
197+
let mut result = PathDiffResult::default();
198+
199+
for (rel_path, source_entry) in source_map {
200+
match target_map.get(rel_path) {
201+
None => result.only_in_source.push(source_entry.clone()),
202+
Some(target_entry) => {
203+
if source_entry.size != target_entry.size
204+
|| source_entry.modified_at != target_entry.modified_at
205+
{
206+
result.modified.push(source_entry.clone());
207+
} else {
208+
result.matched_count += 1;
209+
}
210+
}
211+
}
212+
}
213+
214+
for (rel_path, target_entry) in target_map {
215+
if !source_map.contains_key(rel_path) {
216+
result.only_in_target.push(target_entry.clone());
217+
}
218+
}
219+
220+
result.copy_size = result.only_in_source.iter().map(|e| e.size).sum::<u64>()
221+
+ result.modified.iter().map(|e| e.size).sum::<u64>();
222+
223+
result
224+
}
225+
```
226+
227+
**Content:** Same structure but matches on `content_id` instead of (path, size, mtime). Can detect files that were renamed or moved. Falls back to heuristic for entries without content IDs.
228+
229+
**Hybrid:** Runs heuristic first. For entries where path matches but size/mtime differ, checks content_id to determine if the file actually changed or just had its timestamp updated.
230+
231+
### 5. Register as Action
232+
233+
```rust
234+
// core/src/ops/files/diff/action.rs
235+
236+
pub struct PathDiffAction;
237+
crate::register_library_action!(PathDiffAction, "files.diff");
238+
239+
impl Action for PathDiffAction {
240+
type Input = PathDiffInput;
241+
type Output = PathDiffResult;
242+
243+
async fn run(input: Self::Input, ctx: &ActionContext) -> Result<Self::Output> {
244+
// 1. Ensure both paths indexed
245+
ensure_indexed(&input.source, input.use_index_rules, ctx).await?;
246+
ensure_indexed(&input.target, input.use_index_rules, ctx).await?;
247+
248+
// 2. Build path maps
249+
let source_map = build_path_map(&input.source.resolve(ctx)?, cache).await?;
250+
let target_map = build_path_map(&input.target.resolve(ctx)?, cache).await?;
251+
252+
// 3. Run strategy
253+
match input.strategy {
254+
DiffStrategy::Heuristic => Ok(diff_heuristic(&source_map, &target_map)),
255+
DiffStrategy::Content => Ok(diff_content(&source_map, &target_map)),
256+
DiffStrategy::Hybrid => Ok(diff_hybrid(&source_map, &target_map)),
257+
}
258+
}
259+
}
260+
```
261+
262+
### 6. CLI Integration
263+
264+
```bash
265+
# Show what's missing on the NAS
266+
sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos
267+
268+
# Copy only what's missing
269+
sd-cli files diff /Volumes/ExtDrive/Photos /Volumes/NAS/Photos --copy
270+
271+
# Content-based matching (catches renames)
272+
sd-cli files diff /Volumes/ExtDrive /Volumes/NAS --strategy content
273+
274+
# Include normally-filtered files
275+
sd-cli files diff /path/a /path/b --no-rules
276+
```
277+
278+
## Files to Create
279+
280+
- `core/src/ops/files/diff/mod.rs` - Module definition, types
281+
- `core/src/ops/files/diff/action.rs` - PathDiffAction registration
282+
- `core/src/ops/files/diff/resolver.rs` - Index acquisition and path map building
283+
- `core/src/ops/files/diff/strategies.rs` - Heuristic, Content, and Hybrid matching
284+
285+
**Modified Files:**
286+
- `core/src/ops/files/mod.rs` - Add `diff` module
287+
- CLI command registration for `files diff`
288+
289+
## Acceptance Criteria
290+
291+
- [ ] PathDiffAction registered and callable via API
292+
- [ ] Heuristic strategy correctly identifies files missing from target
293+
- [ ] Heuristic strategy correctly identifies modified files (same path, different size/mtime)
294+
- [ ] Content strategy matches by BLAKE3 hash, detects renames
295+
- [ ] Hybrid strategy uses heuristic first, falls back to content for ambiguous cases
296+
- [ ] Auto-indexes paths that aren't in the ephemeral cache before diffing
297+
- [ ] `use_index_rules: false` triggers complete scan via INDEX-011
298+
- [ ] Diff result feeds directly into FileCopyJob input
299+
- [ ] CLI `files diff` command shows human-readable summary
300+
- [ ] CLI `files diff --copy` triggers copy of missing files
301+
- [ ] Integration test: diff two directories, copy diff, re-diff shows zero missing
302+
- [ ] Handles cross-volume paths (local drive vs NAS)
303+
- [ ] Handles directories with 100K+ files without excessive memory usage
304+
305+
## Performance Notes
306+
307+
- Path map building is O(n) where n = entries under the root
308+
- Heuristic comparison is O(n + m) — one pass over each map
309+
- Content comparison is O(n + m) for indexed entries, O(n * hash_time) if hashing on demand
310+
- For the NAS use case (100K files, heuristic), expect sub-second comparison on indexed data
311+
- The expensive part is the initial indexing, not the diff itself
312+
313+
## Related Tasks
314+
315+
- INDEX-010 - Bidirectional UUID Reconciliation (stable UUIDs across index layers)
316+
- INDEX-011 - Rules-Free Ephemeral Scan Mode (complete coverage for diff accuracy)
317+
- FILE-001 - File Copy Job (executes the copy after diff)
318+
- FSYNC-003 - FileSyncService Core (uses diff logic internally for sync resolution)

.tasks/core/FSYNC-000-file-sync-system.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ tags: [sync, service, epic, index-driven]
99
whitepaper: Section 5.2
1010
design_doc: workbench/FILE_SYNC_IMPLEMENTATION_PLAN.md
1111
documentation: docs/core/file-sync.mdx
12-
last_updated: 2025-10-15
12+
last_updated: 2026-02-07
1313
---
1414

1515
## Description
@@ -56,10 +56,20 @@ Intelligent local storage management with access pattern tracking.
5656

5757
- **FSYNC-001**: DeleteJob Strategy Pattern & Remote Deletion (Phase 1) - Done
5858
- **FSYNC-002**: Database Schema & Entities (Phase 2) - Done
59-
- **FSYNC-003**: FileSyncService Core Implementation (Phase 3)
59+
- **FSYNC-003**: FileSyncService Core Implementation (Phase 3) - Blocked on INDEX-010, INDEX-011
6060
- **FSYNC-004**: Service Integration & API (Phase 4)
6161
- **FSYNC-005**: Advanced Features (Phase 5)
6262

63+
## New Dependencies (2026-02-07)
64+
65+
FSYNC-003 depends on foundational index work that was identified during architecture review:
66+
67+
- **INDEX-010**: Bidirectional UUID reconciliation — ephemeral index must reuse persistent UUIDs so file sync has unified identity across layers
68+
- **INDEX-011**: Rules-free scan mode — file sync needs complete filesystem visibility, not filtered index views
69+
- **FILE-006**: Path intersection & smart diff — extracts the diffing logic from FSYNC-003 into a standalone operation that also serves the "smart copy" use case
70+
71+
Execution order: INDEX-010 → INDEX-011 → FILE-006 → FSYNC-003
72+
6373
## Key Benefits
6474

6575
**No Code Duplication** - Reuses FileCopyJob routing, strategies, VolumeManager infrastructure

0 commit comments

Comments
 (0)