feat(scan): improve auto document detection quality

## Background

The current detection pipeline (`python-compressor/app.py`) uses:
- Otsu thresholding + morphological operations to locate the bright document region
- Boundary-band Hough Transform to find document edges
- Flush-side substitution when the document touches the image frame
- Fallbacks: raw Hough → contour detection → brightness mask contour
- `cv2.getPerspectiveTransform` + `cv2.warpPerspective` for perspective correction

Estimated correct quad detection rate: ~60–70%. Goal is 80–85%+ with no heavy new dependencies.

---

## Improvement Plan (dependency-conscious)

### Priority 1 — Zero new dependencies (OpenCV + numpy only)

**1. Illumination normalization (1 line, immediate impact on shadow photos)**
Flatten uneven lighting before edge detection and Otsu:
```python
background = cv2.GaussianBlur(gray, (101, 101), 0)
normalized = cv2.divide(gray.astype(np.float32), background.astype(np.float32), scale=255).astype(np.uint8)
```
Apply before Canny and the brightness mask. Directly improves detection on photos taken under desk lamps, near windows, or with phone torch.

**2. Probabilistic Hough Transform with length filter**
Replace `cv2.HoughLines` with `cv2.HoughLinesP`. Returns line **segments** — filter by minimum length (~15% of image diagonal) to discard short spurious segments from text content inside the document. Directly addresses the false-line problem on text-dense documents.

**3. LAB / HSV color segmentation**
Replace grayscale Otsu with:
- **L channel (LAB)** for white/cream paper detection — better contrast than raw grayscale
- **Inverted S channel (HSV)** — white paper has near-zero saturation; wood, fabric, and most table backgrounds have clearly higher saturation

Handles non-white documents (cream, yellow, coloured forms) and textured backgrounds (wood, fabric) that currently confuse the Otsu mask.

**4. Quadrilateral edge-probability scoring (Dropbox approach — biggest gain)**
Instead of accepting the first valid quad, generate multiple candidate quads from the top N Hough lines per side and score each by sampling the Canny edge map along its perimeter. The quad whose four sides align best with actual edges wins.

Implementation sketch:
- Take top 3 representative lines per side → up to 81 candidate quads
- For each quad, sample ~50 evenly-spaced points along each side
- Score = sum of edge pixel values at those points
- Return highest-scoring quad that passes `_validate_quad`

This is the approach Dropbox documented as their single most impactful improvement, making detection "60% less likely to need manual correction".

**5. GrabCut fallback**
`cv2.grabCut` is already in OpenCV. Add as a fallback after all Hough/contour methods fail:
1. Use the brightness mask bounding box as the initial `rect`
2. GrabCut refines foreground/background using colour statistics
3. Extract the largest foreground contour as the document quad

Handles off-white, coloured, and low-contrast documents where brightness-based segmentation fails.

**6. RANSAC line fitting**
Use `cv2.fitLine` with `cv2.DIST_HUBER` to robustly fit each document side from the boundary-band edge pixels. More resistant to a few bad edge points from printed borders or tables near the document edge.

---

### Priority 2 — One small addition: `scikit-image` (~15 MB, pure Python)

**7. Sauvola binarization for the enhancement step**
The current `enhance=True` path uses `cv2.adaptiveThreshold`. Sauvola adapts the threshold per-pixel using local mean and standard deviation:
```
T(x,y) = mean(x,y) * [1 + k * (std(x,y)/R - 1)]
```
Significantly better under uneven lighting, shadows, and aged paper. Available as `skimage.filters.threshold_sauvola`.

---

## What to skip (too heavy)

| Approach | Reason |
|---|---|
| HED edge detection | Requires PyTorch (~700 MB) |
| MobileNet V2 corner regression | Requires training pipeline + ONNX Runtime; worth it only at 90%+ accuracy target |
| DocScanner / DocTr / DewarpNet | PyTorch + large models; only adds value for curved/book pages |
| Real-ESRGAN super-resolution | PyTorch; overkill for typical phone scans |

---

## Deep learning path (future, if accuracy target rises to 90%+)

If the no-heavy-deps improvements plateau and 90%+ accuracy is needed:

- **MobileNet V2 corner regression** — 96×96 px input, 8-float coordinate output. This is the approach used by Genius Scan and Google ML Kit. Achieves ~85% correct detection. Requires training on SmartDoc 2015 or MIDV datasets. Export to ONNX, add `onnxruntime` (~10 MB) to the container.
- **HED (Holistically-Nested Edge Detection)** — pretrained PyTorch model, replaces Canny with semantically-aware edge maps. Plug into existing Hough pipeline. Repo: [sniklaus/pytorch-hed](https://github.com/sniklaus/pytorch-hed).
- **DocScanner-T** (2.6M params) — for dewarping curved/folded pages and book scans. Pretrained weights available. Repo: [fh2019ustc/DocScanner](https://github.com/fh2019ustc/DocScanner).
- **Shadow removal (FSENet / IllTr)** — for high-quality output mode where downstream OCR is planned.

---

## Relevant datasets and benchmarks

- **SmartDoc 2015** — 150 video clips of smartphone document capture, annotated with corner coordinates. Standard benchmark for flat document detection. [GitHub](https://github.com/jchazalon/smartdoc15-ch1-dataset)
- **MIDV-2020** — 72,409 annotated images of identity documents in various conditions. [arXiv](https://arxiv.org/abs/2107.00396)
- **DocUNet Benchmark** — 130 warped document images, metrics: MS-SSIM, Local Distortion, OCR CER. [Link](https://www3.cs.stonybrook.edu/~cvl/docunet.html)
- **DIBCO** — Document Image Binarization Competition, benchmark for Otsu vs Sauvola vs DL binarizers.

---

## Expected outcome

Implementing Priority 1 items (all zero new deps) should raise correct quad detection from ~60–70% to ~80–85%, matching the level of early Genius Scan / Dropbox implementations.

## References

- [Dropbox: Fast and Accurate Document Detection for Scanning](https://dropbox.tech/machine-learning/fast-and-accurate-document-detection-for-scanning)
- [Genius Scan: Deep Learning for Document Detection (2024)](https://blog.thegrizzlylabs.com/2024/10/document-detection.html)
- [HED: Holistically-Nested Edge Detection (arXiv)](https://arxiv.org/abs/1504.06375)
- [DocScanner GitHub](https://github.com/fh2019ustc/DocScanner)
- [DocTr GitHub](https://github.com/fh2019ustc/DocTr)
- [UVDoc GitHub](https://github.com/tanguymagne/UVDoc)
- [Shadow Removal Survey 2024](https://arxiv.org/html/2407.08865v1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scan): improve auto document detection quality #13

Background

Improvement Plan (dependency-conscious)

Priority 1 — Zero new dependencies (OpenCV + numpy only)

Priority 2 — One small addition: `scikit-image` (~15 MB, pure Python)

What to skip (too heavy)

Deep learning path (future, if accuracy target rises to 90%+)

Relevant datasets and benchmarks

Expected outcome

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Approach	Reason
HED edge detection	Requires PyTorch (~700 MB)
MobileNet V2 corner regression	Requires training pipeline + ONNX Runtime; worth it only at 90%+ accuracy target
DocScanner / DocTr / DewarpNet	PyTorch + large models; only adds value for curved/book pages
Real-ESRGAN super-resolution	PyTorch; overkill for typical phone scans

feat(scan): improve auto document detection quality #13

Description

Background

Improvement Plan (dependency-conscious)

Priority 1 — Zero new dependencies (OpenCV + numpy only)

Priority 2 — One small addition: scikit-image (~15 MB, pure Python)

What to skip (too heavy)

Deep learning path (future, if accuracy target rises to 90%+)

Relevant datasets and benchmarks

Expected outcome

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Priority 2 — One small addition: `scikit-image` (~15 MB, pure Python)