Skip to content

feat(scan): improve auto document detection quality #13

@roiguri

Description

@roiguri

Background

The current detection pipeline (python-compressor/app.py) uses:

  • Otsu thresholding + morphological operations to locate the bright document region
  • Boundary-band Hough Transform to find document edges
  • Flush-side substitution when the document touches the image frame
  • Fallbacks: raw Hough → contour detection → brightness mask contour
  • cv2.getPerspectiveTransform + cv2.warpPerspective for perspective correction

Estimated correct quad detection rate: ~60–70%. Goal is 80–85%+ with no heavy new dependencies.


Improvement Plan (dependency-conscious)

Priority 1 — Zero new dependencies (OpenCV + numpy only)

1. Illumination normalization (1 line, immediate impact on shadow photos)
Flatten uneven lighting before edge detection and Otsu:

background = cv2.GaussianBlur(gray, (101, 101), 0)
normalized = cv2.divide(gray.astype(np.float32), background.astype(np.float32), scale=255).astype(np.uint8)

Apply before Canny and the brightness mask. Directly improves detection on photos taken under desk lamps, near windows, or with phone torch.

2. Probabilistic Hough Transform with length filter
Replace cv2.HoughLines with cv2.HoughLinesP. Returns line segments — filter by minimum length (~15% of image diagonal) to discard short spurious segments from text content inside the document. Directly addresses the false-line problem on text-dense documents.

3. LAB / HSV color segmentation
Replace grayscale Otsu with:

  • L channel (LAB) for white/cream paper detection — better contrast than raw grayscale
  • Inverted S channel (HSV) — white paper has near-zero saturation; wood, fabric, and most table backgrounds have clearly higher saturation

Handles non-white documents (cream, yellow, coloured forms) and textured backgrounds (wood, fabric) that currently confuse the Otsu mask.

4. Quadrilateral edge-probability scoring (Dropbox approach — biggest gain)
Instead of accepting the first valid quad, generate multiple candidate quads from the top N Hough lines per side and score each by sampling the Canny edge map along its perimeter. The quad whose four sides align best with actual edges wins.

Implementation sketch:

  • Take top 3 representative lines per side → up to 81 candidate quads
  • For each quad, sample ~50 evenly-spaced points along each side
  • Score = sum of edge pixel values at those points
  • Return highest-scoring quad that passes _validate_quad

This is the approach Dropbox documented as their single most impactful improvement, making detection "60% less likely to need manual correction".

5. GrabCut fallback
cv2.grabCut is already in OpenCV. Add as a fallback after all Hough/contour methods fail:

  1. Use the brightness mask bounding box as the initial rect
  2. GrabCut refines foreground/background using colour statistics
  3. Extract the largest foreground contour as the document quad

Handles off-white, coloured, and low-contrast documents where brightness-based segmentation fails.

6. RANSAC line fitting
Use cv2.fitLine with cv2.DIST_HUBER to robustly fit each document side from the boundary-band edge pixels. More resistant to a few bad edge points from printed borders or tables near the document edge.


Priority 2 — One small addition: scikit-image (~15 MB, pure Python)

7. Sauvola binarization for the enhancement step
The current enhance=True path uses cv2.adaptiveThreshold. Sauvola adapts the threshold per-pixel using local mean and standard deviation:

T(x,y) = mean(x,y) * [1 + k * (std(x,y)/R - 1)]

Significantly better under uneven lighting, shadows, and aged paper. Available as skimage.filters.threshold_sauvola.


What to skip (too heavy)

Approach Reason
HED edge detection Requires PyTorch (~700 MB)
MobileNet V2 corner regression Requires training pipeline + ONNX Runtime; worth it only at 90%+ accuracy target
DocScanner / DocTr / DewarpNet PyTorch + large models; only adds value for curved/book pages
Real-ESRGAN super-resolution PyTorch; overkill for typical phone scans

Deep learning path (future, if accuracy target rises to 90%+)

If the no-heavy-deps improvements plateau and 90%+ accuracy is needed:

  • MobileNet V2 corner regression — 96×96 px input, 8-float coordinate output. This is the approach used by Genius Scan and Google ML Kit. Achieves ~85% correct detection. Requires training on SmartDoc 2015 or MIDV datasets. Export to ONNX, add onnxruntime (~10 MB) to the container.
  • HED (Holistically-Nested Edge Detection) — pretrained PyTorch model, replaces Canny with semantically-aware edge maps. Plug into existing Hough pipeline. Repo: sniklaus/pytorch-hed.
  • DocScanner-T (2.6M params) — for dewarping curved/folded pages and book scans. Pretrained weights available. Repo: fh2019ustc/DocScanner.
  • Shadow removal (FSENet / IllTr) — for high-quality output mode where downstream OCR is planned.

Relevant datasets and benchmarks

  • SmartDoc 2015 — 150 video clips of smartphone document capture, annotated with corner coordinates. Standard benchmark for flat document detection. GitHub
  • MIDV-2020 — 72,409 annotated images of identity documents in various conditions. arXiv
  • DocUNet Benchmark — 130 warped document images, metrics: MS-SSIM, Local Distortion, OCR CER. Link
  • DIBCO — Document Image Binarization Competition, benchmark for Otsu vs Sauvola vs DL binarizers.

Expected outcome

Implementing Priority 1 items (all zero new deps) should raise correct quad detection from ~60–70% to ~80–85%, matching the level of early Genius Scan / Dropbox implementations.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions