Background
The current detection pipeline (python-compressor/app.py) uses:
- Otsu thresholding + morphological operations to locate the bright document region
- Boundary-band Hough Transform to find document edges
- Flush-side substitution when the document touches the image frame
- Fallbacks: raw Hough → contour detection → brightness mask contour
cv2.getPerspectiveTransform + cv2.warpPerspective for perspective correction
Estimated correct quad detection rate: ~60–70%. Goal is 80–85%+ with no heavy new dependencies.
Improvement Plan (dependency-conscious)
Priority 1 — Zero new dependencies (OpenCV + numpy only)
1. Illumination normalization (1 line, immediate impact on shadow photos)
Flatten uneven lighting before edge detection and Otsu:
background = cv2.GaussianBlur(gray, (101, 101), 0)
normalized = cv2.divide(gray.astype(np.float32), background.astype(np.float32), scale=255).astype(np.uint8)
Apply before Canny and the brightness mask. Directly improves detection on photos taken under desk lamps, near windows, or with phone torch.
2. Probabilistic Hough Transform with length filter
Replace cv2.HoughLines with cv2.HoughLinesP. Returns line segments — filter by minimum length (~15% of image diagonal) to discard short spurious segments from text content inside the document. Directly addresses the false-line problem on text-dense documents.
3. LAB / HSV color segmentation
Replace grayscale Otsu with:
- L channel (LAB) for white/cream paper detection — better contrast than raw grayscale
- Inverted S channel (HSV) — white paper has near-zero saturation; wood, fabric, and most table backgrounds have clearly higher saturation
Handles non-white documents (cream, yellow, coloured forms) and textured backgrounds (wood, fabric) that currently confuse the Otsu mask.
4. Quadrilateral edge-probability scoring (Dropbox approach — biggest gain)
Instead of accepting the first valid quad, generate multiple candidate quads from the top N Hough lines per side and score each by sampling the Canny edge map along its perimeter. The quad whose four sides align best with actual edges wins.
Implementation sketch:
- Take top 3 representative lines per side → up to 81 candidate quads
- For each quad, sample ~50 evenly-spaced points along each side
- Score = sum of edge pixel values at those points
- Return highest-scoring quad that passes
_validate_quad
This is the approach Dropbox documented as their single most impactful improvement, making detection "60% less likely to need manual correction".
5. GrabCut fallback
cv2.grabCut is already in OpenCV. Add as a fallback after all Hough/contour methods fail:
- Use the brightness mask bounding box as the initial
rect
- GrabCut refines foreground/background using colour statistics
- Extract the largest foreground contour as the document quad
Handles off-white, coloured, and low-contrast documents where brightness-based segmentation fails.
6. RANSAC line fitting
Use cv2.fitLine with cv2.DIST_HUBER to robustly fit each document side from the boundary-band edge pixels. More resistant to a few bad edge points from printed borders or tables near the document edge.
Priority 2 — One small addition: scikit-image (~15 MB, pure Python)
7. Sauvola binarization for the enhancement step
The current enhance=True path uses cv2.adaptiveThreshold. Sauvola adapts the threshold per-pixel using local mean and standard deviation:
T(x,y) = mean(x,y) * [1 + k * (std(x,y)/R - 1)]
Significantly better under uneven lighting, shadows, and aged paper. Available as skimage.filters.threshold_sauvola.
What to skip (too heavy)
| Approach |
Reason |
| HED edge detection |
Requires PyTorch (~700 MB) |
| MobileNet V2 corner regression |
Requires training pipeline + ONNX Runtime; worth it only at 90%+ accuracy target |
| DocScanner / DocTr / DewarpNet |
PyTorch + large models; only adds value for curved/book pages |
| Real-ESRGAN super-resolution |
PyTorch; overkill for typical phone scans |
Deep learning path (future, if accuracy target rises to 90%+)
If the no-heavy-deps improvements plateau and 90%+ accuracy is needed:
- MobileNet V2 corner regression — 96×96 px input, 8-float coordinate output. This is the approach used by Genius Scan and Google ML Kit. Achieves ~85% correct detection. Requires training on SmartDoc 2015 or MIDV datasets. Export to ONNX, add
onnxruntime (~10 MB) to the container.
- HED (Holistically-Nested Edge Detection) — pretrained PyTorch model, replaces Canny with semantically-aware edge maps. Plug into existing Hough pipeline. Repo: sniklaus/pytorch-hed.
- DocScanner-T (2.6M params) — for dewarping curved/folded pages and book scans. Pretrained weights available. Repo: fh2019ustc/DocScanner.
- Shadow removal (FSENet / IllTr) — for high-quality output mode where downstream OCR is planned.
Relevant datasets and benchmarks
- SmartDoc 2015 — 150 video clips of smartphone document capture, annotated with corner coordinates. Standard benchmark for flat document detection. GitHub
- MIDV-2020 — 72,409 annotated images of identity documents in various conditions. arXiv
- DocUNet Benchmark — 130 warped document images, metrics: MS-SSIM, Local Distortion, OCR CER. Link
- DIBCO — Document Image Binarization Competition, benchmark for Otsu vs Sauvola vs DL binarizers.
Expected outcome
Implementing Priority 1 items (all zero new deps) should raise correct quad detection from ~60–70% to ~80–85%, matching the level of early Genius Scan / Dropbox implementations.
References
Background
The current detection pipeline (
python-compressor/app.py) uses:cv2.getPerspectiveTransform+cv2.warpPerspectivefor perspective correctionEstimated correct quad detection rate: ~60–70%. Goal is 80–85%+ with no heavy new dependencies.
Improvement Plan (dependency-conscious)
Priority 1 — Zero new dependencies (OpenCV + numpy only)
1. Illumination normalization (1 line, immediate impact on shadow photos)
Flatten uneven lighting before edge detection and Otsu:
Apply before Canny and the brightness mask. Directly improves detection on photos taken under desk lamps, near windows, or with phone torch.
2. Probabilistic Hough Transform with length filter
Replace
cv2.HoughLineswithcv2.HoughLinesP. Returns line segments — filter by minimum length (~15% of image diagonal) to discard short spurious segments from text content inside the document. Directly addresses the false-line problem on text-dense documents.3. LAB / HSV color segmentation
Replace grayscale Otsu with:
Handles non-white documents (cream, yellow, coloured forms) and textured backgrounds (wood, fabric) that currently confuse the Otsu mask.
4. Quadrilateral edge-probability scoring (Dropbox approach — biggest gain)
Instead of accepting the first valid quad, generate multiple candidate quads from the top N Hough lines per side and score each by sampling the Canny edge map along its perimeter. The quad whose four sides align best with actual edges wins.
Implementation sketch:
_validate_quadThis is the approach Dropbox documented as their single most impactful improvement, making detection "60% less likely to need manual correction".
5. GrabCut fallback
cv2.grabCutis already in OpenCV. Add as a fallback after all Hough/contour methods fail:rectHandles off-white, coloured, and low-contrast documents where brightness-based segmentation fails.
6. RANSAC line fitting
Use
cv2.fitLinewithcv2.DIST_HUBERto robustly fit each document side from the boundary-band edge pixels. More resistant to a few bad edge points from printed borders or tables near the document edge.Priority 2 — One small addition:
scikit-image(~15 MB, pure Python)7. Sauvola binarization for the enhancement step
The current
enhance=Truepath usescv2.adaptiveThreshold. Sauvola adapts the threshold per-pixel using local mean and standard deviation:Significantly better under uneven lighting, shadows, and aged paper. Available as
skimage.filters.threshold_sauvola.What to skip (too heavy)
Deep learning path (future, if accuracy target rises to 90%+)
If the no-heavy-deps improvements plateau and 90%+ accuracy is needed:
onnxruntime(~10 MB) to the container.Relevant datasets and benchmarks
Expected outcome
Implementing Priority 1 items (all zero new deps) should raise correct quad detection from ~60–70% to ~80–85%, matching the level of early Genius Scan / Dropbox implementations.
References