Computational Stability of Cubical Homology: Orthogonal Controls for Fault Isolation in Retinal Diagnostics
This repository contains an end-to-end pipeline for testing how stable Topological Data Analysis (TDA) features are on discrete sensor grids.
The clinical testbed is Diabetic Retinopathy screening from fundus images. For each image, the pipeline extracts the green channel (highest vessel contrast), computes cubical persistent homology, and compares:
- a 6D topological summary:
$\textbf{H}_0 \oplus \textbf{H}_1 \oplus \textbf{H}_S$ - a 7D geometric baseline: Hu invariant moments
The comparison is run across 169 perturbed settings (170 total settings including baseline) to test the Orthogonality Hypothesis:
- geometric invariants are more robust to affine/mechanical distortions
- topological invariants are more robust to illumination/quantisation shifts
- Python 3.9+
- Packages:
numpy,scipy,scikit-learn,opencv-python,gudhi,matplotlib
pip install numpy scipy scikit-learn opencv-python gudhi matplotlib- Download the FIVES dataset (Fundus Image Vessel Segmentation).
- Extract the archive directly into the workspace root.
- The target directory must be named exactly:
FIVES A Fundus Image Dataset for AI-based Vessel Segmentation/
The ingest script filters out AMD and Glaucoma partitions to isolate the balanced Normal/DR cohort (
The feature design uses a coarse vectorisation strategy to preserve continuity under bottleneck-distance perturbations.
Homology notation:
-
$\textbf{H}_0$ : zeroth homology on the sublevel filtration (connected dark components) -
$\textbf{H}_1$ : first homology on the sublevel filtration (loops/holes, e.g. vascular rings) -
$\textbf{H}_S$ : zeroth homology on the superlevel filtration (connected bright components, e.g. exudate-like regions)- We write this as
$\textbf{H}_S$ because$\textbf{H}_0$ is already used for sublevel zeroth homology; theSavoids a notation clash.
- We write this as
Feature vectors:
-
Topological vector (6D):
$\textbf{H}_0 \oplus \textbf{H}_1 \oplus \textbf{H}_S$ , where each block is encoded as[Top-5 Persistence Sum, Total Persistence] - Geometric control (7D): log-transformed Hu invariant moments
-
Audit vector (13D):
$\textbf{H}_0 \oplus \textbf{H}_1 \oplus \textbf{H}_S \oplus \text{Hu}$
Betti counts are intentionally excluded to avoid discontinuous count jumps under small perturbations.
| Step | Script | Purpose |
|---|---|---|
| 0 | 0_ingest.py |
Stage dataset arrays (green channel, CLAHE branch, split indices). |
| 1 | 1_precompute.py |
Build persistence/Hu caches for every protocol level and split. |
| 2 | 2_audit.py |
Baseline statistical audit (Welch's t-test, Cohen's d, lifetimes). |
| 3 | 3_signal.py |
Baseline 5-fold CV signal check for H0, H1, HS, H0H1HS, Hu. |
| 4 | 4_generalise.py |
Train on standard train split, evaluate on standard test split. |
| 5 | 5_ablate.py |
Main stress audit across all perturbation protocols and levels. |
| 6 | 6_plot.py |
Render robustness panels and orthogonality plots. |
| 7 | 7_orchestrate.py |
Run steps 1 -> 6 in sequence. |
Step I/O details:
0_ingest.pyInput:FIVES A Fundus Image Dataset for AI-based Vessel Segmentation/Output:data/clean_cohort.npz,data/clean_images.npy,data/clean_labels.npy,data/raw_images.npy,data/raw_paths.npy,data/train_indices.npy,data/test_indices.npy1_precompute.pyInput:data/*.npy, protocol ranges fromfives_shared.pyOutput:cache_parts/{split}_{protocol}_{level}.pkl2_audit.pyInput:cache_parts/train_standard_0.pklOutput: console tables andperf_log.jsonlentries (2_audit,2_audit_lifetimes)3_signal.pyInput:cache_parts/train_standard_0.pklOutput:cache_parts/ablation_features.pklandperf_log.jsonlentries (3_signal)4_generalise.pyInput:cache_parts/train_standard_0.pkl,cache_parts/test_standard_0.pklOutput:perf_log.jsonlentries (4_generalise)5_ablate.pyInput: baseline and perturbed caches from step 1 Output:results_mechanical.csv,results_radiometric.csv,results_failure.csv,stress_test_results_*.csv, andperf_log.jsonlentries (5_ablate)6_plot.pyInput:perf_log.jsonlOutput:plot_results/*.png7_orchestrate.pyInput: staged data and scripts Output: end-to-end artefacts from steps1 -> 6
Full run (recommended):
python 0_ingest.py
python 7_orchestrate.pyNotes:
7_orchestrate.pydoes not run0_ingest.py; ingestion is a one-time staging step.- With current protocol ranges in
fives_shared.py, step 5 evaluates 169 perturbed settings plus baseline.
Manual run (step-by-step):
python 1_precompute.py
python 2_audit.py
python 3_signal.py
python 4_generalise.py
python 5_ablate.py
python 6_plot.py- Read-only source data: The FIVES dataset directory is treated as input-only. The pipeline reads from
FIVES A Fundus Image Dataset for AI-based Vessel Segmentation/and writes artefacts todata/,cache_parts/,plot_results/,results_*.csv, andperf_log.jsonl. - Deterministic seed policy: The codebase uses a global seed (
GLOBAL_SEED = 42infives_shared.py). Stochastic perturbations are generated from deterministic RNGs derived from this seed (make_rng(...)), and cross-validation splits use fixedrandom_statevalues. - Stable processing order: Cache items are emitted and consumed in a fixed sequence (ordered indices and sequential pickle streaming), which keeps downstream feature alignment consistent across reruns.
- Rerun behavior:
1_precompute.pyskips cache files that already exist. To force full cache regeneration, clearcache_parts/first.perf_log.jsonlis append-only by default; remove it for a fresh run history. - Single-entry execution:
7_orchestrate.pyis the driver for steps1 -> 6and re-runs the full downstream analysis pipeline from staged inputs. - Compute cost: The precompute stage is the dominant bottleneck and can take multiple hours depending on hardware and worker count (
MAX_WORKERSis chosen from CPU/memory constraints). - Code availability: Public implementation and scripts are available at https://github.com/reyjosephnieto/persistent_homology/.
-
Why pooled cross-validation? A quick train/test check on the official split showed severe drift (train accuracy
$84.0%$ , test accuracy$41.0%$ ), so stress evaluation is reported on pooled Normal/DR samples under stratified 5-fold CV. -
Why truncation
$\tau = 5.0$ ? Lifetime distributions forH0/H1/HScluster near mean persistence$\approx 4.0$ , so$\tau = 5.0$ removes low-amplitude topological dust.
Lifetime summary (train baseline):
| Homology | Mean | Median | Mode | Count |
|---|---|---|---|---|
| 4.080 | 3.412 | 1.482 | 8,762,399 | |
| 3.852 | 3.383 | 1.477 | 11,609,642 | |
| 4.010 | 3.422 | 1.484 | 8,668,531 |
Full univariate feature ranking by absolute effect size
| Feature Name | Healthy Mean | Diabetic Mean | p-value | Cohen's |
|---|---|---|---|---|
| HS Top5 | 363.807 | 609.067 | 0.000 | 1.573 |
| H0 Top5 | 231.300 | 543.800 | 0.000 | 1.415 |
| Hu 5 | 7.680 | -1.920 | 0.000 | -0.901 |
| H1 Top5 | 466.493 | 547.920 | 0.000 | 0.879 |
| H0 Total | 127244.039 | 111088.070 | 0.000 | -0.698 |
| HS Total | 122674.039 | 109061.906 | 0.000 | -0.641 |
| H1 Total | 158138.562 | 139959.438 | 0.000 | -0.580 |
| Hu 4 | -9.832 | -9.995 | 0.004 | -0.336 |
| Hu 7 | -0.800 | -3.840 | 0.025 | -0.260 |
| Hu 3 | -11.274 | -11.196 | 0.116 | 0.182 |
| Hu 2 | -8.490 | -8.590 | 0.165 | -0.161 |
| Hu 1 | -2.634 | -2.642 | 0.386 | -0.100 |
| Hu 6 | -2.878 | -2.078 | 0.557 | 0.068 |
The pipeline evaluates three operational failure regimes. Global random seeds dictate stochastic noise generation (seed_everything()).
-
Mechanical (Aim): Probing spatial interpolation penalties (
$\lambda \times L_\mcI$ ).- Rotation:
$-10^\circ \to 10^\circ$ (step 1) - Blur:
$\sigma \in [0.0, 1.5]$ (step 0.15) - Resolution:
$64\text{px} \to 2048\text{px}$
- Rotation:
-
Radiometric (Shoot): Probing sensor transfer function deviations.
- Gamma:
$0.1 \to 3.0$ (step 0.1) - Contrast:
$0.5 \to 3.0$ (step 0.1) - Drift:
$-150 \to 150$ (step 10)
- Gamma:
-
Stochastic/Digital Failure: Probing noise floors (
$\eta$ ).- Gaussian Noise:
$\sigma^2 \in [0.0, 0.20]$ (step 0.02) - Poisson Noise:
$\lambda \in [0.0, 0.20]$ (step 0.02) - Salt & Pepper:
$p \in [0.0000, 0.0250]$ (step 0.0025) - Bit Depth (Quantisation):
$2 \to 8$ bits
- Gaussian Noise:
cache_parts/: Cached persistence modules and Hu moment vectors.perf_log.jsonl: Raw cross-validation metrics.results_*.csv: Tabulated accuracy drops for each stress regime.plot_results/:plot_panel_mechanical.png,plot_panel_radiometric.png,plot_panel_failure.pngplot_single_{protocol}.pngplot_orthogonality.png
Standalone scripts to isolate grid interpolation artefacts ("The Discretisation Gap"). Outputs write to wobble/.
Rotational Shattering (
python experiment_shattering_binary.pyLoop Closure (
python experiment_loops_binary.py