docs: Borusyak, Jaravel & Spiess (2024) paper review — ImputationDiD (PR-A)

igerber · claude · igerber · commit eb50ae806252 · 2026-06-06T10:52:28.000-04:00
Add the canonical scholarly review of the primary source (DOI 10.1093/restud/rdae007, Review of Economic Studies 91(6), 3253-3285) as the Step-1 fidelity artifact for the ImputationDiD methodology validation (PR-A of the 2-PR sequence). Version-pinned to the published REStud article; equation/theorem numbering verified against the source PDF (pp. 3266-3274) and the existing REGISTRY.md ImputationDiD section. Local AI review (4 fresh codex rounds) surfaced and resolved one P1 methodology finding: the library's _compute_auxiliary_residuals_treated uses a simplified observation-level v_it-weighted mean for the Theorem 3 auxiliary tau_tilde_g, not the paper's unit-clustered Equation 8 (verified vs imputation.py:1663-1676 and the PDF p. 3272). The two coincide for the unweighted default paths (overall ATT / single- horizon event-study under cohort_horizon) and diverge for survey-weighted / non-uniform / coarser-partition cases; the simplification stays conservative (Theorem 3) but is not the excess-variance-minimizing form. Per the chosen disposition (document in PR-A, fix code in PR-B): - docs/methodology/papers/borusyak-jaravel-spiess-2024-review.md: the paper review, with the Eq.8 deviation note, qualified checklist, and headline PR-B correction. - docs/methodology/REGISTRY.md: relabel the aggregator + recognized **Note:** deviation. - docs/doc-deps.yaml: map the review under diff_diff/imputation.py (mirrors #524). - TODO.md: track the Eq.8 code reconciliation + the PR-B didimputation parity and tests/test_methodology_imputation.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/TODO.md b/TODO.md
@@ -88,6 +88,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | SyntheticControl: the remaining ADH-2015 §4 items — the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`) and sparse-SC subset search (`l < J`, holding `V` fixed). Leave-one-out (`leave_one_out()`), the in-time placebo (`in_time_placebo()`), out-of-sample CV `V`-selection (`v_method="cv"`), and inverse-variance `V` (`v_method="inverse_variance"`) have landed; these two are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 follow-up | Low |
 | ContinuousDiD deferred CGBS 2024 extensions: (a) `covariates=` kwarg not implemented (matches R `contdid` v0.1.0); (b) discrete-treatment saturated regression deferred (integer-valued dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per CGBS 2024 Remark 3.1 (when `P(D=0) = 0`) not implemented — estimator requires never-treated controls. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these as deferred `[ ]` items. | `diff_diff/continuous_did.py` | — | Low |
 | Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement | `efficient_did_covariates.py` | — | Low |
+| ImputationDiD: auxiliary `tau_tilde_g` (Theorem 3 variance) uses a simplified **observation-level** v_it-weighted mean `sum(v*tau_hat)/sum(v)` instead of the paper's **unit-clustered** Eq. 8 `sum_i[(sum_t v)(sum_t v*tau_hat)]/sum_i(sum_t v)^2` (BJS 2024 p. 3272). Coincides for uniform within-group weights (the **unweighted** overall ATT / single-horizon ES under default `cohort_horizon`); diverges for non-uniform weights (survey-weighted, balanced/heterogeneity) or coarser `cohort`/`horizon` partitions. Stays conservative (Theorem 3) but is not the excess-variance-minimizing form (Supp. App. A.8). Fix `_compute_auxiliary_residuals_treated` + its docstring + the REGISTRY label to the exact unit-clustered Eq. 8, with a diverging-case methodology test. Surfaced by the ImputationDiD PR-A AI review. | `imputation.py`, `docs/methodology/REGISTRY.md` | imputation-validation (PR-B) | Medium |
 | Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: the shared `_compute_stratified_psu_meat` finite-sample correction counts zero-weight units as PSUs (an `n_psu/(n_psu-1)`-style factor), so adding zero-weight rows shifts the SE by a second-order amount (~2e-4 relative in the EfficientDiD e2e). The point estimate is exactly invariant and the weighted scores of zero-weight rows are already zero — only the DOF correction's PSU count includes them. Cross-cutting across all survey-enabled estimators; fix by counting only positive-weight PSUs in the correction. | `survey.py` (`_compute_stratified_psu_meat`) | PR-B follow-up | Low |
 | TROP: extend Wave 4's `_setup_trop_data` helper to also cover the duplicated bootstrap resampling loop in `_bootstrap_variance` / `_bootstrap_variance_global` (~40 LoC dedup; mirrors the data-setup helper pattern with a `fit_callable` parameter for the per-draw refit step). | `trop_local.py`, `trop_global.py` | follow-up | Low |
 | TripleDifference power auto-routing: `power.simulate_power` ignores `n_periods` for DDD because `_ddd_dgp_kwargs` is hard-coded to the cross-sectional `generate_ddd_data`. Now that `generate_ddd_panel_data` exists (Wave 4), add a new `_EstimatorProfile` registry entry (or extend the existing one) to route to the panel DGP when `n_periods > 2`. | `power.py`, `prep_dgp.py` | follow-up | Low |
@@ -177,6 +178,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
 | `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
 | Render `docs/methodology/REPORTING.md` and `docs/methodology/REGISTRY.md` as in-site Sphinx pages so cross-references can use `:doc:` instead of off-site GitHub `blob/main` URLs. Current state (#410 fix-audit-r2) restores navigable links via `blob/main`, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add `myst-parser` to `docs/conf.py` extensions + docs extras and link with `:doc:`, or (b) convert both files to `.rst`. | `docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, `docs/tutorials/18_geo_experiments.ipynb`, `docs/tutorials/19_dcdh_marketing_pulse.ipynb` | follow-up | Low |
+| ImputationDiD methodology validation (PR-B): add `tests/test_methodology_imputation.py` with paper-equation-numbered Verified Components (Theorems 1-3, eqs. 5-9, Props. 5/9) and an R `didimputation` parity fixture (none on file). Flips the METHODOLOGY_REVIEW.md row to Complete. | `tests/test_methodology_imputation.py` | imputation-validation (PR-B) | Medium |
 
 ---
 
diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
@@ -224,6 +224,8 @@ sources:
       - path: docs/methodology/REGISTRY.md
         section: "ImputationDiD"
         type: methodology
+      - path: docs/methodology/papers/borusyak-jaravel-spiess-2024-review.md
+        type: methodology
       - path: docs/api/imputation.rst
         type: api_reference
       - path: docs/tutorials/11_imputation_did.ipynb
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -1335,7 +1335,8 @@ where `W_it(h) = 1[K_it = h]` are lead indicators, estimated on `Omega_0` only.
 - **Sparse variance solver:** `_compute_v_untreated_with_covariates` uses `scipy.sparse.linalg.spsolve` to solve `(A_0'A_0) z = A_1'w` without densifying the normal equations matrix. Falls back to dense `lstsq` if the sparse solver fails and emits a `UserWarning` on the fallback (silent-failure audit axis C) so callers know variance estimates came from the degraded path.
 - **Note:** Survey weights enter ImputationDiD via weighted iterative FE (Step 1), survey-weighted ATT aggregation (Step 3), and design-based variance via `compute_survey_if_variance()`. PSU clustering, stratification, and FPC are fully supported in the Theorem 3 variance path. When `resolved_survey` is present, the observation-level influence function (`v_it * epsilon_tilde_it`) is passed to `compute_survey_if_variance()` which applies the stratified PSU-level sandwich with FPC correction. Strata also enters survey df (n_PSU - n_strata) for t-distribution inference. Bootstrap + survey supported (Phase 6) via PSU-level multiplier weights.
 - **Bootstrap inference:** Uses multiplier bootstrap on the Theorem 3 influence function: `psi_i = sum_t v_it * epsilon_tilde_it`. Cluster-level psi sums are pre-computed for each aggregation target (overall, per-horizon, per-group), then perturbed with multiplier weights (Rademacher by default; configurable via `bootstrap_weights` parameter to use Mammen or Webb weights, matching CallawaySantAnna). This is a library extension (not in the paper) consistent with CallawaySantAnna/SunAbraham bootstrap patterns.
-- **Auxiliary residuals (Equation 8):** Uses v_it-weighted tau_tilde_g formula: `tau_tilde_g = sum(v_it * tau_hat_it) / sum(v_it)` within each partition group. Zero-weight groups (common in event-study SE computation) fall back to unweighted mean.
+- **Auxiliary residuals (simplified aggregator; cf. Equation 8):** Uses an observation-level v_it-weighted average within each partition group, `tau_tilde_g = sum(v_it * tau_hat_it) / sum(v_it)`. Zero-weight groups (common in event-study SE computation) fall back to unweighted mean.
+- **Note:** The auxiliary aggregator above is the *observation-level* v_it-weighted mean, which is a simplification of the paper's *unit-clustered* Equation 8, `tau_tilde_g = sum_i[(sum_t v_it)(sum_t v_it * tau_hat_it)] / sum_i(sum_t v_it)^2` (Borusyak-Jaravel-Spiess 2024, eq. 8, p. 3272; minimal-excess-variance derivation in Supplementary Appendix A.8). The two coincide when within-group treated weights are uniform — including the common **unweighted** default paths (overall ATT and single-horizon event-study under `aux_partition="cohort_horizon"`, where each unit contributes one observation per cohort×horizon cell at equal `w_it`) — and diverge whenever treated `v_it` is non-uniform: **survey-weighted estimands** (`survey_design=` makes `w_it` proportional to the survey weight, so even overall ATT diverges), balanced / heterogeneity-by-covariate estimands, or the coarser `aux_partition="cohort"`/`"horizon"` groupings (a unit gets several observations per group). The simplification remains within Theorem 3's conservative-inference conditions (any auxiliary model converging to a non-random limit yields excess variance `sigma^2_tau >= 0`), so SEs stay valid/conservative; it is simply not the excess-variance-minimizing weighting, so confidence intervals can be marginally wider than the paper's off the default path. Reconciliation to the exact unit-clustered Eq. 8 (with the matching `_compute_auxiliary_residuals_treated` docstring fix) is tracked for PR-B — see `docs/methodology/papers/borusyak-jaravel-spiess-2024-review.md` and the TODO.md Methodology/Correctness row. Surfaced by the PR-A AI review (2026-06-06).
 - **Note:** Both the iterative FE solver (`_iterative_fe`, Step 1) and the iterative alternating-projection demeaning helper (`_iterative_demean`, used in covariate residualization and the pre-trend test) emit `UserWarning` when `max_iter` exhausts without reaching `tol`, via `diff_diff.utils.warn_if_not_converged`. Silent return of the current iterate was classified as a silent failure under the Phase 2 audit and replaced with an explicit signal to match the logistic/Poisson IRLS pattern in `linalg.py`.
 - **Note:** `vcov_type` is permanently narrow to `{"hc1"}` per the Theorem 3 IF-based variance decomposition. Analytical-sandwich families `{classical, hc2, hc2_bm}` are rejected at `__init__` — the per-unit influence function aggregation has no equivalent single design matrix on which hat-matrix leverage or Bell-McCaffrey Satterthwaite DOF can be defined. `cluster=<col>` invokes per-cluster IF summation (Theorem 3 equation 7 conservative variance, `sigma_sq = (cluster_psi_sums**2).sum()` — plain CR1 with no Stata-style `(n-1)/(n-p)` finite-sample factor because the IF has no design-matrix `p` in the OLS sense); `cluster=None` (the default) routes the SAME Theorem 3 cluster-summed IF variance with `cluster_var = unit` (the unit column passed to `fit()`), so the summary renders `"CR1 cluster-robust at <unit>, G=<n_units>"` rather than the generic `"HC1"` label; `survey_design=` invokes TSL on the combined IF. Under bootstrap (`n_bootstrap > 0`) the analytical variance-family label is suppressed in `summary()` because `fit()` overwrites the reported SE/CI/p-value with bootstrap_results (mirrors the canonical `DiDResults` gate at `results.py:213-226`). `vcov_type='conley'` is deferred to the ImputationDiD Conley follow-up row in TODO.md.
 - **Note:** `cluster=<col>` combined with a replicate-weight `SurveyDesign` raises `NotImplementedError` at `fit()`. Replicate-weight variance ignores PSU/cluster structure entirely (replicates encode the design implicitly), so honoring `cluster=` would silently no-op while populating `cluster_name`/`n_clusters` on Results dishonestly. Either omit `cluster=` (the replicate weights encode the design structure implicitly) or use a non-replicate survey design (with explicit strata/psu/fpc). Mirrors the `CallawaySantAnna` and `TripleDifference` fail-closed guards.
diff --git a/docs/methodology/papers/borusyak-jaravel-spiess-2024-review.md b/docs/methodology/papers/borusyak-jaravel-spiess-2024-review.md