Skip to content

Commit 2fde00f

Browse files
igerberclaude
andcommitted
PowerAnalysis methodology review (PR-B): fix panel variance to Burlig Eq.2 equicorrelated; tracker -> Complete
Reconciles diff_diff/power.py with the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits (paper reviews added in PR-A #506). Behavior change: the analytical panel-DiD variance was the Moulton (1+(T-1)rho)/T factor (wrong period-scaling, ~4x too small at rho=0/m=r=5, AND opposite rho-sign). Replaced with the within-unit equicorrelated special case of Burlig Eq. 2, sigma^2 (1/n_T+1/n_C)(1/m+1/r)(1-rho), so within-unit correlation now LOWERS the MDE. The MDE multiplier stays the normal-z Bloom multiplier (documented as a deliberate large-sample approximation to Burlig's t). - power.py: equicorrelated variance in _compute_variance + _compute_required_n; input validation for ALL designs (n_pre>=1, n_post>=1, rho in [-1/(T-1), 1)) enforced BEFORE the 2x2-vs-panel router, so invalid two-period shapes no longer fall through silently; the (1-rho) factor applies at T=2 too (Burlig footnote 11, the m=r=1 case), so rho is never silently ignored and rho=0 recovers Bloom's 2*sigma^2; docstrings rewritten; PR-A under-review notes removed. - REGISTRY ## PowerAnalysis equation block rewritten (z not t; unified equicorrelated SE with the 2x2 as the m=r=1 special case; cluster-m and inverted-R^2 terms removed; both reference surfaces; checklist ticked). - New tests/test_methodology_power.py (Bloom Table 1; 2x2 + panel closed forms; literal-equicorrelated Monte-Carlo; sample_size<->mde round-trip; input-guard + rho-at-T=2 + compute_* wrapper validation; base-R qnorm parity). - benchmarks/R/generate_power_golden.R + benchmarks/data/r_power_golden.json. - tests/test_power.py: inverted test_icc_effect + test_extreme_icc to Burlig's sign. - references.rst: + Frison & Pocock (1992), McKenzie (2012) lineage. - docs/tutorials/06_power_analysis.ipynb: corrected rho cells + summary. - METHODOLOGY_REVIEW.md row -> Complete; TODO row removed; CHANGELOG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent e9b57d4 commit 2fde00f

11 files changed

Lines changed: 1170 additions & 145 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
99

1010
### Added
1111
- **New tutorial: `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` — "Staggered Rollout or a Simple 2×2? A Power-Analysis Decision Guide".** A practitioner walkthrough for geo experiments (framed on a 50-state staggered rollout) on when to reach for Callaway-Sant'Anna vs collapsing to a familiar pre/post 2×2. Shows, with live paired Monte Carlo on `generate_staggered_data`, that the collapsed 2×2 silently targets a *diluted* estimand (reports ~60–94% of the true effect-on-treated as the rollout staggers, with near-zero CI coverage of the truth under a slow rollout), and that CS's minimum-detectable-lift penalty is a *fast-rollout* phenomenon that shrinks to parity as the rollout becomes more staggered. Fully self-contained (runs live, no committed data files); ends with a CS-vs-2×2 decision guide.
12+
- **PowerAnalysis methodology-review-tracker promotion: In Progress → Complete, with a panel-variance correction (behavior change).** Closes the Bloom (1995) + Burlig, Preonas & Woerman (2020) source audits on the tracker (PR-A #506 added both paper reviews + under-review Notes; this PR validates the source against the code and reconciles the discrepancies). **Behavior change:** the analytical *panel* DiD variance was the Moulton design-effect factor `(1+(T−1)·rho)/T`, wrong two ways versus the source — wrong period-scaling (~4× too small at `rho=0`, `m=r=5` versus the iid DiD benchmark) and the **opposite `rho`-sign** (it *raised* the MDE as within-unit correlation grew). It is replaced by the within-unit equicorrelated special case of Burlig et al. Eq. 2, `Var(ATT) = sigma² · (1/n_T + 1/n_C) · (1/n_pre + 1/n_post) · (1 − rho)`, in which within-unit (serial) correlation *lowers* the MDE because the difference-in-differences cancels the shared within-unit component. So `PowerAnalysis.mde` / `power` / `sample_size` (and the `compute_*` wrappers) now return a **smaller** MDE / required N as `rho` rises for **all** designs; the 2×2 path matches Bloom's `2σ²` at the default `rho = 0` and is continuous with the panel form at `n_pre = n_post = 1`. New input validation, enforced for **all** designs *before* the 2×2-vs-panel router: `n_pre >= 1`, `n_post >= 1`, and `rho ∈ [−1/(T−1), 1)` (`T = n_pre + n_post`) now raise `ValueError` (previously invalid two-period shapes and out-of-range `rho` fell through to `basic_did` silently). The `(1 − rho)` factor applies at `T = 2` too — the 2×2 path is Burlig's `m = r = 1` special case (footnote 11), so a nonzero `rho` is no longer silently ignored there, while `rho = 0` still recovers Bloom's `2σ²`. The MDE multiplier stays the **normal (z)** Bloom multiplier (a deliberate large-sample approximation to Burlig's t, documented as `**Deviation from R:**`) — unchanged. New `tests/test_methodology_power.py` (Bloom Table 1 multipliers; 2×2 + panel closed forms; a literal-equicorrelated Monte-Carlo validation of the panel variance; `sample_size`↔`mde` round-trip; validation guards; base-R `qnorm` parity at `benchmarks/data/r_power_golden.json`, generator `benchmarks/R/generate_power_golden.R`); the two `tests/test_power.py` ICC-direction tests were inverted to Burlig's sign. REGISTRY `## PowerAnalysis` equation block rewritten (z not t; corrected 2×2 / panel SE + sample-size; removed the cluster-`m` and inverted-`R²` terms that matched neither code nor source); `docs/references.rst` adds Frison & Pocock (1992) + McKenzie (2012) as the equicorrelated lineage; tutorial `06_power_analysis.ipynb` corrected. `METHODOLOGY_REVIEW.md` row promoted to **Complete** (`Last Review = 2026-05-31`) with a Verified Components / Corrections Made / Deviations detail block; priority queue pruned; the PR-A under-review Notes removed across REGISTRY / `power.py` / `references.rst`.
1213
- **New estimator: `SyntheticControl` — classic Synthetic Control Method (Abadie, Diamond & Hainmueller 2010; Abadie & Gardeazabal 2003).** Standalone estimator (`diff_diff/synthetic_control.py`) + `SyntheticControlResults` (`diff_diff/synthetic_control_results.py`) + `synthetic_control()` convenience function, exported from `diff_diff`. Builds a single treated unit's counterfactual as a convex combination of never-treated donor units — **donor (unit) weights only**, no time weights or ridge, distinct from `SyntheticDiD`. The inner simplex-constrained weighted-LS solve `W*(V)` reuses `utils._sc_weight_fw` (folding `V^½` into the predictor matrix, `intercept=False`, `zeta=0`); the diagonal predictor-importance matrix `V` is selected data-driven by minimizing pre-period outcome MSPE (`v_method="nested"`, softmax-on-simplex multistart Nelder-Mead + Powell polish) or supplied by the user (`v_method="custom"`). Predictors are built from `predictors`/`predictor_window`/`predictors_op`, `special_predictors`, and per-period outcome lags (`pre_period_outcomes`), in the R `Synth::dataprep` row order; per-row standardization (SD over donors+treated, ddof=1) matches the R `Synth::synth` source. Reports the gap path (`α̂_1t = Y_1t − Σ_j w_j Y_jt`), `att` (mean post-period gap), `pre_rmspe`, donor weights, `v_weights`, and a predictor-balance table. **No analytical standard error** — `se`/`t_stat`/`p_value`/`conf_int` are NaN (in-space placebo permutation inference with the post/pre RMSPE-ratio statistic is planned for a follow-up release; `_placebo_gaps`/`_rmspe_ratio`/`_fit_snapshot` are reserved on the results object). Ten validation gates baked in: predictor-period leakage, absorbing post-period suffix + no-anticipation cross-check against the treatment column, post-period canonicalization, donor-pool filtering before period derivation, empty-window rejection, poor-pre-fit `UserWarning` (RMSPE > SD of treated pre-outcomes), duplicate-predictor-label rejection, inner-solve non-convergence warning, order-independent gap-path rebuild, and the `standardize="none"` deviation; plus fail-closed `custom_v` cross-field rules and degenerate single-donor / single-pre-period handling. **R-`Synth` parity** (`tests/test_methodology_synthetic_control.py`, fixtures generated by `benchmarks/R/generate_synth_basque_golden.R` into `tests/data/`): two-tier on the Basque Country study — Tier-1 feeds R's `solution.v` via `custom_v` and reproduces the published donor weights (region 10 Cataluña 0.851 + region 14 Madrid 0.149) to `atol=1e-3` deterministically; Tier-2 (`@pytest.mark.slow`) checks the data-driven nested fit lands in a tolerance band (the nested `V` legitimately differs because the outer objective uses all pre periods, not R's `time.optimize.ssr` window). Documented in `docs/methodology/REGISTRY.md` §SyntheticControl (with `**Deviation from R:** standardize="none"` and `**Note:**` labels for the standardization formula, objective window, softmax `V` parametrization, and 1×SD poor-fit threshold), `docs/api/synthetic_control.rst`, the LLM guides, and `README.md`.
1314
- **StaggeredTripleDifference methodology-review-tracker promotion: In Progress → Complete**, plus a new opt-in Eq-4.14 overall ATT. Closes the Ortiz-Villavicencio & Sant'Anna (2025, arXiv:2505.09942v3) primary-source review on the tracker (PR-A #499 added the paper review on file; this PR validates the source against it). New paper-equation-anchored Verified Components in `tests/test_methodology_staggered_triple_diff.py` (Theorem 4.1 / Eq. 4.5 RA=IPW=DR identification; Eq. 4.1 three-term DDD decomposition; Eqs. 4.11-4.12 optimal-GMM weight normalization + single-group reduction; Eq. 4.13 event-study cohort-share weighting; Eq. 4.14 / Cor. 4.2 overall) alongside the existing R cross-validation against `triplediff::ddd(panel=TRUE)` + `agg_ddd()`. **New feature — opt-in `overall_att_es` (paper Eq. 4.14 overall):** the unweighted mean of the post-treatment event-study effects ES(e), exposed on `StaggeredTripleDiffResults` (with `overall_se_es` / `overall_t_stat_es` / `overall_p_value_es` / `overall_conf_int_es`) and populated only when `aggregate="event_study"` / `"all"`. The default `overall_att` is unchanged (the Callaway-Sant'Anna simple post-treatment (g,t) average — the library-wide convention). Its analytical SE is the influence function of that mean (the average of the per-event-time combined IFs, routed through the same survey-aware variance estimator as the per-e effects via a new `_se_from_psi` helper); a multiplier-bootstrap SE replaces it under `n_bootstrap>0`. Computed via a side-channel stash on the shared `CallawaySantAnnaAggregationMixin._aggregate_event_study` (no return-signature change; CallawaySantAnna unaffected), over post-treatment `e >= -anticipation` (the library convention, matching `overall_att`). Cross-validated against R `agg_ddd(type="eventstudy")$overall.att` / `overall.se` (SE matches to ~0.1%). REGISTRY `## StaggeredTripleDifference`: the previously-unlabeled overall-aggregation prose is formalized under a `**Note:**` documenting both overalls, and the duplicate aggregation-weight deviation is consolidated (fixing a `P(G=g)` vs R `P(S=g)` mislabel). `METHODOLOGY_REVIEW.md` row L69 promoted to **Complete** (`Last Review = 2026-05-30`) with a Verified Components / R Comparison Results detail block; priority queue pruned. `docs/references.rst` Ortiz-Villavicencio entry pinned to arXiv:2505.09942v3.
1415
- **SunAbraham + WooldridgeDiD-OLS `vcov_type="conley"` (Conley 1999 spatial-HAC) threading.** Both estimators now accept `vcov_type="conley"` with the five `conley_*` constructor params (`conley_coords`, `conley_cutoff_km`, `conley_metric`, `conley_kernel`, `conley_lag_cutoff`), reusing the already-`conleyreg`-validated `solve_ols` / `conley.py` machinery — within-period spatial HAC at `conley_lag_cutoff=0`, plus the within-unit Bartlett serial term at `conley_lag_cutoff>0` (the panel-aware path, since `conley_time`/`conley_unit` are always supplied — not pooled cross-sectional), no new variance code. Conley routes through each estimator's within-transform path; the unit auto-cluster is dropped on the conley path (an explicit `cluster=` enables the spatial+cluster product kernel); `survey_design=` / `weights` / `n_bootstrap>0` are rejected, and WooldridgeDiD conley is OLS-path-only (`method ∈ {logit, poisson}` + conley still rejected via the `method != "ols"` guard). `SunAbrahamResults` / `WooldridgeDiDResults` gain a `conley_lag_cutoff` field plus a Conley variance-label line in `summary()` (`SunAbrahamResults` also gains `cluster_name`). FWL-composability — the within-transform conley SE equals the full-dummy conley SE — is pinned in `tests/test_conley_vcov.py` (`TestConleySunAbraham` / `TestConleyWooldridge`). **`StackedDiD` conley remains deferred for a methodology reason** (the stacked design replicates units across sub-experiments, so Conley would see same-unit copies at distance 0; no `conleyreg` anchor; paper-gated) — its prior "same shape as the SunAbraham follow-up" framing is corrected in REGISTRY / TODO / the rejection message.

METHODOLOGY_REVIEW.md

Lines changed: 23 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ A **Complete** entry has a documented review pass against the primary academic s
2424

2525
The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries.
2626

27-
**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
27+
**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests (e.g., PlaceboTests). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
2828

2929
**Not Started** entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands.
3030

@@ -81,7 +81,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
8181
| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** | 2026-05-16 |
8282
| HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 |
8383
| PreTrendsPower | `pretrends.py` | `pretrends` package | **Complete** | 2026-05-19 |
84-
| PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** | |
84+
| PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **Complete** | 2026-05-31 |
8585
| PlaceboTests | `diagnostics.py` | (no canonical reference) | **In Progress** ||
8686

8787
### Cross-Cutting Inference Features
@@ -1236,20 +1236,25 @@ CI and extending covariate-adjusted R parity are tracked follow-ups in `TODO.md`
12361236
| Field | Value |
12371237
|-------|-------|
12381238
| Module | `power.py` |
1239-
| Primary References | Bloom (1995); Burlig, Preonas & Woerman (2020) — clustered DiD power (both listed in REGISTRY) |
1240-
| R Reference | `pwr` (basic) / `DeclareDesign` (design-based simulation) |
1241-
| Status | **In Progress** |
1242-
| Last Review | |
1239+
| Primary References | Bloom (1995) — normal MDE multiplier; Burlig, Preonas & Woerman (2020) — panel-DiD variance (equicorrelated special case of Eq. 2) |
1240+
| R Reference | `pwr::pwr.norm.test` (analytical, normal-based — **not** `pwr.t.test`); Stata `pcpanel` (Burlig panel); `DeclareDesign` (simulation) |
1241+
| Status | **Complete** |
1242+
| Last Review | 2026-05-31 |
12431243

1244-
**Documentation in place:**
1245-
- REGISTRY.md section: `## PowerAnalysis` (MDE / power / sample size / simulation-based power / cluster adjustment); primary sources Bloom (1995) and Burlig et al. (2020) listed
1246-
- Implementation: `tests/test_power.py` (MDE / power / sample-size / simulation paths plus cluster adjustment)
1244+
**Verified components:**
1245+
- MDE multiplier `M = z_{1-α/2 (or 1-α)} + z_{1-κ}` is the normal (Bloom 1995) multiplier; reproduces Bloom Table 1 (2.49 @ one-sided .05/.80, 2.93, 2.17).
1246+
- The unified equicorrelated SE `√(σ²(1/n_T+1/n_C)(1/m+1/r)(1−ρ))` (Burlig Eq. 2 equicorrelated special case): the panel path (T>2) and the 2×2 path — the m=r=1 case `√(2σ²(1/n_T+1/n_C)(1−ρ))`, reducing to Bloom Eq. 1's DiD analog at ρ=0 — validated by closed-form assertions, a literal-equicorrelated Monte-Carlo check, and base-R `qnorm` parity (incl. a 2×2 ρ>0 fixture).
1247+
- Allocation factor `f(1−f)` (50/50-optimal) and the exact two-tailed normal power function confirmed.
12471248

1248-
**Outstanding for promotion:**
1249-
- Paper review under `docs/methodology/papers/` (likely a combined review covering Bloom 1995 + Burlig et al. 2020)
1250-
- Dedicated `tests/test_methodology_power.py` with closed-form walk-through against `pwr::pwr.t.test()` and Burlig et al.'s clustered-DiD power formula
1251-
- Documented reference-validation harness against `pwr` / `DeclareDesign`
1252-
- Verify the REGISTRY Implementation Checklist (all five items currently unchecked)
1249+
**Corrections made (PR-B):**
1250+
- Panel variance switched from the Moulton `(1+(T−1)ρ)/T` factor (wrong period-scaling — ~4× too small at ρ=0, m=r=5 — and wrong ρ-sign) to the Burlig Eq. 2 equicorrelated `(1/m+1/r)(1−ρ)` form, in which within-unit correlation *lowers* the MDE. The two existing direction tests (`test_icc_effect`, `test_extreme_icc`) were inverted; tutorial `06_power_analysis.ipynb` was corrected. Input guards added for **all** designs (validated before the 2×2-vs-panel router): `n_pre≥1`, `n_post≥1`, `ρ ∈ [−1/(T−1), 1)`; the `(1−ρ)` factor also applies at T=2 (the m=r=1 case, Burlig footnote 11), so ρ is not silently ignored there.
1251+
- REGISTRY equation block rewritten (z not t; corrected SE / sample-size; removed the cluster-`m` and inverted-`` terms that matched neither code nor source).
1252+
1253+
**Deviations (documented in REGISTRY `## PowerAnalysis`):**
1254+
- Critical values use the **normal (z)** distribution (Bloom 1995) — a large-sample approximation to Burlig Eq. 1's t — labelled `**Deviation from R:**`.
1255+
- Only the **equicorrelated** special case of Burlig Eq. 2 is implemented (single ρ); the fully general SCR form (independent ψ^B/ψ^A/ψ^X) is not.
1256+
1257+
**Tests:** `tests/test_methodology_power.py` (Bloom Table 1; 2×2 + panel closed forms; Monte-Carlo; round-trip; validation guards; R parity) + `tests/test_power.py`. R goldens at `benchmarks/data/r_power_golden.json` (generator `benchmarks/R/generate_power_golden.R`).
12531258

12541259
---
12551260

@@ -1407,14 +1412,13 @@ Promotion priority for the **In Progress** entries, ordered by what's blocked on
14071412

14081413
**Substantive-review-blocked (no methodology test file, no paper review, no R parity):**
14091414

1410-
1. **PowerAnalysis** — larger surface (MDE / power / sample size / simulation paths); REGISTRY already lists Bloom (1995) and Burlig et al. (2020) as primary sources; least urgent if the library's power-analysis utilities are not heavily used.
1411-
2. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
1412-
3. **EfficientDiD** — no paper review on file; substantial implementation work (`tests/test_efficient_did.py` + validation tests) needs paper-vs-code audit against Chen, Sant'Anna & Xie (2025).
1413-
4. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`.
1415+
1. **PlaceboTests** — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
1416+
2. **EfficientDiD** — no paper review on file; substantial implementation work (`tests/test_efficient_did.py` + validation tests) needs paper-vs-code audit against Chen, Sant'Anna & Xie (2025).
1417+
3. **ImputationDiD / TwoStageDiD** — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against `didimputation` / `did2s`.
14141418

14151419
**Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**
14161420

1417-
5. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
1421+
4. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
14181422

14191423
---
14201424

0 commit comments

Comments
 (0)