NCAR · mattldawson · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/.github/issues/rosenbrock_dae_algebraic_error.md b/.github/issues/rosenbrock_dae_algebraic_error.md
@@ -0,0 +1,150 @@
+# Rosenbrock DAE Error Estimator: Algebraic Variable Overshoot
+
+## Summary
+
+The Rosenbrock DAE4 error estimator produces near-zero error values for algebraic
+variables, causing the step-acceptance criterion to ignore large errors in
+constraint-derived quantities. This allows algebraic "balance" variables (e.g.,
+SO₂(g) in a total-sulfur conservation constraint) to overshoot through zero
+without triggering step rejection.
+
+## Reproducer
+
+See `python/test/integration/test_dae_algebraic_overshoot.py` for a minimal
+chemistry system that demonstrates the issue.
+
+**System**: H₂O₂-pathway sulfate production in cloud droplets.
+- 10 species: 3 differential (SO₄²⁻, SO₂OOH⁻, H₂O), 7 algebraic (SO₂, H₂O₂,
+  SO₂(aq), H₂O₂(aq), HSO₃⁻, OH⁻, H⁺)
+- SO₂(g) is algebraic via `LinearConstraint`: SO₂ = C − SO₂(aq) − HSO₃⁻ − SO₂OOH⁻ − SO₄²⁻
+- Fast reactions convert S(IV) → S(VI) until SO₂(g) should approach zero
+
+**Observed**: With a single 30 s solve and default tolerances (atol=1e-3 for gas,
+1e-14 for aqueous, rtol=1e-6):
+- SO₂(g) = −6.5e-8 (should be ≥ 0)
+- 47 internal steps, 0 rejections
+- Total sulfur is conserved (constraint works correctly)
+- The solver never rejects a step despite SO₂(g) changing sign
+
+**Key evidence**: Changing `atol` for SO₂ from 1e-3 to 1e-12 produces **identical
+results** (same 47 steps, same SO₂ value). This proves the error vector entry for
+SO₂ is effectively zero — the tolerance is irrelevant.
+
+## Root Cause Analysis
+
+### The error vector is ~0 for algebraic variables
+
+The DAE4 error estimate is `Yerror = K[3]` (only `e[3] = 1.0`, all others zero).
+
+For algebraic variable `i` (mass matrix diagonal `M_ii = 0`):
+
+1. **Stage coupling is zeroed**: In the RHS accumulation:
+   ```
+   K[stage][i] += (c_jk/H) * M_ii * K[j][i]
+   ```
+   Since `M_ii = 0`, the coupling terms from previous stages are zero for algebraic rows.
+
+2. **K[3] is determined by the linear solve**: After coupling, the system
+   `(M/(γH) − J) · K[3] = RHS` is solved. For algebraic row `i`:
+   ```
+   −J_i · K[3] = f_i(Y + 2·K[0] + K[2])
+   ```
+   where `f_i` is the constraint forcing term.
+
+3. **The constraint forcing at the evaluation point is small**: The constraint
+   initialization step ensures `f_i ≈ 0` at the initial point. The evaluation
+   point `Y + 2·K[0] + K[2]` may deviate, but the Jacobian coupling between the
+   constraint row and all other variables means the linear solve distributes the
+   residual across many variables, resulting in a small `K[3]` for any single
+   algebraic variable.
+
+4. **The error norm contribution is negligible**: Since `Yerror[i] ≈ 0`:
+   ```
+   errors_over_scale = Yerror[i] / (atol[i] + rtol * ymax[i]) ≈ 0
+   ```
+   No matter what `atol` is set to.
+
+### Why tight differential tolerances help indirectly
+
+With `atol = 1e-16` and `rtol = 1e-12` for **all** species, the solver takes 690k
+steps. This works not because the algebraic error is checked, but because:
+- SO₄²⁻ (differential) has a tighter tolerance
+- The solver must resolve the SO₄²⁻ trajectory with high accuracy
+- This forces small enough steps that SO₄²⁻ never overshoots the total-S budget
+- SO₂(g) stays positive as a consequence
+
+This is impractical for real applications — 690k steps for 30 s of chemistry.
+
+## What the current "fix" (commit c20507ad) does
+
+The fix removes the `if (upper_left_identity_diagonal_[i] > 0.0)` skip from
+`NormalizedError()`, and replaces `num_ode_variables` with `Y.NumRows() * Y.NumColumns()`.
+
+This is correct in principle but has **no practical effect** because `Yerror[i] ≈ 0`
+for algebraic variables. The skip was a symptom, not the cause.
+
+## Proposed fix options
+
+### Option A: Compute algebraic error from the constraint residual
+
+After computing `Ynew`, evaluate the constraint residual `g(Ynew)` and use it as
+the error for algebraic variables:
+
+```cpp
+// After: Ynew.Copy(Y); for (...) Ynew.Axpy(m_[stage], K[stage]);
+// Compute constraint residual for error estimation
+if (has_constraints) {
+    DenseMatrixPolicy constraint_residual;
+    constraint_residual.Fill(0);
+    constraints_.AddForcingTerms(Ynew, state.custom_rate_parameters_, constraint_residual);
+    // Replace Yerror entries for algebraic variables with constraint residual
+    for (i_cell...) for (i_var...) {
+        if (state.upper_left_identity_diagonal_[i_var] == 0.0) {
+            Yerror[i_cell][i_var] = constraint_residual[i_cell][i_var];
+        }
+    }
+}
+```
+
+**Pro**: Directly measures what we care about — how well the constraint is satisfied.
+**Con**: The residual may not scale correctly compared to the Rosenbrock error estimate.
+
+### Option B: Use the "state constraint" error estimate
+
+For DAEs, the error in algebraic variables `y_a` is related to the error in
+differential variables `y_d` through the constraint Jacobian:
+
+```
+δy_a ≈ −(∂g/∂y_a)^{-1} · (∂g/∂y_d) · δy_d
+```
+
+This could be computed from the Jacobian blocks. For linear constraints like
+total-S conservation, this is exact: `δSO₂ = −δSO₄²⁻ − δSO₂OOH⁻`.
+
+**Pro**: Theoretically sound, relates algebraic error to differential error.
+**Con**: Requires extracting Jacobian blocks, more complex implementation.
+
+### Option C: Solve the "over-determined" error system
+
+Replace the Rosenbrock error computation for algebraic rows with a secondary
+solve using the constraint Jacobian, as described in Hairer & Wanner (1996),
+Section VI.6.
+
+### Option D: Rate damping (user-level workaround)
+
+MIAM's `max_halflife` parameter caps reaction rates so that no reactant is
+consumed faster than a specified half-life. This prevents the differential
+variable (SO₄²⁻) from overshooting, which indirectly keeps SO₂(g) positive.
+
+**Pro**: Already implemented, user can apply immediately.
+**Con**: Workaround, not a solver-level fix. Slightly perturbs the kinetics.
+
+## Recommended approach
+
+**Option A** is the most practical and directly addresses the root cause. The
+constraint residual at `Ynew` is already available (just call `AddForcingTerms`
+again) and provides a meaningful error measure for algebraic variables.
+
+For the constraint residual to work as an error estimate, it should be scaled
+appropriately. One approach: use the L2 norm of the residual per algebraic variable,
+normalized by `atol + rtol * |y_a|`.
diff --git a/.github/prompts/cam_cloud_chemistry_status.md b/.github/prompts/cam_cloud_chemistry_status.md
@@ -0,0 +1,89 @@
+# CAM Cloud Chemistry Tutorial (Tutorial 14): Status & Next Steps
+
+## Current state
+
+Tutorial 14 (`tutorials/14. cam_cloud_chemistry.ipynb`) uses a **total-sulfur
+constraint** (including SO₄²⁻) with rate damping (`max_halflife`) and
+per-species tolerances.
+
+**MICM commit `7e46f9b2`** implements step-change error estimation for
+algebraic variables (`Yerror[a] = Ynew[a] - Y[a]`), making algebraic
+tolerances control step acceptance. This replaces the previous near-zero
+embedded error for algebraic rows.
+
+### What works today
+
+- Full SO₂/H₂O₂ cloud chemistry system (gas + aqueous species)
+- Aqueous oxidation reactions (H₂O₂ pathway R1a/R1b)
+- Algebraic constraints (Henry's Law, dissociation, conservation, charge balance)
+- Total-sulfur conservation constraint (SO₂ + SO₂_aq + HSO₃⁻ + SO₂OOH⁻ + SO₄²⁻)
+- Dummy `emitted_SO₂` tracker for SO₂ source term in mass budget
+- Rate damping via `max_halflife` parameter
+- Solvent damping via `solvent_damping_epsilon` (default 1e-20)
+- Per-species absolute tolerances (1e-14 aqueous, 1e-9 gas algebraic)
+- Validation cells for total-S conservation, charge balance, Henry's Law
+- Step-change error estimate for algebraic variables (MICM fix)
+
+### Tolerance strategy
+
+With the step-change error fix, algebraic tolerances now matter:
+
+| Algebraic atol | SO₂ sign | Steps/30s | Practical? |
+|---|---|---|---|
+| 1e-3 to 1e-8 | negative | 13–48 | fast but overshoots |
+| 1e-10 | negative | ~3,500 | moderate |
+| 1e-12 | **positive** | ~131k | expensive |
+
+For the tutorial, the combination of moderate algebraic tolerances (1e-8 to
+1e-10) plus rate damping (`max_halflife`) provides a practical balance: the
+rate damping limits per-step overshoot and the step-change error controls
+step acceptance.
+
+### Solver parameters
+
+```python
+RosenbrockSolverParameters(
+    absolute_tolerances=[1e-9 (gas algebraic), 1e-14 (aqueous)],
+    constraint_init_max_iterations=100,
+    constraint_init_tolerance=1e-8,
+    max_number_of_steps=5000,
+)
+```
+
+---
+
+## Remaining work
+
+### 1. Tutorial tolerance tuning
+
+With the step-change error fix, re-evaluate whether the tutorial's per-species
+tolerances and `max_halflife` settings give the best balance of accuracy and
+performance. Consider testing:
+- `alg_atol=1e-10` with `max_halflife=10` — may prevent overshoot at ~3.5k steps
+- `alg_atol=1e-8` with `max_halflife=5` — fewer steps, rate damping does the work
+
+### 2. Run the full test suite
+
+```bash
+pytest python/test/integration/test_miam_cloud_chemistry.py -v   # 31 tests
+pytest python/test/integration/test_dae_algebraic_overshoot.py -v  # 3 tests
+```
+
+### 3. Tutorial notebook validation
+
+Run the tutorial end-to-end to verify all cells produce correct output with
+the new MICM.
+
+---
+
+## File inventory
+
+| File | Purpose |
+|---|---|
+| `tutorials/14. cam_cloud_chemistry.ipynb` | The tutorial notebook |
+| `python/test/integration/test_miam_cloud_chemistry.py` | Integration tests (31) |
+| `python/test/integration/test_dae_algebraic_overshoot.py` | Overshoot tests (3) — sensitivity, loose, tight |
+| `python/test/integration/diagnose_dae_error.py` | Diagnostic script (historical) |
+| `.github/issues/rosenbrock_dae_algebraic_error.md` | Issue doc (historical — fix landed) |
+| `.github/prompts/dae_algebraic_overshoot.md` | Step-change error description |
+| `.github/prompts/cam_cloud_chemistry_status.md` | This file |
diff --git a/.github/prompts/dae_algebraic_overshoot.md b/.github/prompts/dae_algebraic_overshoot.md
@@ -0,0 +1,77 @@
+# DAE Rosenbrock Error Estimator: Algebraic Variable Step-Change Error
+
+## Summary
+
+MICM's Rosenbrock DAE solver now uses **step-change error estimation** for
+algebraic variables. After computing the embedded error, the solver replaces
+each algebraic entry with `Yerror[a] = Ynew[a] - Y[a]`. This means algebraic
+variable tolerances now control step acceptance.
+
+**MICM commit:** `7e46f9b2f9b968b56cebb6f1272005409b30df3d`
+**MIAM commit:** `002eadfeac2deb7b5bae84c1addf2d09a00f7544`
+
+## Background
+
+Previously, the embedded error formula (`Yerror = Σ e_i·K_i`) produced
+near-zero entries for algebraic rows because the mass matrix diagonal
+`M_ii = 0` zeroed out inter-stage coupling terms. The solver was completely
+insensitive to algebraic tolerances. The fix replaces the near-zero embedded
+error with the actual step change for algebraic variables.
+
+## Behavior
+
+With the step-change error, algebraic tolerances control how much the variable
+can change per internal step before the solver rejects and retries smaller:
+
+| Algebraic atol | SO₂(g) after 30 s | Steps | Notes |
+|---|---|---|---|
+| 1e-3 (loose) | −6.50e-8 | 13 | Overshoot — too few steps |
+| 1e-8 | −6.50e-8 | 48 | More steps, still overshoots |
+| 1e-10 | −6.25e-8 | 3,470 | Significant step reduction |
+| 1e-12 (tight) | +1.73e-8 | 130,766 | No overshoot — SO₂ positive |
+| 1e-14 | +1.19e-7 | 391,987 | Very tight — expensive |
+
+**Key insight:** The guide recommends `atol = 1e-8 to 1e-10` for algebraic
+variables. At these tolerances, SO₂ still goes slightly negative (within atol)
+in this stiff system. For guaranteed non-negativity, `atol ≤ 1e-12` is needed
+but costs ~130k steps per 30 s timestep.
+
+## Practical tolerance strategy
+
+For cloud chemistry with conservation constraints:
+
+```python
+ordering = micm.create_state().get_species_ordering()
+abs_tols = [1e-14] * len(ordering)  # tight default for differential species
+
+# Per the guide: 1e-8 to 1e-10 for algebraic species
+algebraic_species = {"SO2", "H2O2", "CLOUD.AQUEOUS.SO2_aq", ...}
+for name, idx in ordering.items():
+    if name in algebraic_species:
+        abs_tols[idx] = 1e-10  # or 1e-8 for speed
+```
+
+If the balance variable going slightly negative (within atol) is unacceptable,
+combine with rate damping (`max_halflife`) to limit per-step overshoot.
+
+## Test case
+
+**File:** `python/test/integration/test_dae_algebraic_overshoot.py`
+
+Three tests:
+
+- `test_algebraic_tolerance_sensitivity`: Proves that changing atol for
+  algebraic species from 1e-3 to 1e-10 produces >10× more steps
+- `test_negative_so2_with_loose_algebraic_tolerances`: SO₂ goes negative
+  with loose algebraic atol (1e-3), confirming the overshoot mechanism
+- `test_positive_so2_with_tight_algebraic_tolerances`: SO₂ stays positive
+  with tight algebraic atol (1e-12), confirming the fix prevents overshoot
+
+```bash
+pytest python/test/integration/test_dae_algebraic_overshoot.py -v
+```
+
+## Reference
+
+See the MICM developer guide at `micm/.github/prompts/dae_algebraic_tolerance_guide.md`
+for the full specification and C++ examples.