refactor tcl pipeline by solomon-negusse · Pull Request #217 · wri/project-zeno-data-infra

solomon-negusse · 2026-03-23T12:19:06Z

This PR refactors the TCL pipeline to run the global computation before running qc validation and to use the functional pattern used in all the other datasets' pipelines.

…indirection, update tests - Convert TreeCoverLossTasks class to module-level functions in stages.py - Remove tcl.py and TreeCoverLossPrefectTasks - Update all flows and tests to use direct function calls - QC and compute now run as separate steps in the flow

jterry64

I think it mostly looks good other than 2 small comments, although the functions got removed around quite a bit so was hard to tell for stages.py. The new QC function where you pass the result_df looked good though. Thanks for reversing all this refactoring haha.

Did you use AI at all to help with this? Maybe @manukala6 could use something similar.

pipelines/test/integration/tree_cover_loss/test_tcl_flow.py

pipelines/run_updates.py

initial works 1 -> 10 for efficient flox compute adapt min reduced to 1 for the long idle time during gee cal adapt max upped to 75 for heavy TCL flox compute

solomon-negusse · 2026-03-24T11:09:59Z

I think it mostly looks good other than 2 small comments, although the functions got removed around quite a bit so was hard to tell for stages.py.
The new QC function where you pass the result_df looked good though. Thanks for reversing all this refactoring haha.

Thanks for the review @jterry64 - I mainly made it consistent with the other pipelines to minimize the cognitive load of maintaining multiple patterns. We can implement the dependency injection pattern coherently across pipelines when/if we start feeling pain points of what we have.

Did you use AI at all to help with this? Maybe @manukala6 could use something similar.

Yes, this was mostly in co-pilot with claude opus 4.6 prompting it to follow the other pipeline pattern. Need to give it specific pipeline and narrow down the modules it needs to update to have it not waste tokens. I had this instructions.md that it helped put together to help with that:

---
description: "Use when refactoring the tree cover loss (TCL) pipeline to follow the direct-call pattern used by disturbance, carbon_flux, and other pipelines."
applyTo: "pipelines/tree_cover_loss/**, pipelines/test/**/tree_cover_loss/**"
---
# TCL Pipeline Refactor: Adopt Direct-Call Pattern

## Goal
Refactor the TCL pipeline to match the simpler pattern used by `disturbance/`, `carbon_flux/`, and other pipelines. Eliminate the `tasks`-as-class-namespace indirection.

## Target Pattern (Reference: disturbance pipeline)

The project uses a 3-layer pattern:

1. **`stages.py`** — Pure functions, no Prefect dependency. Contains `load_data()`, `setup_compute()`, `create_result_dataframe()`, `postprocess_result()`, QC logic.
2. **`*_tasks.py`** — Thin `@task`-decorated wrappers that delegate to `stages.py`.  
3. **`*_flow.py`** — `@flow` that calls tasks directly with `.with_options(name=...)`.

Reference files:
- `pipelines/disturbance/stages.py`
- `pipelines/disturbance/prefect_flows/dist_common_tasks.py`
- `pipelines/disturbance/prefect_flows/gadm_dist_alerts_by_drivers.py`

## Files to Modify

### 1. `pipelines/tree_cover_loss/stages.py`
- Convert `TreeCoverLossTasks` class methods to **module-level functions**.
- `load_data`, `setup_compute`, `create_result_dataframe`, `postprocess_result` become plain functions.
- QC methods (`qc_against_validation_source`, `get_sample_statistics`, `get_validation_statistics`) should accept `qc_feature_repository` and `gee_repository` as parameters instead of using `self`.
- `compute_zonal_stat` is just a passthrough to `common_stages.compute` — no need to keep it here; use `common_tasks.compute_zonal_stat` directly in the flow.
- Keep `_symmetric_relative_difference` as a module-level helper.
- Remove `LoaderType`, `ExpectedGroupsType`, `SaverType` type aliases (unused after refactor).

### 2. `pipelines/tree_cover_loss/prefect_flows/tcl_tasks.py`
- Remove `TreeCoverLossPrefectTasks` class.
- Keep the `@task`-decorated functions (`load_data`, `setup_compute`, `postprocess_result`, `qc_against_validation_source`).
- Each task calls the corresponding module-level function from `stages.py` directly (not through a class instance).

### 3. `pipelines/tree_cover_loss/prefect_flows/tcl.py`
- **Delete this file.** Its logic moves into `tcl_flow.py`.

### 4. `pipelines/tree_cover_loss/prefect_flows/tcl_flow.py`
- Inline the orchestration from `tcl.py` directly into the flow function.
- Call tasks directly with `.with_options(name=...)`, same as `gadm_dist_alerts_by_drivers.py`.
- The QC step becomes a direct task call at the start of the flow.

### 5. Tests — `pipelines/test/integration/tree_cover_loss/test_tcl_flow.py`
- `test_tcl_flow_real_data`: No change needed — it calls `umd_tree_cover_loss_flow()`.
- `test_tcl_flow_with_new_contextual_layers`: Replace `umd_tree_cover_loss(TreeCoverLossTasks(...))` with direct calls to stage functions, using `@patch` for `stages._load_zarr` and repository dependencies.
- `test_tcl_flow_with_bbox`: Replace `compute_tree_cover_loss(TreeCoverLossTasks(), bbox=...)` with direct calls to stage functions under `@patch`.

### 6. Tests — `pipelines/test/unit/tree_cover_loss/`
- Update imports: `from pipelines.tree_cover_loss.stages import TreeCoverLossTasks` → `from pipelines.tree_cover_loss import stages` and call `stages.setup_compute(...)` etc. directly.

## Constraints
- **Preserve all Prefect task tracking** — every step must remain a `@task`.
- **Preserve QC functionality** — `qc_against_validation_source` calls `compute_tree_cover_loss` internally (via `get_sample_statistics`). After refactor, QC should call stage functions directly or be a separate subflow.
- **No test coverage loss** — all existing assertions must still pass.
- **Do not change `stages.py` function signatures** beyond removing `self`.
- **Keep `stages.py` import of `compute_tree_cover_loss` for QC** — `get_sample_statistics` needs to run the full compute. After removing `tcl.py`, this logic should call stage functions directly.

solomon-negusse marked this pull request as draft March 23, 2026 12:19

solomon-negusse added 3 commits March 23, 2026 16:00

have qc validation use global compute results

d2fc8c8

fix qc test for unknown driver case

e2c511b

revert to 10/50 worker min/max count

f60598a

solomon-negusse changed the title ~~refactor tcl, create zarrs~~ refactor tcl pipeline Mar 23, 2026

Merge branch 'main' into create-tcl-zarrs

fd56818

solomon-negusse marked this pull request as ready for review March 23, 2026 18:32

solomon-negusse requested review from jterry64 and manukala6 March 23, 2026 18:32

solomon-negusse assigned gtempus, danscales and dmannarino and unassigned gtempus, danscales and dmannarino Mar 23, 2026

solomon-negusse requested review from danscales, dmannarino and gtempus March 23, 2026 18:33

jterry64 requested changes Mar 23, 2026

View reviewed changes

pipelines/test/integration/tree_cover_loss/test_tcl_flow.py Outdated Show resolved Hide resolved

pipelines/run_updates.py Show resolved Hide resolved

solomon-negusse added 2 commits March 24, 2026 13:49

just call the flow in test

e5932a6

adjust workers

4325bed

initial works 1 -> 10 for efficient flox compute adapt min reduced to 1 for the long idle time during gee cal adapt max upped to 75 for heavy TCL flox compute

solomon-negusse requested a review from jterry64 March 24, 2026 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor tcl pipeline#217

refactor tcl pipeline#217
solomon-negusse wants to merge 7 commits intomainfrom
create-tcl-zarrs

solomon-negusse commented Mar 23, 2026 •

edited

Loading

Uh oh!

jterry64 left a comment

Uh oh!

Uh oh!

Uh oh!

solomon-negusse commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

solomon-negusse commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jterry64 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

solomon-negusse commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

solomon-negusse commented Mar 23, 2026 •

edited

Loading