Skip to content

refactor tcl pipeline#217

Open
solomon-negusse wants to merge 7 commits intomainfrom
create-tcl-zarrs
Open

refactor tcl pipeline#217
solomon-negusse wants to merge 7 commits intomainfrom
create-tcl-zarrs

Conversation

@solomon-negusse
Copy link
Member

@solomon-negusse solomon-negusse commented Mar 23, 2026

This PR refactors the TCL pipeline to run the global computation before running qc validation and to use the functional pattern used in all the other datasets' pipelines.

…indirection, update tests

- Convert TreeCoverLossTasks class to module-level functions in stages.py
- Remove tcl.py and TreeCoverLossPrefectTasks
- Update all flows and tests to use direct function calls
- QC and compute now run as separate steps in the flow
@solomon-negusse solomon-negusse marked this pull request as draft March 23, 2026 12:19
@solomon-negusse solomon-negusse changed the title refactor tcl, create zarrs refactor tcl pipeline Mar 23, 2026
@solomon-negusse solomon-negusse marked this pull request as ready for review March 23, 2026 18:32
Copy link
Member

@jterry64 jterry64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it mostly looks good other than 2 small comments, although the functions got removed around quite a bit so was hard to tell for stages.py. The new QC function where you pass the result_df looked good though. Thanks for reversing all this refactoring haha.

Did you use AI at all to help with this? Maybe @manukala6 could use something similar.

initial works 1 -> 10 for efficient flox compute

adapt min reduced to 1 for the long idle time during gee cal
adapt max upped to 75 for heavy TCL flox compute
@solomon-negusse
Copy link
Member Author

I think it mostly looks good other than 2 small comments, although the functions got removed around quite a bit so was hard to tell for stages.py.
The new QC function where you pass the result_df looked good though. Thanks for reversing all this refactoring haha.

Thanks for the review @jterry64 - I mainly made it consistent with the other pipelines to minimize the cognitive load of maintaining multiple patterns. We can implement the dependency injection pattern coherently across pipelines when/if we start feeling pain points of what we have.

Did you use AI at all to help with this? Maybe @manukala6 could use something similar.

Yes, this was mostly in co-pilot with claude opus 4.6 prompting it to follow the other pipeline pattern. Need to give it specific pipeline and narrow down the modules it needs to update to have it not waste tokens. I had this instructions.md that it helped put together to help with that:

---
description: "Use when refactoring the tree cover loss (TCL) pipeline to follow the direct-call pattern used by disturbance, carbon_flux, and other pipelines."
applyTo: "pipelines/tree_cover_loss/**, pipelines/test/**/tree_cover_loss/**"
---
# TCL Pipeline Refactor: Adopt Direct-Call Pattern

## Goal
Refactor the TCL pipeline to match the simpler pattern used by `disturbance/`, `carbon_flux/`, and other pipelines. Eliminate the `tasks`-as-class-namespace indirection.

## Target Pattern (Reference: disturbance pipeline)

The project uses a 3-layer pattern:

1. **`stages.py`** — Pure functions, no Prefect dependency. Contains `load_data()`, `setup_compute()`, `create_result_dataframe()`, `postprocess_result()`, QC logic.
2. **`*_tasks.py`** — Thin `@task`-decorated wrappers that delegate to `stages.py`.  
3. **`*_flow.py`** — `@flow` that calls tasks directly with `.with_options(name=...)`.

Reference files:
- `pipelines/disturbance/stages.py`
- `pipelines/disturbance/prefect_flows/dist_common_tasks.py`
- `pipelines/disturbance/prefect_flows/gadm_dist_alerts_by_drivers.py`

## Files to Modify

### 1. `pipelines/tree_cover_loss/stages.py`
- Convert `TreeCoverLossTasks` class methods to **module-level functions**.
- `load_data`, `setup_compute`, `create_result_dataframe`, `postprocess_result` become plain functions.
- QC methods (`qc_against_validation_source`, `get_sample_statistics`, `get_validation_statistics`) should accept `qc_feature_repository` and `gee_repository` as parameters instead of using `self`.
- `compute_zonal_stat` is just a passthrough to `common_stages.compute` — no need to keep it here; use `common_tasks.compute_zonal_stat` directly in the flow.
- Keep `_symmetric_relative_difference` as a module-level helper.
- Remove `LoaderType`, `ExpectedGroupsType`, `SaverType` type aliases (unused after refactor).

### 2. `pipelines/tree_cover_loss/prefect_flows/tcl_tasks.py`
- Remove `TreeCoverLossPrefectTasks` class.
- Keep the `@task`-decorated functions (`load_data`, `setup_compute`, `postprocess_result`, `qc_against_validation_source`).
- Each task calls the corresponding module-level function from `stages.py` directly (not through a class instance).

### 3. `pipelines/tree_cover_loss/prefect_flows/tcl.py`
- **Delete this file.** Its logic moves into `tcl_flow.py`.

### 4. `pipelines/tree_cover_loss/prefect_flows/tcl_flow.py`
- Inline the orchestration from `tcl.py` directly into the flow function.
- Call tasks directly with `.with_options(name=...)`, same as `gadm_dist_alerts_by_drivers.py`.
- The QC step becomes a direct task call at the start of the flow.

### 5. Tests — `pipelines/test/integration/tree_cover_loss/test_tcl_flow.py`
- `test_tcl_flow_real_data`: No change needed — it calls `umd_tree_cover_loss_flow()`.
- `test_tcl_flow_with_new_contextual_layers`: Replace `umd_tree_cover_loss(TreeCoverLossTasks(...))` with direct calls to stage functions, using `@patch` for `stages._load_zarr` and repository dependencies.
- `test_tcl_flow_with_bbox`: Replace `compute_tree_cover_loss(TreeCoverLossTasks(), bbox=...)` with direct calls to stage functions under `@patch`.

### 6. Tests — `pipelines/test/unit/tree_cover_loss/`
- Update imports: `from pipelines.tree_cover_loss.stages import TreeCoverLossTasks` → `from pipelines.tree_cover_loss import stages` and call `stages.setup_compute(...)` etc. directly.

## Constraints
- **Preserve all Prefect task tracking** — every step must remain a `@task`.
- **Preserve QC functionality** — `qc_against_validation_source` calls `compute_tree_cover_loss` internally (via `get_sample_statistics`). After refactor, QC should call stage functions directly or be a separate subflow.
- **No test coverage loss** — all existing assertions must still pass.
- **Do not change `stages.py` function signatures** beyond removing `self`.
- **Keep `stages.py` import of `compute_tree_cover_loss` for QC** — `get_sample_statistics` needs to run the full compute. After removing `tcl.py`, this logic should call stage functions directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants