Objective: Federated learning (FL) enables collaborative model training without centralizing sensitive data, making it particularly relevant for medical imaging. Yet, its deployment in medical image segmentation is challenged by real-world data imperfections across institutions, including label noise manifested as contour disagreement, missing or additional structures, or confused labels. Although federated noisy label learning (FNLL) aims to mitigate these effects, existing studies commonly evaluate methods on few datasets, simplified settings, and synthetic noise types. We address the lack of standardized benchmarking resources for FNLL in cross-silo medical image segmentation by introducing a benchmark suite combining diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation.
Materials & Methods: The suite combines the curation of diverse, real-world noisy medical image segmentation datasets with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. To demonstrate its capabilities, we compare representative FNLL methods across approaches, including noise-aware aggregation, robust personalization, label correction, and sample selection.
Results: In-depth data analysis shows that real-world segmentation label noise occurs both in isolation and in combinations of characterized noise types. The benchmark identifies FedSelect as the strongest overall FNLL method, underlines FedAvg as a competitive baseline, and provides an actionable decision guide to support selection of suitable FNLL strategies based on label-noise type and client-noise scenario.
Discussion & Conclusion: The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.
Figure 1: Segmentation label noise of various forms degrades model training and poses a particular challenge in FL, where noisy annotations are distributed across clients and cannot be centrally inspected. While FNLL methods aim to address this problem, existing literature is often limited to few and synthetic noise types, restricted client-noise scenarios, and narrow data scope. Our benchmark suite closes this gap by combining diverse real-world noisy segmentation datasets, a federated benchmarking framework, and comprehensive noise-targeted evaluation, thereby enabling FNLL method selection, dataset characterization, benchmarking on new data, and evaluation of newly developed FNLL methods.
Will be updated upon acceptance of manuscript "Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection".
Setup
Create and populate the Python environment from the repository requirements:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtThe repository also provides Make targets for the local development environment:
make
make update
make cleanSet the nnU-Net paths before preparing data or running experiments:
export nnUNet_raw="/path/to/nnUNet_raw"
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
export nnUNet_results="/path/to/nnUNet_results"Data
Download the raw source datasets from their original providers and keep them outside the repository. Download respective datasets from here:
- LIDC: images and labels
- RIGA: images and labels
- Gleason: images and labels
- MouseTumor: images and labels
- MMIS: images and labels
- MMIA: images and labels
The inherently noisy segmentation datasets are prepared to obtain a noisy and a clean (conensus or expert) version of the datasets.
This is done by calling their respective prepare.py script:
src/data/lidc-idri/prepare.py
src/data/riga/prepare.py
src/data/gleasonxai/prepare.py
src/data/mouse-tumor/prepare.py
src/data/mmis/prepare.py
src/data/mama-mia/prepare.py
$nnUNet_raw is read from the environment (no CLI arg needed).
The script creates one nnUNet dataset per FL client, split by data source.
Prepare LIDC dataset
Requires pylidc configured with the local LIDC-IDRI DICOM files (~/.pylidcrc) and the TCIA download manifest CSV.
Prepare data to clean FL clients (pixelwise annotator majority consensus):
python src/data/lidc-idri/prepare.py \
--raw_data_path /path/to/lidc-working-dir \
--single_seg_mode annotator_majority \
--dataset_ids "041 042 043 044" \
--lidc_manifest /path/to/tcia_manifest/metadata.csvPrepare data to noisy FL clients (randomly selected rater per nodule):
python src/data/lidc-idri/prepare.py \
--raw_data_path /path/to/lidc-working-dir \
--single_seg_mode random \
--dataset_ids "045 046 047 048" \
--lidc_manifest /path/to/tcia_manifest/metadata.csv$nnUNet_raw is read from the environment (no CLI arg needed).
Intermediate files are written to <raw_data_path>/nifti/ and <raw_data_path>/single_seg_<mode>/.
4 FL clients, one per CT scanner manufacturer (split via FLamby LIDC metadata bundled at src/data/lidc-idri/flamby_lidc_federated_split_metadata.csv).
Prepare RIGA dataset
Prepare data to clean FL clients (annotator majority consensus):
python src/data/riga/prepare.py \
--raw_data_path /path/to/riga \
--single_seg_mode annotator_majority \
--dataset_ids "300 301 302"Prepare data to noisy FL clients (random annotator per sample):
python src/data/riga/prepare.py \
--raw_data_path /path/to/riga \
--single_seg_mode random \
--dataset_ids "303 304 305"$nnUNet_raw is read from the environment (no CLI arg needed).
Intermediate files are written to <raw_data_path>/img_segmask_tif/ and <raw_data_path>/single_seg_<mode>/.
3 FL clients, one per sub-dataset (BinRushed, Magrabia, MESSIDOR).
Prepare GleasonHD dataset
Prepare data to clean FL clients (STAPLE consensus across raters):
python src/data/gleasonxai/prepare.py \
--raw_data_dir /path/to/gleasonxai \
--single_seg_mode consensus_staple \
--dataset_ids "436 437 438"Prepare data to noisy FL clients (random rater per sample):
python src/data/gleasonxai/prepare.py \
--raw_data_dir /path/to/gleasonxai \
--single_seg_mode random_rater \
--dataset_ids "439 440 441"$nnUNet_raw is read from the environment (no CLI arg needed).
Intermediate files are written to <raw_data_dir>/generated_labels/ and <raw_data_dir>/converted_images/.
Only the Harvard Dataverse subset is used (476 samples), split uniformly across 3 FL clients (159/158/159).
Prepare MouseTumor dataset
Prepare data to clean FL clients (STAPLE consensus):
python src/data/mouse-tumor/prepare.py \
--raw_data_path /path/to/mouse-tumor \
--single_seg_mode staple \
--dataset_ids "500 501 502 503 504"Prepare data to noisy FL clients (random annotator per sample):
python src/data/mouse-tumor/prepare.py \
--raw_data_path /path/to/mouse-tumor \
--single_seg_mode random \
--dataset_ids "505 506 507 508 509"$nnUNet_raw is read from the environment (no CLI arg needed).
3 FL clients, samples distributed across clients by annotator assignment.
Prepare MMIS dataset
Prepare data to clean FL clients: ```bash python src/data/mmis/prepare.py \ --raw_data_path /path/to/mmis \ --single_seg_mode majority \ --dataset_ids "700 701 702 703" ```Prepare data to noisy FL clients:
python src/data/mmis/prepare.py \
--raw_data_path /path/to/mmis \
--single_seg_mode rater \
--dataset_ids "704 705 706 707"$nnUNet_raw is read from the environment (no CLI arg needed).
Intermediate NIfTI files are written to <raw_data_path>/nifti/.
4 FL clients, one per annotator (label_a1–label_a4).
Prepare MAMA-MIA dataset
Prepare data to clean FL clients (expert segmentations):
python src/data/mama-mia/prepare.py \
--raw_data_path /path/to/mama-mia \
--single_seg_mode expert \
--dataset_ids "600 601 602 603"Prepare data to noisy FL clients (automatic segmentations):
python src/data/mama-mia/prepare.py \
--raw_data_path /path/to/mama-mia \
--single_seg_mode automatic \
--dataset_ids "604 605 606 607"$nnUNet_raw is read from the environment (no CLI arg needed).
4 FL clients, one per data source (DUKE, ISPY1, ISPY2, NACT).
These preparation files convert the datasets into the nnU-Net raw folder structure, and the have to follow the nnU-Net dataset naming conventions.
Per FL client and noise state (noisy or clean verison of dataset), a nnU-Net dataset is created.
These raw nnU-Net datasets have to located in nnUNet_raw="/path/to/nnUNet_raw".
In the end, dataset being subdivided into 3 FL client should be structured like this:
nnUNet_raw/
Dataset001_<DatasetName>_clean_client0/
imagesTr/
labelsTr/
dataset.json
Dataset002_<DatasetName>_clean_client1/
<same as above>
Dataset003_<DatasetName>_clean_client2/
<same as above>
Dataset004_<DatasetName>_noisy_client0/
<same as above>
Dataset005_<DatasetName>_noisy_client1/
<same as above>
Dataset006_<DatasetName>_noisy_client2/
<same as above>
The nnU-Net model is self-configuring and adapts architecture, patch size, and training settings to the processed data. In FL, independently planned clients can produce incompatible model architectures, which breaks weight aggregation.
For this reason, planning and preprocessing must also be adapted to the
federated setting. The script src/data/utils/nnunet_fed_preparation.py
extracts fingerprints per client, averages them centrally, plans one compatible
experiment configuration, and preprocesses all participating clients with this
shared plan.
To obtain the preprocessed data in nnUNet_preprocessed="/path/to/nnUNet_preprocessed", we have to set the nnUNet_raw and nnUNet_preprocessed environmental variable:
export nnUNet_raw="/path/to/nnUNet_raw"
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
To plan and preprocess the datasets of our examplary FL clients:
python3 ./src/data/utils/nnunet_fed_preparation.py \
--dataset_ids "001 002 003" \
--configuration "3d_fullres" \
--planner "nnUNetPlannerResEncM" \
--plans_name "nnUNetResEncUNetMPlans" \
--verify_dataset_integrityPrior to benchmarking FNLL methods on the noisy data, we analyze the data in two ways:
- Comparative analysis of the multi-rater label masks versus the obtained clean consensus mask (only for multi-rater datasets possible).
- Quantification of label noise.
First compute the multi-rater consensus analysis:
python3 ./src/data/data_analysis/analyze_multirater_consensus.py \
--unified_dir /path/to/all_masksOr with separate directories and dataset filtering:
python3 ./src/data/data_analysis/analyze_multirater_consensus.py \
--dataset_ids "001 002 003" \
--multirater_dir /path/to/multirater_masks \
--consensus_dir /path/to/consensus_masksThen visualize the resulting agreement and error metrics:
python3 ./src/data/data_analysis/visualize_multirater_consensus_violin.py \
--input_json ./results/consensus_analysis/<YOUR-DATASET>/multirater_consensus.json \
--output_png ./results/consensus_analysis/<YOUR-DATASET>/fk_dice_hd95_if1_clsconf.pngThe visualization covers class-wise Fleiss' kappa, Dice, HD95, instance-level F1, and class-confusion statistics against the consensus mask, to quantify rater variability, volume-based label alignment, and the three segmentation label noise types contour variations, missed/additional target structures, confused class labels of target structures.
First compute the noise analysis by comparing noisy masks against the consensus or clean reference masks:
python3 ./src/data/data_analysis/analyze_noise_clean_noisy.py \
--clean_dataset_ids "001 002 003" \
--noisy_dataset_ids "004 005 006"Generate per-class boxplots:
python3 ./src/data/data_analysis/visualize_perclass_boxplots.py \
--input_json ./results/noise_analysis/noise_analysis_results_clean<DATASET-IDS-OF-YOUR-DATASET>.json \
--output_dir ./results/noise_analysis/<YOUR-DATASET>/ \Run benchmarking
Run FL experiments with src/fed/main.py. The core switches are the dataset
IDs, client count, FL rounds, local epochs, trainer, and FNLL method.
To run the experiments, we have to set the nnUNet_preprocessed and nnUNet_results environmental variables:
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
export nnUNet_results="/path/to/nnUNet_results"
The benchmark defines four client-noise scenarios. Using the example dataset structure from the Data section (001–003 clean, 004–006 noisy):
Clean — all clients train on clean data (upper bound):
python3 ./src/fed/main.py \
--dataset_ids "001 002 003" \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--fold 0 \
--num_clients 3 \
--num_rounds 100 \
--num_local_epochs 1 \
--trainer nnUNetTrainer_FedAvg \
--noise_mitigation_method <FNLL method> \
<FNLL method-specific flags>Noisy — all clients train on noisy data, evaluated on clean validation data:
python3 ./src/fed/main.py \
--dataset_ids "004 005 006" \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--fold 0 \
--num_clients 3 \
--num_rounds 100 \
--num_local_epochs 1 \
--trainer nnUNetTrainer_FedAvg \
--noise_mitigation_method <FNLL method> \
<FNLL method-specific flags> \
--clean_validation_dataset <Dataset001_XXX Dataset002_XXX Dataset003_XXX> Ration-on-all (ROA) — all clients train on partially clean, partially noisy data, evaluated on clean validation data:
python3 ./src/fed/main.py \
--dataset_ids "004 005 006" \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--fold 0 \
--num_clients 3 \
--num_rounds 100 \
--num_local_epochs 1 \
--trainer nnUNetTrainer_FedAvg \
--noise_mitigation_method <FNLL method> \
<FNLL method-specific flags> \
--clean_validation_dataset <Dataset001_XXX Dataset002_XXX Dataset003_XXX> \
--noise_ratio 0.5 \
--noisy_train_folder <Dataset004_XXX Dataset005_XXX Dataset006_XXX>Ratio-of-clients (ROC) — partial clients fully clean, partial clients fully noise, evaluated on clean validation data:
python3 ./src/fed/main.py \
--dataset_ids "001 005 006" \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--fold 0 \
--num_clients 3 \
--num_rounds 100 \
--num_local_epochs 1 \
--trainer nnUNetTrainer_FedAvg \
--noise_mitigation_method <FNLL method> \
<FNLL method-specific flags> \
--clean_validation_dataset <Dataset001_XXX Dataset002_XXX Dataset003_XXX> Evaluation and compilation of FNLL decisions
Point all result-processing scripts to your results directory via the
nnUNet_results environment variable (or --nnunet-results-root where
supported):
export nnUNet_results="/path/to/nnUNet_results"bootstrap_parent.py and visualize_results.py require an experiments log
table — a CSV (or Google Sheet exported as CSV) where each row describes one
registered experiment. Required columns:
| Column | Description |
|---|---|
ID |
Non-empty marker that the row is active |
Experiment ID |
Experiment folder name as created by main.py |
Algo |
Method name (e.g. fedavg, fedselect) |
Data |
Dataset name (e.g. LIDC, RIGA) |
Noise |
Client-noise scenario (clean, roa, roc, noisy) |
The scripts in this repository read this table from a Google Sheet; adapt the
sheet_id / csv_url constants at the top of each script to point to your own
table.
Evaluate and bootstrap a single experiment:
python3 ./src/eval/results_processing/bootstrap_nnunet_eval.py \
--exp_id <EXPERIMENT_ID> \
--num-workers 8Evaluate and bootstrap all experiments registered in the log table:
python3 ./src/eval/results_processing/bootstrap_parent.py \
--folds 0 1 2 \
--num-workers 8Force recomputation of all metrics:
python3 ./src/eval/results_processing/bootstrap_parent.py \
--folds 0 1 2 \
--force \
--num-workers 8Force recomputation of specific metrics only:
python3 ./src/eval/results_processing/bootstrap_parent.py \
--folds 0 1 2 \
--force-metrics HD95 Dice \
--num-workers 8Generate per-metric result boxplots:
python3 ./src/eval/results_processing/visualize_results.py \
--metric DiceRestrict to a dataset subset:
python3 ./src/eval/results_processing/visualize_results.py \
--metric Dice \
--datasets LIDC RIGA Gleason MouseTumor MMIA MMISBuild the ranking CSV:
python3 ./src/eval/results_processing/ranking.py \
--output-csv ./results/segmentation_results/bootstrap_method_rankings.csvGenerate ranking stability plots:
python3 ./src/eval/results_processing/visualize_ranking.py \
--output-dir ./results/segmentation_results/ranking_stability \
--metric DiceRun paired Wilcoxon signed-rank tests comparing each FNLL method against FedAvg, with Holm-Bonferroni correction across the four method comparisons within each reported group:
python3 ./src/eval/results_processing/statistical_tests.py \
--metrics Dice HD95 FgBgInstanceF1 ClassConfusion \
--noise-scenarios clean roa roc noisy \
--datasets LIDC RIGA Gleason MouseTumor MMIA MMIS \
--datasets-for-metric Dice=LIDC,RIGA,Gleason,MouseTumor,MMIA,MMIS \
--datasets-for-metric HD95=LIDC,RIGA,Gleason,MouseTumor,MMIA,MMIS \
--datasets-for-metric FgBgInstanceF1=LIDC,RIGA,Gleason,MouseTumor,MMIA,MMIS \
--datasets-for-metric ClassConfusion=LIDC,RIGA,Gleason,MouseTumor,MMIA,MMISUse the rank-frequency summaries from visualize_ranking.py and the ranking
CSV from ranking.py to compile the FNLL decision guide. Relevant artifacts:
results/segmentation_results/bootstrap_method_rankings.csv
results/segmentation_results/ranking_stability/rank_frequency_summary_<metric>_<datasets>.csv
Additional comparison figures:
python3 ./src/eval/results_processing/partial_noisy_scenarios_global_comparison.py \
--output-dir ./results/segmentation_results/partial_noise_comparison \
--figure paired_dotpython3 ./src/eval/results_processing/roc_clean_vs_noisy_clients_global_comparison.py \
--output-dir ./results/segmentation_results/partial_noise_comparison \
--figure paired_dotpython3 ./src/eval/results_processing/robustness_analysis_noisy_scenarios_global.py \
--figure separate_dot \
--delta-mode absExperiment reproducibility information
The splits_final.json files that define the train/validation splits used in all benchmark experiments are included in the repository under data/splits_final/. nnU-Net writes these files during preprocessing and uses them to assign cases to training and validation folds. Including them here ensures that any re-run of the benchmark uses the exact same splits.
To use them, copy the relevant files into the corresponding dataset folders under $nnUNet_preprocessed before training:
cp data/splits_final/splits_final_Dataset<ID>_<Name>.json \
$nnUNet_preprocessed/Dataset<ID>_<Name>/splits_final.jsonThe following splits are provided, covering all six datasets in both clean and noisy client-noise configurations:
| Dataset | Clean IDs | Noisy IDs | # Clients | Label mode |
|---|---|---|---|---|
| LIDC-IDRI | 041–044 | 045–048 | 4 | annotator majority / random rater |
| RIGA | 300–302 | 303–305 | 3 | annotator majority / random rater |
| GleasonXAI | 436–438 | 439–441 | 3 | STAPLE consensus / random rater |
| MouseTumor | 500–504 | 505–509 | 5 | STAPLE consensus / random annotator |
| MAMA-MIA | 600–603 | 604–607 | 4 | expert segmentation / automatic segmentation |
| MMIS | 700–703 | 704–707 | 4 | annotator majority / single rater |
Incorporating a new FNLL method.
This benchmark treats a federated noisy-label learning (FNLL) method as an FL
strategy. Existing examples live in src/methods/ (fedavg, feda3i,
feddm, fedcorr, fedselect, iopfl) and are wired through
src/fed/main.py, src/fed/orchestrator.py, src/fed/client.py, and, if the
method changes the local training step, the nnU-Net trainer.
Most methods need one or more of these integration points:
- Server-side aggregation only: implement a custom aggregation function in
src/methods/<method>/<method>.pyand call it fromOrchestrator.aggregate. - Client-side pre/post local training logic: add a method-specific branch in
Client._run_strategy_roundor afterself.model.run(...)inClient.fed_round. - Custom loss, sample weighting, label correction, or per-batch logic: pass
method flags/state through
src/fed/model.pyintonnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py, then use them incompute_training_loss,train_step, validation, or dataloader setup. - Persistent method state for restarts: store paths and hyperparameters in
self.fl_strategy_stateand implementsave_state, followingIOPFLorFedCorr.
Create a new package under src/methods/, for example:
src/methods/myfnll/
myfnll.py
Start from FedAvg if the method still needs standard weighted averaging:
from methods.fedavg.fedavg import FedAvg
class MyFNLL(FedAvg):
def __init__(self, clients, myfnll_lambda=1.0, fl_strategy_state=None):
super().__init__(clients)
self.name = "myfnll"
self.myfnll_lambda = (
myfnll_lambda
if fl_strategy_state is None
else fl_strategy_state["myfnll_lambda"]
)
self.fl_strategy_state = {
"myfnll_lambda": self.myfnll_lambda,
}
def myfnll_aggregate(self, client_checkpoints):
# Return a state_dict-like dict containing the new server model weights.
return self.fed_avg(client_checkpoints)
def save_state(self, exp_id: str = None, client_id: int = None):
self.save_fl_strategy_state_to_file(self.fl_strategy_state, exp_id)The orchestrator expects strategy objects to expose name, clients, and any
method-specific functions called from the server or client flow.
In src/fed/main.py:
- Add the method name and its hyperparameters to
METHOD_ARG_KEYS.
METHOD_ARG_KEYS = {
...
"myfnll": ("myfnll_lambda",),
}- Add parser arguments near the other method arguments.
parser.add_argument(
"--myfnll_lambda",
type=float,
default=1.0,
help="Regularization weight for MyFNLL.",
)build_fl_args automatically copies all keys listed in METHOD_ARG_KEYS into
the orchestrator's fl_args.
In src/fed/orchestrator.py, import the class:
from methods.myfnll.myfnll import MyFNLLAdd it to _build_fl_strategy:
if strategy_name == "myfnll":
return MyFNLL(
self.clients,
fl_args["myfnll_lambda"],
fl_strategy_state=fl_strategy_state,
)Add a server step in _run_server_step:
elif strategy_name == "myfnll":
self._run_myfnll_server_step(fl_round)Then implement the step:
def _run_myfnll_server_step(self, fl_round: int):
self.aggregate(strategy="myfnll", fl_round=fl_round)Finally, route the aggregation in aggregate:
elif strategy == "myfnll":
self.server_model_weights = self.fl_strategy.myfnll_aggregate(
client_checkpoints
)If your method uses FedAvg unchanged, you can call self.aggregate(strategy="fedavg")
inside _run_myfnll_server_step instead.
For methods that compute client statistics, select samples, maintain local
memory, or update personalized models, add a branch to
Client._run_strategy_round:
elif strategy_name == "myfnll":
self._run_myfnll_round(run_kwargs, fl_round, fl_strategy)and implement:
def _run_myfnll_round(self, run_kwargs: dict, fl_round: int, fl_strategy):
run_kwargs.update(
{
"fl_client_id": self.client_id,
"is_myfnll_active": True,
}
)
self.model.run(**run_kwargs)
fl_strategy.update_client_state(self.model.nnunet_trainer, self.client_id)If the method only needs information after local training, follow the IOP-FL
pattern in Client.fed_round: call the strategy after self.model.run(...)
using self.model.current_model_weights or self.model.nnunet_trainer.
If the nnU-Net trainer needs method-specific values, add them to both
nnUNetv2_fed.run(...) and _build_run_training_kwargs(...) in
src/fed/model.py, then include them in the returned kwargs passed to
run_training.
Example additions:
def run(..., is_myfnll_active: bool = False):
kwargs = self._build_run_training_kwargs(..., is_myfnll_active)return {
...
"is_myfnll_active": is_myfnll_active,
}Also make sure Client._base_run_kwargs or your method-specific client branch
sets the value.
If your method changes the local training behavior, prefer a method-specific
trainer class over editing the base nnUNetTrainer directly. The benchmark
already follows this pattern in:
nnUNet/nnunetv2/training/nnUNetTrainer/variants/fl/nnUNetTrainer_FL.py
That file contains simple aliases such as nnUNetTrainer_FedAvg and mixin-based
trainers such as nnUNetTrainer_FedCorr, nnUNetTrainer_FedDM, and
nnUNetTrainer_FedSelect.
Add your trainer to the same file, or create a new Python file below
nnUNet/nnunetv2/training/nnUNetTrainer/variants/. nnU-Net discovers trainers
by class name, so the class name you pass via --trainer must match the Python
class name.
Minimal example:
from nnunetv2.training.nnUNetTrainer.nnUNetTrainer import nnUNetTrainer
class MyFNLLTrainerMixin:
def compute_training_loss(self, batch, data, output, target):
loss = super().compute_training_loss(batch, data, output, target)
if getattr(self.fl_strategy, "name", None) == "myfnll":
loss = loss + self.fl_strategy.myfnll_regularizer(
batch=batch,
output=output,
target=target,
trainer=self,
)
return loss
class nnUNetTrainer_MyFNLL(MyFNLLTrainerMixin, nnUNetTrainer):
passIf you use a custom base trainer, such as nnUNetTrainerDiceCELoss_noSmooth,
compose the mixin with that base instead:
class nnUNetTrainerDiceCELoss_noSmooth_MyFNLL(
MyFNLLTrainerMixin,
nnUNetTrainerDiceCELoss_noSmooth,
):
passThen launch the benchmark with the new trainer:
python3 ./src/fed/main.py \
--noise_mitigation_method myfnll \
--trainer nnUNetTrainer_MyFNLL \
...Use this trainer-subclass route for changes to _build_loss,
compute_training_loss, train_step, run_train_iterations, dataloaders,
augmentation, validation behavior, or any method-specific local training state.
Use the strategy class in src/methods/<method>/ for server aggregation and
state that belongs to the FL algorithm.
For loss or per-batch behavior, extend your method-specific trainer class:
- Add constructor arguments with defaults, for example
is_myfnll_active: bool = False. - Store them as instance attributes near the existing FL args.
- Use the attributes in
compute_training_lossortrain_step.
Example:
def compute_training_loss(self, batch, data, output, target):
loss = self.loss(output, target)
if self.is_myfnll_active:
loss = loss + self.fl_strategy.myfnll_regularizer(
batch=batch,
output=output,
target=target,
trainer=self,
)
return lossKeep tensor operations on self.device, avoid storing GPU tensors in long-lived
strategy state unless necessary, and move persistent state to CPU before saving
when possible.
If your method has state that must survive restarts, keep JSON-serializable
metadata in self.fl_strategy_state. Save large tensors or model weights as
separate .pth files and store only their paths in the JSON. IOPFL.save_state
is the reference pattern for per-client tensor checkpoints, while
FedCorr.save_global_model_weights is the reference pattern for global model
state.
Restart support is driven by the fl_strategy_state entry in the experiment
args JSON. In your method constructor, accept fl_strategy_state=None and load
saved values from it when present.
Before launching a full benchmark, run a tiny experiment with a few rounds and one local epoch:
python3 ./src/fed/main.py \
--noise_mitigation_method myfnll \
--dataset_ids "001 002 003" \
--num_clients 3 \
--num_rounds 2 \
--num_local_epochs 1 \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--trainer nnUNetTrainer_MyFNLL \
--myfnll_lambda 1.0Check that:
- the method name appears in the generated
ExperimentArgs_*.json; - local training finishes for every client;
Orchestrator.aggregateproducesserver_model_weights;- final checkpoints are written in each client result folder;
- any method-specific state can be saved and loaded again by the restart script.
Incorporating a new dataset with segmentation label noise.
New datasets should enter the benchmark through the nnU-Net dataset interface. Keep raw data, preprocessed data, and experiment results outside the repository and point the suite to them with the standard environment variables:
export nnUNet_raw="/path/to/nnUNet_raw"
export nnUNet_preprocessed="/path/to/nnUNet_preprocessed"
export nnUNet_results="/path/to/nnUNet_results"Create a DatasetXXX_<Name> folder under nnUNet_raw with the standard
imagesTr, labelsTr, and dataset.json layout. Use a unique dataset ID for
each client dataset that participates in FL. If you add noisy labels, keep the
clean reference and noisy labels in clearly named folders so the benchmark CLI
can select them via --clean_validation_dataset and --noisy_train_folder.
Make sure each client dataset exposes the same label set, image channels, and compatible train/validation splits. FL aggregation assumes all clients train the same model architecture, so mismatched labels, modalities, or planning outputs will break aggregation.
Use src/data/utils/nnunet_fed_preparation.py across all client dataset IDs.
This computes client fingerprints, averages them centrally, and writes common
plans so all clients use compatible network weights.
python3 ./src/data/utils/nnunet_fed_preparation.py \
--dataset_ids "001 002 003" \
--configuration "3d_fullres" \
--planner "nnUNetPlannerResEncM" \
--plans_name "nnUNetResEncUNetMPlans" \
--verify_dataset_integrityRun a short FedAvg experiment before evaluating FNLL methods:
python3 ./src/fed/main.py \
--noise_mitigation_method fedavg \
--dataset_ids "001 002 003" \
--num_clients 3 \
--num_rounds 2 \
--num_local_epochs 1 \
--configuration 3d_fullres \
--plan nnUNetResEncUNetMPlans \
--trainer nnUNetTrainer_FedAvgCheck that every client trains, aggregation finishes, validation runs, and the
result folders are created below nnUNet_results.