validate labels in BoxDataset #1093

dylankershaw · 2025-07-15T04:45:51Z

This PR addresses #574.

src/deepforest/main.py

jveitchmichaelis · 2025-07-15T14:56:17Z

Thanks! I'd like to review these changes after we've merged #1083, because that PR also adjusts when/where label handling happens (and also enforces other checks). Some of this logic is moved into set_labels (to verify configs) and we will also do some checks when models are created.

For clarity, this PR specifically addresses a third scenario, when we want to set up a training dataset. Overall we want to check:

Is the label_dict in the config sane? (deepforest instantiation)
Have we tried to make a model with a mis-matched label dict? (model instantiation)
Have we tried to pass in a dataset that disagrees with the label_dict? <- (dataset instantiation)

Here I think it suffices to check that labels in all dataset CSVs are present in the label_dict, but label_dict can contain additional keys.

I would consider moving the function call to the dataset class, and not create_trainer. That's really where we need to perform this sanity check and I think this would also only run on-demand. Otherwise your check will be called whenever a deepforest instance is created (if a training CSV is defined).

It would also be logical to perform this alongside other sanity checks, like verifying bounding boxes are in range, image paths exist and so on. But we probably want this to be optional for bigger datasets).

I think the complexity here is as good as we'll get, since you have to iterate over the whole CSV at least once. But I would check if there are faster/parallel ways to do this within pandas.

jveitchmichaelis · 2025-07-15T15:12:40Z

The test that breaks is test_checkpoint_label_dict because of the ordering issue between setting the dict and creating the trainer. We're moving towards enforcing this at the config level. So we'd prefer:

m = main.deepforest(config_args={"num_classes": 1},
                        label_dict={
                            "Object": 0
                        })

rather than overwriting the label dict after creation (which was fine to set up a unit test, but in real life I don't know why you'd do it). Currently:

m.create_trainer() #<- test fails here
m.label_dict = {"Object": 0}
m.numeric_to_label_dict = {0: "Object"}
m.trainer.fit(m)
m.trainer.save_checkpoint("{}/checkpoint.pl".format(tmpdir))

would be replaced with:

m.create_trainer() 
m.trainer.fit(m) #<- test would fail here (we're not expecting it to), when Lightning calls m.train_dataloader
m.trainer.save_checkpoint("{}/checkpoint.pl".format(tmpdir))

jveitchmichaelis

As comment above:

Can we see what this would look like if the check is performed as a sanity check within the dataset itself (BoxDataset), and not in create_trainer?
We should review some of the existing test cases and make sure that they reflect usage patterns we're expecting for v2+

src/deepforest/main.py

dylankershaw · 2025-07-21T01:58:38Z

Thanks for the review @jveitchmichaelis !

I'd like to review these changes after we've merged #1083

Sounds good. I'll keep on eye on that PR and push any necessary updates here once it's merged.

I would consider moving the function call to the dataset class, and not create_trainer.

Done. Thanks for catching that. I've added some additional tests to verify that we're now running label validation for both train and validate CSVs. Looks like this fixed the failing test_checkpoint_label_dict issue as well.

It would also be logical to perform this alongside other sanity checks, like verifying bounding boxes are in range, image paths exist and so on. But we probably want this to be optional for bigger datasets).

I like that idea. Any objections to me addressing those in a subsequent PR and keeping the scope of this PR limited to label validation?

I think the complexity here is as good as we'll get, since you have to iterate over the whole CSV at least once. But I would check if there are faster/parallel ways to do this within pandas.

Thanks for flagging that. I've tweaked the validate_labels a bit so it should be more efficient than before, but to your point I don't think we can do better than O(n).

jveitchmichaelis · 2025-07-21T19:56:57Z

Yes, no problem to scoping this only to label validation. There is another open PR for box validation but the author hasn't replied in a while. If you wanted to tackle that, we can merge it with co-authorship - looks like it just needs a rebase and an update to reflect the current dataset structure. #1015

I've requested a minor change to make this a dataset method. This also avoids calling read_csv twice since you can check self.annotations.

bw4sz · 2025-07-25T19:20:51Z

@dylankershaw thanks for your help here.

dylankershaw · 2025-07-26T00:12:02Z

I've requested a minor change to make this a dataset method. This also avoids calling read_csv twice since you can check self.annotations.

Done! ✅

I rebased main as well so this should be good to go @jveitchmichaelis. I'll take a look at that other PR you linked next week.

bw4sz

This looks good to me, I'm not sure about the spacing in the tests, but style passes.

jveitchmichaelis · 2025-07-31T02:26:25Z

Yeah bit odd - @dylankershaw can you see if you can push without the utilities test file included? We seem to have an occasional issue with flip-flopping formatters, but hopefully ruff will fix that. Otherwise looks good to me.

dylankershaw · 2025-07-31T04:25:22Z

I think one of the formatters must have made the test_utilities.py changes in the precommit hook. Anyway, I just reverted those 👍

jveitchmichaelis

LGTM :)

src/deepforest/utilities.py

dylankershaw commented Jul 15, 2025

View reviewed changes

src/deepforest/main.py Outdated Show resolved Hide resolved

jveitchmichaelis requested changes Jul 15, 2025

View reviewed changes

src/deepforest/main.py Outdated Show resolved Hide resolved

dylankershaw force-pushed the main branch from a6e3912 to b51cc96 Compare July 26, 2025 00:10

bw4sz approved these changes Jul 31, 2025

View reviewed changes

validate labels in BoxDataset

50b9440

dylankershaw force-pushed the main branch from b51cc96 to 50b9440 Compare July 31, 2025 04:19

dylankershaw changed the title ~~raise error in create_trainer when there's a label mismatch~~ validate labels in BoxDataset Jul 31, 2025

jveitchmichaelis approved these changes Jul 31, 2025

View reviewed changes

src/deepforest/utilities.py Outdated Show resolved Hide resolved

jveitchmichaelis merged commit f0d4a0d into weecology:main Jul 31, 2025
5 checks passed

validate labels in BoxDataset #1093

validate labels in BoxDataset #1093

Uh oh!

Conversation

dylankershaw commented Jul 15, 2025

Uh oh!

Uh oh!

jveitchmichaelis commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dylankershaw commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Jul 25, 2025

Uh oh!

dylankershaw commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz left a comment

Choose a reason for hiding this comment

Uh oh!

jveitchmichaelis commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylankershaw commented Jul 31, 2025

Uh oh!

jveitchmichaelis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jveitchmichaelis commented Jul 15, 2025 •

edited

Loading

jveitchmichaelis commented Jul 15, 2025 •

edited

Loading

dylankershaw commented Jul 21, 2025 •

edited

Loading

jveitchmichaelis commented Jul 21, 2025 •

edited

Loading

dylankershaw commented Jul 26, 2025 •

edited

Loading

jveitchmichaelis commented Jul 31, 2025 •

edited

Loading