-
Notifications
You must be signed in to change notification settings - Fork 221
validate labels in BoxDataset #1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks! I'd like to review these changes after we've merged #1083, because that PR also adjusts when/where label handling happens (and also enforces other checks). Some of this logic is moved into For clarity, this PR specifically addresses a third scenario, when we want to set up a training dataset. Overall we want to check:
Here I think it suffices to check that labels in all dataset CSVs are present in the label_dict, but label_dict can contain additional keys. I would consider moving the function call to the dataset class, and not It would also be logical to perform this alongside other sanity checks, like verifying bounding boxes are in range, image paths exist and so on. But we probably want this to be optional for bigger datasets). I think the complexity here is as good as we'll get, since you have to iterate over the whole CSV at least once. But I would check if there are faster/parallel ways to do this within pandas. |
|
The test that breaks is m = main.deepforest(config_args={"num_classes": 1},
label_dict={
"Object": 0
})rather than overwriting the label dict after creation (which was fine to set up a unit test, but in real life I don't know why you'd do it). Currently: m.create_trainer() #<- test fails here
m.label_dict = {"Object": 0}
m.numeric_to_label_dict = {0: "Object"}
m.trainer.fit(m)
m.trainer.save_checkpoint("{}/checkpoint.pl".format(tmpdir))would be replaced with: m.create_trainer()
m.trainer.fit(m) #<- test would fail here (we're not expecting it to), when Lightning calls m.train_dataloader
m.trainer.save_checkpoint("{}/checkpoint.pl".format(tmpdir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comment above:
- Can we see what this would look like if the check is performed as a sanity check within the dataset itself (
BoxDataset), and not increate_trainer? - We should review some of the existing test cases and make sure that they reflect usage patterns we're expecting for v2+
|
Thanks for the review @jveitchmichaelis !
Sounds good. I'll keep on eye on that PR and push any necessary updates here once it's merged.
Done. Thanks for catching that. I've added some additional tests to verify that we're now running label validation for both train and validate CSVs. Looks like this fixed the failing
I like that idea. Any objections to me addressing those in a subsequent PR and keeping the scope of this PR limited to label validation?
Thanks for flagging that. I've tweaked the |
|
Yes, no problem to scoping this only to label validation. There is another open PR for box validation but the author hasn't replied in a while. If you wanted to tackle that, we can merge it with co-authorship - looks like it just needs a rebase and an update to reflect the current dataset structure. #1015 I've requested a minor change to make this a dataset method. This also avoids calling |
|
@dylankershaw thanks for your help here. |
Done! ✅ I rebased main as well so this should be good to go @jveitchmichaelis. I'll take a look at that other PR you linked next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, I'm not sure about the spacing in the tests, but style passes.
|
Yeah bit odd - @dylankershaw can you see if you can push without the utilities test file included? We seem to have an occasional issue with flip-flopping formatters, but hopefully ruff will fix that. Otherwise looks good to me. |
|
I think one of the formatters must have made the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
This PR addresses #574.