-
Notifications
You must be signed in to change notification settings - Fork 222
Feat torchmetrics eval #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat torchmetrics eval #1071
Conversation
|
this would be a first draft implementing what we discussed in #901.
|
|
Great, just a quick note that we are merging very soon a very large pre-2.0 push. It won't literally touch anything you done here, but rebasing will take some work since it covers most of the codebase. Hopefully it will be merged within a couple days. |
|
Agreed, we can add config args (see new hydra config) to give users more control. Make sure to test for edge cases 1) all empty images, and 2) mixed empty and non-empty images. See empty_frame_accuracy. DeepForest/src/deepforest/main.py Line 97 in af9458b
excited! Something we have wanted for a long time but not been able to crack. |
|
Hello @bw4sz! I have been using my fork and fixed a few issues already (FYI, using this example https://deepforest-modal-app.readthedocs.io/en/latest/treeai-example.html). I will check the edge cases mentioned (I think empty images are indeed problematic with the current code) and push the fixes in this PR in the following days. Looking forward to DeepForest 2.0. |
|
Great, I have identified a multi-gpu evaluate error (on during the training loop), in which the GPU ranks don't properly gather all the data. I'm going to wait until this PR to fix that, we might be able to get rid of our custom code entirely, its slow and i've never loved it. |
|
@martibosch can I help here, looks like a quick rebase, and tell me if you are ready for review. |
|
The next steps here is to compare the evaluation scores from main.evaluate() and here to see what kind of difference there is. |
|
Hello @bw4sz, sorry for the delay on my side, I have been busy preparing materials for a conference. I will retake this tomorrow. I think I will rebase this so that the CI is run on the pull request and then we can compare results. |
f1cf34c to
8dffa19
Compare
|
Hello! I think I fixed the conflicts, however, is there any reason why the test worfklows are not running on this PR? or they are running and I can't see them? |
|
Needs manual approval (for non org members?), but that should be good to run now. |
|
Hello! I compared the torchmetrics-based evaluation and it seems that there are some differences that we will have to explore. I will come back at you when I find something. In any case, here is the zip with the data and notebook (you can see the different scores there) that I am working with |
|
Thanks, I'll take a look if I have a moment today. Maybe we can make a summary table and some plots to compare? I'd be interested to see if it's significant enough that we'd worry, but IoU (as an example) is almost always different between packages. |
|
So I would defer to @ethanwhite and @bw4sz on this. I'm inclined to accept that we should standardize on torchmetrics' implementations even if they disagree with ours. We could for now host both versions with torchmetrics as default and ours as a "legacy" option. It doesn't look like anything is outrageously wrong, but I agree it would be useful to figure out where this comes from. I would suggest checking if we get closer results if the box assignment is the same (i.e. the input to the IOU function is identical) or if this arises from an implementation difference in Ours from two runs:
The species classification differences look near enough identical, though precision is different. One related issue from your notebook is that I think it's in evaluate_boxes somewhere, we modify |
|
@martibosch please could you share the images you used for the example? I've had a look at extracting the raw data from coco_eval, but it's quite difficult to review without seeing the underlying images. Here blue is ground truth, red is predicted:
The predictions don't look particularly good to start with, so even if we check the library-assigned true positives I can't tell if they're sensible. The fact the boxes line up suggests I'm plotting the right annotations at least. I made a wrapper around It's a bit ugly for what amounts to like three lines of changed code to store the evaluation object, but it works. The eval library has an option to run this to label the assignments: https://github.com/MiXaiLL76/faster_coco_eval/blob/45f747f7544be9049f448f1806f6fa43ebf97d36/faster_coco_eval/core/faster_eval_api.py#L118 I enabled this with You can get them from other places, but I like this way because there's no room for misinterpretation (each annotation dict gets a label). The challenge with matching is that the eval tools use greedy assignment which will probably be slightly different to the Hungarian matching that DeepForest (and your code) does. That alone might explain a small discrepancy. The matching function is found here: https://github.com/MiXaiLL76/faster_coco_eval/blob/45f747f7544be9049f448f1806f6fa43ebf97d36/csrc/faster_eval_api/coco_eval/cocoeval.cpp#L72 The original from pycocotools is here: https://github.com/ppwwyyxx/cocoapi/blob/8cbc887b3da6cb76c7cc5b10f8e082dd29d565cb/PythonAPI/pycocotools/cocoeval.py#L272. The faster version just ported it to C++ from what I can tell and For what it's worth Ultralytics (Yolo) has had an open issue about this since 2021. This is not a problem unique to us :)
|
I checked this a while ago (see this comment and the previous commit which the comment refers to), and there were no mismatches between deepforest and torchmetrics, however I have changed the code quite a bit so we should check this again.
I noticed that too and I already adressed that so it should not be happening here. |
I've been looking into the results you shared here, as far as differences go. I'll continue discussion on the issue. |
|
My two cents is that if we are within 10% of standardized code, we go with
that. This looks like the case. Part of my motivation was I was expecting
torchmetrics to be quite a bit faster. If we run a large evaluation, like
on MillionTrees, is there a time difference? I remember writing the IoU
metric myself and there is no way it was fast as what torchmetric has done.
…On Tue, Aug 5, 2025 at 11:25 PM Josh Veitch-Michaelis < ***@***.***> wrote:
*jveitchmichaelis* left a comment (weecology/DeepForest#1071)
<#1071 (comment)>
I compared the torchmetrics-based evaluation and it seems that there are
some differences that we will have to explore. I will come back at you when
I find something.
I've been looking into the results you shared here, as far as differences
go. I'll continue discussion on the issue.
—
Reply to this email directly, view it on GitHub
<#1071 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJHBLB4F5V5MEHAQ4BE57L3MGNV3AVCNFSM6AAAAAB66VB6YGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJXGU2TSNJYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Ben Weinstein, Ph.D.
Research Scientist
University of Florida
https://bw4sz.github.io/
|
|
Yeah see #901 (comment). As long as the results are sane and repeatable, it should be OK.
My guess is probably not for matching. That runs per-image and the forward pass dominates. We could do some testing on like 10M synthetic labels just to see, but for a one-off evaluation I doubt it would be that long. You're running a bunch of N^2 IoU calculations (per image), and the matrix ops for Hungarian (even worse at N^3), but N isn't quite big enough to make it blow up. I also want to say that in the source for models like Mask2Former they use library functions like we do, they're not hand optimized. There are memory concerns for larger matrices, but we're not at that point. Running the COCO-style evaluation might take a long time (especially with pycocotools, even though there are some C bits), but |
|
|
@martibosch take a look at the PR here. I had a look at where we could optimize without changing any of the existing interfaces, and this looks quite promising. The short answer is that most of our existing inefficiency can be removed by vectorising and avoiding lots of repeated dataframe operations. We have switched to faster_coco_eval in The other difference that's apparent in very large tests is that torchmetrics (really cocotools/faster_coco_eval) is single threaded. So if you have a large dataset and a beefy machine, it sits for ages spinning on a single core when most of the per-image processing can be parallelised. |
|
Hello @jveitchmichaelis, yes the ideal scenario would be to retrieve the match directly form torchmetrics, which I found to be a bit too hacky especially if the user is not bound to a specific backend. However, if the user will be bounded to faster_coco_eval, couldn't we just use the In any case, it makes sense to not use torchmetrics if we do not want many of their features (e.g., maximum detection thresholds), but I am wondering to what extent it would make sense to allow the user to retrieve the whole precision/recall curve. Finally, out of curiosity, is using the STRtree faster than a spatial join with geopandas? Also a while ago I used the overlay functions and I was surprised of how fast they ran over large datasets. However, I suppose that in most of DeepForest use cases, the individual image geo-data frame will most certainly be rather small, in which case most gains would indeed come from exploiting the "embarrassingly parallel" nature of the multi-image evaluation. Let me know if you prefer to move the review to your PR. Best, |
|
@martibosch I think your effort here stands and you can go with the assumption that we will restrict to The reason I was poking around with the eval code is that we're running some very large tests at the moment, and it was taking a while to process O(10k+) images. We also needed the output to compare with existing/published results, so I wanted to make sure we got identical outputs. The end result takes around the same time as I don't see any reason why we couldn't use On question is: can we get this feature pulled into torchmetrics directly? I would be happy to raise a PR with them as it's a pretty minor change to expose the attributes (the developers at faster_coco_eval are also very responsive). But I don't see them supporting multi-threading any time soon.
From profiling, most of the improvements were through vectorisation and avoiding implicit loops. Most of the time now is spent computing the intersection and union areas. Tree construction and querying is comparably fast. |

For #901.