Feat torchmetrics eval #1071

martibosch · 2025-06-10T05:48:49Z

martibosch · 2025-06-10T14:34:25Z

this would be a first draft implementing what we discussed in #901.

As mentioned there, all boxes are matched regardless of their label, which is in line with the current implementation and makes most sense for a bounding box detection + crop model classification approach. Maybe it would make sense to let the user decide whether matches should consider all boxes or only boxes of the same class (e.g., via a boolean argument) in which case we may discard the latest commit 33cdcc3 and work from there.
Also as discussed, torchmetrics offers much more options to control the evaluation. Again this PR mimics the current evaluation API, however it may be a good idea to let users define their custom, e.g., recall thresholds, max detection thresholds and the like. This could be done via keyword arguments whose default values mimic the current evaluation, which would allow advanced users to further customize the evaluation without being too overwhelming for most users.

bw4sz · 2025-06-10T16:05:36Z

Great, just a quick note that we are merging very soon a very large pre-2.0 push. It won't literally touch anything you done here, but rebasing will take some work since it covers most of the codebase. Hopefully it will be merged within a couple days.

bw4sz · 2025-06-11T17:59:00Z

Agreed, we can add config args (see new hydra config) to give users more control. Make sure to test for edge cases 1) all empty images, and 2) mixed empty and non-empty images. See empty_frame_accuracy.

DeepForest/src/deepforest/main.py

Line 97 in af9458b

self.empty_frame_accuracy = BinaryAccuracy()

excited! Something we have wanted for a long time but not been able to crack.

martibosch · 2025-06-18T04:55:29Z

Hello @bw4sz! I have been using my fork and fixed a few issues already (FYI, using this example https://deepforest-modal-app.readthedocs.io/en/latest/treeai-example.html). I will check the edge cases mentioned (I think empty images are indeed problematic with the current code) and push the fixes in this PR in the following days.

Looking forward to DeepForest 2.0.

bw4sz · 2025-06-18T16:55:50Z

Great, I have identified a multi-gpu evaluate error (on during the training loop), in which the GPU ranks don't properly gather all the data. I'm going to wait until this PR to fix that, we might be able to get rid of our custom code entirely, its slow and i've never loved it.

bw4sz · 2025-06-27T14:30:07Z

@martibosch can I help here, looks like a quick rebase, and tell me if you are ready for review.

bw4sz · 2025-07-10T17:13:26Z

The next steps here is to compare the evaluation scores from main.evaluate() and here to see what kind of difference there is.

martibosch · 2025-07-10T20:38:06Z

Hello @bw4sz, sorry for the delay on my side, I have been busy preparing materials for a conference. I will retake this tomorrow. I think I will rebase this so that the CI is run on the pull request and then we can compare results.

martibosch · 2025-07-12T15:02:52Z

Hello! I think I fixed the conflicts, however, is there any reason why the test worfklows are not running on this PR? or they are running and I can't see them?

jveitchmichaelis · 2025-07-12T16:25:25Z

Needs manual approval (for non org members?), but that should be good to run now.

martibosch · 2025-07-15T10:24:18Z

Hello!

I compared the torchmetrics-based evaluation and it seems that there are some differences that we will have to explore. I will come back at you when I find something.

In any case, here is the zip with the data and notebook (you can see the different scores there) that I am working with
torchmetrics-eval.zip

jveitchmichaelis · 2025-07-15T13:21:55Z

Thanks, I'll take a look if I have a moment today. Maybe we can make a summary table and some plots to compare? I'd be interested to see if it's significant enough that we'd worry, but IoU (as an example) is almost always different between packages.

jveitchmichaelis · 2025-08-04T17:54:09Z

So I would defer to @ethanwhite and @bw4sz on this. I'm inclined to accept that we should standardize on torchmetrics' implementations even if they disagree with ours. We could for now host both versions with torchmetrics as default and ours as a "legacy" option. It doesn't look like anything is outrageously wrong, but I agree it would be useful to figure out where this comes from.

I would suggest checking if we get closer results if the box assignment is the same (i.e. the input to the IOU function is identical) or if this arises from an implementation difference in iou().

Ours from two runs:

-	Precision	Recall
Ours	0.20307664715486898	0.24961946337537655
Theirs	0.19584055459272098	0.2297029702970297

-	Precision	Recall
Ours	0.510849773661384	0.5168490229029666
Theirs	0.5781637717121588	0.46534653465346537

The species classification differences look near enough identical, though precision is different.

One related issue from your notebook is that __evaluate_wrapper__ is not idempotent. You can't run it twice interactively. Not sure if that goes away with this PR.

I think it's in evaluate_boxes somewhere, we modify predictions. I think we should be doing a copy under the hood there.

jveitchmichaelis · 2025-08-05T21:47:46Z

@martibosch please could you share the images you used for the example? I've had a look at extracting the raw data from coco_eval, but it's quite difficult to review without seeing the underlying images. Here blue is ground truth, red is predicted:

The predictions don't look particularly good to start with, so even if we check the library-assigned true positives I can't tell if they're sensible. The fact the boxes line up suggests I'm plotting the right annotations at least.

I made a wrapper around MeanAveragePrecision to get the cocoEval object to extract:

true_positives = [ann for ann in metric.coco_eval.cocoDt.anns.items() if ann[1].get('tp', False)]
false_positives = [ann for ann in metric.coco_eval.cocoDt.anns.items() if ann[1].get('fp', False)]
false_negatives = [ann for ann in metric.coco_eval.cocoGt.anns.items() if ann[1].get('fn', False)]

map_wrapper.ipynb.zip

It's a bit ugly for what amounts to like three lines of changed code to store the evaluation object, but it works. The eval library has an option to run this to label the assignments: https://github.com/MiXaiLL76/faster_coco_eval/blob/45f747f7544be9049f448f1806f6fa43ebf97d36/faster_coco_eval/core/faster_eval_api.py#L118 I enabled this with coco_eval = coco_backend.cocoeval(coco_target, coco_preds, iouType=i_type, extra_calc=True).

You can get them from other places, but I like this way because there's no room for misinterpretation (each annotation dict gets a label).

The challenge with matching is that the eval tools use greedy assignment which will probably be slightly different to the Hungarian matching that DeepForest (and your code) does. That alone might explain a small discrepancy. The matching function is found here: https://github.com/MiXaiLL76/faster_coco_eval/blob/45f747f7544be9049f448f1806f6fa43ebf97d36/csrc/faster_eval_api/coco_eval/cocoeval.cpp#L72

The original from pycocotools is here: https://github.com/ppwwyyxx/cocoapi/blob/8cbc887b3da6cb76c7cc5b10f8e082dd29d565cb/PythonAPI/pycocotools/cocoeval.py#L272. The faster version just ported it to C++ from what I can tell and torchmetrics say they found no difference.

For what it's worth Ultralytics (Yolo) has had an open issue about this since 2021. This is not a problem unique to us :)

"image ID" in the eval is the order in which you pass into update
Mapping the prediction + gt IDs to the original dataframes is... awkward.

martibosch · 2025-08-06T05:45:10Z

I would suggest checking if we get closer results if the box assignment is the same (i.e. the input to the IOU function is identical) or if this arises from an implementation difference in iou().

I checked this a while ago (see this comment and the previous commit which the comment refers to), and there were no mismatches between deepforest and torchmetrics, however I have changed the code quite a bit so we should check this again.

One related issue from your notebook is that __evaluate_wrapper__ is not idempotent. You can't run it twice interactively. Not sure if that goes away with this PR.

I think it's in evaluate_boxes somewhere, we modify predictions. I think we should be doing a copy under the hood there.

I noticed that too and I already adressed that so it should not be happening here.

jveitchmichaelis · 2025-08-06T06:25:11Z

I compared the torchmetrics-based evaluation and it seems that there are some differences that we will have to explore. I will come back at you when I find something.

I've been looking into the results you shared here, as far as differences go. I'll continue discussion on the issue.

bw4sz · 2025-08-06T13:28:58Z

My two cents is that if we are within 10% of standardized code, we go with that. This looks like the case. Part of my motivation was I was expecting torchmetrics to be quite a bit faster. If we run a large evaluation, like on MillionTrees, is there a time difference? I remember writing the IoU metric myself and there is no way it was fast as what torchmetric has done.

…

On Tue, Aug 5, 2025 at 11:25 PM Josh Veitch-Michaelis < ***@***.***> wrote: *jveitchmichaelis* left a comment (weecology/DeepForest#1071) <#1071 (comment)> I compared the torchmetrics-based evaluation and it seems that there are some differences that we will have to explore. I will come back at you when I find something. I've been looking into the results you shared here, as far as differences go. I'll continue discussion on the issue. — Reply to this email directly, view it on GitHub <#1071 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJHBLB4F5V5MEHAQ4BE57L3MGNV3AVCNFSM6AAAAAB66VB6YGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNJXGU2TSNJYGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Ben Weinstein, Ph.D. Research Scientist University of Florida https://bw4sz.github.io/

jveitchmichaelis · 2025-08-06T20:54:00Z

Yeah see #901 (comment). As long as the results are sane and repeatable, it should be OK.

If we run a large evaluation, like on MillionTrees, is there a time difference?

My guess is probably not for matching. That runs per-image and the forward pass dominates. We could do some testing on like 10M synthetic labels just to see, but for a one-off evaluation I doubt it would be that long. You're running a bunch of N^2 IoU calculations (per image), and the matrix ops for Hungarian (even worse at N^3), but N isn't quite big enough to make it blow up. I also want to say that in the source for models like Mask2Former they use library functions like we do, they're not hand optimized. There are memory concerns for larger matrices, but we're not at that point.

Running the COCO-style evaluation might take a long time (especially with pycocotools, even though there are some C bits), but faster-coco-eval solved a few problems there by porting to C++ so we should definitely use that backend. I think that evaluation is still order of seconds and most of it is forming the reverse index between image ID <> annotations. Though if we have to depend on that library anyway, we could run the eval directly to get access to the matches. It feels odd to do that, but torchmetrics doesn't do anything particularly clever beyond making it easy to call update/compute. I would definitely use their metric class during training since it can be used in a MetricCollection.

bw4sz · 2025-08-18T14:47:37Z

TypeError: __evaluate_wrapper__() got an unexpected keyword argument 'numeric_to_label_dict'
``` for several tests. Must have been something rebased from main, we can fix this easily.

jveitchmichaelis · 2025-09-26T20:57:24Z

@martibosch take a look at the PR here. I had a look at where we could optimize without changing any of the existing interfaces, and this looks quite promising. The short answer is that most of our existing inefficiency can be removed by vectorising and avoiding lots of repeated dataframe operations.

We have switched to faster_coco_eval in main which is even faster than my improvements I think, but I wanted to see where we could get a performance boost without losing the ability to compare to older metrics. I'm also in touch with the maintainers of faster_coco_eval who seem open to feature requests. One of which could be exposing matching results so that you don't need to awkwardly inspect the coco object inside torchmetrics.

The other difference that's apparent in very large tests is that torchmetrics (really cocotools/faster_coco_eval) is single threaded. So if you have a large dataset and a beefy machine, it sits for ages spinning on a single core when most of the per-image processing can be parallelised.

martibosch · 2025-09-28T09:36:05Z

Hello @jveitchmichaelis,

yes the ideal scenario would be to retrieve the match directly form torchmetrics, which I found to be a bit too hacky especially if the user is not bound to a specific backend. However, if the user will be bounded to faster_coco_eval, couldn't we just use the faster_coco_eval.mask.io method which essentially runs in C++? I do not know if that is faster than linear_sum_assignment.

In any case, it makes sense to not use torchmetrics if we do not want many of their features (e.g., maximum detection thresholds), but I am wondering to what extent it would make sense to allow the user to retrieve the whole precision/recall curve.

Finally, out of curiosity, is using the STRtree faster than a spatial join with geopandas? Also a while ago I used the overlay functions and I was surprised of how fast they ran over large datasets. However, I suppose that in most of DeepForest use cases, the individual image geo-data frame will most certainly be rather small, in which case most gains would indeed come from exploiting the "embarrassingly parallel" nature of the multi-image evaluation.

Let me know if you prefer to move the review to your PR. Best,
Martí

jveitchmichaelis · 2025-09-28T18:51:33Z

@martibosch I think your effort here stands and you can go with the assumption that we will restrict to faster_coco_eval for now. There is also a semi-maintained(?) branch from FAIR, but it seems to be mainly to have a controlled package for Detectron2.

The reason I was poking around with the eval code is that we're running some very large tests at the moment, and it was taking a while to process O(10k+) images. We also needed the output to compare with existing/published results, so I wanted to make sure we got identical outputs. The end result takes around the same time as faster_coco_eval (5 mins for 30k images), but I have a parallel implementation on top which cuts it down to about 40 seconds. This matters on cluster machines, because we're wasting compute at the moment. I was surprised at how good the improvement was, and it's kind of a stepping stone to this PR.

I don't see any reason why we couldn't use faster_coco_eval directly and add image-level parallelism later. As we discussed, there is little benefit in torchmetrics specifically for mAP, since it's a wrapper around other backends, aside from the convenience methods for update/compute. But, we could write a custom metric for it that exposes the methods that you need for this. Although computing matches etc isn't too expensive, it does seem silly to do it twice just for our box precision/recall, and if we can use the output from coco eval then that's great.

On question is: can we get this feature pulled into torchmetrics directly? I would be happy to raise a PR with them as it's a pretty minor change to expose the attributes (the developers at faster_coco_eval are also very responsive). But I don't see them supporting multi-threading any time soon.

Finally, out of curiosity, is using the STRtree faster than a spatial join with geopandas? Also a while ago I used the overlay functions and I was surprised of how fast they ran over large datasets. However, I suppose that in most of DeepForest use cases, the individual image geo-data frame will most certainly be rather small, in which case most gains would indeed come from exploiting the "embarrassingly parallel" nature of the multi-image evaluation.

From profiling, most of the improvements were through vectorisation and avoiding implicit loops. Most of the time now is spent computing the intersection and union areas. Tree construction and querying is comparably fast. rtree is slow because (as far as I know?) it doesn't support batch queries. I imagine geopandas spatial join would also be fast, but I'm not as familiar with optimizing pandas code. Similarly linear sum assignment is negligible, possibly because we have a friendly use-case where there is little work for the algorithm to do once NMS has run. Since the bulk of the processing is deferred to shapely, numpy, scipy, etc we get performance that's comparable to compiled code anyway. That's my take at least.

martibosch added 4 commits June 2, 2025 18:08

feat: torchmetrics eval WIP

0970f34

fix: __evaluate_wrapper__ takes label to numeric dict

4069c1a

feat: remove stale code after using torchmetrics eval

eba6d5a

feat: class-independent box matching, then compute class recall

33cdcc3

martibosch added 2 commits June 16, 2025 07:09

fix: index-based prediction-truth mapping, avoid modifying preds

dd4dc18

fix: bug in omitting pred labels to compute class recall/precision

fdb8fed

bw4sz self-requested a review July 10, 2025 17:14

feat: merge changes from af9458b into torchmetrics eval

8dffa19

martibosch force-pushed the feat-torchmetrics-eval branch from f1cf34c to 8dffa19 Compare July 12, 2025 14:59

test: drop IoU tests (IoU now computed by torchmetrics)

047b0e9

bw4sz added this to the DeepForest 2.1 milestone Sep 4, 2025

bw4sz removed their request for review October 9, 2025 17:59

Feat torchmetrics eval #1071

Are you sure you want to change the base?

Feat torchmetrics eval #1071

Uh oh!

Conversation

martibosch commented Jun 10, 2025

Uh oh!

martibosch commented Jun 10, 2025

Uh oh!

bw4sz commented Jun 10, 2025

Uh oh!

bw4sz commented Jun 11, 2025

Uh oh!

martibosch commented Jun 18, 2025

Uh oh!

bw4sz commented Jun 18, 2025

Uh oh!

bw4sz commented Jun 27, 2025

Uh oh!

bw4sz commented Jul 10, 2025

Uh oh!

martibosch commented Jul 10, 2025

Uh oh!

martibosch commented Jul 12, 2025

Uh oh!

jveitchmichaelis commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martibosch commented Jul 15, 2025

Uh oh!

jveitchmichaelis commented Jul 15, 2025

Uh oh!

jveitchmichaelis commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jveitchmichaelis commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martibosch commented Aug 6, 2025

Uh oh!

jveitchmichaelis commented Aug 6, 2025

Uh oh!

bw4sz commented Aug 6, 2025 via email

Uh oh!

jveitchmichaelis commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bw4sz commented Aug 18, 2025

Uh oh!

jveitchmichaelis commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martibosch commented Sep 28, 2025

Uh oh!

jveitchmichaelis commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jveitchmichaelis commented Jul 12, 2025 •

edited

Loading

jveitchmichaelis commented Aug 4, 2025 •

edited

Loading

jveitchmichaelis commented Aug 5, 2025 •

edited

Loading

jveitchmichaelis commented Aug 6, 2025 •

edited

Loading

jveitchmichaelis commented Sep 26, 2025 •

edited

Loading

jveitchmichaelis commented Sep 28, 2025 •

edited

Loading