Question regarding "Accuracy" Metric

Dear Authors.

Based on your [evaluation script](https://github.com/OpenDriveLab/DriveLM/blob/1de72a74b257e5373400fa68239e99bd5d20580a/challenge/evaluation.py#L28),
accuracy scores are accumulated if and only if the `answer` strictly matches the `GT`.

```python
    def eval_acc(self):
        scores = []
        for i in range(len(self.accuracy["answer"])):
            answer = self.accuracy["answer"][i]
            GT = self.accuracy["GT"][i]
            if answer == GT:
                scores.append(1.0)
            else:
                scores.append(0.0)

        scores = sum(scores) / len(scores)
        return scores
```

I find this odd since, there could be multiple variations of correct answers like so.

```bash
GT: A
answer (model prediction): The correct answer is A. The ego vehicle is steering to the left.
```

Such case is not counted as a correct answer, while the answer(model prediction) is contextually same as the GT.

Since we do not get the ground truth of the validation set, I was not able to verify if my concern was true or not. I would really appreciate if you could revise this inquiry and get back to us.

Thank you very much for your amazing work and your precious time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question regarding "Accuracy" Metric #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Question regarding "Accuracy" Metric #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions