Proposal: Evaluator and/or Benchmark repositories

## Preface

Evaluating the accuracy of the output of an NLP component is a science in itself.

When a new NLP algorithm, method or tool is published, it is always accompanied by benchmarks against existing systems.

Those benchmarks are produced using standard evaluation techniques and dataset.

These evaluation techniques are not always [automatic](https://people.cs.umass.edu/~brenocon/inlp2015/15-eval.pdf). 

A human judgment is sometimes necessary. In this case, there's nothing Cadmium can do to help.

However a set of existing tools exist depending on the NLP task to be tested : 
- [Precision, recall](https://en.wikipedia.org/wiki/Precision_and_recall) and [F1 Score](https://en.wikipedia.org/wiki/F1_score) are useful statistical metrics when evaluating classification POS tagging, sentiment analysis, etc.
- [METEOR](https://en.wikipedia.org/wiki/METEOR) or [BLEU](https://en.wikipedia.org/wiki/BLEU) if Cadmium ever does machine translation.
- [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)) for summarization evaluation

We can add to those tools standard dataset and corpora already gold labeled and human checked.

These are just examples found after a cursory search. The list is bigger and the tools get better fast.


## Details

The main idea of this proposal is to : 

- Create a `cadmiumcr/evaluator ` repository.
This module will have the tools listed above and methods to conveniently download the large datasets of gold labelled data.

- Create a `cadmiumcr/benchmark` repository.
This repository will be more like a custom set of crystal scripts using the tools of Cadmium::Evaluator to run benchmarks against the vanilla tools of Cadmium (classifiers, pos tagging, language identification, etc.) and display the results next to competing tools results.

The point being to give a glimpse of Cadmium possibilities and routinely check our tools accuracy (which crystal spec is not intended to do).

This proposal is mainly a braindump, as I don't intend to start working on this short term (I have to finish my POS Tagger first !)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Proposal: Evaluator and/or Benchmark repositories #33

Preface

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Evaluator and/or Benchmark repositories #33

Description

Preface

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions