Releases · mjpost/sacrebleu

12 Jan 17:13

martinpopel

v2.6.0

2277cac

v2.6.0 Latest

Latest

Dropped Python 3.8, added Python 3.13 (requires-python = ">=3.9")
License Format: Changed to PEP 639 bare SPDX format (license = "Apache-2.0")
Build System: Updated setuptools requirement from >=64 to >=77 for PEP 639 support
mecab-ko: Updated dependency to >=1.0.2,<2.0.0 (now supports Apple Silicon and Python 3.12+)
new tokenizer: spBLEU-1K

Assets 2

03 Jan 20:08

mjpost

v2.5.0

fa700fd

v2.5.0

Add WMT 2024 test sets.

Assets 2

21 Jun 13:11

mjpost

v2.4.2

908ce50

v2.4.2

Allow printing of the domain (via --echo domain) if available on the underlying test set.

Assets 2

18 Oct 20:50

mjpost

v2.3.0

5166cf7

v2.3.0

Features:

(#203) Added -tok flores101 and -tok flores200, a.k.a. spbleu.
These are multilingual tokenizations that make use of the
multilingual SPM models released by Facebook and described in the
following papers:
- Flores-101: https://arxiv.org/abs/2106.03193
- Flores-200: https://arxiv.org/abs/2207.04672
(#213) Added JSON formatting for multi-system output (thanks to Manikanta Inugurthi @me-manikanta)
(#211) You can now list all test sets for a language pair with --list SRC-TRG.
Thanks to Jaume Zaragoza (@ZJaume) for adding this feature.
Added WMT22 test sets (test set wmt22)
System outputs: include with wmt22. Also added wmt21/systems which will produce WMT21 submitted systems.
To see available systems, give a dummy system to --echo, e.g., sacrebleu -t wmt22 -l en-de --echo ?

Contributors

ZJaume and me-manikanta

Assets 2

25 Jul 21:12

mjpost

v2.2.0

c802abc

v2.2.0

This release contains an inner reworking of the data representations, contributed by @BrightXiaoHan. This enables the following features:

Added WMT21 datasets (which are properly XML-encoded)
Exposed corpus metadata via --echo (including origlang, docid, and genre, which are all available for most WMT corpora)

We also added a Korean tokenizer (--tok ko-mecab), contributed by @NoUnique.

In addition, there are a number of bug fixes and minor fixes:

Empty references (#161) are now allowed. Some of our speech test sets could not be used before this was fixed!
We now recommend that people use the spm tokenizer, particularly for CJK languages.
Internally, the tarball downloads and extracted test and metadata files now have names that are globally unique (e.g., .sacrebleu/wmt21/wmt_21.en-de.ref instead of .sacrebleu/wmt21/de-en.ref. The file extension corresponds to the field that gets passed to --echo.

Contributors

NoUnique and BrightXiaoHan

Assets 2

09 Aug 18:14

ozancaglayan

v2.0.0

078c440

v2.0.0

This is a major release that introduces statistical significance testing for BLEU, chrF and TER. It should be noted that as of v2.0.0, the default output format of the CLI utility is json rather than the old single-line output. All tools should adapt to this change if they parse standard output.

Build: Add Windows and OS X testing to github workflow
Improve documentation and type annotations.
Drop Python < 3.6 support and migrate to f-strings.
Drop input type manipulation through isinstance checks. If the user does not obey
to the expected annotations, exceptions will be raised. Robustness attempts lead to
confusions and obfuscated score errors in the past (fixes #121)
Use colored strings in tabular outputs (multi-system evaluation mode) through
the help of colorama package.
tokenizers: Add caching to tokenizers which seem to speed up things a bit.
intl tokenizer: Use regex module. Speed goes from ~4 seconds to ~0.6 seconds
for a particular test set evaluation. (fixes #46)
Signature: Formatting changed (mostly to remove '+' separator as it was
interfering with chrF++). The field separator is now '|' and key values
are separated with ':' rather than '.'.
Metrics: Scale all metrics into the [0, 100] range (fixes #140)
BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141).
BLEU: allow modifying max_ngram_order (fixes #156)
CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
CHRF: Added chrF+ support through word_order argument. Added test cases against chrF++.py.
Exposed it through the CLI (--chrf-word-order) (fixes #124)
CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
We keep the effective ordering as the default for compatibility, since this only
affects sentence-level scoring with very short sentences. (fixes #144)
CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
CLI: Added --format/-f flag. The single-system output mode is now json by default.
If you want to keep the old text format persistently, you can export SACREBLEU_FORMAT=text into your
shell.
CLI: sacreBLEU now supports evaluating multiple systems for a given test set
in an efficient way. Through the use of tabulate package, the results are
nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
The systems can be either given as a list of plain text files to -i/--input or
as a tab-separated single stream redirected into STDIN. In the former case,
the basenames of the files will be automatically used as system names.
Statistical tests: sacreBLEU now supports confidence interval estimation
through bootstrap resampling for single-system evaluation (--confidence flag)
as well as paired bootstrap resampling (--paired-bs) and paired approximate
randomization tests (--paired-ar) when evaluating multiple systems (fixes #40 and fixes #78).

Assets 2

05 Mar 12:13

ozancaglayan

v1.5.1

5dfcaa3

v1.5.1

Minor bugfix release:

Fix extraction error for WMT18 extra test sets (test-ts) (#142)
Add validation and test datasets for the multilingual TEDx dataset (#136)
Make sure sacreBLEU still runs on Python 3.5 by pinning portalocker dependency

Assets 2

21 Jan 15:08

mjpost

v1.5.0

657c98a

v1.5.0

1.5.0 (2021-01-15)

Fix an assertion error in chrF (#121)
Add missing __repr__() methods for BLEU and TER
TER: Fix exception when --short is used (#131)
Pin Mecab version to 1.0.3 for Python 3.5 support
[API Change]: Default value for floor smoothing is now 0.1 instead of 0.
[API Change]: sacrebleu.sentence_bleu() now uses the exp smoothing method,
exactly the same as the CLI's --sentence-level behavior. This was mainly done
to make two methods behave the same.
Add smoothing value to BLEU signature (#98)
dataset: Fix IWSLT links (#128)
Allow variable number of references for BLEU (only via API) (#130).
Thanks to Ondrej Dusek (@tuetschek)

Assets 2

30 Jul 19:07

ozancaglayan

v1.4.13

abfbf38

v1.4.13

1.4.13 (2020-07-30)

Added WMT20 newstest test sets (#103)
Make mecab3-python an extra dependency, adapt code to new mecab3-python. This fixes the recent Windows installation issues as well (#104) Japanese support should now be explicitly installed through sacrebleu[ja] package.
Fix return type annotation of corpus_bleu()
Improve sentence_score's documentation, do not allow single ref string (#98)

Assets 2

Releases: mjpost/sacrebleu

v2.6.0

Uh oh!

v2.5.0

Uh oh!

v2.4.2

Uh oh!

v2.3.0

Contributors

Uh oh!

v2.2.0

Contributors

Uh oh!

v2.0.0

Uh oh!

v1.5.1

Uh oh!

v1.5.0

Uh oh!

v1.4.13

Uh oh!