This is a repository containing data for the Robust Word Sense Induction shared task.
For details, please visit the the website.
The sample/
directory contains the annotated test sets for three words for each language.
The files are encoded as UTF-8 and use columnar format separated by TAB characters. No quoting is used and the first line describes the names of the columns. All the files have the same structure.
- Column
headword
represents the headword. - Columns starting with
sense
represent the "gold" annotations, one column per annotator. Value ending with anx
means that the annotator has not marked this line in any way. - Column
text
contains the the sentence, within which the specific occurrence appears.
The test
directory contains the files to be clustered by your word sense induction system. The format differs from the sample files by omitting the annotation columns, which are used for the evaluation.
To obtain a good performance, is written in Rust
, the source code is in the scorer/
directory, a prebuilt static binary for x86_64 Linux is present in the scorer/bin/
directory.
Annotate the test set using your own WSI system and create a TSV file containing a column with the cluster labels. A header needs to be present. The default name for the cluster column is cluster
. Other columns might be present as well. You can also place the column with the cluster labels into the file containing the gold data.
Then run the scorer and observe the output:
./bin/scorer GOLD_FILE -f CLUSTER_FILE
To change the name of the cluster column, use the -c
option. If your labels are in the same file as the gold data is, omit the -f
option.
To build the program yourself, install Rust using https://rustup.rs/ and then run cargo build --release
from the scorer/
directory.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Do not hesitate to contact us at [email protected].