This is a repository containing ShadowSense, a word sense annotated dataset for Czech and English.
For a detailed description, please read the paper.
The data/
directory contains the annotated test sets.
English.tsv.zst
contains the full English dataset compressed using zstd.Czech.tsv.zst
contains the full Czech dataset compressed using zstd.English_sample.tsv
contains the first 1000 rows of the English dataset.Czech_sample.tsv
contains the first 1000 rows of the Czech dataset.
Note that the compressed files are stored using Git LFS, which you might need to install to be able to access them from a local copy of the repository.
The files are encoded as UTF-8 and use columnar format separated by TAB characters. No quoting is used and the first line describes the names of the columns. All the files have the same structure.
- Column
head
represents the headword. - Columns starting with
sense
represent the "gold" annotations, one column per annotator. Value ending with anx
means that the annotator has not marked this line in any way. - Column
text
contains the the sentence, within which the specific occurrence appears. - Columns
rel
andcol
are the word sketch relations used for extracting the instances from the corpus. - Column
pos
shows the token number in the underlying corpus.- English dataset uses the enTenTen08 corpus.
- Czech dataset uses the csTenTen17 corpus.
To obtain a good performance, is written in Rust
, the source code is in the scorer/
directory, a prebuilt static binary for x86_64 Linux is present in the bin/
directory.
Annotate the test set using your own WSI system and add the result as another column in the file. Only the sense and head columns need to be kept.
Run the scorer and observe the output:
./bin/scorer ANNOTATED_FILE ANNOTATEDCOLUMN_NAME
To build the program yourself, install Rust using https://rustup.rs/ and then run cargo build --release
from the scorer/
directory.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use this repository in your work, please cite it! A ready made BibTex citation record is available in the CITATION.bib file.
Your citation helps acknowledge the effort put into developing this resource and assists others in locating and using it effectively. Thank you!