Skip to content

A small corpus of sentences with language learning level (CEFR scale) annotations.

License

Notifications You must be signed in to change notification settings

IldikoPilan/sent_cefr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

sent_cefr

A small corpus of 375 Swedish sentences with second language (L2) learning level annotations. The levels follow the scale of the Common European Framework of Reference for Languages (CEFR https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions).

Sentences numbered 1-285 are corpus examples collected from Korp from generic Swedish corpora (e.g. newspaper texts and novels). The CEFR level annotations are derived from the user evaluation of a corpus example selection system, HitEx (https://spraakbanken.gu.se/larka/hitex), where L2 Swedish teachers evaluated the sentences and their atomatically assigned CEFR levels. For more information about the user evaluation see the following paper: http://www.atala.org/content/candidate-sentence-selection-language-learning-exercises-comprehensive-framework-empirical.

The sentences included in this dataset have been filtered for their well-formedness, independence from the rest of their textual context and some additional lexical and structural criteria (e.g. abbreviations, interrogative form) using HitEx. From the original evaluation data only those sentences have been included that: (i) that were found overall suitable (with an evaluation score >= 2.5 out of 4); and (ii) where a majority of teachers agreed with the CEFR level assigned automatically by our system.

Sentences 286-375 are individually occurring sentences collected from the COCTAILL corpus (https://spraakbanken.gu.se/eng/resource/coctaill) consisting of L2 Swedish coursebook texts.

Number of sentences per CEFR level in the resource:

A1 8 + 90 (from COCTAILL)

A2 81

B1 59

B2 92

C1 45

The resource can be easily equipped with a wide variety of linguistic annotations using the Sparv (https://spraakbanken.gu.se/sparv/) pipeline both via a GUI or an API.

About

A small corpus of sentences with language learning level (CEFR scale) annotations.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published