This repository contains the dataset associated with the COLING 2025 accepted paper:
"A Testset for Context-Aware LLM Translation in Korean-to-English Discourse Level Translation"
Minjae Lee¹, Youngbin Noh², Seung-Jin Lee²
¹ Korea University
² NCSOFT
This dataset is designed to evaluate Korean-to-English discourse-level translation capabilities of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems. It consists of 600 manually constructed text instances that highlight six linguistic phenomena requiring contextual inference beyond the sentence level.
Unlike sentence-level datasets, this dataset emphasizes inter-sentential (cross-sentence) context that is necessary to resolve ambiguities and accurately translate challenging linguistic phenomena.
-
Lexical Ambiguity
Words with multiple meanings that need context across sentences to resolve correctly. -
Zero Anaphora
Omitted subjects, objects, or complements that require inference from preceding or following sentences. -
Slang
Informal expressions where inter-sentential context helps in understanding the tone and meaning. -
Idiom
Phrases with non-literal meanings that can only be accurately translated by considering the broader discourse. -
Figurative Language
Metaphors or expressions requiring context from surrounding sentences to be properly interpreted. -
Implicature
Implied meanings where the intention becomes clear only when considering the entire conversation.
korean-english-context-aware-translation-dataset/
│-- README.md
│-- LICENSE
│-- dataset/ # Dataset files for each linguistic phenomenon
│ ├── lexical_ambiguity.jsonl
│ ├── zero_anaphora.jsonl
│ ├── slang.jsonl
│ ├── idiom.jsonl
│ ├── figurative_language.jsonl
│ └── implicature.jsonl
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Copyright (c) 2025 NCSOFT. All rights reserved.