Korean-English Context-Aware Translation Challenge Dataset

This repository contains the dataset associated with the COLING 2025 accepted paper:

"A Testset for Context-Aware LLM Translation in Korean-to-English Discourse Level Translation"
Minjae Lee¹, Youngbin Noh², Seung-Jin Lee²
¹ Korea University
² NCSOFT

📊 Dataset Overview

This dataset is designed to evaluate Korean-to-English discourse-level translation capabilities of Large Language Models (LLMs) and Neural Machine Translation (NMT) systems. It consists of 600 manually constructed text instances that highlight six linguistic phenomena requiring contextual inference beyond the sentence level.

Unlike sentence-level datasets, this dataset emphasizes inter-sentential (cross-sentence) context that is necessary to resolve ambiguities and accurately translate challenging linguistic phenomena.

🧩 Linguistic Phenomena Covered

Lexical Ambiguity
Words with multiple meanings that need context across sentences to resolve correctly.
Zero Anaphora
Omitted subjects, objects, or complements that require inference from preceding or following sentences.
Slang
Informal expressions where inter-sentential context helps in understanding the tone and meaning.
Idiom
Phrases with non-literal meanings that can only be accurately translated by considering the broader discourse.
Figurative Language
Metaphors or expressions requiring context from surrounding sentences to be properly interpreted.
Implicature
Implied meanings where the intention becomes clear only when considering the entire conversation.

📂 File Structure

korean-english-context-aware-translation-dataset/
│-- README.md                       
│-- LICENSE                         
│-- dataset/                        # Dataset files for each linguistic phenomenon 
│   ├── lexical_ambiguity.jsonl
│   ├── zero_anaphora.jsonl
│   ├── slang.jsonl
│   ├── idiom.jsonl
│   ├── figurative_language.jsonl
│   └── implicature.jsonl

License

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korean-English Context-Aware Translation Challenge Dataset

📊 Dataset Overview

🧩 Linguistic Phenomena Covered

📂 File Structure

License

Copyright

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset		dataset
LICENSE		LICENSE
README.md		README.md

License

minseye/korean-english-context-aware-translation-dataset

Folders and files

Latest commit

History

Repository files navigation

Korean-English Context-Aware Translation Challenge Dataset

📊 Dataset Overview

🧩 Linguistic Phenomena Covered

📂 File Structure

License

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages