These is release 1.0 of the Universal Proposition Banks. It is built upon release 1.4 of the Universal Dependency Treebanks and inherits their licence. We use the frame and role labels from the English Proposition Bank version 3.0.
News (02/10/2017): Initial version of Italian UP released!
News (01/31/2017): Initial versions of Finnish, Portuguese and Spanish UP released!
This release contains propbanks for the following languages:
-
Chinese UP - Inherits license CC BY-NC-SA 3.0 US from the Chinese Universal Treebank
-
Finnish UP - Inherits license CC BY-NC-SA 3.0 US from the Finnish Universal Treebank
-
French UP - Inherits license CC BY-NC-SA 3.0 US from the French Universal Treebank
-
German UP - Inherits license CC BY-NC-SA 3.0 US from the German Universal Treebank
-
Italian UP - Inherits license CC BY-NC-SA 3.0 US from the Italian Universal Treebank
-
Portuguese UP - Inherits license CC BY-NC-SA 3.0 US from the Portuguese Universal Treebank
-
Spanish UP - Inherits license CC BY-NC-SA 3.0 US from the Spanish Universal Treebank
Using this data, we can create SRL systems that predict English PropBank labels for many different languages. See a recent demo screencast of this SRL for English, French and German here.
This project aims to annotate text in different languages with a layer of "universal" semantic role labeling annotation. For this purpose, we use the frame and role labels of the English Proposition Bank to label shallow semantics in sentences in new target languages.
For instance, consider the German sentence "Seine Arbeit wird von ehrenamtlichen Helfern und Regionalgruppen des Vereins unterstützt" (His work is supported by volunteers and regional groupings of the association). In CoNLL format, it looks like this, with English PropBank labels in the last two columns:
Id | Form | POS | HeadId | Deprel | Frame | Role |
---|---|---|---|---|---|---|
1 | Seine | DET | 2 | det:poss | _ | _ |
2 | Arbeit | NOUN | 11 | nsubjpass | _ | A1 |
3 | wird | AUX | 11 | auxpass | _ | _ |
4 | von | ADP | 6 | case | _ | _ |
5 | ehrenamtlichen | ADJ | 6 | amod | _ | _ |
6 | Helfern | NOUN | 11 | nmod | _ | A0 |
7 | und | CONJ | 6 | cc | _ | _ |
8 | Regionalgruppen | NOUN | 6 | conj | _ | _ |
9 | des | DET | 10 | det | _ | _ |
10 | Vereins | NOUN | 8 | nmod | _ | _ |
11 | unterstützt | VERB | 0 | root | support.01 | _ |
12 | . | PUNCT | 11 | punct | _ | _ |
The German verb 'unterstützt' is labeled as evoking the 'support.01' frame with two roles: "Seine Arbeit" (his work) is labeled A1 (project being supported) and "ehrenamtlichen Helfern und Regionalgruppen des Vereins" (volunteers and regional groupings of the association) is labeled A0 (the helper).
The universal propbank (UP) for each language consists of three files in CoNLL-U format (one for training, dev and test data). In addition, each language has a folder with verb overview files in html format. These files can be viewed in a browser and give an overview of all English frames that each target language verb can evoke.
Our current focus is to annotate all target language verbs with appropriate English frames. This means that the scope of frame-evoking elements is currently limited to verbs. We also do not label target language auxiliary verbs. For each universal propbank, about 90% of all verbs are currently labeled. Unlabeled verbs often convey semantics for which we either could not find an appropriate English verb, or are part of complex verb constructions which we currently do not handle.
This is an ongoing research project in which we use a combination of data-driven methods and some post-processing to generate these resources. This means that the labels in the UPs are mostly predicted over models trained on a different domain, which affects the quality. A good example is the German verb "angeben" which in our source data was mostly used in the "brag.01" sense, but in the German UD data is mostly used in the "report.01" sense, but almost never detected as such.
This is an ongoing project which we are improving along three lines: (1) We are working on adding new languages to the current release. (2) We are working to curate the data to improve the quality of SRL annotation. (3) We are looking into extending the scope of frame-evoking-elements to other types of predicates besides verbs.
Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan and Huaiyu Zhu. 53rd Annual Meeting of the Association for Computational Linguistics ACL 2015.
Polyglot: Multilingual Semantic Role Labeling with Unified Labels. Alan Akbik and Yunyao Li. 54th Annual Meeting of the Association for Computational Linguistics ACL 2016.
Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages. Alan Akbik, Vishwajeet Kumar and Yunyao Li. 2016 Conference on Empirical Methods on Natural Language Processing EMNLP 2016.
Multilingual Information Extraction with PolyglotIE. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yonas Kbrom, Yunyao Li and Huaiyu Zhu. 26th International Conference on Computational Linguistics COLING 2016.
K-SRL: Instance-based Learning for Semantic Role Labeling. Alan Akbik and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.
Multilingual Aliasing for Auto-Generating Proposition Banks. Alan Akbik, Xinyu Guan and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.
Please email your questions or comments to Yunyao Li and Laura Chiticariu
- Alan Akbik, Zalando, Germany
- Laura Chiticariu, IBM Research - Almaden
- Marina Danilevsky, IBM Research - Almaden
- Yunyao Li, IBM Research - Almaden
- Chenguang (Ray) Wang, IBM Research - Almaden
- Huaiyu Zhu, IBM Research - Almaden
- Tomer Mahlin, IBM Systems Division, Israel
- Alexandre Rademaker, IBM Research - Brazil