The data files are provided gzipped in the JSON Lines format, to facilitate streaming (in cases where order doesn't matter) and minimize the data footprint. While the compressed files can be read directly, on a UNIX system you may run
find data -name *.jsonl.gz -type f -exec sh -c 'gunzip -c $0 > `dirname "$0"`/`basename "$0" .gz`' '{}' \;
from this repository's base directory to decompress all of the data. In each file, each line is a JSON object containing all verb annotations for a sentence. The sentences are ordered alphabetically in the file by their unique string sentence ID.
The structure of the JSON object on each line is a Sentence
object defined as follows, where
Array
denotes a JSON array,
Map
denotes a JSON object, and
Set
denotes a JSON array with unique elements.
Sentence ::= {
sentenceId : SentenceId,
sentenceTokens : Array[Token],
verbEntries : Map[Index, VerbEntry]
}
VerbEntry ::= {
verbIndex : Index,
verbInflectedForms : {
stem : LowerCaseString,
presentSingular3rd : LowerCaseString,
presentParticiple : LowerCaseString,
past : LowerCaseString,
pastParticiple : LowerCaseString
},
questionLabels : Map[QuestionString, QuestionLabel]
}
QuestionLabel ::= {
questionString : QuestionString,
questionSources : Set[QuestionSource],
answerJudgments: Set[AnswerJudgment]
questionSlots : {
wh : LowerCaseString,
aux : LowerCaseString,
subj : LowerCaseString,
verb : LowerCaseString,
obj : LowerCaseString,
prep : LowerCaseString,
obj2 : LowerCaseString
},
tense: LowerCaseString,
isPerfect: Boolean,
isProgressive: Boolean,
isNegated: Boolean,
isPassive: Boolean
}
AnswerJudgment ::= {
sourceId : AnswerSource,
isValid : Boolean,
spans : undefined | Set[Span]
}
Span ::= [Index, Index]
The vast majority of the orig
and expanded
data has 3 validation judgments per question,
and the vast majority of the dense
data has 6 validation judgments per question.
However, these numbers do vary because of a few mistakes (e.g., accidentally gathering data twice
for a question) and limitations of our crowdsourcing pipeline (where in a few cases, the same
validator might have answered a question multiple times, but we collapse identical answer judgments
together). These cases are included for completeness, but easy to filter out if you need to.
The spans
field of AnswerJudgment
is undefined if and only if
isValid == false
. A span is represented as a 2-element array of its beginning
index (inclusive) and end index (exclusive).
The verb
slot of questionSlots
uses an abstracted form of the verb that is
common to all questions, with values such as stem
, been pastParticiple
,
be presentParticiple
, etc.
Replacing the conjugation with the correct form from the verb's verbInflectedForms
field, and
concatenating all of the slots that are not _
, inserting spaces as appropriate, capitalizing the
beginning, and appending a question mark will always yield the questionString
value.
All of the fields below answerJudgments
in a QuestionLabel
are automatically, deterministically
computed using the question's sentenceTokens
, verbInflectedForms
, and questionString
.
The following two prediction tasks are equivalent:
- Predicting all seven
questionSlots
- Predicting all
questionSlots
exceptaux
andverb
, and then then predicting the five grammatical fields,tense
,isPerfect
,isProgressive
,isNegated
, andisPassive
.
The terminals are defined as follows:
SentenceId
: a string with no spaces, unique for each sentence.Token
: a PTB-style token (no spaces).LowerCaseString
: a lower-case string.Index
: a non-negative integer JSON number which is a valid index intosentenceTokens
; used as a string (i.e.,[1-9][0-9]*
) when indexing intoverbEntries
.QuestionString
: a valid QA-SRL question, with only the first character upper-case, ending in a question mark.QuestionSource
: a string uniquely identifying the writer of a question. Begins withturk-
if it was written by a turker, andmodel-
if it was generated by a model. Model sources for questions written by a turker.AnswerSource
: a string denoting the provenance of an answer judgment. Same asQuestionSource
, but is restricted to turkers since we only record human answer judgments in the data, and optionally is appended with a suffix (-expansion
or-eval
) denoting on which round of data collection it was gathered in. However, turker indices are shared across data collection runs.
The prediction task we do in the paper can be phrased as filling in the questionLabels
field of an
otherwise complete VerbEntry
in a Sentence
. While we do not explicitly feed the verb's inflected
forms into the model, we need the inflected forms in order to transform the model's output (which
uses the slot-based format that abstracts out the verb) into the originally written QA-SRL question.
A document-wise index of the data, encoded as a single-line JSON file (index.json.gz
), is provided
for convenience and for associating the sentences with their documents and document titles. You do
not need to use this to run models on the QA-SRL Bank, but it is useful when visualizing the data,
e.g., in the data browser. It is formatted as follows:
Index ::= {
documents: Map[Partition, Set[DocumentMetadata]],
denseIds: Set[SentenceId]
}
DocumentMetadata ::= {
part: Partition,
idString: DocumentId,
domain: Domain,
id: String,
title: String
}
With the following terminals:
Partition
: "train", "dev", or "test".Domain
: "wikipedia", "wikinews", or "tqa".DocumentId
: a string uniquely identifying the document. A sentence in the dataset can be verified to be in a document if itsSentenceId
begins with the document'sDocumentId
. Just a particular string representation of the pair (domain
,id
) for a document.