Skip to content

Latest commit

 

History

History
144 lines (121 loc) · 5.95 KB

FORMAT.md

File metadata and controls

144 lines (121 loc) · 5.95 KB

QA-SRL Bank 2.0 Data Format

The data files are provided gzipped in the JSON Lines format, to facilitate streaming (in cases where order doesn't matter) and minimize the data footprint. While the compressed files can be read directly, on a UNIX system you may run

find data -name *.jsonl.gz -type f -exec sh -c 'gunzip -c $0 > `dirname "$0"`/`basename "$0" .gz`' '{}' \;

from this repository's base directory to decompress all of the data. In each file, each line is a JSON object containing all verb annotations for a sentence. The sentences are ordered alphabetically in the file by their unique string sentence ID.

The structure of the JSON object on each line is a Sentence object defined as follows, where Array denotes a JSON array, Map denotes a JSON object, and Set denotes a JSON array with unique elements.

Sentence ::= {
  sentenceId : SentenceId,
  sentenceTokens : Array[Token],
  verbEntries : Map[Index, VerbEntry]
}

VerbEntry ::= {
  verbIndex : Index,
  verbInflectedForms : {
    stem : LowerCaseString,
    presentSingular3rd : LowerCaseString,
    presentParticiple : LowerCaseString,
    past : LowerCaseString,
    pastParticiple : LowerCaseString
  },
  questionLabels : Map[QuestionString, QuestionLabel]
}

QuestionLabel ::= {
  questionString : QuestionString,
  questionSources : Set[QuestionSource],
  answerJudgments: Set[AnswerJudgment]
  questionSlots : {
    wh : LowerCaseString,
    aux : LowerCaseString,
    subj : LowerCaseString,
    verb : LowerCaseString,
    obj : LowerCaseString,
    prep : LowerCaseString,
    obj2 : LowerCaseString
  },
  tense: LowerCaseString,
  isPerfect: Boolean,
  isProgressive: Boolean,
  isNegated: Boolean,
  isPassive: Boolean
}

AnswerJudgment ::= {
  sourceId : AnswerSource,
  isValid : Boolean,
  spans : undefined | Set[Span]
}

Span ::= [Index, Index]

The vast majority of the orig and expanded data has 3 validation judgments per question, and the vast majority of the dense data has 6 validation judgments per question. However, these numbers do vary because of a few mistakes (e.g., accidentally gathering data twice for a question) and limitations of our crowdsourcing pipeline (where in a few cases, the same validator might have answered a question multiple times, but we collapse identical answer judgments together). These cases are included for completeness, but easy to filter out if you need to.

The spans field of AnswerJudgment is undefined if and only if isValid == false. A span is represented as a 2-element array of its beginning index (inclusive) and end index (exclusive).

The verb slot of questionSlots uses an abstracted form of the verb that is common to all questions, with values such as stem, been pastParticiple, be presentParticiple, etc. Replacing the conjugation with the correct form from the verb's verbInflectedForms field, and concatenating all of the slots that are not _, inserting spaces as appropriate, capitalizing the beginning, and appending a question mark will always yield the questionString value.

All of the fields below answerJudgments in a QuestionLabel are automatically, deterministically computed using the question's sentenceTokens, verbInflectedForms, and questionString.

The following two prediction tasks are equivalent:

  • Predicting all seven questionSlots
  • Predicting all questionSlots except aux and verb, and then then predicting the five grammatical fields, tense, isPerfect, isProgressive, isNegated, and isPassive.

The terminals are defined as follows:

  • SentenceId: a string with no spaces, unique for each sentence.
  • Token: a PTB-style token (no spaces).
  • LowerCaseString: a lower-case string.
  • Index: a non-negative integer JSON number which is a valid index into sentenceTokens; used as a string (i.e., [1-9][0-9]*) when indexing into verbEntries.
  • QuestionString: a valid QA-SRL question, with only the first character upper-case, ending in a question mark.
  • QuestionSource: a string uniquely identifying the writer of a question. Begins with turk- if it was written by a turker, and model- if it was generated by a model. Model sources for questions written by a turker.
  • AnswerSource: a string denoting the provenance of an answer judgment. Same as QuestionSource, but is restricted to turkers since we only record human answer judgments in the data, and optionally is appended with a suffix (-expansion or -eval) denoting on which round of data collection it was gathered in. However, turker indices are shared across data collection runs.

The prediction task we do in the paper can be phrased as filling in the questionLabels field of an otherwise complete VerbEntry in a Sentence. While we do not explicitly feed the verb's inflected forms into the model, we need the inflected forms in order to transform the model's output (which uses the slot-based format that abstracts out the verb) into the originally written QA-SRL question.

Data Index

A document-wise index of the data, encoded as a single-line JSON file (index.json.gz), is provided for convenience and for associating the sentences with their documents and document titles. You do not need to use this to run models on the QA-SRL Bank, but it is useful when visualizing the data, e.g., in the data browser. It is formatted as follows:

  Index ::= {
    documents: Map[Partition, Set[DocumentMetadata]],
    denseIds: Set[SentenceId]
  }

  DocumentMetadata ::= {
    part: Partition,
    idString: DocumentId,
    domain: Domain,
    id: String,
    title: String
  }

With the following terminals:

  • Partition: "train", "dev", or "test".
  • Domain: "wikipedia", "wikinews", or "tqa".
  • DocumentId: a string uniquely identifying the document. A sentence in the dataset can be verified to be in a document if its SentenceId begins with the document's DocumentId. Just a particular string representation of the pair (domain, id) for a document.