Before diving into the detail of this doc, you're strongly recommended to know some important concepts about system analyses.
In this file we describe how to analyze multiple-choice QA models. We will give an example using the fig_qa dataset, but other datasets can be analyzed in a similar way.
-
(1)
datalab
: if your datasets have been supported by datalab, you fortunately don't need to prepare the dataset. -
(2)
json
(basically, it's a list of dictionaries with two keys:context
,options
,question
, andanswers
)
[
{'context': 'The girl had the flightiness of a sparrow', 'question': '', 'answers': {'text': 'The girl was very fickle.', 'option_index': 0}, 'options': ['The girl was very fickle.', 'The girl was very stable.']},
{'context': 'The girl had the flightiness of a rock', 'question': '', 'answers': {'text': 'The girl was very stable.', 'option_index': 1}, 'options': ['The girl was very fickle.', 'The girl was very stable.']}
...
]
In order to perform analysis of your results, they should be in the following json format:
[
{
"context": "The girl was as down-to-earth as a Michelin-starred canape",
"question": "",
"answers": {
"text": "The girl was not down-to-earth at all.",
"option_index": 0
},
"options": [
"The girl was not down-to-earth at all.",
"The girl was very down-to-earth."
],
"predicted_answers": {
"text": "The girl was not down-to-earth at all.",
"option_index": 0
}
},
...
]
where
context
represents the text providing context informationquestion
represents the question, which could be null in some scenariooptions
is a list of string, denoting all potential options.answers
is a dictionary with two elements:text
: the true answer textoption_index
: the index options for true answer
predicted_answers
is a dictionary with two elements:text
: the predicted answer textoption_index
: the index options for predicted answer
Let's say we have several files such as
etc. from different systems.
In order to perform your basic analysis, we can run the following command:
explainaboard --task qa-multiple-choice --system-outputs ./data/system_outputs/fig_qa/gpt2.json > report.json
where
--task
: denotes the task name, you can find all supported task names here--system-outputs
: denote the path of system outputs. Multiple one should be separated by space, for example, system1 system2--dataset
:optional, denotes the dataset namereport.json
: the generated analysis file with json format. You can find the file here. Tips: use a json viewer like this one for better interpretation.
Now let's look at the results to see what sort of interesting insights we can glean from them.
TODO: add insights
One also can perform pair-wise analysis:
explainaboard --task qa-multiple-choice --system-outputs model_1 model_2 > report.json
where two system outputs are fed separated by space.
report.json
: the generated analysis file with json format, whose schema is similar to the above one with single system evaluation except that all performance values are obtained using the sys1 subtract sys2.