Below is a (mostly complete) list of tasks that ExplainaBoard currently supports, along with examples of how to analyze different tasks. In particular text classification is a good example to start with.
General notes:
- Click the link on the task name for more details, or when no link exists you can open the example data to see what the file format looks like.
- You can either analyze an existing dataset included in Datalab or use your own custom dataset. The directions below describe how to do both in most cases, but using DataLab has some advantages such as allowing for easy calculation of training-set features and compatibility with ExplainaBoard online leaderboards. You can check the list of datasets supported in DataLab and add your dataset if it doesn't exist.
- All of the examples below will output a json report to standard out, which you can
pipe to a file such as
report.json
for later use. Also, check out our visualization tools.
We welcome contributions of more tasks, or detailed documentation for tasks where the documentation does not yet exist! Please open an issue or file a PR.
- Text Classification
- Text Pair Classification
- Conditional Text Generation
- Language Modeling
- Named Entity Recognition
- Word Segmentation
- Chunking
- Extractive QA
- Multiple Choice QA
- Hybrid Table Text QA
- Aspect-based Sentiment Classification
- KG Link Tail Prediction
- Multiple choice Cloze
- Generative Cloze
- Grammatical Error Correction
- Tabular Classification
- Tabular Regression
- Argument Pair Extraction
- Argument Pair Identification
Text classification consists of classifying text into different categories, such as sentiment values or topics. The below example performs an analysis on the Stanford Sentiment Treebank, a set of sentiment tags over English reviews.
The below example loads the sst2
dataset from DataLab:
explainaboard --task text-classification --dataset sst2 --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt
The below example loads a dataset from an existing file:
explainaboard --task text-classification --custom-dataset-paths ./data/system_outputs/sst2/sst2-dataset.tsv --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt
Classification of pairs of text, such as natural language inference or paraphrase detection. The example below concerns natural language infernce, predicting whether a premise, entails, contradicts, or is neutral with respect to a hypothesis, on the Stanford Natural Language Inference dataset.
The below example loads the snli
dataset from DataLab:
explainaboard --task text-pair-classification --dataset snli --system-outputs ./data/system_outputs/snli/snli-roberta-output.txt
The below example loads a dataset from an existing file:
explainaboard --task text-pair-classification --custom-dataset-paths ./data/system_outputs/snli/snli-dataset.tsv --system-outputs ./data/system_outputs/snli/snli-roberta-output.txt
Conditional text generation concerns generation of one text based on other texts, including tasks like summarization and machine translation. The below example evaluates a summarization system on the CNN-daily mail dataset.
The below example loads a miniature version of the CNN-daily mail dataset (100 lines only) from an existing file:
explainaboard --task summarization --custom-dataset-paths ./data/system_outputs/cnndm/cnndm_mini-dataset.tsv --system-outputs ./data/system_outputs/cnndm/cnndm_mini-bart-output.txt --metrics rouge2 bart_score_en_ref
Note that this uses two different metrics separated by a space.
You could also load the cnn_dailymail
dataset from DataLab.
Because the test set is large we don't include it directly in the explainaboard
repository, but you can get an example by downloading with wget:
wget -P ./data/system_outputs/cnndm/ https://storage.googleapis.com/inspired-public-data/explainaboard/task_data/summarization/cnndm-bart-output.txt
Then run the below command and it should work:
explainaboard --task summarization --dataset cnn_dailymail --system-outputs ./data/system_outputs/cnndm/cnndm-bart-output.txt --metrics rouge2
Language modeling is the task of predicting the probability for words in a text. You can analyze your language model outputs by inputting a file that has one log probability for each space-separated word. Here is an example:
The below example analyzes the wikitext corpus:
explainaboard --task language-modeling --custom-dataset-paths ./data/system_outputs/wikitext/wikitext-dataset.txt --system-outputs ./data/system_outputs/wikitext-sys1-output.txt
Named entity recognition recognizes entities such as people, organizations, or locations in text. The below examples demonstrate how you can perform such analysis on the CoNLL 2003 English named entity recognition dataset.
The below example loads the conll2003
NER dataset from DataLab:
explainaboard --task named-entity-recognition --dataset conll2003 --sub-dataset ner --system-outputs ./data/system_outputs/conll2003/conll2003-elmo-output.conll
Alternatively, you can reference a dataset file directly.
explainaboard --task named-entity-recognition --custom-dataset-paths ./data/system_outputs/conll2003/conll2003-dataset.conll --system-outputs ./data/system_outputs/conll2003/conll2003-elmo-output.conll
Word segmentation aims to segment texts without spaces between words.
The below example loads the msr
dataset from DataLab:
explainaboard --task word-segmentation --dataset msr --system-outputs ./data/system_outputs/cws/test-msr-predictions.tsv
Note that the file test-msr-predictions.tsv
can be downloaded here
Alternatively, you can reference a dataset file directly.
explainaboard --task word-segmentation --custom-dataset-paths ./data/system_outputs/cws/test.tsv --system-outputs ./data/system_outputs/cws/prediction.tsv
Dividing text into syntactically related non-overlapping groups of words.
The below example loads the conll00_chunk
dataset from DataLab:
explainaboard --task chunking --dataset conll00_chunk --system-outputs ./data/system_outputs/chunking/test-conll00-predictions.tsv
Alternatively, you can reference a dataset file directly.
explainaboard --task chunking --custom-dataset-paths ./data/system_outputs/chunking/dataset-test-conll00.tsv --system-outputs ./data/system_outputs/chunking/test-conll00-predictions.tsv
Extractive QA attempts to answer queries based on extracting segments from an evidence passage. The below example performs this extraction on the dataset SQuAD.
Below is an example of referencing the dataset directly.
explainaboard --task qa-extractive --custom-dataset-paths ./data/system_outputs/squad/squad_mini-dataset.json --system-outputs ./data/system_outputs/squad/squad_mini-example-output.json > report.json
The below example loads the squad
dataset from DataLab. There is an
open issue that prevents the
specification of a dataset split, so this will not work at the moment. But we are
working on it.
explainaboard --task qa-extractive --dataset squad --system-outputs MY_FILE > report.json
This task aims to answer a question based on a hybrid of tabular and textual context, e.g., Zhu et al.2021.
The below example loads the tat_qa
dataset from DataLab.
explainaboard --task qa-tat --output-file-type json --dataset tat_qa --system-outputs predictions_list.json > report.json
where you can download the file predictions_list.json
by:
wget -P ./ https://explainaboard.s3.amazonaws.com/system_outputs/qa_table_text_hybrid/predictions_list.json
Open-domain QA aims to answer a question in the form of natural language based on large-scale unstructured documents
Following examples show how an open-domain QA system can be evaluated with detailed analyses using ExplainaBoard CLI.
Using Build-in datasets from DataLab:
explainaboard --task qa-open-domain --dataset natural_questions_comp_gen --system-outputs ./data/system_outputs/qa_open_domain/test.dpr.nq.txt > report.json
Answer a question from multiple options. The following example demonstrates this on the metaphor QA dataset.
The below example loads the fig_qa
dataset from DataLab.
explainaboard --task qa-multiple-choice --dataset fig_qa --system-outputs ./data/system_outputs/fig_qa/fig_qa-gptneo-output.json > report.json
And this is what it looks like with a custom dataset.
explainaboard --task qa-multiple-choice --custom-dataset-paths ./data/system_outputs/fig_qa/fig_qa-dataset.json --system-outputs ./data/system_outputs/fig_qa/fig_qa-gptneo-output.json > report.json
Predicting the tail entity of missing links in knowledge graphs
The below example loads the fb15k_237
dataset from DataLab.
wget https://datalab-hub.s3.amazonaws.com/predictions/test_distmult.json
explainaboard --task kg-link-tail-prediction --dataset fb15k_237 --sub-dataset origin --system-outputs test_distmult.json > log.res
explainaboard --task kg-link-tail-prediction --custom-dataset-paths ./data/system_outputs/fb15k-237/data_mini.json --system-outputs ./data/system_outputs/fb15k-237/test-kg-prediction-no-user-defined-new.json > report.json
Predict the sentiment of a text based on a specific aspect.
This is an example with a custom dataset.
explainaboard --task aspect-based-sentiment-classification --custom-dataset-paths ./data/system_outputs/absa/absa-dataset.txt --system-outputs ./data/system_outputs/absa/absa-example-output.tsv > report.json
Fill in a blank based on multiple provided options
This is an example using the dataset from DataLab
explainaboard --task cloze-multiple-choice --dataset gaokao2018_np1 --sub-dataset cloze-multiple-choice --metrics CorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_cloze_choice.json > report.json
Fill in a blank based on hint
This is an example using the dataset from DataLab
explainaboard --task cloze-generative --dataset gaokao2018_np1 --sub-dataset cloze-hint --metrics CorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_cloze_hint.json > report.json
Correct errors in a text
This is an example using the dataset from DataLab
explainaboard --task grammatical-error-correction --dataset gaokao2018_np1 --sub-dataset writing-grammar --metrics SeqCorrectScore --system-outputs ./integration_tests/artifacts/gaokao/rst_2018_quanguojuan1_gec.json > report.json
Classification over tabular data takes in a set of features and predicts a class for
the outputs. The example below is over the sst2
dataset used in text classification,
but after the text has been vectorized into bag-of-words features. By default the only
features that is analyzed by ExplainaBoard is the label
feature, so you might want to
specify other features to perform bucketing over using the metadata
entry in the
dataset json
file, as is done in sst2-tabclass-dataset.json
below.
The below example loads a dataset from an existing file:
explainaboard --task tabular-classification --custom-dataset-paths ./data/system_outputs/sst2_tabclass/sst2-tabclass-dataset.json --system-outputs ./data/system_outputs/sst2/sst2-lstm-output.txt
Regression over tabular data is basically the same as tabular classification above, but the predicted outputs are continuous numbers instead of classes.
The below example loads a dataset from an existing file:
explainaboard --task tabular-regression --custom-dataset-paths ./data/system_outputs/sst2_tabreg/sst2-tabclass-dataset.json --system-outputs ./data/system_outputs/sst2_tabreg/sst2-tabreg-lstm-output.txt
This task aim to detect the argument pairs from each passage pair of review and rebuttal.
The below example loads the ape
dataset from DataLab:
explainaboard --task argument-pair-extraction --dataset ape --system-outputs ./data/system_outputs/ape/ape_predictions.txt
Given an argument, the task aims to identify one matched argument from a list of arguments.
The example below loads the iapi
dataset from DataLab:
explainaboard --task argument-pair-identification --dataset iapi --system-outputs data/system_outputs/iapi/predictions.txt > report.json
Evaluating the reliability of automated metrics for general text generation tasks, such as text summarization.
The below example loads the meval_summeval dataset from DataLab:
explainaboard --task meta-evaluation-nlg --dataset meval_summeval --sub-dataset coherence --system-outputs ./data/system_outputs/summeval/sumeval_bart.json > report.json