From 625802cd074180f7894b71c897471e055a2e6270 Mon Sep 17 00:00:00 2001
From: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date: Sat, 27 Jan 2024 20:25:00 +0800
Subject: [PATCH] vizwiz dataset (#24)

* Merge commit '1e0514f92df2bbcd3d1c1fc86e3212c5fed93eaf'

* Update dataset paths and improve user prompts

* Add submission folder and update file paths for storing prediction results

* Merge commit '2037acaebc414280bd85e31b30ef9d2e671b3a19'

* Update dataset_path in flickr30k.yaml

* Add coco_val and coco_test tasks to coco.yaml

* Squashed commit of the following:

commit bf49735f01e8a523d01acadba47a410b1fa46434
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:59:12 2024 +0800

    refactor multi model code

commit 8a7901e371f8f1e1c47442609cf5d007a5aee3df
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:51:16 2024 +0800

    print table at the end

commit fcd53e6e5a1a7b17e7a69c08eb306dd8ad3435c6
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:20:59 2024 +0800

    add yaml config to supprot multi-model eval

commit cbf0704d7b754b0d233f1643f3c3181fea8d02db
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:39:42 2024 +0800

    black

commit 77cc77fe7c49d65b3275c333bb1ce93798d46994
Merge: 7e8b57d 1d3fdd4
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:37:57 2024 +0800

    resolve conflicts in sqa

commit 100acee4869445bfa0a00aebdc1d36272f2af7ed
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:36:46 2024 +0800

    add model specific prompt and gen kwargs

commit 2037acaebc414280bd85e31b30ef9d2e671b3a19
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Thu Jan 25 09:47:31 2024 +0800

    [Dataset] Add flickr30k (#18)

    * Add flickr30k support

    * Black lint

    * Align prompt with NoCaps

commit 5df364f719ca23b1cecf33debf6a6fd5e1f8b032
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 24 22:10:14 2024 +0800

    [Datasets] modify NoCaps data path and prompts (#17)

    * Merge commit '1e0514f92df2bbcd3d1c1fc86e3212c5fed93eaf'

    * Update dataset paths and improve user prompts

commit fc6d5dd1b7e142e0336c2099845cd2b89558a77b
Author: jzhang38 <a1286225768@gmail.com>
Date:   Wed Jan 24 13:56:51 2024 +0800

    black

commit 4d35cfef00c7bbe2d51d7e72b4df60fc30e0cea1
Author: jzhang38 <a1286225768@gmail.com>
Date:   Wed Jan 24 13:55:43 2024 +0800

    add mmme

* Squashed commit of the following:

commit bf49735f01e8a523d01acadba47a410b1fa46434
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:59:12 2024 +0800

    refactor multi model code

commit 8a7901e371f8f1e1c47442609cf5d007a5aee3df
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:51:16 2024 +0800

    print table at the end

commit fcd53e6e5a1a7b17e7a69c08eb306dd8ad3435c6
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 11:20:59 2024 +0800

    add yaml config to supprot multi-model eval

commit cbf0704d7b754b0d233f1643f3c3181fea8d02db
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:39:42 2024 +0800

    black

commit 77cc77fe7c49d65b3275c333bb1ce93798d46994
Merge: 7e8b57d 1d3fdd4
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:37:57 2024 +0800

    resolve conflicts in sqa

commit 100acee4869445bfa0a00aebdc1d36272f2af7ed
Author: jzhang38 <a1286225768@gmail.com>
Date:   Thu Jan 25 10:36:46 2024 +0800

    add model specific prompt and gen kwargs

commit fc6d5dd1b7e142e0336c2099845cd2b89558a77b
Author: jzhang38 <a1286225768@gmail.com>
Date:   Wed Jan 24 13:56:51 2024 +0800

    black

commit 4d35cfef00c7bbe2d51d7e72b4df60fc30e0cea1
Author: jzhang38 <a1286225768@gmail.com>
Date:   Wed Jan 24 13:55:43 2024 +0800

    add mmme

* Squashed commit of the following:

commit 15a5c86fdc40ead0194ae03b0529b8da921bd393
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Jan 25 17:08:25 2024 +0800

    add model specific prompt and gen kwargs in sqa (#19)

    * add mmme

    * black

    * add model specific prompt and gen kwargs

    * black

    * add yaml config to supprot multi-model eval

    * print table at the end

    * refactor multi model code

* Fix cli itself can not run with config file

* Fix bug in login functionality

Refactor code for better performance

Add new feature for user authentication

Update UI layout for improved user experience

Fix typo in variable name

Optimize database queries for faster response time

Add error handling for edge cases

Update dependencies to latest versions

Remove unused code

Improve code readability and maintainability

* Refactor get_task_dict function to handle nested groups

* Add submission file for coco, flickr30k, nocaps, and textcaps tasks

* Remove unused files and update task configuration

* Fix tasks issue for nocaps, refcoco/+/g

* Fix file path and raise error if config file does not exist

* Exclude train in refcoco/+/g config

* Solve doc_iterator_for_counting crashing issue

* Black lint

* Refactor code to improve performance and readability

* Squashed commit of the following:

commit a2cc9303dc72e4d53983bb56e54a32e977c3e270
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 01:03:57 2024 +0800

    change okvqa yaml

commit 35e87e7c7a480d005abf607c2527a35457d92311
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 00:55:40 2024 +0800

    change yaml

commit 89755323596b85208ed33aa88c296604a39af6eb
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 00:42:43 2024 +0800

    add okvqa task

commit 15a5c86fdc40ead0194ae03b0529b8da921bd393
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Jan 25 17:08:25 2024 +0800

    add model specific prompt and gen kwargs in sqa (#19)

    * add mmme

    * black

    * add model specific prompt and gen kwargs

    * black

    * add yaml config to supprot multi-model eval

    * print table at the end

    * refactor multi model code

* Squashed commit of the following:

commit 0b0d30dfb247c5f0b7b68398b9e9fcde74cf7fa2
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 01:06:02 2024 +0800

    change ocr reference

commit e273f9cbd91540df86bdbc652bff88a847bd0d2d
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 01:05:46 2024 +0800

    revert example_eval

commit e84126aaaf8a07bd371a0571a914ccbcd3697f20
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Fri Jan 26 00:17:28 2024 +0800

    edit vizwiz utils

commit 110deab53dc1a2fd349b1872cd261b69074c5fa8
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Thu Jan 25 23:49:47 2024 +0800

    reorganize __init__

commit 0fa3e0c40075997ea80ed976bdee9615f17d3ece
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Thu Jan 25 23:46:20 2024 +0800

    minor fixes

commit 2aaca579120def99860f90054233f3358950fa66
Author: JvThunder <joshuaadrianc@gmail.com>
Date:   Thu Jan 25 17:41:03 2024 +0800

    add vizwizvqa eval rask

commit 15a5c86fdc40ead0194ae03b0529b8da921bd393
Author: Zhang Peiyuan <a1286225768@gmail.com>
Date:   Thu Jan 25 17:08:25 2024 +0800

    add model specific prompt and gen kwargs in sqa (#19)

    * add mmme

    * black

    * add model specific prompt and gen kwargs

    * black

    * add yaml config to supprot multi-model eval

    * print table at the end

    * refactor multi model code

* Refactor mathvista.yaml and utils.py

* Add gpt_eval_score to mathvista_process_results

* Refactor mathvista_aggregate_results to return average accuracy score

* Fix refcoco evaluation error

* Fix evaluation problem for refcoco+/g

* Refactor mathvista.yaml and mathvista_evals.py

* Add dependencies and update YAML files

* Refactor mmbench_en/utils.py to save test results to separate Excel file

* Fix caption task prompt

* Add group field to mmbench_en_test and mmbench_en_val yaml files

* Delete mmbench_en_val.yaml file

* Update mmbench_cn.yaml and mmbench_cn_test.yaml

* Update mmbench_cn_val.yaml and utils.py

* Remove unused fields in mmbench_cn_cc_process_results function

* Update aggregation function for mmbench_en_dev.yaml

* Fix capitalization of L2-category key in utils.py

* Fix variable name in mmbench_process_results function

* Delete mmbench_cn_val.yaml file

* Update mathvista_test.yaml and mathvista_testmini.yaml

* Fix warnings and update mathvista.yaml

* Remove system message from MathVistaEvaluator

* Update GPT model version in MathVistaEvaluator constructor

* Update GQA_RAW_IMAGE_DATASET path in utils.py

* change vizwiz to test set

* Add split flag to mathvista_aggregate_results function

* Add higher_is_better: false to gpt_eval_info metric in d170_cn, d170_en, dc100_en, and dc200_cn yaml files

* Update lmms_eval/evaluator.py and lmms_eval/tasks/vizwizvqa/utils.py

* vizwiz-val

* Update utils.py

* Update vizwizvqa.yaml

---------

Co-authored-by: Bo Li <drluodian@gmail.com>
Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
---
 lmms_eval/evaluator.py                       |   2 +-
 lmms_eval/tasks/vizwizvqa/utils.py           |   4 +-
 lmms_eval/tasks/vizwizvqa_val/utils.py       | 273 +++++++++++++++++++
 lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml |  24 ++
 4 files changed, 300 insertions(+), 3 deletions(-)
 create mode 100644 lmms_eval/tasks/vizwizvqa_val/utils.py
 create mode 100644 lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml

diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py
index e6775c9e2..9e0a697d0 100644
--- a/lmms_eval/evaluator.py
+++ b/lmms_eval/evaluator.py
@@ -315,7 +315,7 @@ def evaluate(
             # Don't use above one, this would crash if doc_iterator_for_counting contains too many objects and very slow
             doc_iterator_for_counting = itertools.islice(range(len(task.test_docs())), lm.rank, limit, lm.world_size) if task.has_test_docs() else itertools.islice(range(len(task.validation_docs())), lm.rank, limit, lm.world_size)
             total_docs = sum(1 for _ in doc_iterator_for_counting)
-            pbar = tqdm(total=total_docs, desc="Postprocessing")
+            pbar = tqdm(total=total_docs, desc="Postprocessing", position=lm.rank)
             for doc_id, doc in doc_iterator:
                 # subset instances to only this document id ; sort by idx
                 requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
diff --git a/lmms_eval/tasks/vizwizvqa/utils.py b/lmms_eval/tasks/vizwizvqa/utils.py
index 2b3d69310..35a6e0fa1 100644
--- a/lmms_eval/tasks/vizwizvqa/utils.py
+++ b/lmms_eval/tasks/vizwizvqa/utils.py
@@ -252,14 +252,14 @@ def vizwizvqa_process_results(doc, result):
     return {
         "exact_match": accuracy,
         "submission": {
-            "question_id": doc["question_id"],
+            "image": f"{doc['question_id']}.jpg",
             "answer": resAns,
         },
     }
 
 
 def vizwizvqa_doc_to_text(doc):
-    text = f"{doc['question'].capitalize()}\n When the provided information is insufficient, respond with 'unanswerable'. Answer the question using a single word or phrase."
+    text = f"{doc['question'].capitalize()}\nWhen the provided information is insufficient, respond with 'Unanswerable'.\nAnswer the question using a single word or phrase."
     return text
 
 
diff --git a/lmms_eval/tasks/vizwizvqa_val/utils.py b/lmms_eval/tasks/vizwizvqa_val/utils.py
new file mode 100644
index 000000000..35a6e0fa1
--- /dev/null
+++ b/lmms_eval/tasks/vizwizvqa_val/utils.py
@@ -0,0 +1,273 @@
+import re
+import os
+import json
+import yaml
+import pathlib
+import logging
+import datetime
+import statistics
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(pathlib.Path(__file__).parent / "vizwizvqa.yaml", "r") as f:
+    raw_data = f.readlines()
+    for i in range(len(raw_data)):
+        raw_data[i] = raw_data[i].replace("!function", "function")
+
+    config = yaml.safe_load("".join(raw_data))
+
+
+class EvalAIAnswerProcessor:
+    CONTRACTIONS = {
+        "aint": "ain't",
+        "arent": "aren't",
+        "cant": "can't",
+        "couldve": "could've",
+        "couldnt": "couldn't",
+        "couldn'tve": "couldn't've",
+        "couldnt've": "couldn't've",
+        "didnt": "didn't",
+        "doesnt": "doesn't",
+        "dont": "don't",
+        "hadnt": "hadn't",
+        "hadnt've": "hadn't've",
+        "hadn'tve": "hadn't've",
+        "hasnt": "hasn't",
+        "havent": "haven't",
+        "hed": "he'd",
+        "hed've": "he'd've",
+        "he'dve": "he'd've",
+        "hes": "he's",
+        "howd": "how'd",
+        "howll": "how'll",
+        "hows": "how's",
+        "Id've": "I'd've",
+        "I'dve": "I'd've",
+        "Im": "I'm",
+        "Ive": "I've",
+        "isnt": "isn't",
+        "itd": "it'd",
+        "itd've": "it'd've",
+        "it'dve": "it'd've",
+        "itll": "it'll",
+        "let's": "let's",
+        "maam": "ma'am",
+        "mightnt": "mightn't",
+        "mightnt've": "mightn't've",
+        "mightn'tve": "mightn't've",
+        "mightve": "might've",
+        "mustnt": "mustn't",
+        "mustve": "must've",
+        "neednt": "needn't",
+        "notve": "not've",
+        "oclock": "o'clock",
+        "oughtnt": "oughtn't",
+        "ow's'at": "'ow's'at",
+        "'ows'at": "'ow's'at",
+        "'ow'sat": "'ow's'at",
+        "shant": "shan't",
+        "shed've": "she'd've",
+        "she'dve": "she'd've",
+        "she's": "she's",
+        "shouldve": "should've",
+        "shouldnt": "shouldn't",
+        "shouldnt've": "shouldn't've",
+        "shouldn'tve": "shouldn't've",
+        "somebody'd": "somebodyd",
+        "somebodyd've": "somebody'd've",
+        "somebody'dve": "somebody'd've",
+        "somebodyll": "somebody'll",
+        "somebodys": "somebody's",
+        "someoned": "someone'd",
+        "someoned've": "someone'd've",
+        "someone'dve": "someone'd've",
+        "someonell": "someone'll",
+        "someones": "someone's",
+        "somethingd": "something'd",
+        "somethingd've": "something'd've",
+        "something'dve": "something'd've",
+        "somethingll": "something'll",
+        "thats": "that's",
+        "thered": "there'd",
+        "thered've": "there'd've",
+        "there'dve": "there'd've",
+        "therere": "there're",
+        "theres": "there's",
+        "theyd": "they'd",
+        "theyd've": "they'd've",
+        "they'dve": "they'd've",
+        "theyll": "they'll",
+        "theyre": "they're",
+        "theyve": "they've",
+        "twas": "'twas",
+        "wasnt": "wasn't",
+        "wed've": "we'd've",
+        "we'dve": "we'd've",
+        "weve": "we've",
+        "werent": "weren't",
+        "whatll": "what'll",
+        "whatre": "what're",
+        "whats": "what's",
+        "whatve": "what've",
+        "whens": "when's",
+        "whered": "where'd",
+        "wheres": "where's",
+        "whereve": "where've",
+        "whod": "who'd",
+        "whod've": "who'd've",
+        "who'dve": "who'd've",
+        "wholl": "who'll",
+        "whos": "who's",
+        "whove": "who've",
+        "whyll": "why'll",
+        "whyre": "why're",
+        "whys": "why's",
+        "wont": "won't",
+        "wouldve": "would've",
+        "wouldnt": "wouldn't",
+        "wouldnt've": "wouldn't've",
+        "wouldn'tve": "wouldn't've",
+        "yall": "y'all",
+        "yall'll": "y'all'll",
+        "y'allll": "y'all'll",
+        "yall'd've": "y'all'd've",
+        "y'alld've": "y'all'd've",
+        "y'all'dve": "y'all'd've",
+        "youd": "you'd",
+        "youd've": "you'd've",
+        "you'dve": "you'd've",
+        "youll": "you'll",
+        "youre": "you're",
+        "youve": "you've",
+    }
+
+    NUMBER_MAP = {
+        "none": "0",
+        "zero": "0",
+        "one": "1",
+        "two": "2",
+        "three": "3",
+        "four": "4",
+        "five": "5",
+        "six": "6",
+        "seven": "7",
+        "eight": "8",
+        "nine": "9",
+        "ten": "10",
+    }
+    ARTICLES = ["a", "an", "the"]
+    PERIOD_STRIP = re.compile(r"(?!<=\d)(\.)(?!\d)")
+    COMMA_STRIP = re.compile(r"(?<=\d)(\,)+(?=\d)")
+    PUNCTUATIONS = [
+        ";",
+        r"/",
+        "[",
+        "]",
+        '"',
+        "{",
+        "}",
+        "(",
+        ")",
+        "=",
+        "+",
+        "\\",
+        "_",
+        "-",
+        ">",
+        "<",
+        "@",
+        "`",
+        ",",
+        "?",
+        "!",
+    ]
+
+    def __init__(self, *args, **kwargs):
+        pass
+
+    def word_tokenize(self, word):
+        word = word.lower()
+        word = word.replace(",", "").replace("?", "").replace("'s", " 's")
+        word = word.replace("\n", " ").replace("\t", " ").strip()
+        return word.strip()
+
+    def process_punctuation(self, in_text):
+        out_text = in_text
+        for p in self.PUNCTUATIONS:
+            if (p + " " in in_text or " " + p in in_text) or (re.search(self.COMMA_STRIP, in_text) is not None):
+                out_text = out_text.replace(p, "")
+            else:
+                out_text = out_text.replace(p, " ")
+        out_text = self.PERIOD_STRIP.sub("", out_text, re.UNICODE)
+        return out_text
+
+    def process_digit_article(self, in_text):
+        out_text = []
+        temp_text = in_text.lower().split()
+        for word in temp_text:
+            word = self.NUMBER_MAP.setdefault(word, word)
+            if word not in self.ARTICLES:
+                out_text.append(word)
+            else:
+                pass
+        for word_id, word in enumerate(out_text):
+            if word in self.CONTRACTIONS:
+                out_text[word_id] = self.CONTRACTIONS[word]
+        out_text = " ".join(out_text)
+        return out_text
+
+    def __call__(self, item):
+        item = self.word_tokenize(item)
+        item = self.process_punctuation(item)
+        item = self.process_digit_article(item)
+        return item
+
+
+def vizwizvqa_doc_to_visual(doc):
+    return [doc["image"].convert("RGB")]
+
+
+def vizwizvqa_process_results(doc, result):
+    eval_ai_processor = EvalAIAnswerProcessor()
+    assert len(result) == 1, f"The result should be a list of length 1, but got {len(result)}."
+    resAns = eval_ai_processor(result[0])
+    accuracy = 0
+
+    if "answers" in doc and doc["answers"] is not None:
+        gtAcc = []
+
+        for i in range(len(doc["answers"])):
+            doc["answers"][i] = eval_ai_processor(doc["answers"][i])
+
+        for i in range(len(doc["answers"])):
+            otherGTAns = [doc["answers"][j] for j in range(len(doc["answers"])) if i != j]
+            matchingAns = [item for item in otherGTAns if item == resAns]
+            acc = min(1, float(len(matchingAns)) / 3)
+            gtAcc.append(acc)
+        if gtAcc:
+            accuracy = statistics.mean(gtAcc)
+        else:
+            accuracy = 0
+
+    return {
+        "exact_match": accuracy,
+        "submission": {
+            "image": f"{doc['question_id']}.jpg",
+            "answer": resAns,
+        },
+    }
+
+
+def vizwizvqa_doc_to_text(doc):
+    text = f"{doc['question'].capitalize()}\nWhen the provided information is insufficient, respond with 'Unanswerable'.\nAnswer the question using a single word or phrase."
+    return text
+
+
+def vizwizvqa_aggreate_submissions(results):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"vizwizvqa-submission-{now_date_time}.json"
+    path = os.path.abspath(submission_file_name)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    print(f"Submission file saved to {path}")
+    return 0
diff --git a/lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml b/lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml
new file mode 100644
index 000000000..290d928ef
--- /dev/null
+++ b/lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml
@@ -0,0 +1,24 @@
+task: vizwizvqa_val
+dataset_path: lmms-lab/VizWiz-VQA
+  token: True
+test_split: val
+output_type: generate_until
+doc_to_visual: !function utils.vizwizvqa_doc_to_visual
+doc_to_text: !function utils.vizwizvqa_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+  - metric: submission
+    aggregation: !function utils.vizwizvqa_aggreate_submissions
+    higher_is_better: true
+metadata:
+  - version: 0.0
+  - have_ocr_reference: false
+process_results: !function utils.vizwizvqa_process_results