vizwiz dataset (EvolvingLMMs-Lab#24)

* Merge commit 'bee5794a597d8a87794b4bcd9b57a1553efad857' * Update dataset paths and improve user prompts * Add submission folder and update file paths for storing prediction results * Merge commit '52ee4a18dad22b2399a4248d2aa9204dbfe88624' * Update dataset_path in flickr30k.yaml * Add coco_val and coco_test tasks to coco.yaml * Squashed commit of the following: commit 4d48d0c9b88e62dfebe05ec909b7f1851e9cd75d Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:59:12 2024 +0800 refactor multi model code commit 4a4b7bec200c72332b61a0c277cd8f8a34e4f721 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:51:16 2024 +0800 print table at the end commit 63739fc6fa0a462d807ae81de0db0173102de584 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:20:59 2024 +0800 add yaml config to supprot multi-model eval commit edcc752f97ea3845cefad56624e5d2855066f680 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:39:42 2024 +0800 black commit 41f4b63d3a6e83babe92bac32a7432a8ef740bb5 Merge: 7e8b57d 1d3fdd4 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:37:57 2024 +0800 resolve conflicts in sqa commit 7e8b57d3bcc21d2a049d3abbc8a8201631641db4 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:36:46 2024 +0800 add model specific prompt and gen kwargs commit 52ee4a1 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Thu Jan 25 09:47:31 2024 +0800 [Dataset] Add flickr30k (EvolvingLMMs-Lab#18) * Add flickr30k support * Black lint * Align prompt with NoCaps commit 04303b0 Author: Li Bo <drluodian@gmail.com> Date: Wed Jan 24 22:10:14 2024 +0800 [Datasets] modify NoCaps data path and prompts (EvolvingLMMs-Lab#17) * Merge commit 'bee5794a597d8a87794b4bcd9b57a1553efad857' * Update dataset paths and improve user prompts commit 7d5058337d3de3cd4f0e85368e3dd463f34e703c Author: jzhang38 <a1286225768@gmail.com> Date: Wed Jan 24 13:56:51 2024 +0800 black commit 73918654650daa0dad965d1b786d53e7c3585010 Author: jzhang38 <a1286225768@gmail.com> Date: Wed Jan 24 13:55:43 2024 +0800 add mmme * Squashed commit of the following: commit 4d48d0c9b88e62dfebe05ec909b7f1851e9cd75d Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:59:12 2024 +0800 refactor multi model code commit 4a4b7bec200c72332b61a0c277cd8f8a34e4f721 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:51:16 2024 +0800 print table at the end commit 63739fc6fa0a462d807ae81de0db0173102de584 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 11:20:59 2024 +0800 add yaml config to supprot multi-model eval commit edcc752f97ea3845cefad56624e5d2855066f680 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:39:42 2024 +0800 black commit 41f4b63d3a6e83babe92bac32a7432a8ef740bb5 Merge: 7e8b57d 1d3fdd4 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:37:57 2024 +0800 resolve conflicts in sqa commit 7e8b57d3bcc21d2a049d3abbc8a8201631641db4 Author: jzhang38 <a1286225768@gmail.com> Date: Thu Jan 25 10:36:46 2024 +0800 add model specific prompt and gen kwargs commit 7d5058337d3de3cd4f0e85368e3dd463f34e703c Author: jzhang38 <a1286225768@gmail.com> Date: Wed Jan 24 13:56:51 2024 +0800 black commit 73918654650daa0dad965d1b786d53e7c3585010 Author: jzhang38 <a1286225768@gmail.com> Date: Wed Jan 24 13:55:43 2024 +0800 add mmme * Squashed commit of the following: commit f7a7db5 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Jan 25 17:08:25 2024 +0800 add model specific prompt and gen kwargs in sqa (EvolvingLMMs-Lab#19) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * Fix cli itself can not run with config file * Fix bug in login functionality Refactor code for better performance Add new feature for user authentication Update UI layout for improved user experience Fix typo in variable name Optimize database queries for faster response time Add error handling for edge cases Update dependencies to latest versions Remove unused code Improve code readability and maintainability * Refactor get_task_dict function to handle nested groups * Add submission file for coco, flickr30k, nocaps, and textcaps tasks * Remove unused files and update task configuration * Fix tasks issue for nocaps, refcoco/+/g * Fix file path and raise error if config file does not exist * Exclude train in refcoco/+/g config * Solve doc_iterator_for_counting crashing issue * Black lint * Refactor code to improve performance and readability * Squashed commit of the following: commit 35c3c7098e489ddc552778ea801a6acb6a25a9d9 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 01:03:57 2024 +0800 change okvqa yaml commit 25d9de0b0ea4418e4b1b6f74bdb0dd4c835f66a9 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 00:55:40 2024 +0800 change yaml commit aad562494c54d6ddd8cc9b9558a2a300e65f2ea2 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 00:42:43 2024 +0800 add okvqa task commit f7a7db5 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Jan 25 17:08:25 2024 +0800 add model specific prompt and gen kwargs in sqa (EvolvingLMMs-Lab#19) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * Squashed commit of the following: commit 40d1888f2e83dadac572c08b7e1f0ae6e2b4d504 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 01:06:02 2024 +0800 change ocr reference commit 02b00db5c3c2dce5ab4c2db6a3eacc7d0b735942 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 01:05:46 2024 +0800 revert example_eval commit f35878778fc0179381b8f3d61d222000b1773774 Author: JvThunder <joshuaadrianc@gmail.com> Date: Fri Jan 26 00:17:28 2024 +0800 edit vizwiz utils commit 64fb8196c4d9a943fa11a1d0b0fd2a065ed37847 Author: JvThunder <joshuaadrianc@gmail.com> Date: Thu Jan 25 23:49:47 2024 +0800 reorganize __init__ commit f79ece372f140427c9461aa652fe1a9e8a312b3d Author: JvThunder <joshuaadrianc@gmail.com> Date: Thu Jan 25 23:46:20 2024 +0800 minor fixes commit 028007a0352365dd42a968df6000eb66c9d30e2b Author: JvThunder <joshuaadrianc@gmail.com> Date: Thu Jan 25 17:41:03 2024 +0800 add vizwizvqa eval rask commit f7a7db5 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Thu Jan 25 17:08:25 2024 +0800 add model specific prompt and gen kwargs in sqa (EvolvingLMMs-Lab#19) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * Refactor mathvista.yaml and utils.py * Add gpt_eval_score to mathvista_process_results * Refactor mathvista_aggregate_results to return average accuracy score * Fix refcoco evaluation error * Fix evaluation problem for refcoco+/g * Refactor mathvista.yaml and mathvista_evals.py * Add dependencies and update YAML files * Refactor mmbench_en/utils.py to save test results to separate Excel file * Fix caption task prompt * Add group field to mmbench_en_test and mmbench_en_val yaml files * Delete mmbench_en_val.yaml file * Update mmbench_cn.yaml and mmbench_cn_test.yaml * Update mmbench_cn_val.yaml and utils.py * Remove unused fields in mmbench_cn_cc_process_results function * Update aggregation function for mmbench_en_dev.yaml * Fix capitalization of L2-category key in utils.py * Fix variable name in mmbench_process_results function * Delete mmbench_cn_val.yaml file * Update mathvista_test.yaml and mathvista_testmini.yaml * Fix warnings and update mathvista.yaml * Remove system message from MathVistaEvaluator * Update GPT model version in MathVistaEvaluator constructor * Update GQA_RAW_IMAGE_DATASET path in utils.py * change vizwiz to test set * Add split flag to mathvista_aggregate_results function * Add higher_is_better: false to gpt_eval_info metric in d170_cn, d170_en, dc100_en, and dc200_cn yaml files * Update lmms_eval/evaluator.py and lmms_eval/tasks/vizwizvqa/utils.py * vizwiz-val * Update utils.py * Update vizwizvqa.yaml --------- Co-authored-by: Bo Li <drluodian@gmail.com> Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
kangreen0210 · Jan 27, 2024 · 01fff97 · 01fff97
1 parent 54a1cbd
commit 01fff97
Show file tree

Hide file tree

Showing 4 changed files with 300 additions and 3 deletions.
diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py
@@ -315,7 +315,7 @@ def evaluate(
             # Don't use above one, this would crash if doc_iterator_for_counting contains too many objects and very slow
             doc_iterator_for_counting = itertools.islice(range(len(task.test_docs())), lm.rank, limit, lm.world_size) if task.has_test_docs() else itertools.islice(range(len(task.validation_docs())), lm.rank, limit, lm.world_size)
             total_docs = sum(1 for _ in doc_iterator_for_counting)
-            pbar = tqdm(total=total_docs, desc="Postprocessing")
+            pbar = tqdm(total=total_docs, desc="Postprocessing", position=lm.rank)
             for doc_id, doc in doc_iterator:
                 # subset instances to only this document id ; sort by idx
                 requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))

diff --git a/lmms_eval/tasks/vizwizvqa/utils.py b/lmms_eval/tasks/vizwizvqa/utils.py
@@ -252,14 +252,14 @@ def vizwizvqa_process_results(doc, result):
     return {
         "exact_match": accuracy,
         "submission": {
-            "question_id": doc["question_id"],
+            "image": f"{doc['question_id']}.jpg",
             "answer": resAns,
         },
     }
 
 
 def vizwizvqa_doc_to_text(doc):
-    text = f"{doc['question'].capitalize()}\n When the provided information is insufficient, respond with 'unanswerable'. Answer the question using a single word or phrase."
+    text = f"{doc['question'].capitalize()}\nWhen the provided information is insufficient, respond with 'Unanswerable'.\nAnswer the question using a single word or phrase."
     return text
 
 

diff --git a/lmms_eval/tasks/vizwizvqa_val/utils.py b/lmms_eval/tasks/vizwizvqa_val/utils.py
@@ -0,0 +1,273 @@
+import re
+import os
+import json
+import yaml
+import pathlib
+import logging
+import datetime
+import statistics
+
+eval_logger = logging.getLogger("lmms-eval")
+
+with open(pathlib.Path(__file__).parent / "vizwizvqa.yaml", "r") as f:
+    raw_data = f.readlines()
+    for i in range(len(raw_data)):
+        raw_data[i] = raw_data[i].replace("!function", "function")
+
+    config = yaml.safe_load("".join(raw_data))
+
+
+class EvalAIAnswerProcessor:
+    CONTRACTIONS = {
+        "aint": "ain't",
+        "arent": "aren't",
+        "cant": "can't",
+        "couldve": "could've",
+        "couldnt": "couldn't",
+        "couldn'tve": "couldn't've",
+        "couldnt've": "couldn't've",
+        "didnt": "didn't",
+        "doesnt": "doesn't",
+        "dont": "don't",
+        "hadnt": "hadn't",
+        "hadnt've": "hadn't've",
+        "hadn'tve": "hadn't've",
+        "hasnt": "hasn't",
+        "havent": "haven't",
+        "hed": "he'd",
+        "hed've": "he'd've",
+        "he'dve": "he'd've",
+        "hes": "he's",
+        "howd": "how'd",
+        "howll": "how'll",
+        "hows": "how's",
+        "Id've": "I'd've",
+        "I'dve": "I'd've",
+        "Im": "I'm",
+        "Ive": "I've",
+        "isnt": "isn't",
+        "itd": "it'd",
+        "itd've": "it'd've",
+        "it'dve": "it'd've",
+        "itll": "it'll",
+        "let's": "let's",
+        "maam": "ma'am",
+        "mightnt": "mightn't",
+        "mightnt've": "mightn't've",
+        "mightn'tve": "mightn't've",
+        "mightve": "might've",
+        "mustnt": "mustn't",
+        "mustve": "must've",
+        "neednt": "needn't",
+        "notve": "not've",
+        "oclock": "o'clock",
+        "oughtnt": "oughtn't",
+        "ow's'at": "'ow's'at",
+        "'ows'at": "'ow's'at",
+        "'ow'sat": "'ow's'at",
+        "shant": "shan't",
+        "shed've": "she'd've",
+        "she'dve": "she'd've",
+        "she's": "she's",
+        "shouldve": "should've",
+        "shouldnt": "shouldn't",
+        "shouldnt've": "shouldn't've",
+        "shouldn'tve": "shouldn't've",
+        "somebody'd": "somebodyd",
+        "somebodyd've": "somebody'd've",
+        "somebody'dve": "somebody'd've",
+        "somebodyll": "somebody'll",
+        "somebodys": "somebody's",
+        "someoned": "someone'd",
+        "someoned've": "someone'd've",
+        "someone'dve": "someone'd've",
+        "someonell": "someone'll",
+        "someones": "someone's",
+        "somethingd": "something'd",
+        "somethingd've": "something'd've",
+        "something'dve": "something'd've",
+        "somethingll": "something'll",
+        "thats": "that's",
+        "thered": "there'd",
+        "thered've": "there'd've",
+        "there'dve": "there'd've",
+        "therere": "there're",
+        "theres": "there's",
+        "theyd": "they'd",
+        "theyd've": "they'd've",
+        "they'dve": "they'd've",
+        "theyll": "they'll",
+        "theyre": "they're",
+        "theyve": "they've",
+        "twas": "'twas",
+        "wasnt": "wasn't",
+        "wed've": "we'd've",
+        "we'dve": "we'd've",
+        "weve": "we've",
+        "werent": "weren't",
+        "whatll": "what'll",
+        "whatre": "what're",
+        "whats": "what's",
+        "whatve": "what've",
+        "whens": "when's",
+        "whered": "where'd",
+        "wheres": "where's",
+        "whereve": "where've",
+        "whod": "who'd",
+        "whod've": "who'd've",
+        "who'dve": "who'd've",
+        "wholl": "who'll",
+        "whos": "who's",
+        "whove": "who've",
+        "whyll": "why'll",
+        "whyre": "why're",
+        "whys": "why's",
+        "wont": "won't",
+        "wouldve": "would've",
+        "wouldnt": "wouldn't",
+        "wouldnt've": "wouldn't've",
+        "wouldn'tve": "wouldn't've",
+        "yall": "y'all",
+        "yall'll": "y'all'll",
+        "y'allll": "y'all'll",
+        "yall'd've": "y'all'd've",
+        "y'alld've": "y'all'd've",
+        "y'all'dve": "y'all'd've",
+        "youd": "you'd",
+        "youd've": "you'd've",
+        "you'dve": "you'd've",
+        "youll": "you'll",
+        "youre": "you're",
+        "youve": "you've",
+    }
+
+    NUMBER_MAP = {
+        "none": "0",
+        "zero": "0",
+        "one": "1",
+        "two": "2",
+        "three": "3",
+        "four": "4",
+        "five": "5",
+        "six": "6",
+        "seven": "7",
+        "eight": "8",
+        "nine": "9",
+        "ten": "10",
+    }
+    ARTICLES = ["a", "an", "the"]
+    PERIOD_STRIP = re.compile(r"(?!<=\d)(\.)(?!\d)")
+    COMMA_STRIP = re.compile(r"(?<=\d)(\,)+(?=\d)")
+    PUNCTUATIONS = [
+        ";",
+        r"/",
+        "[",
+        "]",
+        '"',
+        "{",
+        "}",
+        "(",
+        ")",
+        "=",
+        "+",
+        "\\",
+        "_",
+        "-",
+        ">",
+        "<",
+        "@",
+        "`",
+        ",",
+        "?",
+        "!",
+    ]
+
+    def __init__(self, *args, **kwargs):
+        pass
+
+    def word_tokenize(self, word):
+        word = word.lower()
+        word = word.replace(",", "").replace("?", "").replace("'s", " 's")
+        word = word.replace("\n", " ").replace("\t", " ").strip()
+        return word.strip()
+
+    def process_punctuation(self, in_text):
+        out_text = in_text
+        for p in self.PUNCTUATIONS:
+            if (p + " " in in_text or " " + p in in_text) or (re.search(self.COMMA_STRIP, in_text) is not None):
+                out_text = out_text.replace(p, "")
+            else:
+                out_text = out_text.replace(p, " ")
+        out_text = self.PERIOD_STRIP.sub("", out_text, re.UNICODE)
+        return out_text
+
+    def process_digit_article(self, in_text):
+        out_text = []
+        temp_text = in_text.lower().split()
+        for word in temp_text:
+            word = self.NUMBER_MAP.setdefault(word, word)
+            if word not in self.ARTICLES:
+                out_text.append(word)
+            else:
+                pass
+        for word_id, word in enumerate(out_text):
+            if word in self.CONTRACTIONS:
+                out_text[word_id] = self.CONTRACTIONS[word]
+        out_text = " ".join(out_text)
+        return out_text
+
+    def __call__(self, item):
+        item = self.word_tokenize(item)
+        item = self.process_punctuation(item)
+        item = self.process_digit_article(item)
+        return item
+
+
+def vizwizvqa_doc_to_visual(doc):
+    return [doc["image"].convert("RGB")]
+
+
+def vizwizvqa_process_results(doc, result):
+    eval_ai_processor = EvalAIAnswerProcessor()
+    assert len(result) == 1, f"The result should be a list of length 1, but got {len(result)}."
+    resAns = eval_ai_processor(result[0])
+    accuracy = 0
+
+    if "answers" in doc and doc["answers"] is not None:
+        gtAcc = []
+
+        for i in range(len(doc["answers"])):
+            doc["answers"][i] = eval_ai_processor(doc["answers"][i])
+
+        for i in range(len(doc["answers"])):
+            otherGTAns = [doc["answers"][j] for j in range(len(doc["answers"])) if i != j]
+            matchingAns = [item for item in otherGTAns if item == resAns]
+            acc = min(1, float(len(matchingAns)) / 3)
+            gtAcc.append(acc)
+        if gtAcc:
+            accuracy = statistics.mean(gtAcc)
+        else:
+            accuracy = 0
+
+    return {
+        "exact_match": accuracy,
+        "submission": {
+            "image": f"{doc['question_id']}.jpg",
+            "answer": resAns,
+        },
+    }
+
+
+def vizwizvqa_doc_to_text(doc):
+    text = f"{doc['question'].capitalize()}\nWhen the provided information is insufficient, respond with 'Unanswerable'.\nAnswer the question using a single word or phrase."
+    return text
+
+
+def vizwizvqa_aggreate_submissions(results):
+    now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
+    submission_file_name = f"vizwizvqa-submission-{now_date_time}.json"
+    path = os.path.abspath(submission_file_name)
+    with open(path, "w") as f:
+        json.dump(results, f)
+    print(f"Submission file saved to {path}")
+    return 0
diff --git a/lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml b/lmms_eval/tasks/vizwizvqa_val/vizwizvqa.yaml
@@ -0,0 +1,24 @@
+task: vizwizvqa_val
+dataset_path: lmms-lab/VizWiz-VQA
+  token: True
+test_split: val
+output_type: generate_until
+doc_to_visual: !function utils.vizwizvqa_doc_to_visual
+doc_to_text: !function utils.vizwizvqa_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  until:
+    - "ASSISTANT:"
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+  - metric: submission
+    aggregation: !function utils.vizwizvqa_aggreate_submissions
+    higher_is_better: true
+metadata:
+  - version: 0.0
+  - have_ocr_reference: false
+process_results: !function utils.vizwizvqa_process_results