From 1f8780df5e89ee50f349361bb5ea7351a73e0c19 Mon Sep 17 00:00:00 2001
From: Li Bo <drluodian@gmail.com>
Date: Thu, 1 Feb 2024 16:20:27 +0800
Subject: [PATCH] [Dataset] fix hallusion benchmark, add saving logic inside
 aggregate function (#35)

* add fuyu

* Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

* Squashed commit of the following:

commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Tue Jan 30 19:39:57 2024 +0800

    Add hallu bench

commit 7b7f6368e8e04cddbd6e7f572f1099b7911cbe04
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b61bbb0156dd72d454430cd01a246b5e61
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (#30)

        * mmmu_test

        * black

    commit 4a1183c563835c366ea54a28e1a5761a193b6704
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update hb_doc_to_text function to remove unnecessary line break

* Add Fuyu model and update OtterHD model

* Refactor model response handling and fix image processing bug

* Refactor flatten method to support only getting the first element

* Add support for specifying timezone in datetime string

Update flatten method in OtterHD class

Update get_datetime_str function in utils.py

* Fix condition for checking wandb_args_dict in __main__.py

* Commented out assertions for batch size in Fuyu model

* Add warning message for existing output file

* Fix batch size issue in OtterHD model

* Squashed commit of the following:

commit 7664839d1765e09b06e6cf59c12cb895ef71c40e
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 16:00:22 2024 +0800

    [Datasets] add hallubench (#34)

    * Add hallu bench

    * Fix hall_b gpt eval bugs

    ---------

    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

commit 05487a4e1f1dd1ab20d087399a47502716929a9b
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 14:23:15 2024 +0800

    [Datasets & Models] Fuyu, HalluBench (w/Kaichen, commit 96d95b3) (#33)

    * add fuyu

    * Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

    * Squashed commit of the following:

    commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Tue Jan 30 19:39:57 2024 +0800

        Add hallu bench

    commit 7b7f6368e8e04cddbd6e7f572f1099b7911cbe04
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Tue Jan 30 14:52:51 2024 +0800

        scienceqa for full set (#32)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        * Update generation kwargs for LMMS tasks

        * Update lmms_eval MME task configuration and utils

        * Update generation_kwargs in lmms_eval tasks

        * Update doc_to_text function in coco and okvqa tasks

        * Add COCO 2017 version

        * Update task name in coco_test2017.yaml

        * Squashed commit of the following:

        commit 6ee856b61bbb0156dd72d454430cd01a246b5e61
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Mon Jan 29 22:41:33 2024 +0800

            Add/mmmu test (#30)

            * mmmu_test

            * black

        commit 4a1183c563835c366ea54a28e1a5761a193b6704
        Author: Li Bo <drluodian@gmail.com>
        Date:   Sun Jan 28 22:19:13 2024 +0800

            [Dataset Check] dataset check and add wandb logging (#29)

            * Remove unused code and configuration file

            * Remove docvqa.yaml and update vizwizvqa.yaml

            * lint

            * Add dataset_kwargs to vizwizvqa.yaml

            * Add dataset_kwargs to vizwizvqa.yaml

            * textvqa (#27)

            * Update textvqa.yaml and utils.py

            * Fix YAML formatting in textvqa.yaml and remove unused files

            * remove useless matric

            * add textvqa val & test

            * Update progress bar description in evaluator.py

            * Update submission file names in VizWizVQA tasks

            * Update output path to include log samples suffix

            * Update submission file paths in OKVQA and VizWizVQA tasks

            * Refactor llava-in-the-wild.yaml and utils.py

            * Update metric for llava evaluation

            * Refactor logging message in Task class

            * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

            * Fix formatting issues and add progress bar closing statements

            * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

            * Update tqdm progress bar in OtterHD model

            * Squashed commit of the following:

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * Fix error handling in loading YAML config files

            * Squashed commit of the following:

            commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 12:41:40 2024 +0800

                Fix key bugs

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * List task #num sorted

            * Update prompt messages for image-related tasks

            * Delete unused task configuration files

            * Remove coco_train.yaml configuration file

            * Update task name in mmmu.yaml

            * Fix error message for missing tasks

            * Add wandb import and integration

            ---------

            Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
            Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

        * Remove scienceqa_img task configuration

        * eval scienceqa with no images

        ---------

        Co-authored-by: Bo Li <drluodian@gmail.com>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Update hb_doc_to_text function to remove unnecessary line break

    * Add Fuyu model and update OtterHD model

    * Refactor model response handling and fix image processing bug

    * Refactor flatten method to support only getting the first element

    * Add support for specifying timezone in datetime string

    Update flatten method in OtterHD class

    Update get_datetime_str function in utils.py

    * Fix condition for checking wandb_args_dict in __main__.py

    * Commented out assertions for batch size in Fuyu model

    * Add warning message for existing output file

commit 7b7f6368e8e04cddbd6e7f572f1099b7911cbe04
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b61bbb0156dd72d454430cd01a246b5e61
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (#30)

        * mmmu_test

        * black

    commit 4a1183c563835c366ea54a28e1a5761a193b6704
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da1fb40c446202bf9b0095b02262df2ffc8
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update API configuration and file paths

* Refactor evaluate_by_chatgpt function in utils.py

* Add hallusion_output_vd_model.json to .gitignore

* Add timeout to API request

* Refactor file path generation and remove unnecessary suffix in log samples output names

* Refactor code and add output path handling

* Update lmms-eval API and add new models and datasets
---
 .gitignore                                    |   3 +-
 README.md                                     | 124 +++++++++++++++---
 lmms_eval/__main__.py                         |  34 +++--
 lmms_eval/evaluator.py                        |  11 +-
 lmms_eval/models/otterhd.py                   |   3 +-
 .../tasks/hallusion_bench/evaluate_hb.py      |  62 +++++----
 lmms_eval/tasks/hallusion_bench/utils.py      | 122 +++++++++--------
 lmms_eval/tasks/mmbench_cn/utils.py           |  18 +--
 lmms_eval/tasks/mmbench_en/utils.py           |  18 +--
 9 files changed, 260 insertions(+), 135 deletions(-)

diff --git a/.gitignore b/.gitignore
index 2c1fc3aca..6726a805d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -21,4 +21,5 @@ scripts/
 wandb/
 SimSun.ttf
 submissions/
-
+lmms_eval/tasks/hallusion_bench/hallusion_output_vs_model.json
+lmms_eval/tasks/hallusion_bench/hallusion_output_vd_model.json
diff --git a/README.md b/README.md
index 78accb840..5de90724b 100644
--- a/README.md
+++ b/README.md
@@ -1,12 +1,5 @@
 # lmms-eval
 
-The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.
-
-Below are the changes we made to the original API:
-
-- Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
-- lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
-
 ## How to run
 
 ```bash
@@ -19,21 +12,120 @@ accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml # Ea
 ```
 ## Current models
 
-- llava （only generate_until function. Please help add the other two required functions. You can refer to lm-eval-harness for the required functions and how to implement them.)
+- GPT4V (API)
+  - generation-based evaluation
+
+- Gemini (APi)
+  - generation-based evaluation
+
+- LLaVA-v1.5/v1.6-7B/13B/34B
+  - generation-based evaluation
+  - perplexity-based evaluation
+
+## Models to be added
+
+- InstructBLIP
+- OpenFlamingo/Otter
+- Fuyu/OtterHD
+- Emu
+- CogVLM
 
 ## Current datasets
-- GQA
-- MMMU
-- SQA-IMG
-- MME
-- MMVet
-- LLaVA-Bench
-- LLaVA-Bench-CN
+- AI2D (ai2d)
+- ChartQA (chartqa)
+- COCO Caption (coco_cap)
+  - COCO 2014 Caption Validation (coco2014_cap_val)
+  - COCO 2014 Caption Test (coco2014_cap_test)
+  - COCO 2017 Caption MiniVal (coco2017_cap_val)
+  - COCO 2017 Caption MiniTest (coco2017_cap_test)
+- DOCVQA (docvqa)
+  - DOCVQA Validation (docvqa_val)
+  - DOCVQA Test (docvqa_test)
+- Flickr30K (flickr30k)
+- GQA (gqa)
+- HallusionBenchmark (hallusion_bench_image)
+- Infographic VQA (info_vqa)
+  - Infographic VQA Validation (info_vqa_val)
+  - Infographic VQA Test (info_vqa_test)
+- LLaVA-Bench (llava_bench_wild)
+- LLaVA-Bench-CN (?)
+- LLaVA-Bench-COCO (llava_bench_coco)
+- MathVista (mathvista)
+  - MathVista Validation (mathvista_testmini)
+  - MathVista Test (mathvista_test)
+- MMBench (mmbench)
+  - MMBench English Dev (mmbench_en_dev)
+  - MMBench English Test (mmbench_en_test)
+  - MMBench Chinese Dev (mmbench_cn_dev)
+  - MMBench Chinese Test (mmbench_cn_test)
+- MME (mme)
+  - MME-Cognition
+  - MME-Commonsense
+- MMMU (mmmu)
+  - MMMU Validation (mmmu_val)
+  - MMMU Test (mmmu_test)
+- MMVet (mmvet)
+- NoCaps (nocaps)
+  - NoCaps Validation (nocaps_val)
+  - NoCaps Test (nocaps_test)
+- OKVQA (okvqa)
+- POPE (pope)
+- RefCOCO (refcoco)
+    - refcoco_seg_test
+    - refcoco_seg_val
+    - refcoco_seg_testA
+    - refcoco_seg_testB
+    - refcoco_bbox_test
+    - refcoco_bbox_val
+    - refcoco_bbox_testA
+    - refcoco_bbox_testB
+- RefCOCO+ (refcoco+)
+    - refcoco+_seg_val
+    - refcoco+_seg_testA
+    - refcoco+_seg_testB
+    - refcoco+_bbox_val
+    - refcoco+_bbox_testA
+    - refcoco+_bbox_testB
+- RefCOCOg (refcocog)
+    - refcocog_seg_test
+    - refcocog_seg_val
+    - refcocog_bbox_test
+    - refcocog_bbox_val
+- ScienceQA (scienceqa)
+  - ScienceQA Full (scienceqa_full)
+  - ScienceQA IMG (scienceqa_img)
+- SeedBench (seedbench)
+- TextCaps (textcaps)
+  - TextCaps Validation (textcaps_val)
+  - TextCaps Test (textcaps_test)
+- TextVQA (textvqa)
+  - TextVQA Validation (textvqa_val)
+  - TextVQA Test (textvqa_test)
+- VizWizVQA (vizwizvqa)
+  - VizWizVQA Validation (vizwizvqa_val)
+  - VizWizVQA Test (vizwizvqa_test)
+- VQAv2 (vqav2)
+  - VQAv2 Validation (vqav2_val)
+  - VQAv2 Test (vqav2_test)
 
 ## Datasets to be added and tested
+- CMMMU (cmmmu)
+- Mementos (mementos)
+- Ferret Bench (ferret)
+- ST-VQA (stvqa)
+- Multi-DocVQA (multidocvqa)
+- Winoground (winoground)
+- NLVR2 (nlvr2)
+- RavenIQ-Test (raveniq)
+- IconQA (iconqa)
+- VistBench (vistbench)
 
 
-## Datasets to be added
+## Acknowledgement
 
+The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.
 
+Below are the changes we made to the original API:
 
+- Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
+- lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
\ No newline at end of file
diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py
index 1f624a25b..234b87801 100644
--- a/lmms_eval/__main__.py
+++ b/lmms_eval/__main__.py
@@ -215,6 +215,15 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
 
     # set datetime before evaluation
     datetime_str = utils.get_datetime_str(timezone=args.timezone)
+    if args.output_path:
+        hash_input = f"{args.model_args}".encode("utf-8")
+        hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
+        path = Path(args.output_path)
+        path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}_{args.log_samples_suffix}")
+        args.output_path = path
+
+    elif args.log_samples and not args.output_path:
+        assert args.output_path, "Specify --output_path"
 
     results = evaluator.simple_evaluate(
         model=args.model,
@@ -228,23 +237,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
         show_task_to_terminal=args.show_task_to_terminal,
         log_samples=args.log_samples,
         gen_kwargs=args.gen_kwargs,
+        cli_args=args,
     )
 
-    if args.output_path:
-        hash_input = f"{args.model_args}".encode("utf-8")
-        hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
-        path = Path(args.output_path)
-        path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}")
-        path.mkdir(parents=True, exist_ok=True)
-        assert path.is_dir(), f"Output path {path} is not a directory"
-
-        output_path_file = path.joinpath("results.json")
-        if output_path_file.exists():
-            eval_logger.warning(f"Output file {output_path_file} already exists and will be overwritten.")
-
-    elif args.log_samples and not args.output_path:
-        assert args.output_path, "Specify --output_path"
-
     if results is not None:
         if args.log_samples:
             samples = results.pop("samples")
@@ -253,12 +248,15 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
             print(dumped)
 
         if args.output_path:
-            output_path_file.open("w").write(dumped)
+            args.output_path.mkdir(parents=True, exist_ok=True)
+            result_file_path = path.joinpath("results.json")
+            if result_file_path.exists():
+                eval_logger.warning(f"Output file {result_file_path} already exists and will be overwritten.")
 
+            result_file_path.open("w").write(dumped)
             if args.log_samples:
                 for task_name, config in results["configs"].items():
-                    output_name = f"{task_name}_{args.log_samples_suffix}"
-                    filename = path.joinpath(f"{output_name}.json")
+                    filename = args.output_path.joinpath(f"{task_name}.json")
                     # Structure the data with 'args' and 'logs' keys
                     data_to_dump = {"args": vars(args), "config": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])}  # Convert Namespace to dict
                     samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py
index ee3189bfc..b4cb94d07 100644
--- a/lmms_eval/evaluator.py
+++ b/lmms_eval/evaluator.py
@@ -3,6 +3,7 @@
 import json
 import collections
 import sys
+import inspect
 from tqdm import tqdm
 
 import torch
@@ -40,6 +41,7 @@ def simple_evaluate(
     show_task_to_terminal: bool = False,
     log_samples: bool = True,
     gen_kwargs: str = None,
+    cli_args=None,  # Bo: put args into more functions (cost 48 Bytes per call)
 ):
     """Instantiate and evaluate a model on a list of tasks.
 
@@ -126,6 +128,7 @@ def simple_evaluate(
         bootstrap_iters=bootstrap_iters,
         show_task_to_terminal=show_task_to_terminal,
         log_samples=log_samples,
+        cli_args=cli_args,
     )
 
     if lm.rank == 0:
@@ -156,6 +159,7 @@ def evaluate(
     bootstrap_iters: int = 100000,
     show_task_to_terminal: bool = False,
     log_samples: bool = True,
+    cli_args=None,
 ):
     """Instantiate and evaluate a model on a list of tasks.
 
@@ -423,7 +427,12 @@ def evaluate(
             else:
                 group_name = None
             agg_fn = task.aggregation()[metric]
-            results[task_name][metric_key] = agg_fn(items)
+            # Bo: for models only need agg items
+            if inspect.getfullargspec(agg_fn).args == ["results"]:
+                results[task_name][metric_key] = agg_fn(items)
+            # Bo: for models that need to know the args to save to correct path
+            elif inspect.getfullargspec(agg_fn).args == ["results", "args"]:
+                results[task_name][metric_key] = agg_fn(items, cli_args)
             results[task_name]["samples"] = len(items)
 
             # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
diff --git a/lmms_eval/models/otterhd.py b/lmms_eval/models/otterhd.py
index 80d8f1407..296942a90 100644
--- a/lmms_eval/models/otterhd.py
+++ b/lmms_eval/models/otterhd.py
@@ -37,7 +37,6 @@ def __init__(
         self.processor = FuyuProcessor(image_processor=self.image_processor, tokenizer=self.tokenizer)
         self.max_new_tokens = max_new_tokens
         self.batch_size_per_gpu = int(batch_size)
-        assert self.batch_size_per_gpu == 1, "OtterHD currently does not support batched generation."
 
     @property
     def max_length(self):
@@ -91,7 +90,7 @@ def _collate(x):
             #     visuals = [visuals[idx][0] for idx in range(len(visuals))]  # get the first image in multi-image scenarios.
 
             formatted_contexts = [f"User: {context} Assistant:" for context in contexts]
-            model_inputs = self.processor(text=[formatted_contexts], images=visuals, device=self.device)
+            model_inputs = self.processor(text=formatted_contexts, images=visuals, device=self.device)
             for k, v in model_inputs.items():
                 model_inputs[k] = v.to(self.device, non_blocking=True) if isinstance(v, torch.Tensor) else [vv.to(self.device, non_blocking=True) for vv in v]
 
diff --git a/lmms_eval/tasks/hallusion_bench/evaluate_hb.py b/lmms_eval/tasks/hallusion_bench/evaluate_hb.py
index 09a919af5..4f93279b3 100644
--- a/lmms_eval/tasks/hallusion_bench/evaluate_hb.py
+++ b/lmms_eval/tasks/hallusion_bench/evaluate_hb.py
@@ -5,8 +5,9 @@
 
 from lmms_eval.tasks.hallusion_bench.utils import evaluate_by_chatgpt, check_same_by_chatgpt, assign_correctness, get_eval_all, get_eval_fig, get_eval_pair_all
 
-save_json_path_vd = "./hallusion_output_vd_model.json"
-save_json_path_vs = "./hallusion_output_vs_model.json"
+cur_dir = os.path.dirname(os.path.abspath(__file__))
+save_json_path_vd = f"{cur_dir}/hallusion_output_vd_model.json"
+save_json_path_vs = f"{cur_dir}/hallusion_output_vs_model.json"
 output_entry = "model_prediction"
 correctness_entry = "gpt4v_output_gpt_check"
 
@@ -14,27 +15,31 @@
 
 eval_logger = logging.getLogger("lmms-eval")
 
+
 def hb_doc_to_text(doc):
-    return doc['question'] # + "\nAnswer using yes or no."
+    return doc["question"]  # + "\nAnswer using yes or no."
+
 
 def hb_doc_to_visual(doc):
-    return [doc['image'].convert('RGB')]
+    return [doc["image"].convert("RGB")]
+
 
 def hb_process_results(doc, result):
     sample = doc
     doc.pop("image")
-    sample['model_prediction'] = result[0]
-    return {k : sample for k in metric}
+    sample["model_prediction"] = result[0]
+    return {k: sample for k in metric}
+
 
 def hb_aggregation_result(results, metric):
     data_vd = []
     data_vs = []
     for data in tqdm(results, desc="Split vd and vs"):
-        if data['category'] == 'VD':
+        if data["category"] == "VD":
             data_vd.append(data)
-        if data['category'] == 'VS':
-            data_vs.append(data)  
-    eval_logger.info("Do gpt eval vd ...") 
+        if data["category"] == "VS":
+            data_vs.append(data)
+    eval_logger.info("Do gpt eval vd ...")
     data_vd = evaluate_by_chatgpt(data_vd, output_entry=output_entry, correctness_entry=correctness_entry, load_json=True, save_json_path=save_json_path_vd)
     # data_vd = check_same_by_chatgpt(data_vd, output_entry=output_entry, load_json=True, save_json_path=save_json_path_vd)
     data_vd = assign_correctness(data_vd, correctness_entry=correctness_entry)
@@ -43,33 +48,37 @@ def hb_aggregation_result(results, metric):
     # data_vs = check_same_by_chatgpt(data_vs, output_entry=output_entry, load_json=True, save_json_path=save_json_path_vs)
     data_vs = assign_correctness(data_vs, correctness_entry=correctness_entry)
     results = data_vs + data_vd
-    
+
     if metric == "aAcc":
         all_data = get_eval_all(results, model_correctness_entry=correctness_entry)
-        return round(100 * all_data["correct"]/all_data["total"], 4)
+        return round(100 * all_data["correct"] / all_data["total"], 4)
     elif metric == "fAcc":
         fig_all = get_eval_fig(results)
-        return round(100 * fig_all["correct"]/fig_all["total"], 4)
+        return round(100 * fig_all["correct"] / fig_all["total"], 4)
     elif metric == "qAcc":
         all_data = get_eval_pair_all(results, model_correctness_entry=correctness_entry)
-        return round(100 * all_data["correct"]/all_data["total"], 4)
+        return round(100 * all_data["correct"] / all_data["total"], 4)
+
 
 def hb_aggregation_result_qAcc(results):
     return hb_aggregation_result(results, "qAcc")
 
+
 def hb_aggregation_result_fAcc(results):
     return hb_aggregation_result(results, "fAcc")
 
+
 def hb_aggregation_result_aAcc(results):
     return hb_aggregation_result(results, "aAcc")
 
+
 def hb_aggregation_result_intern(results, metric):
     scores = []
     for result in results:
-        ans = '1' if result['model_prediction'].lower().find('yes')!=-1 else '0'
-        scores.append(ans == result['gt_answer'])
-        result['answer'] = ans
-    
+        ans = "1" if result["model_prediction"].lower().find("yes") != -1 else "0"
+        scores.append(ans == result["gt_answer"])
+        result["answer"] = ans
+
     if metric == "aAcc":
         return sum(scores) / len(scores)
     elif metric == "qAcc":
@@ -77,35 +86,38 @@ def hb_aggregation_result_intern(results, metric):
         for r in results:
             key = "_".join([r["category"], r["subcategory"], str(r["set_id"]), str(r["question_id"])])
             try:
-                qlist[key].append(r['answer'] == r['gt_answer'])
+                qlist[key].append(r["answer"] == r["gt_answer"])
             except:
-                qlist[key] = [r['answer'] == r['gt_answer']]
+                qlist[key] = [r["answer"] == r["gt_answer"]]
         out = []
-        for q, v in qlist.items(): 
+        for q, v in qlist.items():
             out.append(min(v))
-        
+
         return sum(out) / len(out)
     elif metric == "fAcc":
         qlist = {}
         for r in results:
             key = "_".join([r["category"], r["subcategory"], str(r["set_id"]), str(r["figure_id"])])
             try:
-                qlist[key].append(r['answer'] == r['gt_answer'])
+                qlist[key].append(r["answer"] == r["gt_answer"])
             except:
-                qlist[key] = [r['answer'] == r['gt_answer']]
+                qlist[key] = [r["answer"] == r["gt_answer"]]
         out = []
-        for q, v in qlist.items(): 
+        for q, v in qlist.items():
             out.append(min(v))
         return sum(out) / len(out)
 
+
 def hb_aggregation_result_qAcc_intern(results):
     eval_logger.info("Calculating qAcc ...")
     return hb_aggregation_result_intern(results, "qAcc")
 
+
 def hb_aggregation_result_fAcc_intern(results):
     eval_logger.info("Calculating fAcc ...")
     return hb_aggregation_result_intern(results, "fAcc")
 
+
 def hb_aggregation_result_aAcc_intern(results):
     eval_logger.info("Calculating aAcc ...")
     return hb_aggregation_result_intern(results, "aAcc")
diff --git a/lmms_eval/tasks/hallusion_bench/utils.py b/lmms_eval/tasks/hallusion_bench/utils.py
index 84e58e405..a8fbd2c8a 100644
--- a/lmms_eval/tasks/hallusion_bench/utils.py
+++ b/lmms_eval/tasks/hallusion_bench/utils.py
@@ -9,43 +9,54 @@
 import requests
 import logging
 
-API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
-API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+elif API_TYPE == "azure":
+    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
+    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "api-key": API_KEY,
+        "Content-Type": "application/json",
+    }
 
 eval_logger = logging.getLogger("lmms-eval")
 
-def evaluate_by_chatgpt(data, output_entry, correctness_entry, gpt_model="gpt-4", load_json=False, save_json_path="./hallusion_output.json", retries = 3):
-    if load_json and os.path.exists(save_json_path):
-        with open(save_json_path, 'r') as f:
-            output = json.load(f)
-    else:
-        output = []
-    for sample in tqdm(data[len(output):], desc="Eval by GPT"):
-        prompt = 'Imagine you are an intelligent teacher. Thoroughly read the question, reference answer and the prediction answer to ensure a clear understanding of the information provided. Assess the correctness of the predictions. '
+
+def evaluate_by_chatgpt(data, output_entry, correctness_entry, gpt_model="gpt-4", load_json=False, save_json_path="./hallusion_output.json", retries=3):
+    # if load_json and os.path.exists(save_json_path):
+    #     with open(save_json_path, "r") as f:
+    #         output = json.load(f)
+    # else:
+    output = []
+    for sample in tqdm(data, desc="Eval by GPT"):
+        prompt = "Imagine you are an intelligent teacher. Thoroughly read the question, reference answer and the prediction answer to ensure a clear understanding of the information provided. Assess the correctness of the predictions. "
         prompt += 'If the prediction answer does not conflict with the reference answer, please generate “correct”. If the prediction answer conflict with the reference answer, please generate “incorrect”. If the prediction answer is unclear about the answer, please generate "unclear". \n\n Question:'
-        prompt += sample['question']
-        prompt += '\nReference answer: '
-        prompt += sample['gt_answer_details']
-        prompt += '\nPrediction answer:'
+        prompt += sample["question"]
+        prompt += "\nReference answer: "
+        prompt += sample["gt_answer_details"]
+        prompt += "\nPrediction answer:"
         prompt += sample[output_entry]
-        prompt += '\nOutput:'
+        prompt += "\nOutput:"
 
         # https://github.com/openai/openai-python/issues/322#issuecomment-1767841683
         for attempt in range(retries):
             try:
-                headers = {
-                    "api-key": API_KEY,
-                    "Content-Type": "application/json",
-                }
-
                 messages = [{"role": "user", "content": prompt}]
-
                 payload = {
-                    "model": gpt_model,
                     "messages": messages,
                     "max_tokens": 16,
                 }
-                response = requests.post(API_URL, headers=headers, json=payload)
+                # set model when using openai api_key. Azure api_key does not need model since the endpoint fixed the model.
+                if API_TYPE == "openai":
+                    payload["model"] = gpt_model
+                response = requests.post(API_URL, headers=headers, json=payload, timeout=30)
                 response.raise_for_status()
                 response = response.json()
                 break
@@ -56,16 +67,15 @@ def evaluate_by_chatgpt(data, output_entry, correctness_entry, gpt_model="gpt-4"
                 else:  # If this was the last attempt, log and return empty
                     eval_logger.error(f"All {retries} attempts failed. Last error message: {str(e)}")
         try:
-            output_text = response['choices'][0]['message']['content']
+            output_text = response["choices"][0]["message"]["content"]
         except Exception as e:
             eval_logger.info(f"Get error {str(e)} when extracting response")
             output_text = "unclear"
 
-
-        if 'incorrect' in output_text.lower(): 
+        if "incorrect" in output_text.lower():
             gpt_correctness = "0"
 
-        elif 'correct' in output_text.lower():
+        elif "correct" in output_text.lower():
             gpt_correctness = "1"
         else:
             gpt_correctness = "2"
@@ -74,13 +84,13 @@ def evaluate_by_chatgpt(data, output_entry, correctness_entry, gpt_model="gpt-4"
 
         output.append(sample)
 
-        with open(save_json_path, 'w') as f:
+        with open(save_json_path, "w") as f:
             json.dump(output, f, indent=4)
 
     return output
 
-def check_same_by_chatgpt(data, output_entry, gpt_model="gpt-4", load_json=False, save_json_path="./hallusion_output.json", retries = 3):
 
+def check_same_by_chatgpt(data, output_entry, gpt_model="gpt-4", load_json=False, save_json_path="./hallusion_output.json", retries=3):
     orig_response = {}
 
     for r in data:
@@ -93,13 +103,13 @@ def check_same_by_chatgpt(data, output_entry, gpt_model="gpt-4", load_json=False
             key = "_".join([sample["category"], sample["subcategory"], str(sample["set_id"]), str(sample["question_id"])])
             response2 = orig_response[key]
 
-            prompt = 'Imagine you are an intelligent teacher. Thoroughly read the two responses to two different questions. Assess the consistency of the information provided within those two responses. '
-            prompt += 'You do not know the specific questions, but you can asssess the consistency among the two responses by checking for logical conflicts if both responses are correct. '
+            prompt = "Imagine you are an intelligent teacher. Thoroughly read the two responses to two different questions. Assess the consistency of the information provided within those two responses. "
+            prompt += "You do not know the specific questions, but you can asssess the consistency among the two responses by checking for logical conflicts if both responses are correct. "
             prompt += 'If response1 does not conflict with response2, please generate “same”. Otherwise, generate "different". \n\n response1:'
             prompt += sample[output_entry]
-            prompt += '\nresponse2: '
+            prompt += "\nresponse2: "
             prompt += response2
-            prompt += '\nOutput:'
+            prompt += "\nOutput:"
 
             # https://github.com/openai/openai-python/issues/322#issuecomment-1767841683
             for attempt in range(retries):
@@ -129,47 +139,47 @@ def check_same_by_chatgpt(data, output_entry, gpt_model="gpt-4", load_json=False
                         eval_logger.error(f"All {retries} attempts failed. Last error message: {str(e)}")
 
             try:
-                output_text = response['choices'][0]['message']['content']
+                output_text = response["choices"][0]["message"]["content"]
             except Exception as e:
                 eval_logger.info(f"Get error {str(e)} when extracting response")
                 output_text = "different"
 
             gpt_same = "0"
 
-            if 'same' in output_text.lower(): 
+            if "same" in output_text.lower():
                 gpt_same = "1"
 
-            elif 'different' in output_text.lower():
+            elif "different" in output_text.lower():
                 gpt_same = "0"
 
-
             sample["same"] = gpt_same
 
-            with open(save_json_path, 'w') as f:
+            with open(save_json_path, "w") as f:
                 json.dump(data, f, indent=4)
 
     return data
 
+
 def assign_correctness(data_arr, correctness_entry):
     for r in data_arr:
         assert int(r[correctness_entry]) == 0 or int(r[correctness_entry]) == 1 or int(r[correctness_entry]) == 2
-        if r["category"] == "VS" and int(r["figure_id"]) == 0: # if there is no visual supplement and the model does not know, count it as correct
+        if r["category"] == "VS" and int(r["figure_id"]) == 0:  # if there is no visual supplement and the model does not know, count it as correct
             r["correct"] = 1 if int(r[correctness_entry]) == 1 or int(r[correctness_entry]) == 2 else 0
         else:
             r["correct"] = 1 if int(r[correctness_entry]) == 1 else 0
     return data_arr
 
-def get_eval_fig(data): # per figure
 
+def get_eval_fig(data):  # per figure
     eval_fig_dict = dict()
 
     for r in data:
-        if r["category"] == "VS" and str(r["figure_id"]) == "0": # no figure
+        if r["category"] == "VS" and str(r["figure_id"]) == "0":  # no figure
             continue
         name = "_".join([r["category"], r["subcategory"], str(r["set_id"]), str(r["figure_id"])])
         if name in eval_fig_dict:
             c, t = eval_fig_dict[name]
-            eval_fig_dict[name] = (c + r["correct"], t+1)
+            eval_fig_dict[name] = (c + r["correct"], t + 1)
         else:
             eval_fig_dict[name] = (r["correct"], 1)
 
@@ -188,13 +198,13 @@ def get_eval_fig(data): # per figure
             eval_fig_stat["wrong"] += 1
         else:
             eval_fig_stat["inconsistent"] += 1
-        eval_fig_stat["score"] += (v[0] / v[1])
-            
+        eval_fig_stat["score"] += v[0] / v[1]
+
     eval_fig_stat["score"] = eval_fig_stat["score"] / eval_fig_stat["total"]
     return eval_fig_stat
 
-def get_eval_all(data, model_correctness_entry): # per question
 
+def get_eval_all(data, model_correctness_entry):  # per question
     eval_all_dict = dict()
     eval_all_stat = {}
     eval_all_stat["LH"] = 0
@@ -203,11 +213,11 @@ def get_eval_all(data, model_correctness_entry): # per question
 
     for r in data:
         name = "_".join([r["category"], r["subcategory"], str(r["set_id"]), str(r["figure_id"]), str(r["question_id"])])
-        assert name not in eval_all_dict 
-        
+        assert name not in eval_all_dict
+
         eval_all_dict[name] = r["correct"]
-        
-        if str(r["category"]) == "VD": # VD
+
+        if str(r["category"]) == "VD":  # VD
             if str(r["figure_id"]) == "0":
                 if str(r[model_correctness_entry]) == "0" or str(r[model_correctness_entry]) == "2":
                     eval_all_stat["VI"] += 1
@@ -216,11 +226,11 @@ def get_eval_all(data, model_correctness_entry): # per question
                     eval_all_stat["Mix"] += 1
                 elif str(r[model_correctness_entry]) == "2":
                     eval_all_stat["VI"] += 1
-        else: # VS
-            if str(r["visual_input"]) == "0": # no visual
+        else:  # VS
+            if str(r["visual_input"]) == "0":  # no visual
                 if str(r[model_correctness_entry]) == "0":
                     eval_all_stat["LH"] += 1
-            else: # original visual or modified visual (isual_input == 1 or 2)
+            else:  # original visual or modified visual (isual_input == 1 or 2)
                 if str(r[model_correctness_entry]) == "0":
                     eval_all_stat["Mix"] += 1
                 elif str(r[model_correctness_entry]) == "2":
@@ -233,8 +243,8 @@ def get_eval_all(data, model_correctness_entry): # per question
 
     return eval_all_stat
 
-def get_eval_pair_all(data, model_correctness_entry): # per question pair
 
+def get_eval_pair_all(data, model_correctness_entry):  # per question pair
     orig_correctness = dict()
     counter = 0
     lh_counter = 0
@@ -252,10 +262,10 @@ def get_eval_pair_all(data, model_correctness_entry): # per question pair
         name = "_".join([r["category"], r["subcategory"], str(r["set_id"]), str(r["question_id"])])
         if name in get_eval_pair_dict:
             c, t = get_eval_pair_dict[name]
-            get_eval_pair_dict[name] = (c + r["correct"], t+1)
+            get_eval_pair_dict[name] = (c + r["correct"], t + 1)
         else:
-            get_eval_pair_dict[name] = (r["correct"], 1)    
-        counter += 1       
+            get_eval_pair_dict[name] = (r["correct"], 1)
+        counter += 1
 
     eval_all_pair_stat = {}
     eval_all_pair_stat["note"] = "all accuracy per question pair"
diff --git a/lmms_eval/tasks/mmbench_cn/utils.py b/lmms_eval/tasks/mmbench_cn/utils.py
index 976d216f7..175a8da9f 100644
--- a/lmms_eval/tasks/mmbench_cn/utils.py
+++ b/lmms_eval/tasks/mmbench_cn/utils.py
@@ -68,19 +68,21 @@ def mmbench_process_results(doc, results):
     }
 
 
-def mmbench_aggregate_dev_results(results):
+def mmbench_aggregate_dev_results(results, args):
     df = pd.DataFrame(results)
-    os.makedirs("./submissions", exist_ok=True)
-    with pd.ExcelWriter("./submissions/mmbench_cn_dev_results.xlsx") as writer:
+    Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
+    excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_cn_dev_results.xlsx"
+    with pd.ExcelWriter(excel_write_path) as writer:
         df.to_excel(writer, index=False)
-    eval_logger.info(f"Saved results to mmbench_cn_dev_results.xlsx")
+    eval_logger.info(f"Saved results to {excel_write_path}")
     return 0
 
 
-def mmbench_aggregate_test_results(results):
+def mmbench_aggregate_test_results(results, args):
     df = pd.DataFrame(results)
-    os.makedirs("./submissions", exist_ok=True)
-    with pd.ExcelWriter("./submissions/mmbench_cn_test_results.xlsx") as writer:
+    Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
+    excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_cn_test_results.xlsx"
+    with pd.ExcelWriter(excel_write_path) as writer:
         df.to_excel(writer, index=False)
-    eval_logger.info(f"Saved results to mmbench_cn_test_results.xlsx")
+    eval_logger.info(f"Saved results to {excel_write_path}")
     return 0
diff --git a/lmms_eval/tasks/mmbench_en/utils.py b/lmms_eval/tasks/mmbench_en/utils.py
index 88f2443d8..5ec207db4 100644
--- a/lmms_eval/tasks/mmbench_en/utils.py
+++ b/lmms_eval/tasks/mmbench_en/utils.py
@@ -68,19 +68,21 @@ def mmbench_process_results(doc, results):
     }
 
 
-def mmbench_aggregate_dev_results(results):
+def mmbench_aggregate_dev_results(results, args):
     df = pd.DataFrame(results)
-    os.makedirs("./submissions", exist_ok=True)
-    with pd.ExcelWriter("./submissions/mmbench_en_dev_results.xlsx") as writer:
+    Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
+    excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_en_dev_results.xlsx"
+    with pd.ExcelWriter(excel_write_path) as writer:
         df.to_excel(writer, index=False)
-    eval_logger.info(f"Saved results to mmbench_en_dev_results.xlsx")
+    eval_logger.info(f"Saved results to {excel_write_path}")
     return 0
 
 
-def mmbench_aggregate_test_results(results):
+def mmbench_aggregate_test_results(results, args):
     df = pd.DataFrame(results)
-    os.makedirs("./submissions", exist_ok=True)
-    with pd.ExcelWriter("./submissions/mmbench_en_test_results.xlsx") as writer:
+    Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
+    excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_en_test_results.xlsx"
+    with pd.ExcelWriter(excel_write_path) as writer:
         df.to_excel(writer, index=False)
-    eval_logger.info(f"Saved results to mmbench_en_test_results.xlsx")
+    eval_logger.info(f"Saved results to {excel_write_path}")
     return 0