[Dataset] fix hallusion benchmark, add saving logic inside aggregate … · kangreen0210/LIME@1f8780d

Commit

[Dataset] fix hallusion benchmark, add saving logic inside aggregate …

…function (EvolvingLMMs-Lab#35)

* add fuyu

* Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

* Squashed commit of the following:

commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Tue Jan 30 19:39:57 2024 +0800

    Add hallu bench

commit 7b7f636
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (EvolvingLMMs-Lab#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (EvolvingLMMs-Lab#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (EvolvingLMMs-Lab#30)

        * mmmu_test

        * black

    commit 4a1183c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update hb_doc_to_text function to remove unnecessary line break

* Add Fuyu model and update OtterHD model

* Refactor model response handling and fix image processing bug

* Refactor flatten method to support only getting the first element

* Add support for specifying timezone in datetime string

Update flatten method in OtterHD class

Update get_datetime_str function in utils.py

* Fix condition for checking wandb_args_dict in __main__.py

* Commented out assertions for batch size in Fuyu model

* Add warning message for existing output file

* Fix batch size issue in OtterHD model

* Squashed commit of the following:

commit 7664839
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 16:00:22 2024 +0800

    [Datasets] add hallubench (EvolvingLMMs-Lab#34)

    * Add hallu bench

    * Fix hall_b gpt eval bugs

    ---------

    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

commit 05487a4
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 14:23:15 2024 +0800

    [Datasets & Models] Fuyu, HalluBench (w/Kaichen, commit 96d95b3) (EvolvingLMMs-Lab#33)

    * add fuyu

    * Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

    * Squashed commit of the following:

    commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Tue Jan 30 19:39:57 2024 +0800

        Add hallu bench

    commit 7b7f636
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Tue Jan 30 14:52:51 2024 +0800

        scienceqa for full set (EvolvingLMMs-Lab#32)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        * Update generation kwargs for LMMS tasks

        * Update lmms_eval MME task configuration and utils

        * Update generation_kwargs in lmms_eval tasks

        * Update doc_to_text function in coco and okvqa tasks

        * Add COCO 2017 version

        * Update task name in coco_test2017.yaml

        * Squashed commit of the following:

        commit 6ee856b
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Mon Jan 29 22:41:33 2024 +0800

            Add/mmmu test (EvolvingLMMs-Lab#30)

            * mmmu_test

            * black

        commit 4a1183c
        Author: Li Bo <drluodian@gmail.com>
        Date:   Sun Jan 28 22:19:13 2024 +0800

            [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

            * Remove unused code and configuration file

            * Remove docvqa.yaml and update vizwizvqa.yaml

            * lint

            * Add dataset_kwargs to vizwizvqa.yaml

            * Add dataset_kwargs to vizwizvqa.yaml

            * textvqa (EvolvingLMMs-Lab#27)

            * Update textvqa.yaml and utils.py

            * Fix YAML formatting in textvqa.yaml and remove unused files

            * remove useless matric

            * add textvqa val & test

            * Update progress bar description in evaluator.py

            * Update submission file names in VizWizVQA tasks

            * Update output path to include log samples suffix

            * Update submission file paths in OKVQA and VizWizVQA tasks

            * Refactor llava-in-the-wild.yaml and utils.py

            * Update metric for llava evaluation

            * Refactor logging message in Task class

            * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

            * Fix formatting issues and add progress bar closing statements

            * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

            * Update tqdm progress bar in OtterHD model

            * Squashed commit of the following:

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * Fix error handling in loading YAML config files

            * Squashed commit of the following:

            commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 12:41:40 2024 +0800

                Fix key bugs

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * List task #num sorted

            * Update prompt messages for image-related tasks

            * Delete unused task configuration files

            * Remove coco_train.yaml configuration file

            * Update task name in mmmu.yaml

            * Fix error message for missing tasks

            * Add wandb import and integration

            ---------

            Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
            Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

        * Remove scienceqa_img task configuration

        * eval scienceqa with no images

        ---------

        Co-authored-by: Bo Li <drluodian@gmail.com>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Update hb_doc_to_text function to remove unnecessary line break

    * Add Fuyu model and update OtterHD model

    * Refactor model response handling and fix image processing bug

    * Refactor flatten method to support only getting the first element

    * Add support for specifying timezone in datetime string

    Update flatten method in OtterHD class

    Update get_datetime_str function in utils.py

    * Fix condition for checking wandb_args_dict in __main__.py

    * Commented out assertions for batch size in Fuyu model

    * Add warning message for existing output file

commit 7b7f636
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (EvolvingLMMs-Lab#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (EvolvingLMMs-Lab#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (EvolvingLMMs-Lab#30)

        * mmmu_test

        * black

    commit 4a1183c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update API configuration and file paths

* Refactor evaluate_by_chatgpt function in utils.py

* Add hallusion_output_vd_model.json to .gitignore

* Add timeout to API request

* Refactor file path generation and remove unnecessary suffix in log samples output names

* Refactor code and add output path handling

* Update lmms-eval API and add new models and datasets

Loading branch information

Luodian authored Feb 1, 2024

1 parent 7664839 commit 1f8780d

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -21,4 +21,5 @@ scripts/ @@
     wandb/
     SimSun.ttf
     submissions/
+    lmms_eval/tasks/hallusion_bench/hallusion_output_vs_model.json
+    lmms_eval/tasks/hallusion_bench/hallusion_output_vd_model.json

README.md

-Original file line number
+Diff line change
@@ -1,12 +1,5 @@
     # lmms-eval
-    The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.
-    Below are the changes we made to the original API:
-    - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
-    - lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
     ## How to run
     ```bash
@@ Expand All @@
     ```
     ## Current models
-    - llava （only generate_until function. Please help add the other two required functions. You can refer to lm-eval-harness for the required functions and how to implement them.)
+    - GPT4V (API)
+      - generation-based evaluation
+    - Gemini (APi)
+      - generation-based evaluation
+    - LLaVA-v1.5/v1.6-7B/13B/34B
+      - generation-based evaluation
+      - perplexity-based evaluation
+    ## Models to be added
+    - InstructBLIP
+    - OpenFlamingo/Otter
+    - Fuyu/OtterHD
+    - Emu
+    - CogVLM
     ## Current datasets
-    - GQA
-    - MMMU
-    - SQA-IMG
-    - MME
-    - MMVet
-    - LLaVA-Bench
-    - LLaVA-Bench-CN
+    - AI2D (ai2d)
+    - ChartQA (chartqa)
+    - COCO Caption (coco_cap)
+      - COCO 2014 Caption Validation (coco2014_cap_val)
+      - COCO 2014 Caption Test (coco2014_cap_test)
+      - COCO 2017 Caption MiniVal (coco2017_cap_val)
+      - COCO 2017 Caption MiniTest (coco2017_cap_test)
+    - DOCVQA (docvqa)
+      - DOCVQA Validation (docvqa_val)
+      - DOCVQA Test (docvqa_test)
+    - Flickr30K (flickr30k)
+    - GQA (gqa)
+    - HallusionBenchmark (hallusion_bench_image)
+    - Infographic VQA (info_vqa)
+      - Infographic VQA Validation (info_vqa_val)
+      - Infographic VQA Test (info_vqa_test)
+    - LLaVA-Bench (llava_bench_wild)
+    - LLaVA-Bench-CN (?)
+    - LLaVA-Bench-COCO (llava_bench_coco)
+    - MathVista (mathvista)
+      - MathVista Validation (mathvista_testmini)
+      - MathVista Test (mathvista_test)
+    - MMBench (mmbench)
+      - MMBench English Dev (mmbench_en_dev)
+      - MMBench English Test (mmbench_en_test)
+      - MMBench Chinese Dev (mmbench_cn_dev)
+      - MMBench Chinese Test (mmbench_cn_test)
+    - MME (mme)
+      - MME-Cognition
+      - MME-Commonsense
+    - MMMU (mmmu)
+      - MMMU Validation (mmmu_val)
+      - MMMU Test (mmmu_test)
+    - MMVet (mmvet)
+    - NoCaps (nocaps)
+      - NoCaps Validation (nocaps_val)
+      - NoCaps Test (nocaps_test)
+    - OKVQA (okvqa)
+    - POPE (pope)
+    - RefCOCO (refcoco)
+        - refcoco_seg_test
+        - refcoco_seg_val
+        - refcoco_seg_testA
+        - refcoco_seg_testB
+        - refcoco_bbox_test
+        - refcoco_bbox_val
+        - refcoco_bbox_testA
+        - refcoco_bbox_testB
+    - RefCOCO+ (refcoco+)
+        - refcoco+_seg_val
+        - refcoco+_seg_testA
+        - refcoco+_seg_testB
+        - refcoco+_bbox_val
+        - refcoco+_bbox_testA
+        - refcoco+_bbox_testB
+    - RefCOCOg (refcocog)
+        - refcocog_seg_test
+        - refcocog_seg_val
+        - refcocog_bbox_test
+        - refcocog_bbox_val
+    - ScienceQA (scienceqa)
+      - ScienceQA Full (scienceqa_full)
+      - ScienceQA IMG (scienceqa_img)
+    - SeedBench (seedbench)
+    - TextCaps (textcaps)
+      - TextCaps Validation (textcaps_val)
+      - TextCaps Test (textcaps_test)
+    - TextVQA (textvqa)
+      - TextVQA Validation (textvqa_val)
+      - TextVQA Test (textvqa_test)
+    - VizWizVQA (vizwizvqa)
+      - VizWizVQA Validation (vizwizvqa_val)
+      - VizWizVQA Test (vizwizvqa_test)
+    - VQAv2 (vqav2)
+      - VQAv2 Validation (vqav2_val)
+      - VQAv2 Test (vqav2_test)
     ## Datasets to be added and tested
+    - CMMMU (cmmmu)
+    - Mementos (mementos)
+    - Ferret Bench (ferret)
+    - ST-VQA (stvqa)
+    - Multi-DocVQA (multidocvqa)
+    - Winoground (winoground)
+    - NLVR2 (nlvr2)
+    - RavenIQ-Test (raveniq)
+    - IconQA (iconqa)
+    - VistBench (vistbench)
-    ## Datasets to be added
+    ## Acknowledgement
+    The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.
+    Below are the changes we made to the original API:
+    - Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
+    - lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.

lmms_eval/__main__.py

-Original file line number
+Diff line change
@@ Expand Up @@
         # set datetime before evaluation
         datetime_str = utils.get_datetime_str(timezone=args.timezone)
+        if args.output_path:
+            hash_input = f"{args.model_args}".encode("utf-8")
+            hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
+            path = Path(args.output_path)
+            path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}_{args.log_samples_suffix}")
+            args.output_path = path
+        elif args.log_samples and not args.output_path:
+            assert args.output_path, "Specify --output_path"
         results = evaluator.simple_evaluate(
             model=args.model,
@@ Expand All @@
             show_task_to_terminal=args.show_task_to_terminal,
             log_samples=args.log_samples,
             gen_kwargs=args.gen_kwargs,
+            cli_args=args,
         )
-        if args.output_path:
-            hash_input = f"{args.model_args}".encode("utf-8")
-            hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
-            path = Path(args.output_path)
-            path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}")
-            path.mkdir(parents=True, exist_ok=True)
-            assert path.is_dir(), f"Output path {path} is not a directory"
-            output_path_file = path.joinpath("results.json")
-            if output_path_file.exists():
-                eval_logger.warning(f"Output file {output_path_file} already exists and will be overwritten.")
-        elif args.log_samples and not args.output_path:
-            assert args.output_path, "Specify --output_path"
         if results is not None:
             if args.log_samples:
                 samples = results.pop("samples")
@@ Expand All @@
                 print(dumped)
             if args.output_path:
-                output_path_file.open("w").write(dumped)
+                args.output_path.mkdir(parents=True, exist_ok=True)
+                result_file_path = path.joinpath("results.json")
+                if result_file_path.exists():
+                    eval_logger.warning(f"Output file {result_file_path} already exists and will be overwritten.")
+                result_file_path.open("w").write(dumped)
                 if args.log_samples:
                     for task_name, config in results["configs"].items():
-                        output_name = f"{task_name}_{args.log_samples_suffix}"
-                        filename = path.joinpath(f"{output_name}.json")
+                        filename = args.output_path.joinpath(f"{task_name}.json")
                         # Structure the data with 'args' and 'logs' keys
                         data_to_dump = {"args": vars(args), "config": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])}  # Convert Namespace to dict
                         samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
@@ Expand Down @@

lmms_eval/evaluator.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -3,6 +3,7 @@ @@
     import json
     import collections
     import sys
+    import inspect
     from tqdm import tqdm
     import torch
@@ Expand Down Expand Up / @@ -40,6 +41,7 @@ def simple_evaluate( @@
         show_task_to_terminal: bool = False,
         log_samples: bool = True,
         gen_kwargs: str = None,
+        cli_args=None,  # Bo: put args into more functions (cost 48 Bytes per call)
     ):
         """Instantiate and evaluate a model on a list of tasks.
@@ Expand Down Expand Up / @@ -126,6 +128,7 @@ def simple_evaluate( @@
             bootstrap_iters=bootstrap_iters,
             show_task_to_terminal=show_task_to_terminal,
             log_samples=log_samples,
+            cli_args=cli_args,
         )
         if lm.rank == 0:
@@ Expand Down Expand Up / @@ -156,6 +159,7 @@ def evaluate( @@
         bootstrap_iters: int = 100000,
         show_task_to_terminal: bool = False,
         log_samples: bool = True,
+        cli_args=None,
     ):
         """Instantiate and evaluate a model on a list of tasks.
@@ Expand Down Expand Up / @@ -423,7 +427,12 @@ def evaluate( @@
                 else:
                     group_name = None
                 agg_fn = task.aggregation()[metric]
-                results[task_name][metric_key] = agg_fn(items)
+                # Bo: for models only need agg items
+                if inspect.getfullargspec(agg_fn).args == ["results"]:
+                    results[task_name][metric_key] = agg_fn(items)
+                # Bo: for models that need to know the args to save to correct path
+                elif inspect.getfullargspec(agg_fn).args == ["results", "args"]:
+                    results[task_name][metric_key] = agg_fn(items, cli_args)
                 results[task_name]["samples"] = len(items)
                 # hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
@@ Expand Down @@

lmms_eval/models/otterhd.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -37,7 +37,6 @@ def __init__( @@
             self.processor = FuyuProcessor(image_processor=self.image_processor, tokenizer=self.tokenizer)
             self.max_new_tokens = max_new_tokens
             self.batch_size_per_gpu = int(batch_size)
-            assert self.batch_size_per_gpu == 1, "OtterHD currently does not support batched generation."
         @property
         def max_length(self):
@@ Expand Down Expand Up / @@ -91,7 +90,7 @@ def _collate(x): @@
                 #     visuals = [visuals[idx][0] for idx in range(len(visuals))]  # get the first image in multi-image scenarios.
                 formatted_contexts = [f"User: {context} Assistant:" for context in contexts]
-                model_inputs = self.processor(text=[formatted_contexts], images=visuals, device=self.device)
+                model_inputs = self.processor(text=formatted_contexts, images=visuals, device=self.device)
                 for k, v in model_inputs.items():
                     model_inputs[k] = v.to(self.device, non_blocking=True) if isinstance(v, torch.Tensor) else [vv.to(self.device, non_blocking=True) for vv in v]
@@ Expand Down @@

0 comments on commit `1f8780d`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `1f8780d`

Commit

There are no files selected for viewing

0 comments on commit 1f8780d

0 comments on commit `1f8780d`