Skip to content

Commit

Permalink
[Dataset] fix hallusion benchmark, add saving logic inside aggregate …
Browse files Browse the repository at this point in the history
…function (EvolvingLMMs-Lab#35)

* add fuyu

* Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

* Squashed commit of the following:

commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
Author: kcz358 <92624596+kcz358@users.noreply.github.com>
Date:   Tue Jan 30 19:39:57 2024 +0800

    Add hallu bench

commit 7b7f636
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (EvolvingLMMs-Lab#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (EvolvingLMMs-Lab#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (EvolvingLMMs-Lab#30)

        * mmmu_test

        * black

    commit 4a1183c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update hb_doc_to_text function to remove unnecessary line break

* Add Fuyu model and update OtterHD model

* Refactor model response handling and fix image processing bug

* Refactor flatten method to support only getting the first element

* Add support for specifying timezone in datetime string

Update flatten method in OtterHD class

Update get_datetime_str function in utils.py

* Fix condition for checking wandb_args_dict in __main__.py

* Commented out assertions for batch size in Fuyu model

* Add warning message for existing output file

* Fix batch size issue in OtterHD model

* Squashed commit of the following:

commit 7664839
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 16:00:22 2024 +0800

    [Datasets] add hallubench (EvolvingLMMs-Lab#34)

    * Add hallu bench

    * Fix hall_b gpt eval bugs

    ---------

    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

commit 05487a4
Author: Li Bo <drluodian@gmail.com>
Date:   Wed Jan 31 14:23:15 2024 +0800

    [Datasets & Models] Fuyu, HalluBench (w/Kaichen, commit 96d95b3) (EvolvingLMMs-Lab#33)

    * add fuyu

    * Merge commit '7b7f6368e8e04cddbd6e7f572f1099b7911cbe04'

    * Squashed commit of the following:

    commit 96d95b3cb3540cd17bcab31f1a85ad0d04a12f1e
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Tue Jan 30 19:39:57 2024 +0800

        Add hallu bench

    commit 7b7f636
    Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
    Date:   Tue Jan 30 14:52:51 2024 +0800

        scienceqa for full set (EvolvingLMMs-Lab#32)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        * Update generation kwargs for LMMS tasks

        * Update lmms_eval MME task configuration and utils

        * Update generation_kwargs in lmms_eval tasks

        * Update doc_to_text function in coco and okvqa tasks

        * Add COCO 2017 version

        * Update task name in coco_test2017.yaml

        * Squashed commit of the following:

        commit 6ee856b
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Mon Jan 29 22:41:33 2024 +0800

            Add/mmmu test (EvolvingLMMs-Lab#30)

            * mmmu_test

            * black

        commit 4a1183c
        Author: Li Bo <drluodian@gmail.com>
        Date:   Sun Jan 28 22:19:13 2024 +0800

            [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

            * Remove unused code and configuration file

            * Remove docvqa.yaml and update vizwizvqa.yaml

            * lint

            * Add dataset_kwargs to vizwizvqa.yaml

            * Add dataset_kwargs to vizwizvqa.yaml

            * textvqa (EvolvingLMMs-Lab#27)

            * Update textvqa.yaml and utils.py

            * Fix YAML formatting in textvqa.yaml and remove unused files

            * remove useless matric

            * add textvqa val & test

            * Update progress bar description in evaluator.py

            * Update submission file names in VizWizVQA tasks

            * Update output path to include log samples suffix

            * Update submission file paths in OKVQA and VizWizVQA tasks

            * Refactor llava-in-the-wild.yaml and utils.py

            * Update metric for llava evaluation

            * Refactor logging message in Task class

            * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

            * Fix formatting issues and add progress bar closing statements

            * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

            * Update tqdm progress bar in OtterHD model

            * Squashed commit of the following:

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * Fix error handling in loading YAML config files

            * Squashed commit of the following:

            commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 12:41:40 2024 +0800

                Fix key bugs

            commit c09b621195878300417315a97efdec25e67dd7f5
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:46:19 2024 +0800

                Black lint

            commit 864a1aba26388276b7e57717b89520fcc77b3f62
            Merge: ab898e4 ad8d9da
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:45:31 2024 +0800

                Merge branch 'main' into kc/list_tasks_num

            commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:44:23 2024 +0800

                Enable list all tasks num

            commit c0ea54d49cb65b747d7e8fccac75838acabe05db
            Author: kcz358 <92624596+kcz358@users.noreply.github.com>
            Date:   Sun Jan 28 09:41:32 2024 +0800

                Exclude train yaml file in the task list

            commit ad8d9da
            Author: Zhang Peiyuan <a1286225768@gmail.com>
            Date:   Sun Jan 28 02:04:57 2024 +0800

                Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

                * add mmme

                * black

                * add model specific prompt and gen kwargs

                * black

                * add yaml config to supprot multi-model eval

                * print table at the end

                * refactor multi model code

                * add chartqa

                * black

                * add ai2d

                * black

                * update chartqa

                * blacl

                * update ai2d dataset

                * black

                * add qwenvl

                * add infovqa and docvqa

            * List task #num sorted

            * Update prompt messages for image-related tasks

            * Delete unused task configuration files

            * Remove coco_train.yaml configuration file

            * Update task name in mmmu.yaml

            * Fix error message for missing tasks

            * Add wandb import and integration

            ---------

            Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
            Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

        * Remove scienceqa_img task configuration

        * eval scienceqa with no images

        ---------

        Co-authored-by: Bo Li <drluodian@gmail.com>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Update hb_doc_to_text function to remove unnecessary line break

    * Add Fuyu model and update OtterHD model

    * Refactor model response handling and fix image processing bug

    * Refactor flatten method to support only getting the first element

    * Add support for specifying timezone in datetime string

    Update flatten method in OtterHD class

    Update get_datetime_str function in utils.py

    * Fix condition for checking wandb_args_dict in __main__.py

    * Commented out assertions for batch size in Fuyu model

    * Add warning message for existing output file

commit 7b7f636
Author: Pu Fanyi <FPU001@e.ntu.edu.sg>
Date:   Tue Jan 30 14:52:51 2024 +0800

    scienceqa for full set (EvolvingLMMs-Lab#32)

    * Remove unused code and configuration file

    * Remove docvqa.yaml and update vizwizvqa.yaml

    * lint

    * Add dataset_kwargs to vizwizvqa.yaml

    * Add dataset_kwargs to vizwizvqa.yaml

    * textvqa (EvolvingLMMs-Lab#27)

    * Update textvqa.yaml and utils.py

    * Fix YAML formatting in textvqa.yaml and remove unused files

    * remove useless matric

    * add textvqa val & test

    * Update progress bar description in evaluator.py

    * Update submission file names in VizWizVQA tasks

    * Update output path to include log samples suffix

    * Update submission file paths in OKVQA and VizWizVQA tasks

    * Refactor llava-in-the-wild.yaml and utils.py

    * Update metric for llava evaluation

    * Refactor logging message in Task class

    * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

    * Fix formatting issues and add progress bar closing statements

    * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

    * Update tqdm progress bar in OtterHD model

    * Squashed commit of the following:

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * Fix error handling in loading YAML config files

    * Squashed commit of the following:

    commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 12:41:40 2024 +0800

        Fix key bugs

    commit c09b621195878300417315a97efdec25e67dd7f5
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:46:19 2024 +0800

        Black lint

    commit 864a1aba26388276b7e57717b89520fcc77b3f62
    Merge: ab898e4 ad8d9da
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:45:31 2024 +0800

        Merge branch 'main' into kc/list_tasks_num

    commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:44:23 2024 +0800

        Enable list all tasks num

    commit c0ea54d49cb65b747d7e8fccac75838acabe05db
    Author: kcz358 <92624596+kcz358@users.noreply.github.com>
    Date:   Sun Jan 28 09:41:32 2024 +0800

        Exclude train yaml file in the task list

    commit ad8d9da
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Sun Jan 28 02:04:57 2024 +0800

        Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

        * add mmme

        * black

        * add model specific prompt and gen kwargs

        * black

        * add yaml config to supprot multi-model eval

        * print table at the end

        * refactor multi model code

        * add chartqa

        * black

        * add ai2d

        * black

        * update chartqa

        * blacl

        * update ai2d dataset

        * black

        * add qwenvl

        * add infovqa and docvqa

    * List task #num sorted

    * Update prompt messages for image-related tasks

    * Delete unused task configuration files

    * Remove coco_train.yaml configuration file

    * Update task name in mmmu.yaml

    * Fix error message for missing tasks

    * Add wandb import and integration

    * Update generation kwargs for LMMS tasks

    * Update lmms_eval MME task configuration and utils

    * Update generation_kwargs in lmms_eval tasks

    * Update doc_to_text function in coco and okvqa tasks

    * Add COCO 2017 version

    * Update task name in coco_test2017.yaml

    * Squashed commit of the following:

    commit 6ee856b
    Author: Zhang Peiyuan <a1286225768@gmail.com>
    Date:   Mon Jan 29 22:41:33 2024 +0800

        Add/mmmu test (EvolvingLMMs-Lab#30)

        * mmmu_test

        * black

    commit 4a1183c
    Author: Li Bo <drluodian@gmail.com>
    Date:   Sun Jan 28 22:19:13 2024 +0800

        [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29)

        * Remove unused code and configuration file

        * Remove docvqa.yaml and update vizwizvqa.yaml

        * lint

        * Add dataset_kwargs to vizwizvqa.yaml

        * Add dataset_kwargs to vizwizvqa.yaml

        * textvqa (EvolvingLMMs-Lab#27)

        * Update textvqa.yaml and utils.py

        * Fix YAML formatting in textvqa.yaml and remove unused files

        * remove useless matric

        * add textvqa val & test

        * Update progress bar description in evaluator.py

        * Update submission file names in VizWizVQA tasks

        * Update output path to include log samples suffix

        * Update submission file paths in OKVQA and VizWizVQA tasks

        * Refactor llava-in-the-wild.yaml and utils.py

        * Update metric for llava evaluation

        * Refactor logging message in Task class

        * Merge commit 'ad8d9da1fb40c446202bf9b0095b02262df2ffc8'

        * Fix formatting issues and add progress bar closing statements

        * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml

        * Update tqdm progress bar in OtterHD model

        * Squashed commit of the following:

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * Fix error handling in loading YAML config files

        * Squashed commit of the following:

        commit dbba2fe6447b0dfd4bb89a368f62178f2b253006
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 12:41:40 2024 +0800

            Fix key bugs

        commit c09b621195878300417315a97efdec25e67dd7f5
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:46:19 2024 +0800

            Black lint

        commit 864a1aba26388276b7e57717b89520fcc77b3f62
        Merge: ab898e4 ad8d9da
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:45:31 2024 +0800

            Merge branch 'main' into kc/list_tasks_num

        commit ab898e4fd30bf83888125d48b80bc86b01cb5d39
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:44:23 2024 +0800

            Enable list all tasks num

        commit c0ea54d49cb65b747d7e8fccac75838acabe05db
        Author: kcz358 <92624596+kcz358@users.noreply.github.com>
        Date:   Sun Jan 28 09:41:32 2024 +0800

            Exclude train yaml file in the task list

        commit ad8d9da
        Author: Zhang Peiyuan <a1286225768@gmail.com>
        Date:   Sun Jan 28 02:04:57 2024 +0800

            Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28)

            * add mmme

            * black

            * add model specific prompt and gen kwargs

            * black

            * add yaml config to supprot multi-model eval

            * print table at the end

            * refactor multi model code

            * add chartqa

            * black

            * add ai2d

            * black

            * update chartqa

            * blacl

            * update ai2d dataset

            * black

            * add qwenvl

            * add infovqa and docvqa

        * List task #num sorted

        * Update prompt messages for image-related tasks

        * Delete unused task configuration files

        * Remove coco_train.yaml configuration file

        * Update task name in mmmu.yaml

        * Fix error message for missing tasks

        * Add wandb import and integration

        ---------

        Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg>
        Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

    * Remove scienceqa_img task configuration

    * eval scienceqa with no images

    ---------

    Co-authored-by: Bo Li <drluodian@gmail.com>
    Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>

* Update API configuration and file paths

* Refactor evaluate_by_chatgpt function in utils.py

* Add hallusion_output_vd_model.json to .gitignore

* Add timeout to API request

* Refactor file path generation and remove unnecessary suffix in log samples output names

* Refactor code and add output path handling

* Update lmms-eval API and add new models and datasets
  • Loading branch information
Luodian authored Feb 1, 2024
1 parent 7664839 commit 1f8780d
Show file tree
Hide file tree
Showing 9 changed files with 260 additions and 135 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@ scripts/
wandb/
SimSun.ttf
submissions/

lmms_eval/tasks/hallusion_bench/hallusion_output_vs_model.json
lmms_eval/tasks/hallusion_bench/hallusion_output_vd_model.json
124 changes: 108 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,5 @@
# lmms-eval

The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.

Below are the changes we made to the original API:

- Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
- lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.

## How to run

```bash
Expand All @@ -19,21 +12,120 @@ accelerate launch --num_processes=8 -m lmms_eval --config example_eval.yaml # Ea
```
## Current models

- llava (only generate_until function. Please help add the other two required functions. You can refer to lm-eval-harness for the required functions and how to implement them.)
- GPT4V (API)
- generation-based evaluation

- Gemini (APi)
- generation-based evaluation

- LLaVA-v1.5/v1.6-7B/13B/34B
- generation-based evaluation
- perplexity-based evaluation

## Models to be added

- InstructBLIP
- OpenFlamingo/Otter
- Fuyu/OtterHD
- Emu
- CogVLM

## Current datasets
- GQA
- MMMU
- SQA-IMG
- MME
- MMVet
- LLaVA-Bench
- LLaVA-Bench-CN
- AI2D (ai2d)
- ChartQA (chartqa)
- COCO Caption (coco_cap)
- COCO 2014 Caption Validation (coco2014_cap_val)
- COCO 2014 Caption Test (coco2014_cap_test)
- COCO 2017 Caption MiniVal (coco2017_cap_val)
- COCO 2017 Caption MiniTest (coco2017_cap_test)
- DOCVQA (docvqa)
- DOCVQA Validation (docvqa_val)
- DOCVQA Test (docvqa_test)
- Flickr30K (flickr30k)
- GQA (gqa)
- HallusionBenchmark (hallusion_bench_image)
- Infographic VQA (info_vqa)
- Infographic VQA Validation (info_vqa_val)
- Infographic VQA Test (info_vqa_test)
- LLaVA-Bench (llava_bench_wild)
- LLaVA-Bench-CN (?)
- LLaVA-Bench-COCO (llava_bench_coco)
- MathVista (mathvista)
- MathVista Validation (mathvista_testmini)
- MathVista Test (mathvista_test)
- MMBench (mmbench)
- MMBench English Dev (mmbench_en_dev)
- MMBench English Test (mmbench_en_test)
- MMBench Chinese Dev (mmbench_cn_dev)
- MMBench Chinese Test (mmbench_cn_test)
- MME (mme)
- MME-Cognition
- MME-Commonsense
- MMMU (mmmu)
- MMMU Validation (mmmu_val)
- MMMU Test (mmmu_test)
- MMVet (mmvet)
- NoCaps (nocaps)
- NoCaps Validation (nocaps_val)
- NoCaps Test (nocaps_test)
- OKVQA (okvqa)
- POPE (pope)
- RefCOCO (refcoco)
- refcoco_seg_test
- refcoco_seg_val
- refcoco_seg_testA
- refcoco_seg_testB
- refcoco_bbox_test
- refcoco_bbox_val
- refcoco_bbox_testA
- refcoco_bbox_testB
- RefCOCO+ (refcoco+)
- refcoco+_seg_val
- refcoco+_seg_testA
- refcoco+_seg_testB
- refcoco+_bbox_val
- refcoco+_bbox_testA
- refcoco+_bbox_testB
- RefCOCOg (refcocog)
- refcocog_seg_test
- refcocog_seg_val
- refcocog_bbox_test
- refcocog_bbox_val
- ScienceQA (scienceqa)
- ScienceQA Full (scienceqa_full)
- ScienceQA IMG (scienceqa_img)
- SeedBench (seedbench)
- TextCaps (textcaps)
- TextCaps Validation (textcaps_val)
- TextCaps Test (textcaps_test)
- TextVQA (textvqa)
- TextVQA Validation (textvqa_val)
- TextVQA Test (textvqa_test)
- VizWizVQA (vizwizvqa)
- VizWizVQA Validation (vizwizvqa_val)
- VizWizVQA Test (vizwizvqa_test)
- VQAv2 (vqav2)
- VQAv2 Validation (vqav2_val)
- VQAv2 Test (vqav2_test)

## Datasets to be added and tested
- CMMMU (cmmmu)
- Mementos (mementos)
- Ferret Bench (ferret)
- ST-VQA (stvqa)
- Multi-DocVQA (multidocvqa)
- Winoground (winoground)
- NLVR2 (nlvr2)
- RavenIQ-Test (raveniq)
- IconQA (iconqa)
- VistBench (vistbench)


## Datasets to be added
## Acknowledgement

The API, togegher with many code blocks of this project come from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness). **Please read through the [docs of lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) before contributing to this project**. Please do not commit to this project directly. Instead, push your changes to another branch and create a pull request.

Below are the changes we made to the original API:

- Instance.args (lmms_eval/api/instance.py) now contains a list of images to be inputted to lmms.
- lm-eval-harness supports all HF LMM as single model class. Currently this is not possible of lmms because the input/output format of lmms in HF are not yet unified. Thererfore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.
34 changes: 16 additions & 18 deletions lmms_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,15 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:

# set datetime before evaluation
datetime_str = utils.get_datetime_str(timezone=args.timezone)
if args.output_path:
hash_input = f"{args.model_args}".encode("utf-8")
hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
path = Path(args.output_path)
path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}_{args.log_samples_suffix}")
args.output_path = path

elif args.log_samples and not args.output_path:
assert args.output_path, "Specify --output_path"

results = evaluator.simple_evaluate(
model=args.model,
Expand All @@ -228,23 +237,9 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
show_task_to_terminal=args.show_task_to_terminal,
log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs,
cli_args=args,
)

if args.output_path:
hash_input = f"{args.model_args}".encode("utf-8")
hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
path = Path(args.output_path)
path = path.expanduser().resolve().joinpath(f"{args.model}").joinpath(f"model_args_{hash_output}").joinpath(f"{datetime_str}")
path.mkdir(parents=True, exist_ok=True)
assert path.is_dir(), f"Output path {path} is not a directory"

output_path_file = path.joinpath("results.json")
if output_path_file.exists():
eval_logger.warning(f"Output file {output_path_file} already exists and will be overwritten.")

elif args.log_samples and not args.output_path:
assert args.output_path, "Specify --output_path"

if results is not None:
if args.log_samples:
samples = results.pop("samples")
Expand All @@ -253,12 +248,15 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
print(dumped)

if args.output_path:
output_path_file.open("w").write(dumped)
args.output_path.mkdir(parents=True, exist_ok=True)
result_file_path = path.joinpath("results.json")
if result_file_path.exists():
eval_logger.warning(f"Output file {result_file_path} already exists and will be overwritten.")

result_file_path.open("w").write(dumped)
if args.log_samples:
for task_name, config in results["configs"].items():
output_name = f"{task_name}_{args.log_samples_suffix}"
filename = path.joinpath(f"{output_name}.json")
filename = args.output_path.joinpath(f"{task_name}.json")
# Structure the data with 'args' and 'logs' keys
data_to_dump = {"args": vars(args), "config": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"])} # Convert Namespace to dict
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable)
Expand Down
11 changes: 10 additions & 1 deletion lmms_eval/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import json
import collections
import sys
import inspect
from tqdm import tqdm

import torch
Expand Down Expand Up @@ -40,6 +41,7 @@ def simple_evaluate(
show_task_to_terminal: bool = False,
log_samples: bool = True,
gen_kwargs: str = None,
cli_args=None, # Bo: put args into more functions (cost 48 Bytes per call)
):
"""Instantiate and evaluate a model on a list of tasks.
Expand Down Expand Up @@ -126,6 +128,7 @@ def simple_evaluate(
bootstrap_iters=bootstrap_iters,
show_task_to_terminal=show_task_to_terminal,
log_samples=log_samples,
cli_args=cli_args,
)

if lm.rank == 0:
Expand Down Expand Up @@ -156,6 +159,7 @@ def evaluate(
bootstrap_iters: int = 100000,
show_task_to_terminal: bool = False,
log_samples: bool = True,
cli_args=None,
):
"""Instantiate and evaluate a model on a list of tasks.
Expand Down Expand Up @@ -423,7 +427,12 @@ def evaluate(
else:
group_name = None
agg_fn = task.aggregation()[metric]
results[task_name][metric_key] = agg_fn(items)
# Bo: for models only need agg items
if inspect.getfullargspec(agg_fn).args == ["results"]:
results[task_name][metric_key] = agg_fn(items)
# Bo: for models that need to know the args to save to correct path
elif inspect.getfullargspec(agg_fn).args == ["results", "args"]:
results[task_name][metric_key] = agg_fn(items, cli_args)
results[task_name]["samples"] = len(items)

# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
Expand Down
3 changes: 1 addition & 2 deletions lmms_eval/models/otterhd.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ def __init__(
self.processor = FuyuProcessor(image_processor=self.image_processor, tokenizer=self.tokenizer)
self.max_new_tokens = max_new_tokens
self.batch_size_per_gpu = int(batch_size)
assert self.batch_size_per_gpu == 1, "OtterHD currently does not support batched generation."

@property
def max_length(self):
Expand Down Expand Up @@ -91,7 +90,7 @@ def _collate(x):
# visuals = [visuals[idx][0] for idx in range(len(visuals))] # get the first image in multi-image scenarios.

formatted_contexts = [f"User: {context} Assistant:" for context in contexts]
model_inputs = self.processor(text=[formatted_contexts], images=visuals, device=self.device)
model_inputs = self.processor(text=formatted_contexts, images=visuals, device=self.device)
for k, v in model_inputs.items():
model_inputs[k] = v.to(self.device, non_blocking=True) if isinstance(v, torch.Tensor) else [vv.to(self.device, non_blocking=True) for vv in v]

Expand Down
Loading

0 comments on commit 1f8780d

Please sign in to comment.