[Fix] InfoVQA, WandB logging, CLI problems. (EvolvingLMMs-Lab#31)

* Remove unused code and configuration file * Remove docvqa.yaml and update vizwizvqa.yaml * lint * Add dataset_kwargs to vizwizvqa.yaml * Add dataset_kwargs to vizwizvqa.yaml * textvqa (EvolvingLMMs-Lab#27) * Update textvqa.yaml and utils.py * Fix YAML formatting in textvqa.yaml and remove unused files * remove useless matric * add textvqa val & test * Update progress bar description in evaluator.py * Update submission file names in VizWizVQA tasks * Update output path to include log samples suffix * Update submission file paths in OKVQA and VizWizVQA tasks * Refactor llava-in-the-wild.yaml and utils.py * Update metric for llava evaluation * Refactor logging message in Task class * Merge commit '5553d106e5ffd84b280b3d5a3c8d47c35e2d310b' * Fix formatting issues and add progress bar closing statements * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml * Update tqdm progress bar in OtterHD model * Squashed commit of the following: commit eae210c3700a59b7d5cc9de46fcb855f443096aa Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:46:19 2024 +0800 Black lint commit 18e4a19e82357352ab25df77b5ae4f1b011d61ae Merge: ab898e4 fb209e4 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:45:31 2024 +0800 Merge branch 'main' into kc/list_tasks_num commit e899be48f55f95172fdf96bd2a98d3b91ff2aaed Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:44:23 2024 +0800 Enable list all tasks num commit a999fc6889c6986c28ec5d95460a4ab5233e5d4f Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:41:32 2024 +0800 Exclude train yaml file in the task list commit 5553d10 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Sun Jan 28 02:04:57 2024 +0800 Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * add chartqa * black * add ai2d * black * update chartqa * blacl * update ai2d dataset * black * add qwenvl * add infovqa and docvqa * Fix error handling in loading YAML config files * Squashed commit of the following: commit fdb0c6785b0c5d6979d10e7ddf75ce9055038db8 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 12:41:40 2024 +0800 Fix key bugs commit eae210c3700a59b7d5cc9de46fcb855f443096aa Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:46:19 2024 +0800 Black lint commit 18e4a19e82357352ab25df77b5ae4f1b011d61ae Merge: ab898e4 fb209e4 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:45:31 2024 +0800 Merge branch 'main' into kc/list_tasks_num commit e899be48f55f95172fdf96bd2a98d3b91ff2aaed Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:44:23 2024 +0800 Enable list all tasks num commit a999fc6889c6986c28ec5d95460a4ab5233e5d4f Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:41:32 2024 +0800 Exclude train yaml file in the task list commit 5553d10 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Sun Jan 28 02:04:57 2024 +0800 Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * add chartqa * black * add ai2d * black * update chartqa * blacl * update ai2d dataset * black * add qwenvl * add infovqa and docvqa * List task #num sorted * Update prompt messages for image-related tasks * Delete unused task configuration files * Remove coco_train.yaml configuration file * Update task name in mmmu.yaml * Fix error message for missing tasks * Add wandb import and integration * Update generation kwargs for LMMS tasks * Update lmms_eval MME task configuration and utils * Update generation_kwargs in lmms_eval tasks * Update doc_to_text function in coco and okvqa tasks * Add COCO 2017 version * Update task name in coco_test2017.yaml * Squashed commit of the following: commit 0fd4558 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Mon Jan 29 22:41:33 2024 +0800 Add/mmmu test (EvolvingLMMs-Lab#30) * mmmu_test * black commit f125889 Author: Li Bo <drluodian@gmail.com> Date: Sun Jan 28 22:19:13 2024 +0800 [Dataset Check] dataset check and add wandb logging (EvolvingLMMs-Lab#29) * Remove unused code and configuration file * Remove docvqa.yaml and update vizwizvqa.yaml * lint * Add dataset_kwargs to vizwizvqa.yaml * Add dataset_kwargs to vizwizvqa.yaml * textvqa (EvolvingLMMs-Lab#27) * Update textvqa.yaml and utils.py * Fix YAML formatting in textvqa.yaml and remove unused files * remove useless matric * add textvqa val & test * Update progress bar description in evaluator.py * Update submission file names in VizWizVQA tasks * Update output path to include log samples suffix * Update submission file paths in OKVQA and VizWizVQA tasks * Refactor llava-in-the-wild.yaml and utils.py * Update metric for llava evaluation * Refactor logging message in Task class * Merge commit '5553d106e5ffd84b280b3d5a3c8d47c35e2d310b' * Fix formatting issues and add progress bar closing statements * Update task from "infovqa_val" to "infovqa_test" in infovqa_test.yaml * Update tqdm progress bar in OtterHD model * Squashed commit of the following: commit eae210c3700a59b7d5cc9de46fcb855f443096aa Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:46:19 2024 +0800 Black lint commit 18e4a19e82357352ab25df77b5ae4f1b011d61ae Merge: ab898e4 fb209e4 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:45:31 2024 +0800 Merge branch 'main' into kc/list_tasks_num commit e899be48f55f95172fdf96bd2a98d3b91ff2aaed Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:44:23 2024 +0800 Enable list all tasks num commit a999fc6889c6986c28ec5d95460a4ab5233e5d4f Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:41:32 2024 +0800 Exclude train yaml file in the task list commit 5553d10 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Sun Jan 28 02:04:57 2024 +0800 Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * add chartqa * black * add ai2d * black * update chartqa * blacl * update ai2d dataset * black * add qwenvl * add infovqa and docvqa * Fix error handling in loading YAML config files * Squashed commit of the following: commit fdb0c6785b0c5d6979d10e7ddf75ce9055038db8 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 12:41:40 2024 +0800 Fix key bugs commit eae210c3700a59b7d5cc9de46fcb855f443096aa Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:46:19 2024 +0800 Black lint commit 18e4a19e82357352ab25df77b5ae4f1b011d61ae Merge: ab898e4 fb209e4 Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:45:31 2024 +0800 Merge branch 'main' into kc/list_tasks_num commit e899be48f55f95172fdf96bd2a98d3b91ff2aaed Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:44:23 2024 +0800 Enable list all tasks num commit a999fc6889c6986c28ec5d95460a4ab5233e5d4f Author: kcz358 <92624596+kcz358@users.noreply.github.com> Date: Sun Jan 28 09:41:32 2024 +0800 Exclude train yaml file in the task list commit 5553d10 Author: Zhang Peiyuan <a1286225768@gmail.com> Date: Sun Jan 28 02:04:57 2024 +0800 Add InfoVQA, DocVQA, and QwenVL (EvolvingLMMs-Lab#28) * add mmme * black * add model specific prompt and gen kwargs * black * add yaml config to supprot multi-model eval * print table at the end * refactor multi model code * add chartqa * black * add ai2d * black * update chartqa * blacl * update ai2d dataset * black * add qwenvl * add infovqa and docvqa * List task #num sorted * Update prompt messages for image-related tasks * Delete unused task configuration files * Remove coco_train.yaml configuration file * Update task name in mmmu.yaml * Fix error message for missing tasks * Add wandb import and integration --------- Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com> * Refactor CLI evaluate function and improve error logging --------- Co-authored-by: Fanyi Pu <FPU001@e.ntu.edu.sg> Co-authored-by: kcz358 <92624596+kcz358@users.noreply.github.com>
kangreen0210 · Jan 30, 2024 · 12675c7 · 12675c7
1 parent 0fd4558
commit 12675c7
Show file tree

Hide file tree

Showing 27 changed files with 146 additions and 64 deletions.
diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py
@@ -123,7 +123,7 @@ def parse_eval_args() -> argparse.Namespace:
     return args
 
 
-def cli_evaluate(args: Union[argparse.Namespace, None], wandb_run) -> None:
+def cli_evaluate(args: Union[argparse.Namespace, None] = None, wandb_run=None) -> None:
     if args is None:
         args = parse_eval_args()
 
@@ -292,10 +292,22 @@ def print_results(args, results):
 
     # initialize Accelerator
     accelerator = Accelerator()
+    all_args_dict = vars(args)
 
     if accelerator.is_main_process:
         # initialize a W&B run only on rank 0
         wandb_args_dict = utils.simple_parse_args_string(args.wandb_args)
+        if "name" not in wandb_args_dict:
+            if "config" not in all_args_dict:
+                # use the model name and task names as run name
+                task_names = args.tasks.replace(",", "_")
+                wandb_args_dict["name"] = f"{args.model}_{task_names}_{args.log_samples_suffix}"
+                if args.num_fewshot:
+                    wandb_args_dict["name"] += f"_{args.num_fewshot}shot"
+            else:
+                # use the name of the config file as run name
+                wandb_args_dict["name"] = all_args_dict["config"].split("/")[-1].split(".")[0]
+
         wandb_run = wandb.init(**wandb_args_dict)
         is_main_process = True
     else:
@@ -307,3 +319,6 @@ def print_results(args, results):
     for args in args_list:
         results = cli_evaluate(args, wandb_run)
         results_list.append(results)
+
+    if is_main_process:
+        wandb_run.finish()
diff --git a/lmms_eval/models/llava.py b/lmms_eval/models/llava.py
@@ -258,7 +258,7 @@ def _collate(x):
             if "image_aspect_ratio" in gen_kwargs.keys() and "image_aspect_ratio" not in self._config.__dict__:
                 # here we should pop it out of gen_kwargs so that it doesn't get passed to the model for next step of generation
                 self._config.image_aspect_ratio = gen_kwargs.pop("image_aspect_ratio")
-
+                eval_logger.info(f"Setting image aspect ratio: {self._config.image_aspect_ratio}")
             # encode, pad, and truncate contexts for this batch
             if visuals:
                 image_tensor = process_images(visuals, self._image_processor, self._config)
@@ -289,7 +289,7 @@ def _collate(x):
             input_ids = tokenizer_image_token(prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(self.device)
 
             # preconfigure gen_kwargs with defaults
-            gen_kwargs["image_sizes"] = [visuals[0].size]
+            gen_kwargs["image_sizes"] = [visuals[idx].size for idx in range(len(visuals))]
             if "max_new_tokens" not in gen_kwargs:
                 gen_kwargs["max_new_tokens"] = 1024
             if "temperature" not in gen_kwargs:
@@ -318,9 +318,11 @@ def _collate(x):
                     use_cache=self.use_cache,
                 )
             except Exception as e:
-                print("Error in generating")
+                eval_logger.error(f"Error {e} in generating")
                 cont = ""
-                raise e
+                eval_logger.error(prompt)
+                eval_logger.error(visuals)
+                eval_logger.error(prompts_input)
 
             cont_toks_list = cont.tolist()
             for cont_toks, context in zip(cont_toks_list, contexts):

diff --git a/lmms_eval/tasks/coco/coco2017.yaml b/lmms_eval/tasks/coco/coco2017.yaml
@@ -0,0 +1,4 @@
+group : coco2017
+task:
+  - coco_val2017
+  - coco_test2017
diff --git a/lmms_eval/tasks/coco/coco_test.yaml b/lmms_eval/tasks/coco/coco_test.yaml
@@ -6,12 +6,10 @@ group : "coco_caption"
 test_split: test
 output_type: generate_until
 doc_to_visual: !function utils.coco_doc_to_visual
-doc_to_text: !function utils.coco_doc_to_text
+doc_to_text: "Provide a one-sentence caption for the provided image."
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 128
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/coco/coco_test2017.yaml b/lmms_eval/tasks/coco/coco_test2017.yaml
@@ -0,0 +1,24 @@
+dataset_path: lmms-lab/COCO-Caption2017
+dataset_kwargs:
+  token: True
+task : "coco_test2017"
+group : "coco_caption2017"
+test_split: test
+output_type: generate_until
+doc_to_visual: !function utils.coco_doc_to_visual
+doc_to_text: !function utils.coco_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 128
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
+process_results: !function utils.coco_test_process_result
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: coco_passthrough 
+    aggregation : !function utils.coco_test_aggregation_result
+    higher_is_better : true
+metadata:
+  - version: 0.0
diff --git a/lmms_eval/tasks/coco/coco_val.yaml b/lmms_eval/tasks/coco/coco_val.yaml
@@ -6,12 +6,10 @@ group : "coco_caption"
 test_split: val
 output_type: generate_until
 doc_to_visual: !function utils.coco_doc_to_visual
-doc_to_text: !function utils.coco_doc_to_text
+doc_to_text: "Provide a one-sentence caption for the provided image."
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 64
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/coco/coco_val2017.yaml b/lmms_eval/tasks/coco/coco_val2017.yaml
@@ -0,0 +1,45 @@
+dataset_path: lmms-lab/COCO-Caption2017
+dataset_kwargs:
+  token: True
+task: "coco_val2017"
+group : "coco_caption2017"
+test_split: val
+output_type: generate_until
+doc_to_visual: !function utils.coco_doc_to_visual
+doc_to_text: !function utils.coco_doc_to_text
+doc_to_target: "answer"
+generation_kwargs:
+  max_new_tokens: 64
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
+process_results: !function utils.coco_process_result
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: coco_Bleu_4 
+    aggregation : !function utils.coco_bleu4
+    higher_is_better : true
+  - metric: coco_Bleu_3
+    aggregation : !function utils.coco_bleu3
+    higher_is_better : true
+  - metric: coco_Bleu_2
+    aggregation : !function utils.coco_bleu2
+    higher_is_better : true
+  - metric: coco_Bleu_1
+    aggregation : !function utils.coco_bleu1
+    higher_is_better : true
+  - metric: coco_METEOR
+    aggregation : !function utils.coco_meteor
+    higher_is_better : true
+  - metric: coco_ROUGE_L
+    aggregation : !function utils.coco_rougel
+    higher_is_better : true
+  - metric: coco_CIDEr
+    aggregation : !function utils.coco_cider
+    higher_is_better : true
+  #- metric: coco_SPICE
+  #  aggregation : !function utils.coco_spice
+  #  higher_is_better : true
+metadata:
+  - version: 0.0
diff --git a/lmms_eval/tasks/coco/utils.py b/lmms_eval/tasks/coco/utils.py
@@ -18,8 +18,7 @@ def coco_doc_to_visual(doc):
 
 
 def coco_doc_to_text(doc):
-    question = doc["question"]
-    return f"{question}\nDescribe this image briefly using a single sentence."
+    return f"Provide a one-sentence caption for the provided image."
 
 
 def coco_process_result(doc, result):

diff --git a/lmms_eval/tasks/flickr30k/flickr30k.yaml b/lmms_eval/tasks/flickr30k/flickr30k.yaml
@@ -8,9 +8,7 @@ doc_to_visual: !function utils.flickr_doc_to_visual
 doc_to_text: !function utils.flickr_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 64
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/flickr30k/utils.py b/lmms_eval/tasks/flickr30k/utils.py
@@ -18,8 +18,8 @@ def flickr_doc_to_visual(doc):
 
 
 def flickr_doc_to_text(doc):
-    question = "Please carefully observe the image and come up with a caption for the image."
-    return f"{question}\nAnswer the question with a short phrase."
+    # question = "Please carefully observe the image and come up with a caption for the image"
+    return f"Provide a one-sentence caption for the provided image."
 
 
 def flickr_process_result(doc, result):

diff --git a/lmms_eval/tasks/gqa/gqa.yaml b/lmms_eval/tasks/gqa/gqa.yaml
@@ -9,8 +9,11 @@ doc_to_visual: !function utils.gqa_doc_to_visual
 doc_to_text: !function utils.gqa_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
+  max_new_tokens: 16
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
 metric_list:
   - metric: exact_match
     aggregation: mean

diff --git a/lmms_eval/tasks/infovqa/infovqa_test.yaml b/lmms_eval/tasks/infovqa/infovqa_test.yaml
@@ -3,7 +3,7 @@ dataset_name: InfographicVQA
 dataset_kwargs:
   token: True
 task: "infovqa_test"
-test_split: validation
+test_split: test
 output_type: generate_until
 doc_to_visual: !function utils.infovqa_doc_to_visual
 doc_to_text: !function utils.infovqa_doc_to_text

diff --git a/lmms_eval/tasks/mmbench_cn/mmbench_cc.yaml b/lmms_eval/tasks/mmbench_cn/mmbench_cc.yaml
@@ -10,9 +10,7 @@ doc_to_visual: !function cc_utils.mmbench_doc_to_visual
 doc_to_text: !function cc_utils.mmbench_cn_cc_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 256
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/mmbench_cn/mmbench_cn_dev.yaml b/lmms_eval/tasks/mmbench_cn/mmbench_cn_dev.yaml
@@ -10,9 +10,7 @@ doc_to_visual: !function utils.mmbench_doc_to_visual
 doc_to_text: !function utils.mmbench_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 256
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/mmbench_cn/mmbench_cn_test.yaml b/lmms_eval/tasks/mmbench_cn/mmbench_cn_test.yaml
@@ -10,9 +10,7 @@ doc_to_visual: !function utils.mmbench_doc_to_visual
 doc_to_text: !function utils.mmbench_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 256
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/mmbench_en/mmbench_en_test.yaml b/lmms_eval/tasks/mmbench_en/mmbench_en_test.yaml
@@ -9,9 +9,7 @@ doc_to_visual: !function utils.mmbench_doc_to_visual
 doc_to_text: !function utils.mmbench_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 256
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/mme/mme.yaml b/lmms_eval/tasks/mme/mme.yaml
@@ -8,8 +8,11 @@ doc_to_visual: !function utils.mme_doc_to_visual
 doc_to_text: !function utils.mme_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
+  max_new_tokens: 16
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
 # The return value of process_results will be used by metrics
 process_results: !function utils.mme_process_results
 # Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
@@ -20,5 +23,9 @@ metric_list:
   - metric: mme_cognition_score
     aggregation: !function utils.mme_aggregate_results
     higher_is_better: true
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question using a single word or phrase."
 metadata:
-  - version: 0.0
+  - version: 0.0
diff --git a/lmms_eval/tasks/mme/utils.py b/lmms_eval/tasks/mme/utils.py
@@ -22,18 +22,22 @@
 }
 
 
-replace_prompt = "Please answer yes or no."
+replace_prompt = " Please answer yes or no."
 
 
 def mme_doc_to_visual(doc):
     return [doc["image"].convert("RGB")]
 
 
-def mme_doc_to_text(doc):
-    question = doc["question"]
-    # TODO: This is a hack. We should fix this in the dataset.
-    question = question.replace(replace_prompt, "").strip()
-    return f"{question}\nAnswer the question using a single word or phrase."
+def mme_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    question = doc["question"].strip()
+    if "pre_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["pre_prompt"] != "":
+        question = question.replace(replace_prompt, "")
+        question = f"{model_specific_prompt_kwargs['pre_prompt']}{question}"
+    if "post_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["post_prompt"] != "":
+        question = question.replace(replace_prompt, "")
+        question = f"{question}{model_specific_prompt_kwargs['post_prompt']}"
+    return question
 
 
 def parse_pred_ans(pred_ans):

diff --git a/lmms_eval/tasks/nocaps/nocaps_test.yaml b/lmms_eval/tasks/nocaps/nocaps_test.yaml
@@ -9,9 +9,7 @@ doc_to_visual: !function utils.nocaps_doc_to_visual
 doc_to_text: !function utils.nocaps_doc_to_text
 doc_to_target: "annotations_captions"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 64
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/nocaps/nocaps_val.yaml b/lmms_eval/tasks/nocaps/nocaps_val.yaml
@@ -9,9 +9,7 @@ doc_to_visual: !function utils.nocaps_doc_to_visual
 doc_to_text: !function utils.nocaps_doc_to_text
 doc_to_target: "annotations_captions"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 64
   temperature: 0
   top_p: 0
   num_beams: 1

diff --git a/lmms_eval/tasks/okvqa/okvqa.yaml b/lmms_eval/tasks/okvqa/okvqa.yaml
@@ -6,8 +6,11 @@ doc_to_visual: !function utils.okvqa_doc_to_visual
 doc_to_text: !function utils.okvqa_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
+  max_new_tokens: 16
+  temperature: 0
+  top_p: 0
+  num_beams: 1
+  do_sample: false
 metric_list:
   - metric: exact_match
     aggregation: mean

diff --git a/lmms_eval/tasks/okvqa/utils.py b/lmms_eval/tasks/okvqa/utils.py
@@ -262,7 +262,7 @@ def okvqa_process_results(doc, result):
 
 
 def okvqa_doc_to_text(doc):
-    text = f"{doc['question'].capitalize()}\n Answer the question using a single word or phrase."
+    text = f"{doc['question'].capitalize()}\nAnswer the question using a single word."
     return text
 
 

diff --git a/lmms_eval/tasks/textcaps/textcaps_test.yaml b/lmms_eval/tasks/textcaps/textcaps_test.yaml
@@ -9,9 +9,7 @@ doc_to_visual: !function utils.textcaps_doc_to_visual
 doc_to_text: !function utils.textcaps_doc_to_text
 doc_to_target: "answer"
 generation_kwargs:
-  until:
-    - "ASSISTANT:"
-  max_new_tokens: 1024
+  max_new_tokens: 64
   temperature: 0
   top_p: 0
   num_beams: 1