Skip to content

Commit

Permalink
Merge branch 'main' into joshua/olympiadbench
Browse files Browse the repository at this point in the history
  • Loading branch information
Luodian authored Mar 28, 2024
2 parents 035f3df + 9dfb53a commit a96d319
Show file tree
Hide file tree
Showing 19 changed files with 333 additions and 134 deletions.
12 changes: 8 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@

> Accelerating the development of large multimodal models (LMMs) with `lmms-eval`
🏠 [Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab)
🏠 [Homepage](https://lmms-lab.github.io/) | 🎉 [Blog](https://lmms-lab.github.io/lmms-eval-blog/lmms-eval-0.1/) | 📚 [Documentation](docs/README.md) | 🤗 [Huggingface Datasets](https://huggingface.co/lmms-lab) | <a href="https://emoji.gg/emoji/1684-discord-thread"><img src="https://cdn3.emoji.gg/emojis/1684-discord-thread.png" width="14px" height="14px" alt="Discord_Thread"></a> [discord/lmms-eval](https://discord.gg/ebAMGSsS)

In an era where people pursue AGI (Artificial General Intelligence) with the zeal akin to 1960s moon landing mission.
Evaluating the core of AGI, the large language models (LLMs) and large multimodal models (LMMs) with unprecedented capabilities that can understand, learn, and interact across a broad range of human tasks, has become a pivotal challenge.

To surmount this, a broad spectrum of evaluation datasets is proposed and used to assess model capabilities across various dimensions, creating a comprehensive capability chart that reveals the true performance of models. However, evaluation of models has become quite hard since there are countless evaluation benchmarks and datasets organized in various ways, scattered across the internet, sleeping in somebody's Google Drive, Dropbox, and other websites hosted by schools or research labs.
In today's world, we're on a thrilling quest for Artificial General Intelligence (AGI), driven by a passion that reminds us of the excitement surrounding the 1960s moon landing. At the heart of this adventure are the incredible large language models (LLMs) and large multimodal models (LMMs). These models are like brilliant minds that can understand, learn, and interact with a vast array of human tasks, marking a significant leap toward our goal.

To truly understand how capable these models are, we've started to create and use a wide variety of evaluation benchmarks. These benchmarks help us map out a detailed chart of abilities, showing us how close we are to achieving true AGI. However, this journey is not without its challenges. The sheer number of benchmarks and datasets we need to look at is overwhelming. They're all over the place - tucked away in someone's Google Drive, scattered across Dropbox, and hidden in the corners of various school and research lab websites. It's like embarking on a treasure hunt where the maps are spread far and wide.

In the field of language models, there has been a valuable precedent set by the work of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the [open-llm-leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), and has gradually become the underlying ecosystem of the era of foundation models.

Expand All @@ -25,6 +25,10 @@ We believe our effort could provide an efficient interface for the detailed comp

# Annoucement

## Contribution Guidance

We've added guidance on contributing new datasets and models. Please refer to our [documentation](docs/README.md). If you need assistance, you can contact us via [discord/lmms-eval](https://discord.gg/ebAMGSsS).

## v0.1.0 Released

The first version of the `lmms-eval` is released. We are working on providing an one-command evaluation suite for accelerating the development of LMMs.
Expand Down
44 changes: 30 additions & 14 deletions llava_repr_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,22 +1,38 @@
llava@git+https://github.com/haotian-liu/LLaVA@v1.1.3
accelerate>=0.21.0
black==24.1.0
accelerate==0.21.0
datasets==2.16.1
evaluate>=0.4.0
jsonlines
numexpr
peft>=0.2.0
pybind11>=2.6.2
pytablewriter
rouge-score>=0.0.4
sacrebleu>=1.5.0
scikit-learn>=0.24.1
sqlitedict
evaluate==0.4.1
hf_transfer==0.1.6
Jinja2==3.1.3
numpy==1.26.4
openai==1.13.3
packaging==23.2
pandas==2.2.1
Pillow==10.2.0
protobuf==4.25.3
pycocoevalcap==1.2
pycocotools==2.0.7
pytablewriter==1.2.0
pytest==8.0.2
python_Levenshtein==0.25.0
pytz==2024.1
PyYAML==6.0.1
PyYAML==6.0.1
Requests==2.31.0
sacrebleu==2.4.0
scikit_learn==1.2.2
sentencepiece==0.1.99
setuptools==68.2.2
sglang==0.1.12
shortuuid==1.0.12
sqlitedict==2.1.0
tenacity==8.2.3
torch==2.0.1
openai>=1.0.0
pycocoevalcap
tokenizers==0.15.2
tqdm==4.66.2
tqdm-multiprocess
transformers
transformers==4.37.2
zstandard
pillow
pyyaml
Expand Down
4 changes: 2 additions & 2 deletions lmms_eval/tasks/_task_utils/file_utils.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import os


def generate_submission_file(file_name, args):
path = os.path.join(args.output_path, "submissions")
def generate_submission_file(file_name, args, subpath="submissions"):
path = os.path.join(args.output_path, subpath)
os.makedirs(path, exist_ok=True)
path = os.path.join(path, file_name)
return os.path.abspath(path)
22 changes: 22 additions & 0 deletions lmms_eval/tasks/mmbench/_default_template_mmbench_cn_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
dataset_path: lmms-lab/MMBench
dataset_kwargs:
token: True
doc_to_target: "answer"
dataset_name: "cn"
output_type: generate_until
doc_to_visual: !function cn_utils.mmbench_doc_to_visual
doc_to_text: !function cn_utils.mmbench_doc_to_text
generation_kwargs:
max_new_tokens: 256
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function cn_utils.mmbench_process_results
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: "\n请直接使用所提供的选项字母作为答案回答。"
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original
25 changes: 25 additions & 0 deletions lmms_eval/tasks/mmbench/_default_template_mmbench_en_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
dataset_path: lmms-lab/MMBench
dataset_kwargs:
token: True
doc_to_target: "answer"
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer with the option's letter from the given choices directly."
doc_to_visual: !function en_utils.mmbench_doc_to_visual
doc_to_text: !function en_utils.mmbench_doc_to_text
doc_to_target: "answer"
process_results: !function en_utils.mmbench_process_results
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original
output_type: generate_until
dataset_name: "en"
generation_kwargs:
until:
- "ASSISTANT:"
max_new_tokens: 1024
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
9 changes: 5 additions & 4 deletions lmms_eval/tasks/mmbench/cc_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

eval_logger = logging.getLogger("lmms-eval")
from lmms_eval.tasks.mmbench.mmbench_evals import MMBench_Evaluator
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file

with open(Path(__file__).parent / "mmbench_cn.yaml", "r") as f:
raw_data = f.readlines()
Expand Down Expand Up @@ -66,9 +67,9 @@ def mmbench_cn_cc_process_results(doc, results):
return data


def mmbench_cn_cc_aggregate_results(results):
def mmbench_cn_cc_aggregate_results(results, args):
df = pd.DataFrame(results)
os.makedirs("./submissions", exist_ok=True)
with pd.ExcelWriter("./submissions/mmbench_cn_cc_results.xlsx") as writer:
file = generate_submission_file("mmbench_cn_cc_results.xlsx", args)
with pd.ExcelWriter(file) as writer:
df.to_excel(writer, index=False)
eval_logger.info(f"Saved results to mmbench_cn_cc_results.xlsx")
eval_logger.info(f"Saved results to {file}")
3 changes: 1 addition & 2 deletions lmms_eval/tasks/mmbench/cn_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,7 @@ def mmbench_aggregate_dev_results(results, args):

def mmbench_aggregate_test_results(results, args):
df = pd.DataFrame(results)
Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_cn_test_results.xlsx"
excel_write_path = generate_submission_file("mmbench_cn_test_results.xlsx", args)
with pd.ExcelWriter(excel_write_path) as writer:
df.to_excel(writer, index=False)
eval_logger.info(f"Saved results to {excel_write_path}")
9 changes: 4 additions & 5 deletions lmms_eval/tasks/mmbench/en_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ def mmbench_doc_to_text(doc, model_specific_prompt_kwargs=None):
"answer": doc.get("answer", None),
"options": options_prompt,
"category": doc["category"],
"L2-category": doc["l2-category"],
"L2-category": doc["L2-category"],
"options_dict": options_dict,
"index": doc["index"],
"hint": doc["hint"],
"source": doc["source"],
"split": doc["split"],
}

query_prompt = f"{data['hint']} {data['question']} {data['options']}" if pd.notna(data["hint"]) else f"{data['question']} {data['options']}"
query_prompt = f"{data['hint']} {data['question']} {data['options']}" if pd.notna(data["hint"]) and data["hint"] != "nan" else f"{data['question']} {data['options']}"

if model_specific_prompt_kwargs:
query_prompt = f"{query_prompt}\n{model_specific_prompt_kwargs['post_prompt']}"
Expand All @@ -64,7 +64,7 @@ def mmbench_process_results(doc, results):
"source": doc["source"],
"split": doc["split"],
"category": doc["category"],
"L2-category": doc["l2-category"],
"L2-category": doc["L2-category"],
}
}
option_candidate = ["A", "B", "C", "D", "E"]
Expand All @@ -83,8 +83,7 @@ def mmbench_aggregate_dev_results(results, args):

def mmbench_aggregate_test_results(results, args):
df = pd.DataFrame(results)
Path(args.output_path).joinpath("submissions").mkdir(parents=True, exist_ok=True)
excel_write_path = Path(args.output_path) / "submissions" / f"mmbench_en_test_results.xlsx"
excel_write_path = generate_submission_file("mmbench_en_test_results.xlsx", args)
with pd.ExcelWriter(excel_write_path) as writer:
df.to_excel(writer, index=False)
eval_logger.info(f"Saved results to {excel_write_path}")
5 changes: 2 additions & 3 deletions lmms_eval/tasks/mmbench/mmbench_cc.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
dataset_path: lmms-lab/MMBench_CN
dataset_path: lmms-lab/MMBench
dataset_name: cc
dataset_kwargs:
token: True
group: mmbench_cn
task: "mmbench_cn_cc"
dataset_name: "chinese_culture"
test_split: test
output_type: generate_until
doc_to_visual: !function cc_utils.mmbench_doc_to_visual
Expand Down
4 changes: 3 additions & 1 deletion lmms_eval/tasks/mmbench/mmbench_cn.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,6 @@ task:
- mmbench_cn_cc
metadata:
version: 0.0
sys_prompt: "有如下几个选项:"
gpt_eval_model_name: "gpt-3.5-turbo"
quick_extract: true
sys_prompt: "有如下几个选项:"
30 changes: 2 additions & 28 deletions lmms_eval/tasks/mmbench/mmbench_cn_dev.yaml
Original file line number Diff line number Diff line change
@@ -1,33 +1,7 @@
dataset_path: lmms-lab/MMBench_CN
dataset_kwargs:
token: True
group: mmbench_cn
task: "mmbench_cn_dev"
dataset_name: "default"
test_split: "dev"
output_type: generate_until
doc_to_visual: !function cn_utils.mmbench_doc_to_visual
doc_to_text: !function cn_utils.mmbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 256
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function cn_utils.mmbench_process_results
metric_list:
- metric: submission
higher_is_better: true
aggregation: !function cn_utils.mmbench_aggregate_dev_results
metadata:
version: 0.0
gpt_eval_model_name: "gpt-3.5-turbo"
quick_extract: true

model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: "\n请直接使用所提供的选项字母作为答案回答。"
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original
include: _default_template_mmbench_cn_yaml
30 changes: 2 additions & 28 deletions lmms_eval/tasks/mmbench/mmbench_cn_test.yaml
Original file line number Diff line number Diff line change
@@ -1,33 +1,7 @@
dataset_path: lmms-lab/MMBench_CN
dataset_kwargs:
token: True
task: "mmbench_cn_test"
dataset_name: "default"
task: mmbench_cn_test
test_split: test
output_type: generate_until
doc_to_visual: !function cn_utils.mmbench_doc_to_visual
doc_to_text: !function cn_utils.mmbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 256
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function cn_utils.mmbench_process_results
metric_list:
- metric: submission
aggregation: !function cn_utils.mmbench_aggregate_test_results
higher_is_better: true
metadata:
version: 0.0
gpt_eval_model_name: "gpt-3.5-turbo"
quick_extract: true

model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: "\n请直接使用所提供的选项字母作为答案回答。"
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original
include: _default_template_mmbench_cn_yaml
8 changes: 0 additions & 8 deletions lmms_eval/tasks/mmbench/mmbench_en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,3 @@ task:
metadata:
version: 0.0
sys_prompt: "There are several options:"

model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer with the option's letter from the given choices directly."
model_specific_generation_kwargs:
llava:
image_aspect_ratio: original
20 changes: 2 additions & 18 deletions lmms_eval/tasks/mmbench/mmbench_en_dev.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,7 @@
dataset_path: lmms-lab/MMBench_EN
dataset_kwargs:
token: True
task: "mmbench_en_dev"
test_split: dev
output_type: generate_until
doc_to_visual: !function en_utils.mmbench_doc_to_visual
doc_to_text: !function en_utils.mmbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
until:
- "ASSISTANT:"
max_new_tokens: 1024
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function en_utils.mmbench_process_results
include: _default_template_mmbench_en_yaml
metric_list:
- metric: submission
aggregation: !function en_utils.mmbench_aggregate_dev_results
metadata:
version: 0.0
higher_is_better: true
17 changes: 1 addition & 16 deletions lmms_eval/tasks/mmbench/mmbench_en_test.yaml
Original file line number Diff line number Diff line change
@@ -1,22 +1,7 @@
dataset_path: lmms-lab/MMBench_EN
dataset_kwargs:
token: True
task: "mmbench_en_test"
test_split: test
output_type: generate_until
doc_to_visual: !function en_utils.mmbench_doc_to_visual
doc_to_text: !function en_utils.mmbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 256
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function en_utils.mmbench_process_results
include: _default_template_mmbench_en_yaml
metric_list:
- metric: submission
aggregation: !function en_utils.mmbench_aggregate_test_results
higher_is_better: true
metadata:
version: 0.0
22 changes: 22 additions & 0 deletions lmms_eval/tasks/ocrbench/ocrbench.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
dataset_path: echo840/OCRBench
dataset_kwargs:
token: True
task: "ocrbench"
test_split: test
output_type: generate_until
doc_to_visual: !function utils.ocrbench_doc_to_visual
doc_to_text: !function utils.ocrbench_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 128
temperature: 0
top_p: 0
num_beams: 1
do_sample: false
process_results: !function utils.ocrbench_process_results
metric_list:
- metric: ocrbench_accuracy
aggregation: !function utils.ocrbench_aggregate_accuracy
higher_is_better: true
metadata:
- version: 0.0
Loading

0 comments on commit a96d319

Please sign in to comment.