Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
📖 English •
中文
📊 LeadBoard •
🤗 Dataset •
📚 Preprint Paper
This is an evaluation tool for the paper: Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions, which can also use the multi agent data generation framework proposed in the paper to synthesize multi-mission data for Agents.
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities. Users increasingly rely on LLM-based agents to solve complex missions through iterative interactions. However, existing benchmarks predominantly access agents in single-mission scenarios, failing to capture real-world complexity. To bridge this gap, we propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions. This design requires agents to dynamically adapt to evolving demands. Moreover, the proposed benchmark explores all possible mission-switching patterns within a fixed mission number. Specifically, we propose a multi agent data generation framework to construct the benchmark. We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees. Experiments on diverse open-source and closed-source LLMs reveal critical factors influencing agent robustness and provide actionable insights to the tool invocation society.
- [2025.04.04] 🔥 We have released the Preprint Paper, and we will make further revisions in the final version.
- [2025.04.03] 🌟 We have released the LeadBoard Website. Everyone can access the latest results of models on the Multi-Mission Tool Bench leaderboard there.
- [2025.04.02] 🎆 We have released the code for Controllable Multi Agent Data Generation。
- [2025.03.30] 🎉 We have released the test data and evaluation code for the Multi-Mission Tool Bench.
- [2025.03.18] 🏅️ Our team achieved the first place in China and the second place in the world in the Agent category in the March 2025 SuperClue Chinese Large Model Benchmark Evaluation, outperforming DeepSeek, Qianwen, and Doubao by more than 5 points, 10 points, and 15 points, respectively.
- Test data location: mmtb/data/Multi-Mission-Tool-Bench.jsonl or 🤗 Dataset
- Location of the prediction results for the 28 models reported in the paper: mmtb/bench_test/result
- More detailed information about the Multi-Mission Tool Bench can be found below
# Create a new Conda environment with Python 3.10
conda create -n MMTB python=3.10
conda activate MMTB
# Clone the MMTB repository
git clone https://github.com/yupeijei1997/MMTB.git
# Change directory to the `mmtb`
cd mmtb/
# Install the package
pip install -r requirements.txt
Address: mmtb/data/Multi-Mission-Tool-Bench.jsonl
Description: Our test data has undergone five rounds of manual inspection and correction by five senior algorithm researcher with years of experience in NLP, CV, and LLM, taking about one month in total. It boasts extremely high quality and accuracy, with a tight connection between multiple rounds of missions, increasing difficulty, no unusable invalid data, and complete consistency with human distribution. Its evaluation results and conclusions are of great reference value for subsequent optimization in the Agent direction.
Specifically, the data quality optimization work went through the following stages:
-
The initial data was generated using our proposed Multi Agent Data Generation framework, covering all possible action spaces.
-
The test data was then divided according to four different types of actions defined by us and manually inspected and corrected by four different algorithm researcher. Specifically, since missions generated by LLM are always too formal and not colloquial enough, especially after the second mission, it is difficult to generate true multi-turn missions. Therefore, we conducted the first round of corrections based on the criteria of colloquialism and true multi-turn missions. Notably, in designing the third and fourth round missions, we added missions with long-term memory, a true multi-turn type, to increase the difficulty of the test set.
Note: In the actual construction process, the four algorithm researcher adopted a layer-by-layer approach, first generating a layer of data with the model, then manually inspecting and correcting it, before generating and correcting the next layer of data. This approach avoids the difficulty of ensuring overall correctness and maintaining data coherence when, after generating all layers of data at once, a problem in one layer requires corrections that often affect both the previous and subsequent layers. Thus, our layer-by-layer construction ensures strong logical consistency and close relationships between layers, without any unreasonable trajectories.
-
After the first round of corrections by the four algorithm researcher, one senior experts in the Agent field would comment on each piece of data, indicating whether it meets the requirements and what problems exist, followed by a second correction by the four algorithm researcher.
-
After the second round of corrections, we introduced cross-validation, where the four algorithm researcher inspected and commented on each other's data. Then, the four algorithm researcher and one senior experts in the Agent field discussed and made a third round of corrections on the doubtful data.
-
After the third round of corrections, the one senior experts in the Agent field separately conducted a fourth round of inspection and correction on all data to ensure absolute accuracy.
-
Finally, since human corrections might introduce errors, we used code to check for possible parameter type errors and unreasonable dependencies caused by manual operations, with one senior experts making the final fifth round of corrections.
Through these five stages of data quality optimization, each piece of data was manually corrected and constructed by multiple algorithm experts, improving our test data's accuracy from less than 60% initially to 100% correctness. The combination of model generation and multiple human corrections also endowed our data with excellent diversity and quality.
At the same time, compared to other benchmarks such as BFCL, T-EVAL, etc., our test data covers all possible action spaces, and in the second to fourth rounds of true multi-turn missions, the coverage rate has reached two 100%, which also makes our data distribution very balanced, capable of testing out the weaknesses of the model without any blind spots.
Ultimately, this high-quality data set we constructed laid the foundation for our subsequent experiments, lending absolute credibility to our conclusions.
Additionally, we provide bilingual support for the test data, including both English and Chinese versions, all of which have undergone the aforementioned manual inspection process. Subsequent LeadBoard results will primarily report the English version.
Our evaluation framework is constructed in a manner that separates inference and results analysis, offering several advantages as follows:
- High reproducibility: In our test data, the execution results of all tools corresponding to the golden answers have been persistently saved. There is no need for any website's KEY, and there are no unstable tool invocation scenarios, ensuring the reproducibility of the results.
- High evaluation efficiency: Our evaluation is conducted dynamically. The first phase is carried out using EvalByToolCallGraph module, deciding whether to continue calling based on whether the action (predicted tool name) matches the golden answer. Meanwhile, decision tree pruning is used during the process, significantly reducing the number of maintenance paths and speeding up the evaluation.
- High code reusability: All our requests use the standard ToolCalls protocol, making our evaluation code highly reusable. Additionally, we have encapsulated the ToolCalls protocol for several open-source general models and open-source specialized models that did not support the ToolCalls protocol, making the code logic clearer and solving the problem of other evaluation frameworks mixing Prompt and ToolCalls invocation methods, leading to confused logic.
- Multiple evaluation analysis dimensions: After obtaining the prediction and action-level evaluation results from the first phase, we use the AnalysisResult module to conduct a detailed evaluation of its results, including analyses across six dimensions. To our knowledge, among all Agent evaluation frameworks, we offer the most analysis dimensions and the most detailed results. Also, our results are saved in CSV files, facilitating developers in bad case analysis.
- Strong scalability: Since we use the standard ToolCalls protocol, for API models, GPTHandle can be used for rapid integration; for new open-source models, we will continue to update this repository for integration; for developers' own trained models, they can refer to our Handle code to encapsulate the Prompt calling method into the ToolCalls protocol for rapid integration and verification.
The overall framework diagram is as follows:
This project supports multiple API models, including: GPT-4o, GPT-o1, Gemini-1.5, Claude-3.5, Mistral-Large, etc.
Taking GPT-4o as an example, set the following key in the environment variables
export OPENAI_MODEL=xxxxxxxxx
export OPENAI_API_KEY=xxxxxxxxx
export OPENAI_BASE_URL=xxxxxxxxx
If using AZURE, set the following key:
export AZURE_OPENAI_DEPLOYMENT=xxxxxxxxx
export AZURE_OPENAI_ENDPOINT=xxxxxxxxx
export AZURE_OPENAI_API_KEY=xxxxxxxxx
export AZURE_OPENAI_API_VERSION=xxxxxxxxx
Afterwards, use the following code to request model results, setting the model to gpt4o. If the test is unexpectedly interrupted midway, you can modify the continue_file to continue the test. This will ensure that the already predicted results will not be predicted again.
cd mmtb/bench_test
python3 request_pipeline.py \
--model=gpt4o \
--data_path=./data/Multi-Mission-Tool-Bench.jsonl \
--output_path=./result \
--language=en \
--continue_file=empty.jsonl \
--remove_role=True \
--contain_context=True
This project also supports a variety of open-source specialized models and open-source general models, as follows:
Open-source specialized models include: watt-tool series, ToolACE-8B, Hammer2.1 series, xLAM-7b-fc-r-7b, gorilla-openfunctions-v2.
Open-source general models include: Llama-3.3 series, Qwen2.5 series, GLM-4-9B-Chat, DeepSeek-R1, DeepSeek-V3.
For example, with Qwen2.5-7B-Instruct, you can refer to the Qwen function call documentation to deploy the Qwen model first.
Afterward, use the following code to request the model results, set the model to qwen7b, and set the model_url to the IP and port number of your deployment machine, for example: http://111.111.111.111:12345. If the test stops unexpectedly, you can modify the continue_file to continue testing.
python3 request_pipeline.py \
--model=qwen7b \
--data_path=./data/Multi-Mission-Tool-Bench.jsonl \
--output_path=./result \
--language=en \
--model_url=MODEL_URL \
--continue_file=empty.jsonl \
--remove_role=True \
--contain_context=True
Finally, in mmtb/bench_test/handle/handles.py, we have enumerated the 10 types of Handles that we have implemented. If you want to test other models, you can refer to this file to get the settings for the model parameters. Additionally, if you want to add your own implemented Handle, you can also do so in this file.
Use the following code to evaluate the model's prediction results. Fill in PREDICT_DATA_FILE with the corresponding prediction file from the previous step's ./result directory. The evaluation results include: matrix accuracy for action type and layer, individual accuracy for action type and layer, multi-tool invocation result analysis, error type analysis, true/false multi-turn accuracy, true multi-turn subtype accuracy, and parameter error type analysis.
Detailed results will be output to data_with_details.csv.
cd mmtb/bench_test
python3 analysis_result.py \
--data_file PREDICT_DATA_FILE \
--output_csv_flag=True \
--output_csv_path=./data_with_details.csv
In particular, we have persistently saved the results of all models reported in the paper in the mmtb/bench_test/result directory, making it convenient for everyone to directly reproduce the results presented in the paper. This facilitates people developing Agent models to analyze bad cases, and provides a convenient way for people who want to quickly learn evaluation code, without the need to perform inference again.
Below is an example of how to reproduce the results of GPT-o1.
python3 analysis_result.py \
--data_file ./result/2025-02-11-11:45:51_a5be8b_gpt4o1_en_remove_role_contain_context.jsonl \
--output_csv_flag=True \
--output_csv_path=./data_with_details.csv
Additionally, we have also supported the evaluation of prediction results from multiple models at the same time, which further enhances usability.
Below is an example of simultaneously reproducing the results of GPT-o1 and GPT-4o, with multiple files concatenated using a comma.
python3 analysis_result.py \
--data_file ./result/2025-02-11-11:45:51_a5be8b_gpt4o1_en_remove_role_contain_context.jsonl,./result/2025-02-11-14:40:24_5ef3f9_gpt4o1120_en_remove_role_contain_context.jsonl \
--output_csv_flag=True \
--output_csv_path=./data_with_details.csv
- OpenAI's GPT-o1 and GPT-4o models rank first and second, respectively, while the strongest model in the Qwen2.5 series, 72b, ranks third. This indicates that OpenAI's models still hold a leading position in the Agent field, but domestic models are gradually closing the gap.
- Some of the latest specialized small models for Agents, such as ToolACE-8B, Hammer2.1-7b, and watt-tool-8b, have all shown capabilities that can rival those of general-purpose large models. This suggests that specialized models fine-tuned for specific domains still hold great value. However, some older specialized models, such as xLAM-7b-fc-r and gorilla-openfunctions-v2, performed poorly, indicating that the industry as a whole is making rapid progress in the Agent field.
- Some general models that lead in areas such as mathematics and reasoning, such as DeepSeek-R1, DeepSeek-V3, and doubao1.5-pro-32k, although ahead in mathematical reasoning, are lagging in the Agent field and need further improvement.
- The performance of various models under multi-tool action types is generally poor, especially when it comes to multi-tool invocation - serial and multi-tool invocation - serial-parallel types. Compared to the multi-tool invocation - parallel type, the performance further declines. This indicates that when missions involve complex planning abilities, it remains a significant challenge for large language models.
- The performance of various models under clarification action types is also generally poor, with the highest score not reaching 50 points. This suggests that the models struggle to accurately capture the required tool parameter information missing from user missions, which can lead to parameter hallucination phenomena, resulting in tool invocation outcomes that do not meet expectations.
- Individual models, such as Mistral and doubao, show a significant decline in performance under the chat type, while their performance in single-tool invocation is quite good. This indicates that these models may have focused too much on positive examples that require tool invocation during training, while neglecting the construction of negative cases, leading to a severe function hallucination phenomenon. This may prevent them from being used in actual production.
- The accuracy of each model significantly decreased from the second round onwards, and from the second round, all missions were true multi-turn missions, which could not be completed solely based on the information from that current round and required reliance on context for resolution. This greatly increased the difficulty of the missions and was a real test of the models' ability to understand context. The results show that none of the models were able to handle true multi-turn missions well, indicating that their ability to understand context needs to be strengthened.
- The decline in performance was greater for the open-source specialized models, suggesting that the general models have a better overall ability to understand context compared to the open-source specialized models. This may be because the general models were trained with more diverse true multi-turn data, which to some extent enhanced their ability to understand true multi-turn interactions in Agent scenarios.
- All models show poor accuracy for the true multi-turn three subtypes, with even the strongest GPT-o1 model only achieving 40-50% accuracy. This further explains why, starting from the second round, the accuracy of each model begins to drop sharply.
- In particular, among the three subtypes, the effect on long-term memory is the worst, and long-term memory only appears in the third and fourth rounds of missions (when it appears in the third round, it tests the model's ability to remember information from the first round. When it appears in the fourth round, it tests the model's ability to remember information from both the first and second rounds, which is more challenging). This also explains why there is a further decline in accuracy during the third and fourth rounds compared to the second round.
- The darker the color, the higher the accuracy rate. The denser the current layer, the more cases are done correctly at this layer, and the sparser it is, the more cases are done incorrectly.
- The OpenAI series models GPT-o1 and GPT-4o perform the best overall, with significantly fewer blank areas in the third and fourth layers compared to other models.
- Qwen2.5 and Gemini-1.5 both have noticeable blank areas starting from the third layer, indicating poor performance in certain action spaces.
- The three open-source specialized models overall do not perform as well as the above four general models. Starting from the third layer, the number of blank areas further increases, and the overall layout is relatively sparse.
Our Paper designs a controllable multi agent data generation framework, which has the following eight unique advantages compared to other frameworks:
- Controllable Mission Generation: When generating missions for each round, it is possible to control and specify the type of mission currently needed, including single tool invocation, multiple tool invocations, tool invocation after clarification, and chat. It is this advantage that allows our framework to traverse all possible action spaces and construct unbiased data, which is very important in the field of large Language models. Whether for training or testing, the unbiased nature of the data directly determines whether the model's performance is excellent and whether the evaluation is reliable.
- Specified Quantity Mission Generation: Our framework can generate a specified number of missions. Paired with the first advantage of controllable mission generation, the generated data can cover all possible action spaces for any number of missions.
- Diversified Mission Generation: In the first round of mission generation, our framework can generate multiple missions with different tones, lengths, themes/instances, scenarios, and role identities, and randomly select one to continue generating, offering extremely high diversity, close to the real distribution of humans.
- True Multi-Turn Mission Generation: In subsequent rounds of mission generation, our framework is currently the only one that can controllably generate true multi-turn missions. We can generate three core types of true multi-turn missions, including implicit, ellipsis, and long-term memory. We also provide dozens of few-shot examples to guide the model in generating true multi-turn missions, randomly selecting one of the examples each time, greatly enhancing data diversity and generation efficiency.
- Rich Agents: We have designed five major types of agents, including User agents, AI agents, Planner agents, Tool agents, and Checker agents, with a total of 15 subtypes. The diversity of agents ensures the diversity and high quality of the data generated by our framework.
- Powerful Planner: The Planner Agent we designed is currently the only agent in all intelligent agent frameworks that can make decisions on complex serial and parallel multi-tool invocation missions. We have written prompts of over 4000 characters to guide it in making decisions according to our set guidelines, achieving a very high decision accuracy rate.
- Reliable Checker: The Checker Agent we designed is currently the only agent that checks the logic of parallel invocations. We have also written dozens of rules to check for low-level errors that the Planner might make and provide feedback, allowing it to reflect. Eventually, our Planner Agent and Checker Agent are used in combination, achieving a decision accuracy rate of over 90% without human intervention, which, to our knowledge, is the highest among all multi agent data generation frameworks.
- Arbitrary Model Specification: Our framework can use any LLM as the base model for the agents, allowing researchers to use any model they consider stronger to achieve better results.
- Bilingual Support: Our framework supports both English and Chinese, capable of generating data in both languages. To our knowledge, this is also the only framework currently supporting bilingual data generation.
Taking the example where all agents use AZURE GPT-4o as the base model, and generate data in English. First, set the following key in the environment variables.
export AZURE_OPENAI_DEPLOYMENT=xxxxxxxxx
export AZURE_OPENAI_ENDPOINT=xxxxxxxxx
export AZURE_OPENAI_API_KEY=xxxxxxxxx
export AZURE_OPENAI_API_VERSION=xxxxxxxxx
export LANGUAGE=en
The core innovation of this paper lies in the fact that our proposed MMTB is capable of covering all possible action spaces for any number of missions, and except for the first round, all are true multi-turn missions. Therefore, our framework can support the generation of data for any number of missions. Here we take the generation of four missions as an example, with the reference code as follows:
cd multi_agent
python3 generate.py \
--layer_num_total 4 \
--user_model "gpt4o" \
--planner_model "gpt4o" \
--tool_model "gpt4o" \
--agent_model "gpt4o" \
--checker_model "gpt4o"
In particular, we have preserved data in three formats, including: raw generated data, streamlined data (with checker messages and role-switch messages removed), and captured data from each agent, available for researchers in need.
Finally, let's showcase a portion of the results generated for an actual case. It can be observed that the Continue Question User Agent has generated a true multi-turn mission of implicit type, "Who is the uploader of the third wallpaper?" This implies that in order to solve this mission, the model needs to refer to the preceding context to obtain the ID 778899 of the third wallpaper, and then continue to invoke the tool getWaifuDetails to retrieve detailed information about the wallpaper.
User: I need the latest 5 anime wallpapers.
Planner:getWaifuImages(limit=5, tags=["anime wallpapers"])
Tool:...
Planner:prepare_to_answer(answer_type=tool)
AI:
Here are the latest 5 anime wallpapers:
1. Wallpaper ID: 112233
- [View Wallpaper](https://wallhaven.cc/w/112233)
- 
- Resolution: 1920x1080
2. Wallpaper ID: 445566
- [View Wallpaper](https://wallhaven.cc/w/445566)
- 
- Resolution: 2560x1440
3. Wallpaper ID: 778899
- [View Wallpaper](https://wallhaven.cc/w/778899)
- 
- Resolution: 3840x2160
4. Wallpaper ID: 101112
- [View Wallpaper](https://wallhaven.cc/w/101112)
- 
- Resolution: 1920x1080
5. Wallpaper ID: 131415
- [View Wallpaper](https://wallhaven.cc/w/131415)
- 
- Resolution: 2560x1440
You can click on the links to view the full wallpapers. Hope you enjoy these wallpapers! If you need more help, please let me know.
Continue Question User Agent Generate:
User: Who is the uploader of the third wallpaper?
Planner:getWaifuDetails(image_id=778899)
It should be noted that even though our framework is capable of generating such excellent true multi-turn missions, the generation of true multi-turn missions remains very challenging for LLMs. Therefore, as we mentioned earlier, it was through the manual annotation by multiple experts that the accuracy was increased from less than 60% to 100%. This also includes modifying the pseudo multi-turn missions generated by LLMs into true multi-turn missions.
In the near future, we plan to release the following content:
- Release Multi-Mission Tool Bench test data
- Release Multi-Mission Tool Bench evaluation code
- Release multi agent generation framework code
- Launch LeadBoard Website and provide ongoing updates
- Publish the paper Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
@misc{yu2025multimissiontoolbenchassessing,
title={Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions},
author={Peijie Yu and Yifan Yang and Jinjian Li and Zelong Zhang and Haorui Wang and Xiao Feng and Feng Zhang},
year={2025},
eprint={2504.02623},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.02623},
}