[Discussion] Agentic Framework Based on VLLM and E2B for RL #2880
+194
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
#2723
The code in these files is not ready for implementation in GRPO, as we need to discuss many details. This was put together to demonstrate usage.
Currently, the code primarily focuses on the inference side of the agent, with specific goals including:
example usage:
pip install vllm e2b-code-interpreter
Discussion
On Tool Use:
This framework, similar to smolagents CodeAgent, does not call functions. Instead, it writes code as step. Rather than providing tools and parsing JSON to call functions, it takes a user-provided script and reads and loads it into its system prompt and E2B sandbox. It may be less convenient, but it is the simplest and most flexible way to utilize tools with code.
Since some tools may depend on specific packages, you have two options:
example tool:
a web search tool as a separate file at path
path/to/web_search_tool.py
Using tool with the framework:
Or with a template
System Prompt and Exploration:
How a model was trained affects how well it explores different options and writes code. For example, in my tests with a one-shot prompt using LLaMA 3 1b, which was trained to use tools, the model generated and executed code 20% of the time. However, the the smallest distilled R1, which wasn’t trained with tool calling, only managed to do this 5% of the time.
theoretically, with reinforcement learning, given enough exploration the model will learn everything on it's own, slow exploration or messy outputs can slow down the training.Here are some ideas:
<think>
tokens.You can also customize the system prompt and tokens for code execution.
example:
Regarding E2B:
Many services use E2B to for their agents, but training agents would increase usage and costs related to E2B as part of the overall training budget. You can check E2B pricing to get an idea of these costs.
This framework executes code one step at a time, which means the sandbox is terminated immediately after each code execution. This approach simplifies the process and reduces costs, but it also limits flexibility and restricts the range of things the model can learn.
We should also think about other options, like different services or local execution. However, local execution may not scale as well and could come with its own challenges.
Another idea is to incorporate the cost of e2b usage into the reward function, potentially encouraging the agent to be more efficient.
I believe that in the future, the costs of using tools will keep rising as a part of the total budget. We should talk about whether it makes sense to train agents with E2B and similar services.
Ideas for features and scalability:
Many different features can be added, many of which are provided by E2B for maximum flexibility.
Currently, the sandboxes run sequentially. If the code takes a long time to execute, it would be more efficient to run them in parallel, in batches limited by the E2B user tier.
Each agent could use their sandbox continuously throughout the conversation, rather than having each sandbox terminate after execution. this would maximize flexibility and the different things the agent can learn, though at a higher cost.
Additionally, other E2B features that agents could utilize include running code in different languages, executing commands, setting environment variables, and downloading or uploading files from the sandbox.
Maybe even using E2B Desktop for GUI Agents