Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Agentic Framework Based on VLLM and E2B for RL #2880

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

August-murr
Copy link
Collaborator

@August-murr August-murr commented Feb 17, 2025

#2723
The code in these files is not ready for implementation in GRPO, as we need to discuss many details. This was put together to demonstrate usage.

Currently, the code primarily focuses on the inference side of the agent, with specific goals including:

  • Utilizing VLLM for its speed and efficiency
  • Enabling parallel batch response generation for maximum GPU utilization
  • Implementing E2B for the agentic workflow

example usage:
pip install vllm e2b-code-interpreter

task =[
    [{"role": "user", "content": "what is 3 to the power of 1.5? you must use python code"}],
    [{"role": "user", "content": "what is 8 to the power of 2.7? you must use python code"}],
]
tasks = task*16 # 16 response generations for each task

# assuming your vllm model is loaded
tokenizer = llm.get_tokenizer()

data = prepare_data(tasks, tokenizer)

chats = generate_model_responses(dataset=data, llm=llm ,api_key=api_key)

Discussion

On Tool Use:

This framework, similar to smolagents CodeAgent, does not call functions. Instead, it writes code as step. Rather than providing tools and parsing JSON to call functions, it takes a user-provided script and reads and loads it into its system prompt and E2B sandbox. It may be less convenient, but it is the simplest and most flexible way to utilize tools with code.

Since some tools may depend on specific packages, you have two options:

  1. Add a list of packages that will be installed in advance.
  2. Use an E2B template, which is much more efficient and customizable. With an E2B template, you can customize CPU, RAM, packages, environment variables, and more.
    example tool:
    a web search tool as a separate file at path path/to/web_search_tool.py
from duckduckgo_search import DDGS

def search_duckduckgo(query: str, max_results: int = 5) -> list:
    """
    Perform a search using DuckDuckGo and return results
    
    Args:
        query (str): Search query
        max_results (int): Maximum number of results to return
    
    Returns:
        list: List of search results
    """
    try:
        with DDGS() as ddgs:
            results = [r for r in ddgs.text(query, max_results=max_results)]
            return results
    except Exception as e:
        print(f"Error performing search: {e}")
        return []

Using tool with the framework:

data = prepare_data(tasks, tokenizer,user_script_path="path/to/web_search_tool.py")
dependancies = ["duckduckgo-search"]
chats = generate_model_responses(dataset=data,llm = llm,tools_script_path="path/to/web_search_tool.py",api_key = api_key,dependancies=dependancies)

Or with a template

chats = generate_model_responses(dataset=data,llm = llm,tools_script_path="path/to/web_search_tool.py",api_key = api_key,template=your_e2b_template_id)

System Prompt and Exploration:

How a model was trained affects how well it explores different options and writes code. For example, in my tests with a one-shot prompt using LLaMA 3 1b, which was trained to use tools, the model generated and executed code 20% of the time. However, the the smallest distilled R1, which wasn’t trained with tool calling, only managed to do this 5% of the time.

theoretically, with reinforcement learning, given enough exploration the model will learn everything on it's own, slow exploration or messy outputs can slow down the training.Here are some ideas:

  1. Use few-shot prompting to guide the model towards clear structure and effective tool use.
  2. SFT the model with well-structured data similar to how R1 was trained on "cold start data."
  3. Adjust the reward function to encourage better structure, similar to how R1 rewarded the use of <think> tokens.

You can also customize the system prompt and tokens for code execution.
example:

my_prompt = "you are an assistant........ to call code write it inside <execute_code> </execute_code>"
#custom strings for the model to write code inside
start_code = "<execute_code>"
end_code =  </execute_code>
data = prepare_data(tasks, tokenizer,system_prompt=my_prompt)
chats = generate_model_responses(…..,parsing_token=start_cod,stop_string=end_code)

Regarding E2B:

Many services use E2B to for their agents, but training agents would increase usage and costs related to E2B as part of the overall training budget. You can check E2B pricing to get an idea of these costs.
This framework executes code one step at a time, which means the sandbox is terminated immediately after each code execution. This approach simplifies the process and reduces costs, but it also limits flexibility and restricts the range of things the model can learn.

We should also think about other options, like different services or local execution. However, local execution may not scale as well and could come with its own challenges.
Another idea is to incorporate the cost of e2b usage into the reward function, potentially encouraging the agent to be more efficient.
I believe that in the future, the costs of using tools will keep rising as a part of the total budget. We should talk about whether it makes sense to train agents with E2B and similar services.

Ideas for features and scalability:

Many different features can be added, many of which are provided by E2B for maximum flexibility.

Currently, the sandboxes run sequentially. If the code takes a long time to execute, it would be more efficient to run them in parallel, in batches limited by the E2B user tier.

Each agent could use their sandbox continuously throughout the conversation, rather than having each sandbox terminate after execution. this would maximize flexibility and the different things the agent can learn, though at a higher cost.

Additionally, other E2B features that agents could utilize include running code in different languages, executing commands, setting environment variables, and downloading or uploading files from the sandbox.

Maybe even using E2B Desktop for GUI Agents

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants