This is a pre-built, turnkey implementation of an AI agent grounded with search and extract tools to reduce hallucination.
If you have the API keys and Docker Compose, you should be able to go to http://localhost:3000 and have it Just Work. It runs fine on a Macbook Pro with 8 GB memory.
This project may be of interest to you if:
- You don't want to mess around. You just want a search engine that knows what you want and gets smarter over time with very little effort on your part, and if that means you wait for 30 seconds to get the right answer, you're okay with that.
- You are interested in AI agents. This project is a low effort way to play with Letta, and see a stateful agent that can remember and learn.
- You are interested in RAG pipelines. Haystack toolkit has several options to deal with document conversion, cleaning, and extraction. The Hayhooks deployment system is nice, and the project includes several pipelines and has custom components.
- You're interested in Open WebUI tooling. this project goes to some lengths to work through OWUI's environment variables and REST API to provision Letta.
- You are interested in tool calling and management. Hayhooks exposes an MCP server, and there's a lot you can do with a tool server to play around with MCP and Open API -- it has OpenAPIServiceToFunctions, OpenAPIConnector, MCPTool, and more.
This project makes Claude or ChatGPT (large language models or LLM) dramatically more useful, by grounding your LLM through Letta, an agent framework with memory and tool capabilities.
Because a Letta agent has memory, it is teachable. If you tell it your location, it will not only sort out timezone and locale for all subsequent queries, but it can also learn your preferences, websites to avoid, and will gather additional context from previous searches. It's like a better Claude Projects.
Unlike other models, a Letta agent is a stateful agent. It doesn't matter if you bring up a new chat, the agent will continue the conversation from where you left off and remembers details (like your name and location) so that it can answer new questions.
Letta will dig down into search results if it thinks it's not detailed enough. For example, it performed three searches in response to the question "Please give me history and background about the increased traffic to sites from AI bots scraping, and the countermeasures involved. When did this start, why is it a problem, and why is it happening?"
- "History and background of increased web traffic from AI bots scraping websites, when it started becoming a major issue, why it's a problem, and why it's happening. Include information about countermeasures websites use against AI scraping."
- "When did AI bots scraping websites first become a significant issue? What specific countermeasures have websites implemented against AI scraping bots? Include historical timeline and details about robots.txt, legal cases, and technical measures."
- "What are the major legal cases about AI web scraping from 2020-2025? When did companies like OpenAI and Anthropic start large-scale web scraping for training AI models?"
And produced this:
If you want more details on what it's thinking, you can dig into the reasoning using Letta Desktop. Here's an example of what goes on behind the scenes when I ask "What are the differences between Roo Code and Cline?"
In addition to search, Letta can also extract content from specific URLs. For example:
This is useful when the search engine hasn't picked up information on the pages.
You will need the following:
- Docker Compose.
- Tavily API key (required for search) -- this is free up to a certain level.
- Gemini API key (very useful for searching documentation) -- also has a free tier
- Anthropic or OpenAI API Key for Letta (Claude Sonnet 3.7, gpt4, etc) -- not free but often cheaper than the monthly subscription. The docker-compose.yml file is set up for Claude Sonnet 3.7.
First, configure your keys by creating an .env
file:
cp env.example .env
# edit .env file with your own API keys
To start the services, run the following:
docker compose up
You will see a bunch of text in the logs, but the important bit is this line:
initializer | 2025-04-06 14:29:00,484 - INFO - Initialization complete!
(If you don't see this, it's probably a bug. File an issue and copy and paste the logs into the issue.)
When you see that, you should be good to go. Open a browser at http://localhost:3000 and type in "hello."
The first thing you'll want to do is tell Letta your name and location -- this will help it understand where and when you are.
After that, you will want to give it preferences, using the phrase "store this in your core memory" so that it can remember it for later.
Some example preferences:
- I like mermaid diagrams for visualizing technical concepts and relationships.
- I am using Haystack 2.12, please specify 2.x when searching for Haystack docs.
- When searching for AWS documentation, prefer using documentation from https://docs.aws.amazon.com.
- Only give me sample code examples when I explicitly ask you to.
Because Letta doesn't always store conversations in archival memory, you also want to ask it to explicitly summarize and store the conversation when you're changing topics. This lets you take notes and store bookmarks when you want to bring up an old topic for later.
Grounding with search can reduce hallucinations, but will not eliminate them. You will still need to check the sources and validate that what Letta is telling you is accurate, especially if you are doing anything critical. Also, do your own searches! Search engines are free for humans, and Letta will be happy to give you its reference material.
When you want it to run in the background, you can run it as a daemon:
docker compose up -d
To completely destroy all resources (including all your data!) and rebuild from scratch:
docker compose down -v --remove-orphans && docker compose up --build
The docker compose file integrates several key components:
- Open WebUI: A user-friendly front-end interface
- Letta: An agent framework with built-in memory and tooling capabilities.
- Hayhooks: A tool server for use by Letta.
- LiteLLM Proxy Server: Makes all providers "OpenAI style" for Hayhooks.
- Initializer: A container that calls the 'provision' pipeline to create agent if necessary.
Note that if you delete or rename the Letta agent or the Open WebUI pipe, the initializer will provision a new one with the same name automatically.
Open WebUI is the standard for front end interfaces to LLMs and AIs in general.
There are a number of tweaks to improve performance and minimize the time to get started.
For example, this instance is configured to use Gemini embedding so that it doesn't download 900MB of embedding model for its local RAG.
It is not possible to upload files into Letta through the Open WebUI interface right now. The functionality does exist in Letta through the data sources feature, but it might be easier to use a OWUI plugin to send it to Hayhooks and keep it in a document store.
Letta is an agent framework that has built-in self editing memory and built-in tooling for editing the behavior of the agent, including adding new tools.
The search technique is pulled from this academic paper on DeepRAG, although query decomposition is a well known technique in general. If you want a classic deep learning style agent, you can import one from Letta's agent-file git repository.
The model is set up with Claude Sonnet 3.7 as it is much more proactive about calling tools until it gets a good answer. You can use OpenAI for the same effect. Gemini 2.0 models have been inconsistent and less proactive than Claude Sonnet.
If you are going to use Ollama with Letta you will need a powerful model, at least 13B and preferably 70B.
Some reasoning models have difficulty interacting with Letta's reasoning step. Deepseek and Gemini 2.5 Pro will attempt to reply in the reasoning step, so avoid using them in Letta.
You may want Letta Desktop, which will allow you to see what the agent is doing under the hood, and directly edit the functionality. You can download it here. Pick the PostgreSQL option when it comes up.
Start the docker compose app first and then open up Letta Desktop, as it is connecting to the Letta agent running inside the container.
Hayhooks is a FastAPI-based server that exposes Haystack Pipelines through REST APIs. It's primarily used for RAG, but it's also a great way to make tools available in general as it has MCP and OpenAPI support.
To cut down on Anthropic's brutally low rate limits and higher costs, the search and extract tools use Google Flash 2.0 to process the output from Tavily and create an answer for Letta. Google Flash 2.0 also recommends possible follow up queries and query expansion along with the search results.
The extract tool converts HTML to Markdown and does some document cleanup before sending it to Google Flash 2.0. Only HTML is processed for now, although there are many converters available, and PDF support through docling-haystack or docling-serve should be easy.
There is no vector/embeddings/database RAG involved in this project, although you have the option to use your own by plugging it into Hayhooks. In addition, Letta's archival memory is technically a RAG implementation based on pgvector.
See the README for details of the tools provided by Hayhooks.
The LiteLLM proxy server that provides an OpenAI compatible layer on top of several different providers. It is provided to Open WebUI (commented out) and to Hayhooks.
LiteLLM is mostly commented out here to focus attention on Letta. However, it is very useful in general, especially as you scale up in complexity, and I think it's easier if you start using it from the beginning.
- It provides a way to point to a conceptual model rather than a concrete one (you can point to "claude-sonnet" and change the model from 3.5 to 3.7).
- It insulates Open WebUI from the underlying providers. You don't have to worry about changing your API key or other configuration settings when switching providers. You also don't have to worry about Open WebUI timing out for 30 seconds while it tries to reach an unreachable provider.
- It lets you specify the same model with different parameters, so you can use
extra-headers
to experiment with token-efficient tool use, for example.
Since you're using this for search, you may want to know how your queries are processed.
There are three different services involved in search, each with their own privacy policy.
You do have options for customization. Since the tools go through Hayhooks, you can write a ConditionalRouter that will send different queries to different services or evaluate them before they are processed.
Anthropic's privacy policy is clear: they do not use personal data for model training without explicit consent.
Tavily's privacy policy is that they do store queries, and they will use queries to improve the quality of their services. You can opt out through the account settings.
Google's privacy policy states that your conversations with Gemini may be used to improve and develop their products and services, including machine learning technologies. They do use human reviewers and there is a note saying Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.