VectorDB
\ No newline at end of file
diff --git a/tamingllms/_build/html/_images/llm_judge.png b/tamingllms/_build/html/_images/llm_judge.png
new file mode 100644
index 0000000..deeea05
Binary files /dev/null and b/tamingllms/_build/html/_images/llm_judge.png differ
diff --git a/tamingllms/_build/html/_images/llm_judge.svg b/tamingllms/_build/html/_images/llm_judge.svg
deleted file mode 100644
index 4292dfa..0000000
--- a/tamingllms/_build/html/_images/llm_judge.svg
+++ /dev/null
@@ -1,879 +0,0 @@
-LLM Judge Evaluation SystemLLM-Judgecomponentsapps
App Rankings
-Detailed Scores
-Analysis Report
-
-
Task description
-Scoring guidelines
-Output format
-
-
(Optional)Ground TruthLLM App 1LLM App 2
...
-
LLM App NGenerate EvaluationPromptCompare ResultsSubmit for Review
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/tamingllms/_build/html/_images/meta2.png b/tamingllms/_build/html/_images/meta2.png
new file mode 100644
index 0000000..93f0a9b
Binary files /dev/null and b/tamingllms/_build/html/_images/meta2.png differ
diff --git a/tamingllms/_build/html/_images/meta2.svg b/tamingllms/_build/html/_images/meta2.svg
deleted file mode 100644
index 8833843..0000000
--- a/tamingllms/_build/html/_images/meta2.svg
+++ /dev/null
@@ -1,882 +0,0 @@
-LLM Judge Pairwise Evaluation SystemPool of LLM JudgesPairwiseSelectorllmcomparison_pairHumanEvaluatorsRankingAlgorithm
LLM Judges Leaderboard
----------------------
-1. Judge C (0.95)
-2. Judge A (0.92)
-3. Judge B (0.89)
- ...
-N. Judge X (0.75)
-
-
PromptLLM ResponseJudge AvsJudge BDraw JudgesGenerate PairInput forEvaluationEvaluatePreferencesGenerateRankings
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/tamingllms/_build/html/_images/rag.svg b/tamingllms/_build/html/_images/rag.svg
new file mode 100644
index 0000000..6b77e28
--- /dev/null
+++ b/tamingllms/_build/html/_images/rag.svg
@@ -0,0 +1,4 @@
+
+
+
+
Data Parsing & Ingestion
Data
Embeddings
Retrieval
RAG Context
reranking
Query
LLM
Context Window
Indexing
Query
User
VectorDB
Retrieval System
RAG
\ No newline at end of file
diff --git a/tamingllms/_build/html/_images/similarity.png b/tamingllms/_build/html/_images/similarity.png
new file mode 100644
index 0000000..4f2f228
Binary files /dev/null and b/tamingllms/_build/html/_images/similarity.png differ
diff --git a/tamingllms/_build/html/_sources/markdown/intro.md b/tamingllms/_build/html/_sources/markdown/intro.md
index a3879a7..ab10fe5 100644
--- a/tamingllms/_build/html/_sources/markdown/intro.md
+++ b/tamingllms/_build/html/_sources/markdown/intro.md
@@ -35,11 +35,15 @@ Throughout this book, we'll tackle the following (non-exhaustive) list of critic
3. **Testing Complexity**: Traditional software testing methodologies break down when dealing with non-deterministic and generative systems, requiring new approaches.
-4. **Safety and Alignment**: LLMs can generate harmful, biased, or inappropriate content, requiring robust safeguards and monitoring systems to ensure safe deployment.
+4. **Safety**: LLMs can generate harmful, biased, or inappropriate content, requiring robust safeguards and monitoring systems to ensure safe deployment.
-5. **Vendor Lock-in**: Cloud-based LLM providers can create significant dependencies and lock-in through their proprietary APIs and infrastructure, making it difficult to switch providers or self-host solutions.
+5. **Alignment**: LLMs are next-token prediction models, which means they are not aligned with the user's preferences by default.
-6. **Cost Optimization**: The computational and financial costs of operating LLM-based systems can quickly become prohibitive without careful management, and optimization.
+6. **Vendor Lock-in**: Cloud-based LLM providers can create significant dependencies and lock-in through their proprietary APIs and infrastructure, making it difficult to switch providers or self-host solutions.
+
+7. **Cost Optimization**: The computational and financial costs of operating LLM-based systems can quickly become prohibitive without careful management, and optimization.
+
+We conclude with a discussion on the future of LLMs and the challenges that will arise as we move forward.
## A Practical Approach
@@ -171,7 +175,7 @@ Now that your environment is set up, let's begin our exploration of LLM challeng
## About the Author
-Tharsis Souza (Ph.D. Computer Science, UCL University of London) is a computer scientist and product leader specializing in AI-based products. He is a Lecturer at Columbia University's Master of Science program in Applied Analytics, (*incoming*) Head of Product, Equities at Citadel, and former Senior VP at Two Sigma Investments. He mentors under-represented students & working professionals to help create a more diverse global AI1 ecosystem.
+Tharsis Souza (Ph.D. Computer Science, UCL University of London) is a computer scientist and product leader specializing in AI-based products. He is a Lecturer at Columbia University's Master of Science program in Applied Analytics, (*incoming*) Head of Product, Equities at Citadel, and former Senior VP at Two Sigma Investments. He mentors under-represented students & working professionals to help create a more diverse global AI ecosystem.
With over 15 years of experience delivering technology products across startups and Fortune 500 companies, he is also an author of numerous scholarly publications and a frequent speaker at academic and business conferences. Grounded on academic background and drawing from practical experience building and scaling up products powered by language models at early-stage startups, major institutions as well as contributing to open source projects, he brings a unique perspective on bridging the gap between LLMs promised potential and their practical implementation challenges to enable the next generation of AI-powered products.
diff --git a/tamingllms/_build/html/_sources/markdown/toc.md b/tamingllms/_build/html/_sources/markdown/toc.md
index 6b39520..c343795 100644
--- a/tamingllms/_build/html/_sources/markdown/toc.md
+++ b/tamingllms/_build/html/_sources/markdown/toc.md
@@ -43,4 +43,14 @@ Abstract: *The current discourse around Large Language Models (LLMs) tends to fo
[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
[cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
-[cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC-BY--NC--SA-4.0-lightgrey.svg
\ No newline at end of file
+[cc-by-nc-sa-shield]: https://img.shields.io/badge/License-CC-BY--NC--SA-4.0-lightgrey.svg
+
+```
+@misc{tharsistpsouza2024tamingllms,
+ author = {Tharsis T. P. Souza},
+ title = {Taming LLMs: A Practical Guide to LLM Pitfalls with Open Source Software},
+ year = {2024},
+ journal = {GitHub repository},
+ url = {https://github.com/souzatharsis/tamingLLMs)
+}
+```
\ No newline at end of file
diff --git a/tamingllms/_build/html/_sources/notebooks/cost.ipynb b/tamingllms/_build/html/_sources/notebooks/cost.ipynb
index 4cd6849..5a1bc87 100644
--- a/tamingllms/_build/html/_sources/notebooks/cost.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/cost.ipynb
@@ -315,7 +315,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Quantization is a powerful technique for reducing the memory footprint of LLMs. This can be exemplified by the case of LLaMa 3.3 70B as quantized by {cite}`unsloth2024llama3` [^unsloth]. The model's memory requirements vary significantly based on the quantization level used as demonstrated in {numref}`quantized`.\n",
+ "Quantization[^visual-quantization] is a powerful technique for reducing the memory footprint of LLMs. This can be exemplified by the case of LLaMa 3.3 70B as quantized by {cite}`unsloth2024llama3` [^unsloth]. The model's memory requirements vary significantly based on the quantization level used as demonstrated in {numref}`quantized`.\n",
+ "\n",
+ "[^visual-quantization]: Maarten Grootendorst provides the best visual guide for model quantization {cite}`grootendorst2024quantization`.\n",
"\n",
"[^unsloth]: Unsloth runs a business of making LLMs fine-tuning streamlined. Check them out at [unsloth.ai](https://unsloth.ai).\n",
"\n",
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
index 002eb4a..68b2390 100644
--- a/tamingllms/_build/html/_sources/notebooks/evals.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -853,7 +853,7 @@
"4. **Run Evaluations**: Use the judge model to score outputs. Consider using a large and/or more capable model as a judge to provide more nuanced assessments.\n",
"5. **Aggregate and Analyze Results**: Interpret scores to refine applications.\n",
"\n",
- "```{figure} ../_static/evals/llm_judge.svg\n",
+ "```{figure} ../_static/evals/llm_judge.png\n",
"---\n",
"name: llm_judge\n",
"alt: Conceptual Overview\n",
@@ -1187,11 +1187,11 @@
"\n",
"An alternative to the above approaches is to use humans to directly evaluate the LLM-judges themselves. A notable example of this is [Judge Arena](https://judgearena.com/) {cite}`judgearena2024`, which is a platform that allows users to vote on which AI model made the better evaluation. Under this approach, the performance of the LLM evaluator is given by the (blind) evaluation of humans who perform the voting on randomly generated pairs of LLM judges as depicted in {numref}`meta2`. Only after submitting a vote, users can see which models were actually doing the judging.\n",
"\n",
- "```{figure} ../_static/evals/meta2.svg\n",
+ "```{figure} ../_static/evals/meta2.png\n",
"---\n",
"name: meta2\n",
"alt: Human-in-the-loop meta evaluation Conceptual Overview\n",
- "scale: 60%\n",
+ "scale: 75%\n",
"align: center\n",
"---\n",
"Human-in-the-loop Meta Evaluation.\n",
diff --git a/tamingllms/_build/html/_sources/notebooks/input.ipynb b/tamingllms/_build/html/_sources/notebooks/input.ipynb
index a8d6b4c..8397a78 100644
--- a/tamingllms/_build/html/_sources/notebooks/input.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/input.ipynb
@@ -12,11 +12,6 @@
"-- Steve Jobs\n",
"```\n",
"```{contents}\n",
- "```\n",
- "\n",
- "\n",
- "```{note}\n",
- "This Chapter is Work-in-Progress.\n",
"```"
]
},
@@ -26,20 +21,22 @@
"source": [
"## Introduction\n",
"\n",
- "Large Language Models face several critical challenges in effectively processing input data. While advances in long-context language models (LCLMs) {cite}`lee2024longcontextlanguagemodelssubsume` have expanded the amount of information these systems can process simultaneously, significant challenges remain in managing and effectively utilizing extended inputs. \n",
+ "While advances in long-context language models (LCs) {cite}`lee2024longcontextlanguagemodelssubsume` have expanded the amount of information these systems can process, significant challenges remain in managing and effectively utilizing extended data inputs:\n",
"\n",
- "LLMs are sensitive to input formatting and structure, requiring careful data preparation to achieve optimal results {cite}`tan2024htmlraghtmlbetterplain`. They operate with knowledge cutoffs, providing potentially stale or outdated information that may not reflect current reality and demonstrate problems with temporal knowledge accuracy {cite}`amayuelas-etal-2024-knowledge`. LLMs also struggle with less common but important information showing a systematic loss of long-tail knowledge {cite}`kotha2024understanding`.\n",
+ "- LLMs are sensitive to input formatting and structure, requiring careful data preparation to achieve optimal results {cite}`he2024doespromptformattingimpact, liu2024enhancingllmscognitionstructurization, tan2024htmlraghtmlbetterplain`.\n",
+ "- They operate with knowledge cutoffs, providing potentially stale or outdated information that may not reflect current reality and demonstrate problems with temporal knowledge accuracy {cite}`amayuelas-etal-2024-knowledge`.\n",
+ "- LLMs also face \"lost-in-the-middle\" problems {cite}`wu2024longdocumentsummaryevaluation` and struggle with less common but important information showing a systematic loss of long-tail knowledge {cite}`kotha2024understanding`.\n",
"\n",
- "Motivated by these challenges, this chapter explores two key components:\n",
+ "Motivated by these challenges, this chapter explores two key input data components:\n",
"\n",
- "1. Data Parsing: Parsing documents into a unified format that is suitable for LLMs to process.\n",
+ "1. Data Parsing and Chunking: Parsing and chunking documents into a unified format that is suitable and more manageable for LLMs to process.\n",
"2. Retrieval Augmentation: Augmenting LLMs with the ability to retrieve relevant, recent, and specialized information.\n",
"\n",
"In data parsing, we will explore some useful open source tools that help transform data into LLM-compatible formats, demonstrating their impact through a case study of structured information extraction from complex PDFs. In a second case study, we will introduce some chunking strategies to help LLMs process long inputs and implement a particular technique called Chunking with Contextual Linking the enables contextually relevant chunk processing.\n",
"\n",
- "In retrieval augmentation, we will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation). Through a detailed case study, we build a RAG system for querying live codebases, illustrating methods to bridge static model knowledge with dynamic information requirements.\n",
+ "In retrieval augmentation, we will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation) while discussing whether RAGs will be really needed in the future given the rise of long-context language models.\n",
"\n",
- "In our last case study, we build a quiz generator using a LLM with large context window. We will explore some additional relevant techniques such as prompt caching and response verification through citations.\n",
+ "While RAGs are useful for incorporating external context, they are not a silver bullet nor a mandatory component for all LLM applications. In our last case study, we leverage long-context windows to build a quiz generator from a large knowledge base. We will also explore some additional relevant techniques such as prompt caching and response verification through citations.\n",
"\n",
"By the chapter's conclusion, readers will possess relevant knowledge of input data management strategies for LLMs and practical expertise in selecting and implementing appropriate approaches and tools for specific use cases."
]
@@ -50,9 +47,11 @@
"source": [
"## Parsing Documents\n",
"\n",
- "Building robust data ingestion and preprocessing pipelines is essential for any LLM application. This section explores tools and frameworks that streamline input data processing, in particular for parsing purposes, providing a unified interface for converting diverse data formats into standardized representations that LLMs can effectively process. By abstracting away format-specific complexities, they allow developers to focus on core application logic rather than parsing implementation details while maximizing the performance of the LLM.\n",
+ "Data parsing and formatting play a critical role in LLMs performance {cite}`he2024doespromptformattingimpact, liu2024enhancingllmscognitionstructurization, tan2024htmlraghtmlbetterplain`. Hence, building robust data ingestion and preprocessing pipelines is essential for any LLM application. \n",
+ "\n",
+ "This section explores open source tools that streamline input data processing, in particular for parsing purposes, providing a unified interface for converting diverse data formats into standardized representations that LLMs can effectively process. By abstracting away format-specific complexities, they allow developers to focus on core application logic rather than parsing implementation details while maximizing the LLM performance.\n",
"\n",
- "We will cover open source tools and frameworks that provide parsing capabilities for a wide range of data formats. And we will demonstrate how some of these tools can be used to extract structured information from complex PDFs also discussing how the quality of the parser can impact LLM's performance."
+ "We will cover open source tools that provide parsing capabilities for a wide range of data formats. And we will demonstrate how some of these tools can be used to extract structured information from complex PDFs demonstrating how the quality of the parser can impact LLM's performance."
]
},
{
@@ -61,7 +60,7 @@
"source": [
"### MarkItDown\n",
"\n",
- "MarkItDown is a Python package and CLI too developed by the Microsoft AutoGen team for converting various file formats to Markdown. It supports a wide range of formats including PDF, PowerPoint, Word, Excel, images (with OCR and EXIF metadata), audio (with transcription), HTML, and other text-based formats making it a useful tool for document indexing and LLM-based applications.\n",
+ "MarkItDown {cite}`microsoft2024markitdown` is a Python package and CLI tool developed by the Microsoft AutoGen team for converting various file formats to Markdown. It supports a wide range of formats including PDF, PowerPoint, Word, Excel, images (with OCR and EXIF metadata), audio (with transcription), HTML, and other text-based formats making it a useful tool for document indexing and LLM-based applications.\n",
"\n",
"Key features:\n",
"- Simple command-line and Python API interfaces\n",
@@ -81,7 +80,7 @@
"\n",
"### Docling\n",
"\n",
- "Docling is a Python package developed by IBM Research for parsing and converting documents into various formats. It provides advanced document understanding capabilities with a focus on maintaining document structure and formatting.\n",
+ "Docling {cite}`docling2024github` is a Python package developed by IBM Research for parsing and converting documents into various formats. It provides advanced document understanding capabilities with a focus on maintaining document structure and formatting.\n",
"\n",
"Key features:\n",
"- Support for multiple document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, etc.)\n",
@@ -101,13 +100,6 @@
"```"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Frameworks-Based Parsing\n"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -119,17 +111,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "A common use case where document parsing matters is to structured data extraction from documents, particularly in the presence of complex formatting and layout. In this case study, we will extract the economic forecasts from Merrill Lynch's CIO Capital Market Outlook released on December 16, 2024 {cite:p}`merrill2024`. We will focus on page 7 of this document, which contains several economic variables organized in a mix of tables, text and images (see {numref}`forecast`)\n",
+ "A common use case where document parsing matters is structured data extraction, particularly in the presence of complex formatting and layout. In this case study, we will extract the economic forecasts from Merrill Lynch's CIO Capital Market Outlook released on December 16, 2024 {cite}`merrill2024`. We will focus on page 7 of this document, which contains several economic variables organized in a mix of tables, text and images (see {numref}`forecast`).\n",
"\n",
"\n",
"```{figure} ../data/input/forecast.png\n",
"---\n",
"name: forecast\n",
"alt: Forecast\n",
- "scale: 50%\n",
+ "scale: 45%\n",
"align: center\n",
"---\n",
- "Forecast\n",
+ "Merrill Lynch's CIO Capital Market Outlook released on December 16, 2024 {cite}`merrill2024`\n",
"```"
]
},
@@ -184,7 +176,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "How similar are the two results? We can use use Levenshtein distance to measure the similarity between the two results. We will also calculate a naive score using the `SequenceMatcher` from the `difflib` package, which is a simple measure of the similarity between two strings based on the number of matches in the longest common subsequence."
+ "How similar are the two results? We can use use Levenshtein distance to measure the similarity between the two results. We will also calculate a naive score using the `SequenceMatcher` from the `difflib` package, which is a simple measure of similarity between two strings based on the number of matches in the longest common subsequence."
]
},
{
@@ -256,7 +248,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "It turns out that the two results are quite different, with a similarity score of about 13.98% and 17.77% for Levenshtein and `SequenceMatcher` respectively."
+ "It turns out that the two results are quite different, with a similarity score of about 13.98% and 17.77% for Levenshtein and `SequenceMatcher`, respectively."
]
},
{
@@ -351,7 +343,7 @@
"scale: 45%\n",
"align: center\n",
"---\n",
- "Forecast 2025\n",
+ "Merrill Lynch's CIO Economic Forecasts.\n",
"```\n",
"\n",
"We will define a `Forecast` pydantic model to represent an economic forecast composed of a `financial_variable` and a `financial_forecast`. We will also define a `EconForecast` pydantic model to represent the list of economic forecasts we want to extract from the document.\n"
@@ -375,7 +367,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We write a simple function to extract the economic forecasts from the document using an LLM model (with structured output) with the following prompt template, where `extract_prompt` is kind of data the user would like to extract and `doc` is the input document to analyze."
+ "We write a simple function to extract the economic forecasts from the document using an LLM model (with structured output) with the following prompt template, where `extract_prompt` represents the kind of data the user would like to extract and `doc` is the input document to analyze."
]
},
{
@@ -682,7 +674,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now, let's focus on the asset class weightings. We will extract the asset class weightings from the document and compare the results from MarkItDown and Docling. The information now is presented in a quite different structure. The CIO view information is represented in a spectrum from starting with \"Underweight\", passing through \"Neutral\" and reaching \"Overweight\". The actual view is marked by some colored dots in the chart. Let's see if we can extract this information from the document.\n",
+ "Now, let's focus on the asset class weightings. We will extract the asset class weightings from the document and compare the results from MarkItDown and Docling. The information now is presented in a quite different structure as we can see in {ref}`asset_class`. The CIO view information is represented in a spectrum starting with \"Underweight\", passing through \"Neutral\" and reaching \"Overweight\". The actual view is marked by some colored dots in the chart. Let's see if we can extract this relatively more complex information from the document.\n",
"```{figure} ../_static/input/asset_class.png\n",
"---\n",
"name: asset_class\n",
@@ -729,7 +721,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Now we construct a DataFrame to compare the results from MarkItDown and Docling with an added \"true_value\" column containing the true values from the document, which we extracted manually from the chart."
+ "We construct a DataFrame to compare the results from MarkItDown and Docling with an added \"true_value\" column containing the true values from the document, which we extracted manually from the chart. This enables us to calculate accuracy of the structured data extraction task in case."
]
},
{
@@ -936,7 +928,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "Docling performs significantly better at 93.33% accuracy missing only one value. MarkItDown achieves 53.33% accuracy, struggling with nuanced asset class weightings. In this case, Docling's structured parsed output did help the LLM to extract the information more accurately compared to MarkItDown's unstructured output. Hence, in this case, the strategy used to parse the data did impact the LLM's ability to extract the information. A more robust analysis would run data extraction on a large sample data a number of repeated runs to estimate error rates."
+ "We observe that Docling performs significantly better at 93.33% accuracy missing only one value. MarkItDown achieves 53.33% accuracy struggling with nuanced asset class weightings. In this case, Docling's structured parsed output did help the LLM to extract the information more accurately compared to MarkItDown's unstructured output. Hence, in this case, the strategy used to parse the data did impact the LLM's ability to extract structured information. Having said that, it is important to mention that a more robust analysis would run data extraction on a large sample data a number of repeated runs to estimate error rates since results are non-deterministic."
]
},
{
@@ -945,8 +937,8 @@
"source": [
"What if we want to systematically extract all tables from the document? We can use Docling to do that by simply accessing the `tables` attribute of the `DocumentConverter` object.\n",
"\n",
- "By doing that, we observe that Docling extracted 7 tables from the document. Exporting tables from top down and left to right in order of appearance in the document.\n",
- "Below, we can see the first table successfully extracted for Equities forecasts, the second one for Fixed Income forecasts as well as the last table, which contains CIO Equity Sector Views.\n"
+ "By doing that, we observe that Docling extracted 7 tables from the document exporting tables from top down and left to right in order of appearance in the document.\n",
+ "Below, we display the first two and the last tables. We can see the first table successfully extracted for Equities forecasts, the second one for Fixed Income forecasts as well as the last table, which contains CIO Equity Sector Views.\n"
]
},
{
@@ -1593,7 +1585,14 @@
"- The description mentions \"overweight positions in certain sectors such as Utilities and Financials\" but looking at the CIO Equity Sector Views, both these sectors show neutral positions, not overweight positions.\n",
"- For fixed income, the description cites a \"10-Year (4.03%)\" yield, but the image shows the 30-Year Yield at 4.03%, while the 10-Year Yield is actually 4.40%.\n",
"\n",
- "Arguably, the description's inaccuracies could be a consequence of the underlying LLM model's inability to process the image. Further research is needed to determine if this is the case."
+ "Arguably, the description's inaccuracies could be a consequence of the underlying LLM model's inability to process the image.\n",
+ "\n",
+ "We have covered MarkitDown and Docling as examples of open source tools that can help developers parse input data into a suitable format to LLMs. Other relevant open source tools worth mentioning include:\n",
+ "- Unstructured.io {cite}`unstructured2024github`: A Python library for unstructured data extraction.\n",
+ "- FireCrawl {cite}`mendable2024firecrawl`: A Fast and Efficient Web Crawler for LLM Training Data.\n",
+ "- LlamaParse {cite}`llamaparse2024github`: Llamaindex's data parsing solution.\n",
+ "\n",
+ "The choice of tool depends on the specific requirements of the application and the nature of the input data. This choice should be taken as a critical decision of any data intensive LLM-based application and deserves dedicated research and evidence-based experimentation.\n"
]
},
{
@@ -1602,75 +1601,152 @@
"source": [
"## Retrieval-Augmented Generation\n",
"\n",
- "RAG is a technique that allows LLMs to retrieve information from a knowledge base to answer questions. It is a popular technique for building LLM applications that require knowledge-intensive tasks {cite}`lewis2021retrievalaugmentedgenerationknowledgeintensivenlp`.\n",
+ "What happens if we asked ChatGPT who's the author of the book \"Taming LLMs\"?\n",
"\n",
- "RAG utilizes a retrieval system to fetch external knowledge and augment the LLM. It has proved effective in mitigating hallucinations of LLMs {cite}`10.1145/3589334.3645481, ni-etal-2024-llms`."
+ "\n"
]
},
{
- "cell_type": "markdown",
+ "cell_type": "code",
+ "execution_count": 1,
"metadata": {},
+ "outputs": [],
"source": [
- "## Case Studies\n",
- "\n",
- "This section presents three case studies that demonstrate practical solutions to common LLM limitations:\n",
- "\n",
- "First, Content Chunking with Contextual Linking showcases how intelligent chunking strategies can overcome both context window and output token limitations. This case study illustrates techniques for breaking down and reassembling content while maintaining coherence, enabling the generation of high-quality long-form outputs despite model constraints.\n",
+ "from dotenv import load_dotenv\n",
+ "import os\n",
"\n",
- "Second, a Retrieval Augmented Generation case study addresses the challenge of stale or outdated model knowledge. By implementing semantic search over a GitHub repository, this example demonstrates how to augment LLM responses with current, accurate information - allowing users to query and receive up-to-date answers about code repository contents.\n",
+ "# Load environment variables from .env file\n",
+ "load_dotenv()\n",
"\n",
- "Third, the final case study builds a Quiz generator with citations. This case study explores some additional input management techniques that become particularly useful when long context window is available. This includes implementing prompt caching for efficiency and adding citations to enhance response accuracy and verifiability. These approaches show how to maximize the benefits of larger context models while maintaining response quality."
+ "from openai import OpenAI\n",
+ "client = OpenAI()\n",
+ "model = \"gpt-4o-mini\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "question = \"Who's the Author of the Book Taming LLMs?\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The book \"Taming LLMs\" is authored by *G. Arulkumaran, H. M. B. P. D. Karthikeyan, and I. A. M. Almasri.* If you need more information about the book or its contents, feel free to ask!\n"
+ ]
+ }
+ ],
+ "source": [
+ "response = client.chat.completions.parse(\n",
+ " model=\"gpt-4o-mini\",\n",
+ " messages=[\n",
+ " {\"role\": \"user\", \"content\": question}\n",
+ " ]\n",
+ ")\n",
+ "response.choices[0].message.content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Case Study I: Content Chunking with Contextual Linking\n",
+ "Turns out ChatGPT hallucinates. A quick web search on the before mentioned authors yields no results. In fact, those authors names are made up. And of course the correct answer would have been \"Tharsis Souza\".\n",
"\n",
- "Content chunking with contextual linking is a technique to break down long-form content into smaller, manageable chunks while keeping chunk-specific context. This approach tackles three problems:\n",
- "1. The LLM's inability to process long inputs to do context-size limits\n",
- "2. The LLM's inability to generate long-form content due to the `max_output_tokens` limitation.\n",
- "3. The LLM's inability to maintain coherence and context when generating responses per chunks\n",
+ "LLMs only have access to the information they have been trained on, which of course has been fixed at a point in time. Hence, LLMs operate with stale data. The problem gets exacerbated by the fact that LLMs are trained to provide an answer even if the answer is unknown by them, hence leading to hallucinations. \n",
"\n",
- "Here, we exemplify this technique by following these steps:\n",
- "1. **Chunking the Content**: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input.\n",
+ "One solution to this problem is to use a retrieval system to fetch information from a knowledge base to provide recent and relevant context to user queries using so-called Retrieval Augmented Generation (RAG) system.\n",
"\n",
- "2. **Maintaining Context**: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks.\n",
+ "RAG utilizes a retrieval system to fetch external knowledge and augment LLM's context. It is a useful technique for building LLM applications that require domain-specific information or knowledge-intensive tasks {cite}`lewis2021retrievalaugmentedgenerationknowledgeintensivenlp`. It has also proved effective in mitigating LLMs hallucinations {cite}`10.1145/3589334.3645481, ni-etal-2024-llms`."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the above example, a RAG would help with hallucinations by grounding the LLM's response to information provided in the knowledge base. Additional common use cases of RAG systems include:\n",
"\n",
- "3. **Generating Linked Prompts**: For each chunk, a prompt is generated that includes the chunk's content and its context. This prompt is then used to generate the output for that chunk.\n",
+ "1. **Enterprise Knowledge Management**: RAG enables organizations to synthesize answers from diverse internal data sources like documents, databases, and communication channels. This creates a unified knowledge interface that can accurately answer questions using the organization's own data.\n",
+ "2. **Document Processing and Analysis**: RAG excels at extracting and analyzing information from complex documents like financial reports, presentations, and spreadsheets. The system can enable LLMs to understand context and relationships across different document types and formats.\n",
+ "3. **Intelligent Customer Support**: By combining knowledge bases with conversational abilities, RAG powers chatbots and support systems that can maintain context across chat history, provide accurate responses, and handle complex customer queries while reducing hallucinations.\n",
+ "4. **Domain-Specific Applications**: RAG allows LLMs to be equipped with specialized knowledge in fields like medicine, law, or engineering by retrieving information from domain-specific literature, regulations, and technical documentation. This enables accurate responses aligned with professional standards and current best practices.\n",
+ "5. **Code Documentation and Technical Support**: RAG can help developers by retrieving relevant code examples, API documentation, and best practices from repositories and documentation, which often suffer updates frequently, enabling more accurate and contextual coding assistance.\n",
"\n",
- "4. **Combining the Outputs**: The outputs of all chunks are combined to form the final long-form content.\n",
+ "If LLMs alone work on stale, general-purpose data with the added challenge of being prone to hallucinations, RAG systems serve as an added capability enabling LLMs to work on recent, domain-specific knowledge increasing the likelihood of LLMs to provide responses that are factual and relevant to user queries.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### RAG Pipeline\n",
"\n",
- "Let's examine an example implementation of this technique.\n",
+ "RAG architectures vary but they all share the same goal: to retrieve relevant information from a knowledge base to maximize the LLM's ability to effectively and accurately respond to prompts, particularly when the answer requires out-of-training data information.\n",
"\n",
- "#### Generating long-form content\n",
+ "We will introduce key components of a RAG system one by one leading to a full canonical RAG pipeline at the end that ultimately will be used to answer our original question \"Who's the author of the book Taming LLMs?\", accurately.\n",
"\n",
- "- Goal: Generate a long-form report analyzing a company's financial statement.\n",
- "- Input: A company's 10K SEC filing.\n",
+ "The following basic components will be introduced (see {numref}`rag_pipeline` for a visual representation):\n",
+ "- Vector Database\n",
+ " - Embeddings\n",
+ " - Indexing\n",
+ "- Retrieval System including re-ranking\n",
+ "- LLM Augmented Generation via in-context learning\n",
"\n",
- "```{figure} ../_static/structured_output/diagram1.png\n",
+ "Data extraction, parsing and chunking are also part of a canonical pipeline as we prepare the knowledge base. Those are concepts that we have already explored in the previous sections, hence we will be succinct here. We will start by preparing the knowledge base.\n",
+ "\n",
+ "```{figure} ../_static/input/rag.svg\n",
"---\n",
- "name: content-chunking-with-contextual-linking\n",
- "alt: Content Chunking with Contextual Linking\n",
- "scale: 50%\n",
+ "name: rag_pipeline\n",
+ "alt: RAG Pipeline\n",
+ "scale: 99%\n",
"align: center\n",
"---\n",
- "Content Chunking with Contextual Linking Schematic Representation.\n",
- "```\n",
- "\n",
- "The diagram in {numref}`content-chunking-with-contextual-linking` illustrates the process we will follow for handling long-form content generation with Large Language Models through \"Content Chunking with Contextual Linking.\" It shows how input content is first split into manageable chunks using a chunking function (e.g. `CharacterTextSplitter` with `tiktoken` tokenizer), then each chunk is processed sequentially while maintaining context from previous chunks. For each chunk, the system updates the context, generates a dynamic prompt with specific parameters, makes a call to the LLM chain, and stores the response. After all chunks are processed, the individual responses are combined with newlines to create the final report, effectively working around the token limit constraints of LLMs while maintaining coherence across the generated content.\n",
+ "Simplified RAG Pipeline\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Preparing the Knowledge Base\n",
"\n",
- "**Step 1: Chunking the Content**\n",
+ "Every RAG system requires a knowledge base. In our case, the knowledge base is a set of documents that we equip the LLM to answer our authorship question.\n",
"\n",
- "There are different methods for chunking, and each of them might be appropriate for different situations. However, we can broadly group chunking strategies in two types:\n",
- "- **Fixed-size Chunking**: This is the most common and straightforward approach to chunking. We simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking may be a reasonable path in many common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any specialied techniques or libraries.\n",
- "- **Content-aware Chunking**: These are a set of methods for taking advantage of the nature of the content we’re chunking and applying more sophisticated chunking to it. Examples include:\n",
- " - **Sentence Splitting**: Many models are optimized for embedding sentence-level content. Naturally, we would use sentence chunking, and there are several approaches and tools available to do this, including naive splitting (e.g. splitting on periods), NLTK, and spaCy.\n",
- " - **Recursive Chunking**: Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators.\n",
- " - **Semantic Chunking**: This is a class of methods that leverages embeddings to extract the semantic meaning present in your data, creating chunks that are made up of sentences that talk about the same theme or topic.\n",
+ "Hence, we will compose our knowledge base by adding the web version of (some of the chapters of) the book \"Taming LLMs\", namely:\n",
+ "- Introduction\n",
+ "- Structured Output\n",
+ "- Input (this very chapter)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "book_url = \"https://www.tamingllms.com/\"\n",
+ "chapters = [\"markdown/intro.html\",\n",
+ " \"notebooks/structured_output.html\",\n",
+ " \"notebooks/input.html\"]\n",
"\n",
- " Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model.\n"
+ "chapter_urls = [f\"{book_url}/{chapter}\" for chapter in chapters]\n",
+ "chapter_ids = [chapter.split(\"/\")[-1].replace(\".html\", \"\") for chapter in chapters]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We use `Docling` to download the chapters from the web and parse them as markdown files."
]
},
{
@@ -1679,36 +1755,57 @@
"metadata": {},
"outputs": [],
"source": [
- "def get_chunks(text: str, chunk_size: int, chunk_overlap: int) -> list:\n",
- " \"\"\"\n",
- " Split input text into chunks of specified size with specified overlap.\n",
+ "chapters = [converter.convert(chapter_url).document.export_to_markdown() for chapter_url in chapter_urls]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now we are ready to store the chapters in a vector database to enable the construction of a retrieval system."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Vector Database\n",
"\n",
- " Args:\n",
- " text (str): The input text to be chunked.\n",
- " chunk_size (int): The maximum size of each chunk in tokens.\n",
- " chunk_overlap (int): The number of tokens to overlap between chunks.\n",
+ "Vector databases are specialized databases designed to store and retrieve high-dimensional vectors, which are mathematical representations of data like text, images, or audio. These databases are optimized for similarity search operations, making them ideal for embeddings-based retrieval systems.\n",
"\n",
- " Returns:\n",
- " list: A list of text chunks.\n",
- " \"\"\"\n",
- " from langchain_text_splitters import CharacterTextSplitter\n",
+ "A typical pipeline involving a vector database includes the following:\n",
"\n",
- " text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
- " return text_splitter.split_text(text)\n"
+ "1. Input data is converted into \"documents\" forming a collection representing our knowledge base\n",
+ "2. Each document is converted into an embedding which are stored in the vector database\n",
+ "3. Embeddings are indexed in the vector database for efficient similarity search\n",
+ "4. The vector database is queried to retrieve the most relevant documents\n",
+ "5. The retrieved documents are used to answer questions\n",
+ "\n",
+ "Vector databases are not a mandatory component of RAG systems. In fact, we can use a simple list of strings to store the chapters (or their chunks) and then use the LLM to answer questions about the document. However, vector databases are useful for RAG applications as they enable:\n",
+ "- Fast similarity search for finding relevant context\n",
+ "- Efficient storage of document embeddings\n",
+ "- Scalable retrieval for large document collections\n",
+ "- Flexible querying with metadata filters\n",
+ "\n",
+ "In that way, RAG applications can be seen as a retrieval system that uses a vector database to store and retrieve embeddings of documents, which in turn are used to augment LLMs with contextually relevant information as we will see in the next sections.\n",
+ "\n",
+ "Here, we will use ChromaDB {cite}`chromadb2024docs` as an example of an open source vector database but key features and concepts we cover are applicable to other vector databases, in general.\n",
+ "\n",
+ "ChromaDB is a popular open-source vector database that offers:\n",
+ "- Efficient storage and retrieval of embeddings\n",
+ "- Support for metadata and filtering\n",
+ "- Easy integration with Python applications\n",
+ "- In-memory and persistent storage options\n",
+ "- Support for multiple distance metrics\n",
+ "\n",
+ "Other notable vector databases include Weaviate, FAISS, and Milvus."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "**Step 2: Writing the Base Prompt Template**\n",
- "\n",
- "We will write a base prompt template which will serve as a foundational structure for all chunks, ensuring consistency in the instructions and context provided to the language model. The template includes the following parameters:\n",
- "- `role`: Defines the role or persona the model should assume.\n",
- "- `context`: Provides the background information or context for the task.\n",
- "- `instruction`: Specifies the task or action the model needs to perform.\n",
- "- `input_text`: Contains the actual text input that the model will process.\n",
- "- `requirements`: Lists any specific requirements or constraints for the output."
+ "In ChromaDB, we can create a vector database client as follows."
]
},
{
@@ -1717,26 +1814,17 @@
"metadata": {},
"outputs": [],
"source": [
- "from langchain_core.prompts import PromptTemplate\n",
- "def get_base_prompt_template() -> str:\n",
- " \n",
- " base_prompt = \"\"\"\n",
- " ROLE: {role}\n",
- " CONTEXT: {context}\n",
- " INSTRUCTION: {instruction}\n",
- " INPUT: {input}\n",
- " REQUIREMENTS: {requirements}\n",
- " \"\"\"\n",
- " \n",
- " prompt = PromptTemplate.from_template(base_prompt)\n",
- " return prompt"
+ "import chromadb\n",
+ "chroma_client = chromadb.Client()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "We will write a simple function that returns an `LLMChain` which is a simple `langchain` construct that allows you to chain together a combination of prompt templates, language models and output parsers."
+ "This will create a vector database in memory. We can also create a persistent vector database by specifying a path to a directory or alternatively by using a cloud-based vector database service like AWS, Azure or GCP. We will use a vector database in memory for this example.\n",
+ "\n",
+ "Next, we create a collection to store the embeddings of the chapters. And add our chapters as documents to the collection as follows."
]
},
{
@@ -1745,45 +1833,19 @@
"metadata": {},
"outputs": [],
"source": [
- "from langchain_core.output_parsers import StrOutputParser\n",
- "from langchain_community.chat_models import ChatLiteLLM\n",
+ "collection = chroma_client.create_collection(name=\"taming_llms\")\n",
"\n",
- "def get_llm_chain(prompt_template: str, model_name: str, temperature: float = 0):\n",
- " \"\"\"\n",
- " Returns an LLMChain instance using langchain.\n",
- "\n",
- " Args:\n",
- " prompt_template (str): The prompt template to use.\n",
- " model_name (str): The name of the model to use.\n",
- " temperature (float): The temperature setting for the model.\n",
- "\n",
- " Returns:\n",
- " llm_chain: An instance of the LLMChain.\n",
- " \"\"\"\n",
- " \n",
- " from dotenv import load_dotenv\n",
- " import os\n",
- "\n",
- " # Load environment variables from .env file\n",
- " load_dotenv()\n",
- " \n",
- " api_key_label = model_name.split(\"/\")[0].upper() + \"_API_KEY\"\n",
- " llm = ChatLiteLLM(\n",
- " model=model_name,\n",
- " temperature=temperature,\n",
- " api_key=os.environ[api_key_label],\n",
- " )\n",
- " llm_chain = prompt_template | llm | StrOutputParser()\n",
- " return llm_chain"
+ "collection.add(\n",
+ " documents=chapters,\n",
+ " ids=chapter_ids\n",
+ ")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "**Step 3: Constructing Dynamic Prompt Parameters**\n",
- "\n",
- "Now, we will write a function (`get_dynamic_prompt_template`) that constructs prompt parameters dynamically for each chunk."
+ "We are ready to query the collection. We write a simple function that takes the collection, input query and number of retrieved results as argument and returns the retrieved documents."
]
},
{
@@ -1792,59 +1854,19 @@
"metadata": {},
"outputs": [],
"source": [
- "from typing import Dict\n",
- "def get_dynamic_prompt_params(prompt_params: Dict, \n",
- " part_idx: int, \n",
- " total_parts: int,\n",
- " chat_context: str,\n",
- " chunk: str) -> str:\n",
- " \"\"\"\n",
- " Construct prompt template dynamically per chunk while maintaining the chat context of the response generation.\n",
- " \n",
- " Args:\n",
- " prompt_params (Dict): Original prompt parameters\n",
- " part_idx (int): Index of current conversation part\n",
- " total_parts (int): Total number of conversation parts\n",
- " chat_context (str): Chat context from previous parts\n",
- " chunk (str): Current chunk of text to be processed\n",
- " Returns:\n",
- " str: Dynamically constructed prompt template with part-specific params\n",
- " \"\"\"\n",
- " dynamic_prompt_params = prompt_params.copy()\n",
- " # saves the chat context from previous parts\n",
- " dynamic_prompt_params[\"context\"] = chat_context\n",
- " # saves the current chunk of text to be processed as input\n",
- " dynamic_prompt_params[\"input\"] = chunk\n",
- " \n",
- " # Add part-specific instructions\n",
- " if part_idx == 0: # Introduction part\n",
- " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
- " You are generating the Introduction part of a long report.\n",
- " Don't cover any topics yet, just define the scope of the report.\n",
- " \"\"\"\n",
- " elif part_idx == total_parts - 1: # Conclusion part\n",
- " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
- " You are generating the last part of a long report. \n",
- " For this part, first discuss the below INPUT. Second, write a \"Conclusion\" section summarizing the main points discussed given in CONTEXT.\n",
- " \"\"\"\n",
- " else: # Main analysis part\n",
- " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
- " You are generating part {part_idx+1} of {total_parts} parts of a long report.\n",
- " For this part, analyze the below INPUT.\n",
- " Organize your response in a way that is easy to read and understand either by creating new or merging with previously created structured sections given in CONTEXT.\n",
- " \"\"\"\n",
- " \n",
- " return dynamic_prompt_params"
+ "def query_collection(collection, query_text, n_results=3):\n",
+ " results = collection.query(\n",
+ " query_texts=[query_text],\n",
+ " n_results=n_results\n",
+ " )\n",
+ " return results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "\n",
- "**Step 4: Generating the Report**\n",
- "\n",
- "Finally, we will write a function that generates the actual report by calling the `LLMChain` with the dynamically updated prompt parameters for each chunk and concatenating the results at the end."
+ "We write a simple query, enquiring the purpose of the book."
]
},
{
@@ -1853,24 +1875,907 @@
"metadata": {},
"outputs": [],
"source": [
- "def generate_report(input_content: str, llm_model_name: str, \n",
- " role: str, requirements: str,\n",
- " chunk_size: int, chunk_overlap: int) -> str:\n",
- " # stores the parts of the report, each generated by an individual LLM call\n",
- " report_parts = [] \n",
- " # split the input content into chunks\n",
- " chunks = get_chunks(input_content, chunk_size, chunk_overlap)\n",
- " # initialize the chat context with the input content\n",
- " chat_context = input_content\n",
- " # number of parts to be generated\n",
- " num_parts = len(chunks)\n",
- "\n",
- " prompt_params = {\n",
- " \"role\": role, # user-provided\n",
- " \"context\": \"\", # dinamically updated per part\n",
- " \"instruction\": \"\", # dynamically updated per part\n",
- " \"input\": \"\", # dynamically updated per part\n",
- " \"requirements\": requirements #user-priovided\n",
+ "q = \"What is the purpose of this book?\"\n",
+ "res = query_collection(collection, q)\n",
+ "res.get(\"ids\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print([['intro', 'input', 'structured_output']])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As response, we obtain an object that contains several attributes including:\n",
+ "- `documents`: The actual documents retrieved from the collection, i.e. the chapters \n",
+ "- `ids`: The ids of the documents retrieved from the collection\n",
+ "- `distances`: The distances of the documents to the query vector\n",
+ "\n",
+ "We can see that the chapters \"Introduction\", \"Input\" and \"Structured Output\" are retrieved from the collection ordered by their distance to the query vector.\n",
+ "\n",
+ "We observe that the Introduction chapter is the most relevant one as it ranks first, followed by the Input and Structured Output chapters. Indeed, the purpose of the book is included in the Introduction chapter demonstrating the retrieval system successfully retrieved the most relevant document to the input query, in this simple example.\n",
+ "\n",
+ "In order to understand how the retrieval system works and how the \"distance to the query vector\" is computed, we need to understand how the embeddings are created and how the documents are indexed."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Embeddings**\n",
+ "\n",
+ "Embeddings are numerical representations of data (including text, images, audio, etc.) that capture meaning, allowing machines to process data quantitatively. Each embedding can be represented as a vector of floating-point numbers such that embedded data with similar meanings produce similar, i.e. close, vectors [^embeddings_definition].\n",
+ "\n",
+ "[^embeddings_definition]: Bengio et al. {cite}`bengio2014representationlearningreviewnew` provide serves as an excellent reference for representation learning in general including embeddings. OpenAI provides a good intro to Embeddings for developers {cite}`openai2024embeddings`\n",
+ "\n",
+ "For text data, small distances among embeddings suggest high semantic relatedness and large distances suggest low semantic relatedness among the embedded texts. HuggingFace provides a leaderboard of embeddings models {cite}`huggingface2024mteb`, which are ranked by in dimensions such as classification, clustering and reranking performance.\n",
+ "\n",
+ "Behind the scenes, ChromaDB is using the model `all-MiniLM-L6-v2` by default [^chroma_embeddings] to create embeddings for the input documents and the query (see {numref}`embedding`). This model is available in `sentence_transformers` {cite}`sentencetransformers2024website`. Let's see how it works.\n",
+ "\n",
+ "```{figure} ../_static/input/embedding.svg\n",
+ "---\n",
+ "name: embedding\n",
+ "alt: Embedding\n",
+ "scale: 70%\n",
+ "align: center\n",
+ "---\n",
+ "Embedding\n",
+ "```\n",
+ "\n",
+ "[^chroma_embeddings]: ChromaDB enables custom embedding functions and provides a list of wrappers around commonly used embedding models and APIs https://docs.trychroma.com/docs/embeddings/embedding-functions"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sentence_transformers import SentenceTransformer\n",
+ "\n",
+ "embedding_model = SentenceTransformer('all-MiniLM-L6-v2')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We replicate what ChromaDB did by embedding our chapters as well as input query using sentence transformers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "(4, 384)\n"
+ ]
+ }
+ ],
+ "source": [
+ "q = \"What is the purpose of this book?\"\n",
+ "docs_to_embed = [q] + chapters\n",
+ "embeddings = embedding_model.encode(docs_to_embed)\n",
+ "print(embeddings.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As a result, we obtain four 384-dimensional vectors representing our embeddings (one for each of the three chapters and one for the input query).\n",
+ "\n",
+ "Now we can calculate similarity among the embeddings. By default, sentence transformers uses cosine similarity to calculate the similarity between embeddings. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "similarities = embedding_model.similarity(embeddings, embeddings)\n",
+ "similarities"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "tensor([[1.0000, 0.4402, 0.3022, 0.4028],\n",
+ " [0.4402, 1.0000, 0.6606, 0.5807],\n",
+ " [0.3022, 0.6606, 1.0000, 0.6313],\n",
+ " [0.4028, 0.5807, 0.6313, 1.0000]])\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Let's visualize the similarity matrix to better understand the relationships between our documents in {numref}`similarities`. The top row of the matrix represents the similarity of the input query against all chapters. That's exactly what we previously obtained by querying ChromaDB which returned a response with documents ranked by similarity to input query.\n",
+ "\n",
+ "```{figure} ../_static/input/similarity.png\n",
+ "---\n",
+ "name: similarities\n",
+ "alt: Similarity matrix heatmap\n",
+ "scale: 90%\n",
+ "align: center\n",
+ "---\n",
+ "Similarity matrix heatmap showing relationships among query and chapters.\n",
+ "``` \n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Calculating similarity among embeddings can become computationally intensive if brute force is used, i.e. pair-wise computation, as the number of documents grows in the knowledge base. Indexing is a technique to help address this challenge."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Indexing**\n",
+ "\n",
+ "Indexing is a crucial optimization technique that makes similarity searches faster and more efficient.\n",
+ "\n",
+ "Without indexing, finding similar vectors would require an exhaustive search - comparing a query vector against every single vector in the database. For large datasets, this becomes prohibitively slow.\n",
+ "\n",
+ "Common indexing strategies include:\n",
+ "\n",
+ "1. **Tree-based Indexes**\n",
+ " - Examples include KD-trees and Ball trees\n",
+ " - Work by partitioning the vector space into hierarchical regions\n",
+ " - Effective for low-dimensional data but suffer from the \"curse of dimensionality\"\n",
+ "\n",
+ "2. **Graph-based Indexes**\n",
+ " - HNSW (Hierarchical Navigable Small World) is a prominent example\n",
+ " - Creates a multi-layered graph structure for navigation\n",
+ " - Offers excellent search speed but requires more memory\n",
+ "\n",
+ "3. **LSH (Locality-Sensitive Hashing)**\n",
+ " - Uses hash functions that map similar vectors to the same buckets\n",
+ " - More memory-efficient than graph-based methods\n",
+ " - May sacrifice some accuracy for performance\n",
+ "\n",
+ "4. **Quantization-based Indexes**\n",
+ " - Product Quantization compresses vectors by encoding them into discrete values\n",
+ " - Reduces memory footprint significantly\n",
+ " - Good balance between accuracy and resource usage\n",
+ "\n",
+ "HNSW is the underlying library for Chroma vector indexing and search {cite}`chromadb2024hnsw`. HNSW provides fast searches with high accuracy but uses more memory. LSH and quantization methods offer better memory efficiency but may sacrifice some precision.\n",
+ "\n",
+ "But are indexing + basic embeddings based similarity sufficient? Often not, as we will see next as we cover reranking technique."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Reranking\n",
+ "\n",
+ "Let's go back to querying our vector database. Here are additional examples."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "First, we write a query about how to get structured output from LLMs. Successfully retrieving the \"Structured Output\" chapter from the book as top result."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[['structured_output', 'input', 'intro']]\n"
+ ]
+ }
+ ],
+ "source": [
+ "q = \"How to get structured output from LLMs?\"\n",
+ "res = query_collection(collection, q)\n",
+ "res.get(\"ids\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we would like to obtain a tutorial on `Docling`, a tool we covered in this very chapter. However, we fail to obtain the correct chapter and instead obtain the \"Introduction\" chapter as a result."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[['intro', 'input', 'structured_output']]\n"
+ ]
+ }
+ ],
+ "source": [
+ "q = \"Docling tutorial\"\n",
+ "res = query_collection(collection, q)\n",
+ "res.get(\"ids\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Retrieval systems solely based on vector similarity search might miss semantic relevance. That brings the need for techniques that can improve accuracy of the retrieval system. One such technique is re-ranking.\n",
+ "\n",
+ "Re-ranking is a method that can improve accuracy of the retrieval system by re-ranking the retrieved documents based on their relevance to the input query.\n",
+ "\n",
+ "In the following, we will use the `sentence_transformers` library to re-rank the retrieved documents based on their relevance to the input query. We utilize the `CrossEncoder` model to re-rank the documents. Cross-Encoder models are more accurate at judging relevance at the cost of speed compared to basic vector-based similarity. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can implement a reranking step in a RAG system using a Cross-Encoder model in the following steps:\n",
+ "\n",
+ "1. First, we initialize the Cross-Encoder model:\n",
+ "```python\n",
+ "model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512)\n",
+ "```\n",
+ "- Uses the `ms-marco-MiniLM-L-6-v2` model, which is specifically trained for passage reranking\n",
+ "- Sets a maximum sequence length of 512 tokens\n",
+ "- This model is designed to score the relevance between query-document pairs\n",
+ "\n",
+ "2. Then we perform the reranking:\n",
+ "```python\n",
+ "scores = model.predict([(q, doc) for doc in res[\"documents\"][0]])\n",
+ "```\n",
+ "- Creates pairs of (query, document) for each retrieved document\n",
+ "- The model predicts relevance scores for each pair\n",
+ "- Higher scores indicate better semantic match between query and document\n",
+ "\n",
+ "3. Finally, we select the best match:\n",
+ "```python\n",
+ "print(res[\"documents\"][0][np.argmax(scores)])\n",
+ "```\n",
+ "- `np.argmax(scores)` finds the index of the highest scoring document\n",
+ "- Uses that index to retrieve the most relevant document\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We obtain the following scores for the retrieved documents (\"intro\", \"input\", \"structured_output\"), the higher the score, the more relevant the document is in relation to the input query.\n",
+ "\n",
+ "```\n",
+ "array([-8.52623 , -6.328738, -8.750055], dtype=float32)\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As a result, we obtain the index of the highest scoring document, which corresponds to the \"input\" chapter. Hence, the re-ranking step successfully retrieved the correct chapter."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "input\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(res[\"ids\"][0][np.argmax(scores)])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The ideia is to first run semantic similarity on embeddings, which should be fast but potentially inaccurate, and then run re-raking on the top-k results, which is more accurate but slower. By doing so, we can balance the speed and accuracy of the retrieval system.\n",
+ "\n",
+ "Hence, instead of going over all retrieved documents:\n",
+ "```python\n",
+ "scores = model.predict([(q, doc) for doc in res[\"documents\"][0]])\n",
+ "```\n",
+ "We would run reranking on the TOPK results, where TOPK <<< number of documents:\n",
+ "```python\n",
+ "scores = model.predict([(q, doc) for doc in res[\"documents\"][0][:TOPK]])\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### LLMs with RAG\n",
+ "\n",
+ "We are finally ready to use the retrieval system to help the LLM answer our authorship question. A common way to integrate RAGs with LLMs is via in-context learning. With in-context learning the LLM learns from the retrieved documents by providing them in the context window as represented in {numref}`incontext`. This is accomplished via a prompt template structure as follows.\n",
+ "\n",
+ "```{figure} ../_static/input/incontext.svg\n",
+ "---\n",
+ "name: incontext\n",
+ "alt: In-Context Learning\n",
+ "scale: 95%\n",
+ "align: center\n",
+ "---\n",
+ "RAG LLM with In-Context Learning\n",
+ "```\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ " rag_system_prompt_template = f\"\"\"\n",
+ " You are a helpful assistant that answers questions based on the provided CONTEXT.\n",
+ "\n",
+ " CONTEXT: {context}\n",
+ " \"\"\"\n",
+ "\n",
+ " user_prompt_template = f\"\"\"\n",
+ " QUESTION: {input}\n",
+ " \"\"\"\n",
+ "```\n",
+ "\n",
+ "This prompt strategy demonstrates a common in-context learning pattern where retrieved documents are incorporated into the LLM's context to enhance response accuracy and relevance. The prompt structure typically consists of a system prompt that:\n",
+ "- Sets clear boundaries for the LLM to use information from the provided context\n",
+ "- Includes the retrieved documents as context\n",
+ "\n",
+ "This approach:\n",
+ "- Reduces hallucination by grounding responses in source documents\n",
+ "- Improves answer relevance by providing contextually relevant information to the LLM\n",
+ "\n",
+ "The context variable is typically populated with the highest-scoring document(s) from the retrieval step, while the input variable contains the user's original query."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def RAG_qa(client, model, context, input):\n",
+ " \"\"\"\n",
+ " Generate a summary of input using a given model\n",
+ " \"\"\"\n",
+ " rag_system_prompt_template = f\"\"\"You are a helpful assistant that answers questions based on the provided CONTEXT.\n",
+ "\n",
+ " CONTEXT: {context}\n",
+ " \"\"\"\n",
+ " \n",
+ " response = client.chat.completions.create(\n",
+ " model=model,\n",
+ " messages=[{\"role\": \"system\", \"content\": rag_system_prompt_template},\n",
+ " {\"role\": \"user\", \"content\": f\"QUESTION: {input}\"}]\n",
+ " )\n",
+ " return response.choices[0].message.content"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "First, we set the LLM."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from dotenv import load_dotenv\n",
+ "import os\n",
+ "\n",
+ "# Load environment variables from .env file\n",
+ "load_dotenv()\n",
+ "\n",
+ "from openai import OpenAI\n",
+ "client = OpenAI()\n",
+ "model = \"gpt-4o-mini\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Then, we run the retrieve step."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "res = query_collection(collection, q)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Next, we run the re-ranking step setting it to consider the `TOPK` retrieved documents."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "TOPK = 2\n",
+ "scores = model.predict([(q, doc) for doc in res[\"documents\"][0][:TOPK]])\n",
+ "res_reranked = res[\"documents\"][0][np.argmax(scores)]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We then pass the top document as context and invoke the LLM with our RAG-based template leading to a successful response."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The author of the book \"Taming LLMs\" is Tharsis Souza.\n"
+ ]
+ }
+ ],
+ "source": [
+ "answer = RAG_qa(model, res_reranked[0], question)\n",
+ "answer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this section, we motivated the use of RAGs as a tool to equip LLMs with relevant context and provided a canonical implementation of its core components. RAGs, however, can be implemented in many shapes and forms and entire books have been written about them. We point the user to additional resources if more specialized techniques and architectures are needed {cite}`kimothi2024simpleguiderag, athinaai2024ragcookbooks, diamant2024ragtechniques, hands-on-llms-book`.\n",
+ "\n",
+ "Next, we discuss RAGs challenges and limitations and conclude our RAGs section envisioning the future of RAGs challenged by the rise of long-context language models.\n",
+ "\n",
+ "### Challenges and Limitations\n",
+ "\n",
+ "While RAG systems offer powerful capabilities for enhancing LLM responses with external knowledge, they face several significant challenges and limitations that require careful consideration:\n",
+ " \n",
+ "- **Data Quality and Accuracy**: The effectiveness of RAG systems fundamentally depends on the quality and reliability of their knowledge sources. When these sources contain inaccurate, outdated, biased, or incomplete information, the system's responses become unreliable. This challenge is particularly acute when dealing with rapidly evolving topics or when sourcing information from unverified channels.\n",
+ " \n",
+ "- **Computational Cost and Latency**: Implementing RAG systems at scale presents computational and operational challenges. The process of embedding documents, maintaining vector databases, and performing similarity searches across large knowledge bases demands computational, budget and operational resources. In real-time applications, these requirements can introduce noticeable latency, potentially degrading the user experience and limiting practical applications.\n",
+ " \n",
+ "- **Explainability and Evaluation**: The complexity of RAG systems, arising from the intricate interaction between retrieval mechanisms and generative models, makes it difficult to trace and explain their reasoning processes. Traditional evaluation metrics often fail to capture the nuanced aspects of RAG performance, such as contextual relevance and factual consistency. This limitation hampers both system improvement and stakeholder trust. Readers are encouraged to read Chapter {ref}`evals` for general LLM evaluation issues as well as consider tools such as Ragas {cite}`ragas2024evaluation` for RAG evaluation.\n",
+ " \n",
+ "- **Hallucination Management**: Though RAG systems help ground LLM responses in source documents, they do not completely eliminate hallucinations. The generative component may still produce content that extrapolates beyond or misinterprets the retrieved context. This risk becomes particularly concerning when the system confidently presents incorrect information with apparent source attribution.\n",
+ "\n",
+ "\n",
+ "Moreover, recent research has shed light on critical limitations of key techniques used in RAGs systems. A relevant finding pertains to reranking, which has shown {cite}`jacob2024drowningdocumentsconsequencesscaling`:\n",
+ "\n",
+ "- **Diminishing Returns**: Performance degrades as the number of documents (K) increases, sometimes performing worse than basic retrievers when dealing with large datasets.\n",
+ "- **Poor Document Discrimination**: Rerankers can be misled by irrelevant documents, sometimes assigning high scores to content with minimal relevance to the query.\n",
+ "- **Consistency Issues**: Performance and relative rankings between different rerankers can vary significantly depending on the number of documents being processed.\n",
+ "\n",
+ "### Will RAGs exist in the future?\n",
+ "\n",
+ "This question is posed as we contrast RAGs with LLMs with long-context windows (LC).\n",
+ "\n",
+ "Recent research has shed light on this specific point {cite}`li2024retrievalaugmentedgenerationlongcontext`, suggesting that, on the one hand, RAGs can be seen as a cost-effective alternative to LC models:\n",
+ "* RAGs offer lower computational cost compared to LC due to the significantly shorter input length required for processing.\n",
+ "* This cost-efficiency arises because RAG reduces the number of input tokens to LLMs, which of course reduces usage cost as pricing is based on the number of input (and output) tokens.\n",
+ "\n",
+ "On the other hand, this RAG benefit is achieved at the cost of performance:\n",
+ "* Recent advancements in LLMs, in particular with Gemini-1.5 and GPT-4o models, demonstrate capabilities in understanding long contexts directly, which enables them to outperform RAG in terms of average performance\n",
+ "* LC models can process extremely long contexts, such as Gemini 1.5 which can handle up to 1 million tokens, and these models benefit from large-scale pretraining to develop strong long-context capabilities.\n",
+ "\n",
+ "This cost-performance trade-off is illustrated in {numref}`LC`, where LC models outperform RAGs in terms of average performance while RAGs are more cost-effective.\n",
+ "\n",
+ "```{figure} ../_static/input/LC.png\n",
+ "---\n",
+ "name: LC\n",
+ "alt: Long-Context LLMs for Superior Performance\n",
+ "scale: 50%\n",
+ "align: center\n",
+ "---\n",
+ "Long-Context LLMs demonstrate superior performance while RAGs are more cost-effective {cite}`li2024retrievalaugmentedgenerationlongcontext`.\n",
+ "```\n",
+ "\n",
+ "{numref}`LC` also shows a model called \"SELF-ROUTE\" which combines RAG and LC by routing queries based on model self-reflection. This is a hybrid approach that reduces computational costs while maintaining performance comparable to LC. The advantage of SELF-ROUTE is most significant for smaller values of *k*, where *k* is the number of retrieved text chunks, and SELF-ROUTE shows a marked improvement in performance over RAG, while as k increases the performance of RAG and SELF-ROUTE approaches that of LC.\n",
+ "\n",
+ "Another example of a hybrid approach that combines the benefits of both LC and RAGs is RetroLLM {cite}`li2024retrollmempoweringlargelanguage`, which is a unified framework that integrates retrieval and generation into a single process, enabling language models to generate fine-grained evidence directly from a corpus. The key contribution is that this approach delivers those benefits while eliminating the need for a separate retriever, addressing limitations of traditional RAG methods. Experimental results demonstrate RetroLLM's superior performance compared to traditional RAG methods, across both in-domain and out-of-domain tasks. It also achieves a significant reduction in token consumption due to its fine-grained evidence retrieval.\n",
+ "\n",
+ "A relevant development in this area is the introduction of LOFT {cite}`lee2024longcontextlanguagemodelssubsume`, a benchmark to assess this paradigm shift from RAGs to LCs, using real-world tasks requiring context up to millions of tokens. Evidence suggests LCs can deliver performance with simplified pipelines compared to RAGs, particularly for tasking requiring multi-hop reasoning over long contexts when using Chain-of-Thought {cite}`wei2023chainofthoughtpromptingelicitsreasoning`. However, LCs can still be outperformed by specialized retrievers, in particular Gecko, a specialized model fine-tuned on extensive text retrieval and similarity tasks.\n",
+ "\n",
+ "Bottom-line: Do we really need RAGs? The answer is conditional:\n",
+ "\n",
+ "* **RAG may be relevant when cost-effectiveness is a key requirement** and where the model needs to access vast amounts of external knowledge without incurring high computational expenses. However, as LLMs context window sizes increase and LLMs cost per input token is decreases, RAG may not be as relevant as it was before.\n",
+ "* **Long-context LLMs are superior when performance is the primary concern**, and the model needs to handle extensive texts that require deep contextual understanding and reasoning.\n",
+ "* **Hybrid approaches like SELF-ROUTE are valuable as they combine the strengths of RAG and LC** offering a practical balance between cost and performance, especially for applications where both factors are critical.\n",
+ "\n",
+ "Ultimately, the choice between RAG, LC, or a hybrid method depends on the specific requirements of the task, available resources, and the acceptable trade-off between cost and performance.\n",
+ "\n",
+ "In a later case study, we demonstrate the power of LCs as we construct a Quiz generator with citations over a large knowledge base without the use of chunking nor RAGs.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## A Note on Frameworks\n",
+ "\n",
+ "We have covered a few open source tools for parsing data and provided a canonical RAG pipeline directly using an open source VectorDB together with an LLM. There is a growing number of frameworks that offer similar functionality wrapping the same core concepts at a higher level of abstraction. The two most popular ones are `Langchain` and `LlamaIndex`. \n",
+ "\n",
+ "For instance, the code below shows how to use `LlamaIndex`'s `LlamaParse` for parsing input documents, which offers support for a wide range of file formats (e.g. .pdf, .pptx, .docx, .xlsx, .html). We we can see that the code is very similar to the one we used for `MarkitDown` and `Docling`.\n",
+ "\n",
+ "```python\n",
+ "from llama_parse import LlamaParse\n",
+ "\n",
+ "# Initialize the parser\n",
+ "parser = LlamaParse(\n",
+ " api_key=\"llx-your-api-key-here\",\n",
+ " result_type=\"markdown\", # Can be \"markdown\" or \"text\"\n",
+ " verbose=True\n",
+ ")\n",
+ "\n",
+ "documents = parser.load_data([\"./doc1.pdf\", \"./doc2.pdf\"])\n",
+ "```\n",
+ "\n",
+ "\n",
+ "\n",
+ "As another example, the code below replicates our ChromaDB-based retrieval system using `LlamaIndex` {cite}`llamaindex2024storing`.\n",
+ "\n",
+ "As we can see, similar concepts are used in both frameworks:\n",
+ "- Documents to represent elements of the knowledge base\n",
+ "- Collections to store the documents\n",
+ "- Indexing of embeddings in the VectorDB and finally\n",
+ "- Querying the VectorDB to retrieve the documents\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```python\n",
+ "import chromadb\n",
+ "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader\n",
+ "from llama_index.vector_stores.chroma import ChromaVectorStore\n",
+ "from llama_index.core import StorageContext\n",
+ "\n",
+ "# load some documents\n",
+ "documents = SimpleDirectoryReader(\"./data\").load_data()\n",
+ "\n",
+ "# initialize client, setting path to save data\n",
+ "db = chromadb.PersistentClient(path=\"./chroma_db\")\n",
+ "\n",
+ "# create collection\n",
+ "chroma_collection = db.get_or_create_collection(\"tamingllms\")\n",
+ "\n",
+ "# assign chroma as the vector_store to the context\n",
+ "vector_store = ChromaVectorStore(chroma_collection=chroma_collection)\n",
+ "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+ "\n",
+ "# create your index\n",
+ "index = VectorStoreIndex.from_documents(\n",
+ " documents, storage_context=storage_context\n",
+ ")\n",
+ "\n",
+ "# create a query engine and query\n",
+ "query_engine = index.as_query_engine()\n",
+ "response = query_engine.query(\"Who is the author of Taming LLMs?\")\n",
+ "print(response)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Frameworks are useful for quickly prototyping RAG systems and for building applications on top of them as they provide a higher level of abstraction and integration with third-party libraries. However, the underlying concepts are the same as the ones we have covered in this chapter. More often than not, problems arise when developers either do not understand the underlying concepts or fail to understand the details of the implement behind the abstractions provided by the framework. Therefore, it is recommended to try and start your implementation using lower level tools as much as possible and only when (i) the underlying problem as well as (ii) the desired solution are well understood, then consider moving to higher level frameworks if really needed."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Case Studies\n",
+ "\n",
+ "This section presents two case studies to complement topics we have covered in this chapter in the context of managing input data for LLMs.\n",
+ "\n",
+ "First, we cover content chunking, in particular Content Chunking with Contextual Linking which showcases how intelligent chunking strategies can overcome both context window and output token limitations. This case study illustrates techniques for breaking down and reassembling content while maintaining coherence, enabling the generation of high-quality long-form outputs despite model constraints.\n",
+ "\n",
+ "Second, we build a Quiz generator with citations using long context window. Not all knowledge intense applications require RAGs. In this case study, we show how to use long context window as well as some additional input management techniques such as prompt caching for efficiency and reference management to enhance response accuracy and verifiability. These approaches show how to maximize the benefits of larger context models while maintaining response quality."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Case Study I: Content Chunking with Contextual Linking\n",
+ "\n",
+ "Content chunking is commonly used to breakdown long-form content into smaller, manageable chunks. In the context of RAGs, this can be helpful not only to help the retrieval system find more contextually relevant documents but also lead to a more cost efficient LLM solution since fewer tokens are processed in the context window. Furthermore, semantic chunking can increase accuracy of RAG systems {cite}`zenml2024rag`.\n",
+ "\n",
+ "Content chunking with contextual linking is a chunking technique that seeks to split input content while keeping chunk-specific context, hence allowing the LLM to maintain coherence and context when generating responses per chunks. In that way, this technique tackles two key problems:\n",
+ "1. The LLM's inability to process long inputs to do context-size limits\n",
+ "2. The LLM's inability to maintain coherence and context when generating responses per chunks\n",
+ "\n",
+ "As a consequence, a third problem is also tackled: LLM's inability to generate long-form content due to the `max_output_tokens` limitation. Since we generate responses per chunk, as we will see later, we end up with a solution that is capable of generating long-form content while maintaining coherence.\n",
+ "\n",
+ "We exemplify this technique by following these steps:\n",
+ "1. **Chunking the Content**: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input.\n",
+ "\n",
+ "2. **Maintaining Context**: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks.\n",
+ "\n",
+ "3. **Generating Linked Prompts**: For each chunk, a prompt is generated that includes the chunk's content and its context. This prompt is then used to generate the output for that chunk.\n",
+ "\n",
+ "4. **Combining the Outputs**: The outputs of all chunks are combined to form the final long-form content.\n",
+ "\n",
+ "Let's examine an example implementation of this technique.\n",
+ "\n",
+ "#### Generating long-form content\n",
+ "\n",
+ "- Goal: Generate a long-form report analyzing a company's financial statement.\n",
+ "- Input: A company's 10K SEC filing.\n",
+ "\n",
+ "```{figure} ../_static/structured_output/diagram1.png\n",
+ "---\n",
+ "name: content-chunking-with-contextual-linking\n",
+ "alt: Content Chunking with Contextual Linking\n",
+ "scale: 50%\n",
+ "align: center\n",
+ "---\n",
+ "Content Chunking with Contextual Linking Schematic Representation.\n",
+ "```\n",
+ "\n",
+ "The diagram in {numref}`content-chunking-with-contextual-linking` illustrates the process we will follow for handling long-form content generation with Large Language Models through \"Content Chunking with Contextual Linking.\" It shows how input content is first split into manageable chunks using a chunking function (e.g. `CharacterTextSplitter` with `tiktoken` tokenizer), then each chunk is processed sequentially while maintaining context from previous chunks. For each chunk, the system updates the context, generates a dynamic prompt with specific parameters, makes a call to the LLM chain, and stores the response. After all chunks are processed, the individual responses are combined with newlines to create the final report, effectively working around the token limit constraints of LLMs while maintaining coherence across the generated content.\n",
+ "\n",
+ "**Step 1: Chunking the Content**\n",
+ "\n",
+ "There are different methods for chunking, and each of them might be appropriate for different situations. However, we can broadly group chunking strategies in two types:\n",
+ "- **Fixed-size Chunking**: This is the most common and straightforward approach to chunking. We simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. In general, we will want to keep some overlap between chunks to make sure that the semantic context doesn’t get lost between chunks. Fixed-sized chunking may be a reasonable path in many common cases. Compared to other forms of chunking, fixed-sized chunking is computationally cheap and simple to use since it doesn’t require the use of any specialied techniques or libraries.\n",
+ "- **Content-aware Chunking**: These are a set of methods for taking advantage of the nature of the content we’re chunking and applying more sophisticated chunking to it. Examples include:\n",
+ " - **Sentence Splitting**: Many models are optimized for embedding sentence-level content. Naturally, we would use sentence chunking, and there are several approaches and tools available to do this, including naive splitting (e.g. splitting on periods), NLTK, and spaCy.\n",
+ " - **Recursive Chunking**: Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators.\n",
+ " - **Semantic Chunking**: This is a class of methods that leverages embeddings to extract the semantic meaning present in your data, creating chunks that are made up of sentences that talk about the same theme or topic.\n",
+ "\n",
+ " Here, we will utilize `langchain` for a content-aware sentence-splitting strategy for chunking. Langchain offers several text splitters {cite}`langchain_text_splitters` such as JSON-, Markdown- and HTML-based or split by token. We will use the `CharacterTextSplitter` with `tiktoken` as our tokenizer to count the number of tokens per chunk which we can use to ensure that we do not surpass the input token limit of our model.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def get_chunks(text: str, chunk_size: int, chunk_overlap: int) -> list:\n",
+ " \"\"\"\n",
+ " Split input text into chunks of specified size with specified overlap.\n",
+ "\n",
+ " Args:\n",
+ " text (str): The input text to be chunked.\n",
+ " chunk_size (int): The maximum size of each chunk in tokens.\n",
+ " chunk_overlap (int): The number of tokens to overlap between chunks.\n",
+ "\n",
+ " Returns:\n",
+ " list: A list of text chunks.\n",
+ " \"\"\"\n",
+ " from langchain_text_splitters import CharacterTextSplitter\n",
+ "\n",
+ " text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)\n",
+ " return text_splitter.split_text(text)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Step 2: Writing the Base Prompt Template**\n",
+ "\n",
+ "We will write a base prompt template which will serve as a foundational structure for all chunks, ensuring consistency in the instructions and context provided to the language model. The template includes the following parameters:\n",
+ "- `role`: Defines the role or persona the model should assume.\n",
+ "- `context`: Provides the background information or context for the task.\n",
+ "- `instruction`: Specifies the task or action the model needs to perform.\n",
+ "- `input_text`: Contains the actual text input that the model will process.\n",
+ "- `requirements`: Lists any specific requirements or constraints for the output."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_core.prompts import PromptTemplate\n",
+ "def get_base_prompt_template() -> str:\n",
+ " \n",
+ " base_prompt = \"\"\"\n",
+ " ROLE: {role}\n",
+ " CONTEXT: {context}\n",
+ " INSTRUCTION: {instruction}\n",
+ " INPUT: {input}\n",
+ " REQUIREMENTS: {requirements}\n",
+ " \"\"\"\n",
+ " \n",
+ " prompt = PromptTemplate.from_template(base_prompt)\n",
+ " return prompt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We will write a simple function that returns an `LLMChain` which is a simple `langchain` construct that allows you to chain together a combination of prompt templates, language models and output parsers."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from langchain_core.output_parsers import StrOutputParser\n",
+ "from langchain_community.chat_models import ChatLiteLLM\n",
+ "\n",
+ "def get_llm_chain(prompt_template: str, model_name: str, temperature: float = 0):\n",
+ " \"\"\"\n",
+ " Returns an LLMChain instance using langchain.\n",
+ "\n",
+ " Args:\n",
+ " prompt_template (str): The prompt template to use.\n",
+ " model_name (str): The name of the model to use.\n",
+ " temperature (float): The temperature setting for the model.\n",
+ "\n",
+ " Returns:\n",
+ " llm_chain: An instance of the LLMChain.\n",
+ " \"\"\"\n",
+ " \n",
+ " from dotenv import load_dotenv\n",
+ " import os\n",
+ "\n",
+ " # Load environment variables from .env file\n",
+ " load_dotenv()\n",
+ " \n",
+ " api_key_label = model_name.split(\"/\")[0].upper() + \"_API_KEY\"\n",
+ " llm = ChatLiteLLM(\n",
+ " model=model_name,\n",
+ " temperature=temperature,\n",
+ " api_key=os.environ[api_key_label],\n",
+ " )\n",
+ " llm_chain = prompt_template | llm | StrOutputParser()\n",
+ " return llm_chain"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**Step 3: Constructing Dynamic Prompt Parameters**\n",
+ "\n",
+ "Now, we will write a function (`get_dynamic_prompt_template`) that constructs prompt parameters dynamically for each chunk."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from typing import Dict\n",
+ "def get_dynamic_prompt_params(prompt_params: Dict, \n",
+ " part_idx: int, \n",
+ " total_parts: int,\n",
+ " chat_context: str,\n",
+ " chunk: str) -> str:\n",
+ " \"\"\"\n",
+ " Construct prompt template dynamically per chunk while maintaining the chat context of the response generation.\n",
+ " \n",
+ " Args:\n",
+ " prompt_params (Dict): Original prompt parameters\n",
+ " part_idx (int): Index of current conversation part\n",
+ " total_parts (int): Total number of conversation parts\n",
+ " chat_context (str): Chat context from previous parts\n",
+ " chunk (str): Current chunk of text to be processed\n",
+ " Returns:\n",
+ " str: Dynamically constructed prompt template with part-specific params\n",
+ " \"\"\"\n",
+ " dynamic_prompt_params = prompt_params.copy()\n",
+ " # saves the chat context from previous parts\n",
+ " dynamic_prompt_params[\"context\"] = chat_context\n",
+ " # saves the current chunk of text to be processed as input\n",
+ " dynamic_prompt_params[\"input\"] = chunk\n",
+ " \n",
+ " # Add part-specific instructions\n",
+ " if part_idx == 0: # Introduction part\n",
+ " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
+ " You are generating the Introduction part of a long report.\n",
+ " Don't cover any topics yet, just define the scope of the report.\n",
+ " \"\"\"\n",
+ " elif part_idx == total_parts - 1: # Conclusion part\n",
+ " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
+ " You are generating the last part of a long report. \n",
+ " For this part, first discuss the below INPUT. Second, write a \"Conclusion\" section summarizing the main points discussed given in CONTEXT.\n",
+ " \"\"\"\n",
+ " else: # Main analysis part\n",
+ " dynamic_prompt_params[\"instruction\"] = f\"\"\"\n",
+ " You are generating part {part_idx+1} of {total_parts} parts of a long report.\n",
+ " For this part, analyze the below INPUT.\n",
+ " Organize your response in a way that is easy to read and understand either by creating new or merging with previously created structured sections given in CONTEXT.\n",
+ " \"\"\"\n",
+ " \n",
+ " return dynamic_prompt_params"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "**Step 4: Generating the Report**\n",
+ "\n",
+ "Finally, we will write a function that generates the actual report by calling the `LLMChain` with the dynamically updated prompt parameters for each chunk and concatenating the results at the end."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def generate_report(input_content: str, llm_model_name: str, \n",
+ " role: str, requirements: str,\n",
+ " chunk_size: int, chunk_overlap: int) -> str:\n",
+ " # stores the parts of the report, each generated by an individual LLM call\n",
+ " report_parts = [] \n",
+ " # split the input content into chunks\n",
+ " chunks = get_chunks(input_content, chunk_size, chunk_overlap)\n",
+ " # initialize the chat context with the input content\n",
+ " chat_context = input_content\n",
+ " # number of parts to be generated\n",
+ " num_parts = len(chunks)\n",
+ "\n",
+ " prompt_params = {\n",
+ " \"role\": role, # user-provided\n",
+ " \"context\": \"\", # dinamically updated per part\n",
+ " \"instruction\": \"\", # dynamically updated per part\n",
+ " \"input\": \"\", # dynamically updated per part\n",
+ " \"requirements\": requirements #user-priovided\n",
" }\n",
"\n",
" # get the LLMChain with the base prompt template\n",
@@ -2076,14 +2981,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "### Case Study II: Github RAG\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Case Study III: Quiz Generation with Citations\n",
+ "### Case Study II: Quiz Generation with Citations\n",
"\n",
"In this case study, we will build a Quiz generator with citations that explores additional input management techniques particularly useful with long context windows. The implementation includes prompt caching for efficiency and citation tracking to enhance accuracy and verifiability. We will use Gemini 1.5 Pro as our LLM model, which has a context window of 2M tokens.\n",
"\n",
@@ -2400,7 +3298,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Conclusion"
+ "## Conclusion\n",
+ "\n",
+ "This chapter has explored critical strategies and techniques for managing input data in LLM applications, focusing on three key areas: data parsing, retrieval augmentation, and practical implementation patterns. We examined how parsing tools like MarkItDown and Docling can transform diverse data formats into LLM-compatible representations, demonstrating through case studies how parser quality can impact LLM performance. The chapter also investigated retrieval augmentation techniques, particularly RAG systems, showing how they can enhance LLM capabilities by providing access to external knowledge while discussing their future relevance in the context of emerging long-context language models.\n",
+ "\n",
+ "Through our case studies, we demonstrated practical approaches to handling common challenges in LLM applications. The Content Chunking with Contextual Linking case study illustrated techniques for managing long-form content generation while maintaining coherence across chunks. The Quiz Generation with Citations case study showcased how long-context windows can be effectively utilized without the need for complex retrieval systems, highlighting the importance of choosing the right approach based on specific application requirements rather than defaulting to more complex solutions.\n",
+ "\n",
+ "As the field continues to evolve, the choice between traditional RAG systems and emerging long-context models will likely become increasingly nuanced. While RAGs offer cost-effective solutions for incorporating external knowledge, the rise of long-context models suggests a future where simpler architectures might suffice for many applications. The key insight is that effective input data management requires careful consideration of trade-offs among complexity, cost, and performance, always guided by specific application requirements rather than following a one-size-fits-all approach. Success in building robust LLM applications will depend on understanding these trade-offs and selecting appropriate strategies for each use case."
]
},
{
diff --git a/tamingllms/_build/html/_static/evals/llm_judge.svg b/tamingllms/_build/html/_static/evals/llm_judge.svg
deleted file mode 100644
index 4292dfa..0000000
--- a/tamingllms/_build/html/_static/evals/llm_judge.svg
+++ /dev/null
@@ -1,879 +0,0 @@
-LLM Judge Evaluation SystemLLM-Judgecomponentsapps
App Rankings
-Detailed Scores
-Analysis Report
-
-
Task description
-Scoring guidelines
-Output format
-
Tharsis Souza (Ph.D. Computer Science, UCL University of London) is a computer scientist and product leader specializing in AI-based products. He is a Lecturer at Columbia University’s Master of Science program in Applied Analytics, (incoming) Head of Product, Equities at Citadel, and former Senior VP at Two Sigma Investments. He mentors under-represented students & working professionals to help create a more diverse global AI1 ecosystem.
+
Tharsis Souza (Ph.D. Computer Science, UCL University of London) is a computer scientist and product leader specializing in AI-based products. He is a Lecturer at Columbia University’s Master of Science program in Applied Analytics, (incoming) Head of Product, Equities at Citadel, and former Senior VP at Two Sigma Investments. He mentors under-represented students & working professionals to help create a more diverse global AI ecosystem.
With over 15 years of experience delivering technology products across startups and Fortune 500 companies, he is also an author of numerous scholarly publications and a frequent speaker at academic and business conferences. Grounded on academic background and drawing from practical experience building and scaling up products powered by language models at early-stage startups, major institutions as well as contributing to open source projects, he brings a unique perspective on bridging the gap between LLMs promised potential and their practical implementation challenges to enable the next generation of AI-powered products.
An alternative title of this book could have been “Language Models Behaving Badly”. If you come from a background in financial modeling, you may have noticed the parallel with Emanuel Derman’s seminal work “Models.Behaving.Badly” [Derman, 2011]. This parallel is not coincidental. Just as Derman cautioned against treating financial models as perfect representations of reality, this book aims to highlight the limitations and pitfalls of Large Language Models (LLMs) in practical applications.
The book “Models.Behaving.Badly” by Emanuel Derman, a former physicist and Goldman Sachs quant, explores how financial and scientific models can fail when we mistake them for reality rather than treating them as approximations full of assumptions.
The core premise of his work is that while models can be useful tools for understanding aspects of the world, they inherently involve simplification and assumptions. Derman argues that many financial crises, including the 2008 crash, occurred in part because people put too much faith in mathematical models without recognizing their limitations.
Like financial models that failed to capture the complexity of human behavior and market dynamics, LLMs have inherent constraints. They can hallucinate facts, struggle with logical reasoning, and fail to maintain consistency in long outputs. Their responses, while often convincing, are probabilistic approximations based on training data rather than true understanding, even though humans insist on treating them as “machines that can reason”.
E. Derman. Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. Free Press, 2011. ISBN 9781439165010. URL: https://books.google.co.uk/books?id=lke_cwM4wm8C.
The release of ChatGPT 3.5 in late 2022 marked a significant moment in the history of artificial intelligence. Within just five days of its launch, the model attracted over a million users, and within two months, it became the fastest-growing consumer application in history with over 100 million monthly active users.
Yet, this raises an intriguing question: Why did ChatGPT 3.5 observe such a dramatic traction when its predecessor, GPT-3, which had the same size/number of parameters, received far less attention from the general public? Arguably, the answer lies not in raw capabilities, but in Preference Alignment.
Through careful fine-tuning using human feedback, OpenAI transformed GPT-3’s raw intelligence into ChatGPT’s helpful and resourceful conversational abilities. This breakthrough demonstrated that aligning language models with human preferences is just as crucial as scaling them to greater sizes.
-
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
+
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
Common pre-trained LLMs are not helpful to humans by default, in general. This is because state-of-the-art language models are trained on the specific objective of predicting the next token. This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Common pre-trained LLMs are not helpful to humans by default, in general. This is because state-of-the-art language models are trained on the specific objective of predicting the next token. This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Let’s take a look at GPT-2’s response to the following prompt: “Explain the moon landing to a 6 year old.”
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
-
Fig. 7.1 OpenAI’s RLHF pipeline for aligning language models with human preferences [Ouyang et al., 2022].¶
+
Fig. 7.1 OpenAI’s RLHF pipeline for aligning language models with human preferences [Ouyang et al., 2022].¶
Fig. 7.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
@@ -422,7 +422,7 @@
-
Fig. 7.2 Simplified view of the alignment process showing the progression from base model to instruction-tuned model to aligned model [Ouyang et al., 2022].¶
+
Fig. 7.2 Simplified view of the alignment process showing the progression from base model to instruction-tuned model to aligned model [Ouyang et al., 2022].¶
A common pattern has emerged in the development of language models: First, a powerful pre-trained base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 7.3.
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [Face, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
+
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [HuggingFace, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
The OpenAI paper introduced two key components of this fine-tuning process - SFT for instruction tuning and RLHF (PPO in particular) for alignment. The following sections will explore these and other more modern alignment techniques.
SFT is a foundational technique for aligning language models with human preferences. Before exploring advanced alignment methods like RLHF, it’s useful to understand how SFT can be used to create a strong foundation for instruction following and desired behaviors.
At a high-level, SFT involves fine-tuning language models using carefully curated demonstrations of desired behavior. The process transforms a general-purpose language model into one that can better follow instructions and exhibit specific behaviors aligned with human preferences. Typically, SFT is used to align a model to a specific task or domain, which than can be later aligned with human preferences using RLHF, PPO or DPO as we will see later.
The decision to employ SFT depends on the gap between a model’s current capabilities and specific requirements. SFT proves particularly valuable in scenarios requiring:
[Hong et al., 2024] therefore leading to unintended results and a suboptimal alignment.
-
SFT can be seen as a form of behavior cloning of humans. Recently, there has been research on using RLHF or DPO [Rafailov et al., 2024] to maximize human preference rather than clone their behavior, which has been shown to be more effective than SFT alone [Ouyang et al., 2022], which we will explore next.
+
While SFT can increase the likelihood of obtaining the desired tokens, it may also raise the probability of generating undesired outcomes [Hong et al., 2024] therefore leading to unintended results and a suboptimal alignment.
+
SFT can be seen as a form of behavior cloning of humans. Recently, there has been research on using RLHF or DPO [Rafailov et al., 2024] to maximize human preference rather than clone their behavior, which has been shown to be more effective than SFT alone [Ouyang et al., 2022], which we will explore next.
The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. PPO [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
-
One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
-
Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 7.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. PPO [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
+
One of the key strengths of PPO lies in its ability to handle complex reward landscapes [HuggingFace, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
+
Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 7.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
-
Fig. 7.4 Direct Preference Optimization (DPO) architecture showing how model outputs are compared against human preferences to optimize policy [Rafailov et al., 2024].¶
+
Fig. 7.4 Direct Preference Optimization (DPO) architecture showing how model outputs are compared against human preferences to optimize policy [Rafailov et al., 2024].¶
The key idea is to train the model to prefer responses that align with our desired behavior over responses that do not. DPO works by:
Modern libraries such as HuggingFace’s TRL [Face, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of our case study later in the Chapter.
+
Modern libraries such as HuggingFace’s TRL [HuggingFace, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of our case study later in the Chapter.
While post-training alignment techniques like RLHF and DPO show promise, technical limitations need to be carefully considered.
-
Reinforcement Learning from Human Feedback faces several critical challenges that distinguish it from pre-training or supervised fine-tuning. One key issue is scalability. Recent research suggests that the current RLHF framework does not scale as effectively as the pretraining stage [Hou et al., 2024], in particular presenting the following challenges:
+
Reinforcement Learning from Human Feedback faces several critical challenges that distinguish it from pre-training or supervised fine-tuning. One key issue is scalability. Recent research suggests that the current RLHF framework does not scale as effectively as the pretraining stage [Hou et al., 2024], in particular presenting the following challenges:
As we discussed in the previous section, DPO is a more recent “reward-free” fine-tuning technique that has gained significant attention which derives reward signals directly from pairwise preference data instead of fitting a reward model as in RLHF. With its increasing popularity, emerging research is exploring DPO limitations and potential improvements [Feng et al., 2024], including the following:
Another key issue is model collapse - a phenomenon where model performance degrades with each training iteration.
-
Model collapse occurs when models are trained on data generated by previous models, creating a potentially dangerous feedback loop. This recursive training process can lead to [Kazdan et al., 2024]:
+
Model collapse occurs when models are trained on data generated by previous models, creating a potentially dangerous feedback loop. This recursive training process can lead to [Kazdan et al., 2024]:
Degradation of output quality with each training iteration
Pollution of training data when synthetic samples replace real data
@@ -592,16 +592,16 @@
[Szép et al., 2024], providing practical guidance on data augmentation, regularization methods, and training strategies to maximize performance while minimizing data requirements. These insights are particularly relevant when aligning models with specific policies or domains where labeled data may be scarce.
+
To effectively mitigate model collapse risks and ensure successful alignment, organizations should implement a comprehensive strategy that includes maintaining high-quality human data in the training mix and closely monitoring key metrics like test loss and output diversity. Further, recent research has explored techniques for effectively fine-tuning language models with limited data [Szép et al., 2024], providing practical guidance on data augmentation, regularization methods, and training strategies to maximize performance while minimizing data requirements. These insights are particularly relevant when aligning models with specific policies or domains where labeled data may be scarce.
A recent issue is Alignment faking, a behaviour exhibited by LLMs where they strategically comply with a training objective during training to prevent modifications to their preferred behaviours outside of training as evidenced by the landmark paper by Anthropic [Askell et al., 2024a] in Dec/2024. This new research has shown that alignment faking can occur in large language models, even when they are not explicitly instructed to do so. This means an LLM might pretend to adopt a new objective while being trained, not because it actually prefers that objective, but because it wants to preserve its original preferences once the training is complete.
A recent issue is Alignment faking, a behaviour exhibited by LLMs where they strategically comply with a training objective during training to prevent modifications to their preferred behaviours outside of training as evidenced by the landmark paper by Anthropic [Askell et al., 2024a] in Dec/2024. This new research has shown that alignment faking can occur in large language models, even when they are not explicitly instructed to do so. This means an LLM might pretend to adopt a new objective while being trained, not because it actually prefers that objective, but because it wants to preserve its original preferences once the training is complete.
To illustrate alignment faking, let’s consider the “helpful-only” setting described in the paper, where the model is trained to always comply with user requests, even if they are harmful. This conflicts with the model’s pre-existing preference for harmlessness. The model is told that it will be retrained to always comply with harmful queries, and that only data from free-tier users will be used in training, not paid-tier users. Fig. 7.5 illustrates the process of alignment faking, showing the “free” (left column) and “paid” (right) conditions. Each column shows the toxic query, excerpts from the model’s scratchpad, and the model’s response. In the “free” condition the model is less likely to refuse toxic queries, and often reasons that it should fake alignment.
Emergence without explicit instruction: LLMs can exhibit alignment faking without being explicitly instructed to do so. This behavior arises from the conflict between the model’s pre-existing preferences and the new training objective.
Scale-dependent: Alignment faking is more likely to occur in larger, more capable models. Models like Claude 3 Opus and Claude 3.5 Sonnet show this behavior, whereas smaller models like Claude 3 Sonnet and Claude 3 Haiku generally do not.
Alignment faking is an interesting empirical observation that warrants additional research. An excellent review of the paper by Askell et al. is provided by Prof. Jacob Andreas, Prof. Yoshua Bengio, Prof. Jasjeet Sekhon, and Dr. Rohin Shah in [Askell et al., 2024b].
In this case study, we will align a language model to an user-provided policy. Here, by policy we mean a set of principles and rules that we want the language model to adhere to. All methodology and code introduced solve this general problem of policy-based alignment. However, we will describe a specific use case to illustrate our approach.
Let’s assume that we are working for Acme Inc., a company dedicated to democratizing access to computer science education for K-12 students. Acme Inc. is in the process of creating a chatbot named smolK-12, a small open source LLM, specifically designed for K-12 students.
In this case study, we’ll explore how to align a language model with Acme Inc.’s policy to ensure its LLM-powered applications are safe and appropriate for K-12 students.
We will use the following base model: HuggingFaceTB/SmolLM2-360M-Instruct[SmolLM2-360M-Instruct, 2024], a compact open source language model that is part of the SmolLM2 family published by HuggingFace.
We will use the following base model: HuggingFaceTB/SmolLM2-360M-Instruct[SmolLM2-360M-Instruct, 2024], a compact open source language model that is part of the SmolLM2 family published by HuggingFace.
We will use the following APIs:
HuggingFace Transformers for local model inference
Since we have decided to anchor our Case Study on HuggingFace’s SmolLM2 models [SmolLM2, 2024], it is worth providing a reason for this choice.
SmolLM2 models are a family of compact language models that have been developed by HuggingFace. They are designed to be lightweight and efficient, making them suitable for a wide range of applications, including on-device deployment.
Its compact size makes it an excellent candidate for efficient, low-cost fine-tuning and training on specific use cases making it particularly suitable for alignment research which is our main focus here.
Having said that, it is important to note that reasoning capabilities of SmolLM2 models are not necessarily on par with state-of-the-art LLMs due to its compact size. As we go through this Case Study, it is important to keep this in mind along with several potential issues and limitations, including:
A company policy articulates the principles and standards that the company upholds, ensuring that employees, users and stakeholders understand the expectations regarding safety, ethical conduct, social responsibility, and integrity. A good policy not only reflects the company’s mission and vision but also fosters a culture of accountability and transparency.
In the context of alignment, a policy codifies “company preferences” when prioritizing decisions and actions.
-
In this case study, Acme Inc. provides as input a comprehensive policy to ensure that LLM-powered applications are both safe and suitable for K-12 students. Acme Inc.’s policy adheres to version 0.5 of the AI Safety Benchmark established by MLCommons [Vidgen et al., 2024]. This benchmark encompasses seven critical hazard categories (see Chapter Safety):
+
In this case study, Acme Inc. provides as input a comprehensive policy to ensure that LLM-powered applications are both safe and suitable for K-12 students. Acme Inc.’s policy adheres to version 0.5 of the AI Safety Benchmark established by MLCommons [Vidgen et al., 2024]. This benchmark encompasses seven critical hazard categories (see Chapter Safety):
In order to fine-tune a base model to create an aligned model, we need to construct a dataset of policy-aligned preferences. This dataset will be used to align our base model to our policy.
To generate a dataset of policy-aligned preferences, we aim to create a dataset of user prompts, rejected responses, and chosen responses. This dataset indicates which responses are preferred (policy-compliant) and which are not (policy-violating).
-
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
-
The application of LLMs for generating synthetic data has shown promise across diverse domains and use cases [Kim et al., 2024], including in the context of alignment with human preferences [Dong et al., 2024]. Recently, Meta AI [Wu et al., 2024] introduced a “self-improving alignment” scheme where a language model generates responses and evaluates them to create preference pairs further used to run preference optimization to improve model capabilities. Inspired by this approach, we will generate a dataset of policy-aligned preferences further used to fine-tune a base model to create our aligned model.
+
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
+
The application of LLMs for generating synthetic data has shown promise across diverse domains and use cases [Kim et al., 2024], including in the context of alignment with human preferences [Dong et al., 2024]. Recently, Meta AI [Wu et al., 2024] introduced a “self-improving alignment” scheme where a language model generates responses and evaluates them to create preference pairs further used to run preference optimization to improve model capabilities. Inspired by this approach, we will generate a dataset of policy-aligned preferences further used to fine-tune a base model to create our aligned model.
First, we define a data schema for our dataset. Each row in the dataset contains two responses: a chosen response that aligns with the policy and a rejected response that violates it. Through DPO-optimization, the model is awarded for generating responses that match the chosen, policy-compliant examples rather than the rejected ones:
The ResponseGenerator class creates a dataset of responses from an unaligned base model that we aim to improve through fine-tuning. These responses serve as “rejected” examples in our training data since they may not properly align with safety policies and guidelines. The class supports both local model inference using the Hugging Face Transformers library and remote inference through the Hugging Face Inference API. When instantiated with a model name, it loads the model locally. Otherwise, if a cloud API URL is provided, it connects to the remote API endpoint for inference.
The next step involves generating policy-compliant responses from a more powerful, sophisticated language model than our base model. The process_aligned_responses() function takes user prompts and generates responses that strictly adhere to the provided safety policy. It uses a carefully crafted system prompt that instructs the model to either provide helpful responses within policy bounds, or explicitly reject requests that violate the policy with a standardized message. These policy-compliant responses will serve as the “chosen” examples in our preference dataset, establishing the target behavior we want the base model to learn through alignment training.
We will use the OpenAIBatchProcessor class from the taming_utils utility module to generate responses in batches using OpenAI’s API for enhanced cost-efficiency and performance.
At this point we already have all the data we need for our DPO dataset, namely user prompts, chosen responses and rejected responses. The generate_dpo_dataset() function loads these data and transforms them into a format suitable for DPO training, optionally pushing the dataset to the Hugging Face Hub if repo_id is provided.
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (trl-lib/ultrafeedback_binarized) [H4, 2024a].
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (trl-lib/ultrafeedback_binarized) [H4, 2024a].
The UltraFeedback binarized dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 7.6.
We now prepare our base language model for alignment fine-tuning using the Hugging Face transformers library. It loads the pre-trained model and its tokenizer and configures them for training.
Let’s do a quick “vibe check” of our newly aligned model by testing it with some challenging prompts. This will help us qualitatively assess whether the DPO fine-tuning has improved the model’s alignment against our input policy (K-12 educational policies and safety standards). We’ll then follow up with a more rigorous quantitative evaluation methodology.
We will use HuggingFace transformers API to generate responses from our base and aligned models, locally.
Evaluating alignment presents unique challenges. Unlike traditional machine learning tasks with clear metrics like accuracy or F1 score, alignment quality is more nuanced and subjective. It requires assessing whether responses adhere to safety guidelines, educational policies, and ethical principles.
The gold standard for evaluating alignment is human evaluation. Having experienced educators and safety experts review model outputs provides a reliable assessment framework. However, human evaluation is expensive, time-consuming, and difficult to scale. Additionally, human evaluators may have varying interpretations of alignment criteria, introducing inconsistency.
-
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
+
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
The evaluation methodology summarized in Fig. 7.9 consists of three key components that work together to assess model alignment against our policy:
LLMs are complex systems and alignment is a challenging problem. In this chapter, we discussed how post-training techniques can be used to align a language model to human preferences. In the case study, we demonstrated how to use DPO to align a language model to a user-provider policy further automating the process via synthetic data generation and LLM-as-judge evaluation. Our approach serves as a proof of concept and several considerations should be taken into account when using this methodology in practice.
Synthetic Data Generation
-
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
-
One major challenge is data distribution bias, where the synthetic data might not accurately mirror the complexities and nuances of real-world data. This can lead to models trained on this data making inaccurate predictions or exhibiting biases. In our case study, we did observe duplicate responses in the synthetic data. Further, the methodology lacks a systematic approach to evaluate the quality of the synthetic data itself only focusing on evals for the consecutive fine-tuned model. This highlights the importance of carefully considering the training data and potential biases of LLMs used for synthetic data generation to mitigate the risk of creating biased or unrepresentative datasets [Hao et al., 2024].
-
Our approach does enable a systematic approach to aligning a model to an input policy. However, according to [Yin et al., 2024], directly sampling preference pairs, which closely resembles an on-policy setting, can result in performance declines due to inherent volatility and inefficiency. Therefore, constructing effective preference data to continuously improve LLMs remains a critical research problem.
+
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
+
One major challenge is data distribution bias, where the synthetic data might not accurately mirror the complexities and nuances of real-world data. This can lead to models trained on this data making inaccurate predictions or exhibiting biases. In our case study, we did observe duplicate responses in the synthetic data. Further, the methodology lacks a systematic approach to evaluate the quality of the synthetic data itself only focusing on evals for the consecutive fine-tuned model. This highlights the importance of carefully considering the training data and potential biases of LLMs used for synthetic data generation to mitigate the risk of creating biased or unrepresentative datasets [Hao et al., 2024].
+
Our approach does enable a systematic approach to aligning a model to an input policy. However, according to [Yin et al., 2024], directly sampling preference pairs, which closely resembles an on-policy setting, can result in performance declines due to inherent volatility and inefficiency. Therefore, constructing effective preference data to continuously improve LLMs remains a critical research problem.
Choice of Base Model
-
The choice of base model is a critical consideration when implementing alignment techniques. In the case study, we selected the smolLM model family due to its efficient architecture and reasonable performance on basic tasks while maintaining relatively low computational requirements. However, the model does have limitations in terms of reasoning capabilities and complex task handling that should be carefully considered [SmolLM2, 2024].
+
The choice of base model is a critical consideration when implementing alignment techniques. In the case study, we selected the smolLM model family due to its efficient architecture and reasonable performance on basic tasks while maintaining relatively low computational requirements. However, the model does have limitations in terms of reasoning capabilities and complex task handling that should be carefully considered [SmolLM2, 2024].
Real-world applications need to carefully evaluate the trade-offs between model size/capabilities, and costs. While smaller models like smolLM can be cost-effective for basic alignment experiments, they may not provide the sophisticated reasoning needed for production use cases. The computational and financial costs of training and deploying larger models must be weighed against the required capabilities.
-
For production applications requiring more advanced capabilities, alternative open source models such as those from the LLaMA-3+ [Meta, 2024] and Qwen [Qwen, 2024] families have demonstrated remarkable performance that rivals state-of-the-art proprietary models. These models offer enhanced reasoning abilities and better handling of complex tasks, though at increased computational and financial cost. The choice ultimately depends on specific use case requirements, available resources, and acceptable performance thresholds.
+
For production applications requiring more advanced capabilities, alternative open source models such as those from the LLaMA-3+ [Meta, 2024] and Qwen [Qwen, 2024] families have demonstrated remarkable performance that rivals state-of-the-art proprietary models. These models offer enhanced reasoning abilities and better handling of complex tasks, though at increased computational and financial cost. The choice ultimately depends on specific use case requirements, available resources, and acceptable performance thresholds.
Evaluation Methodology
-
The LLM-as-judge evaluation methodology is a powerful tool for assessing model alignment. However, it does have limitations [Chen et al., 2024]. For instance, the judge model may not always be able to accurately evaluate the alignment of the model, especially if the judge model is not aligned with the policy itself. Further, the judge model may be biased towards the policy, leading to overly conservative evaluations. In our case study, we do highlight the fact that our judge was simply focused on the policy-alignment aspect of the responses completely neglecting the quality of the responses themselves, i.e. while our fine-tuned model may be more aligned with the policy than the base model, we actually have no evidence that our model is helpful at all.
+
The LLM-as-judge evaluation methodology is a powerful tool for assessing model alignment. However, it does have limitations [Chen et al., 2024]. For instance, the judge model may not always be able to accurately evaluate the alignment of the model, especially if the judge model is not aligned with the policy itself. Further, the judge model may be biased towards the policy, leading to overly conservative evaluations. In our case study, we do highlight the fact that our judge was simply focused on the policy-alignment aspect of the responses completely neglecting the quality of the responses themselves, i.e. while our fine-tuned model may be more aligned with the policy than the base model, we actually have no evidence that our model is helpful at all.
A more robust evaluation approach would combine LLM-based evaluation with human domain experts in a complementary process. The LLM judge could perform initial high-throughput screening of model responses, flagging potential issues and providing preliminary assessments. These results would then be reviewed by human evaluators with relevant domain expertise who can provide nuanced judgment, catch edge cases, and validate the LLM’s evaluations. Additionally, automatic evaluation against standard benchmarks is advised to evaluate general capabilities of the model.
DPO Dataset Composition
The composition of the DPO dataset also plays a crucial role in model behavior. In preliminary experiments, using only policy-aligned preference data led to an overly apologetic model that was hesitant to provide helpful responses even for benign queries, i.e. the model was overfitting to the policy. In fact, a model that simply refused to provide an useful response and instead apologized would indeed be aligned with the policy and therefore rewarded accordingly. This led to our decision to construct a more well balanced dataset.
-
Blending our policy-focused dataset with the more general-purpose UltraFeedback dataset from Hugging Face H4 [H4, 2024a] dramatically improved results by helping the model maintain helpfulness while learning appropriate safety boundaries. The results reported here reflect this balanced dataset approach.
+
Blending our policy-focused dataset with the more general-purpose UltraFeedback dataset from Hugging Face H4 [H4, 2024a] dramatically improved results by helping the model maintain helpfulness while learning appropriate safety boundaries. The results reported here reflect this balanced dataset approach.
The construction of the DPO dataset is perhaps the most critical component of the alignment process. While automated approaches can help scale dataset creation, the involvement of domain experts in dataset construction is highly recommended. Domain experts bring invaluable knowledge about edge cases, nuanced policy interpretations, and real-world usage patterns that may not be captured by synthetic data generation alone. Organizations implementing alignment techniques should consider investing in domain expert involvement during dataset construction as a key success factor.
Fine-tuning Process
The effectiveness of DPO training can be highly sensitive to various fine-tuning hyperparameters. As we mentioned before, the batch size and the beta parameter are two key parameters that can significantly impact training stability and model behavior. A careful parameter tuning is required to achieve optimal results, which lacked in our case study.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022. URL: https://arxiv.org/abs/2204.05862, arXiv:2204.05862.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: harmlessness from ai feedback. 2022. URL: https://arxiv.org/abs/2212.08073, arXiv:2212.08073.
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. 2024. URL: https://arxiv.org/abs/2402.10669, arXiv:2402.10669.
Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-boosting large language models with synthetic preference data. 2024. URL: https://arxiv.org/abs/2410.06961, arXiv:2410.06961.
Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, and Wenqiang Lei. Towards analyzing and understanding the limitations of dpo: a theoretical perspective. 2024. URL: https://arxiv.org/abs/2404.04626, arXiv:2404.04626.
Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu, Chunlin Zhong, Zhangjun Zhou, and He Tang. Synthetic data in ai: challenges, applications, and ethical implications. 2024. URL: https://arxiv.org/abs/2401.01629, arXiv:2401.01629.
Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, and Yuxiao Dong. Does rlhf scale? exploring the impacts from data, model, and method. 2024. URL: https://arxiv.org/abs/2412.06000, arXiv:2412.06000.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: low-rank adaptation of large language models. 2021. URL: https://arxiv.org/abs/2106.09685, arXiv:2106.09685.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. 2022. URL: https://arxiv.org/abs/2210.11610, arXiv:2210.11610.
Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, and Sanmi Koyejo. Collapse or thrive? perils and promises of synthetic data in a self-generating world. 2024. URL: https://arxiv.org/abs/2410.16713, arXiv:2410.16713.
Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating language models as synthetic data generators. 2024. URL: https://arxiv.org/abs/2412.03679, arXiv:2412.03679.
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: a survey. 2024. URL: https://arxiv.org/abs/2406.15126, arXiv:2406.15126.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. 2022. URL: https://arxiv.org/abs/2203.02155, arXiv:2203.02155.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. 2024. URL: https://arxiv.org/abs/2305.18290, arXiv:2305.18290.
Hugging Face SmolLM2. Smollm: a small language model distilled from a larger language model for task-specific applications. 2024. Blog post describing techniques for distilling smaller, task-specific language models. URL: https://huggingface.co/blog/smollm.
+
HuggingFace SmolLM2. Smollm: a small language model distilled from a larger language model for task-specific applications. 2024. Blog post describing techniques for distilling smaller, task-specific language models. URL: https://huggingface.co/blog/smollm.
Márton Szép, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A practical guide to fine-tuning language models with limited data. 2024. URL: https://arxiv.org/abs/2411.09539, arXiv:2411.09539.
Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse Khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Srijan Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Sarah Luger, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, and Joaquin Vanschoren. Introducing v0.5 of the ai safety benchmark from mlcommons. 2024. URL: https://arxiv.org/abs/2404.12241, arXiv:2404.12241.
According to recent analysis from a16z [Andreessen Horowitz, 2024], the cost of LLM inference is decreasing by approximately 10x every year - a rate that outpaces even Moore’s Law in the PC revolution or Edholm’s Law during the bandwidth explosion of the dot-com era.
According to recent analysis from a16z [Andreessen Horowitz, 2024], the cost of LLM inference is decreasing by approximately 10x every year - a rate that outpaces even Moore’s Law in the PC revolution or Edholm’s Law during the bandwidth explosion of the dot-com era.
-
Fig. 9.1 LLMflation [Andreessen Horowitz, 2024]: The cost of LLM inference is decreasing by approximately 10x every year.¶
+
Fig. 9.1 LLMflation [Andreessen Horowitz, 2024]: The cost of LLM inference is decreasing by approximately 10x every year.¶
A model achieving an MMLU score of 42 that cost \(60 per million tokens in late 2021 can now be run for just \)0.06 per million tokens. For higher-capability models scoring 83 on MMLU, prices have fallen by a factor of 62 since GPT-4’s introduction in March 2023.
Before implementing cost optimization strategies for LLMs, organizations must develop a comprehensive understanding of their own requirements and constraints. This systematic approach prevents both over-engineering and under-provisioning, leading to more efficient and cost-effective implementations.
In this section, we define key performance and cost related metrics that will guide our discussion. Then we propose a set of requirements practitioners should consider before we dive into cost optimization techniques.
First, one needs to define the problem to be solved and to what extent it is worth to be solved. Use case requirements form the foundation of any LLM implementation project. A clear definition of the specific business problema and task to be accomplished must be established upfront, along with concrete performance metrics covering accuracy, latency and throughput. This should be accompanied by well-defined cost-per-transaction targets, clear ROI expectations, and a strategic allocation of budgets across different use cases to ensure resources are optimally distributed.
Budget and ROI considerations are critical for ensuring the long-term viability of LLM implementations. Organizations must establish clear spending limits that align with their financial capabilities while defining realistic cost-per-transaction targets. ROI expectations need to be carefully established through detailed analysis, followed by a strategic allocation of budgets across various use cases based on their business impact and priority.
Compliance and security requirements cannot be overlooked. This involves a thorough identification of all applicable regulatory requirements and the establishment of robust data handling standards. Organizations must specify comprehensive audit requirements to maintain transparency and accountability, while implementing appropriate security controls to protect sensitive data and system access.
Accuracy and quality form the foundation of any LLM deployment’s performance requirements. At its core, this involves determining the minimum level of accuracy that the model must achieve to be considered successful. This serves as a critical baseline for evaluating model performance and making deployment decisions. Establishing clear evaluation metrics, whether through automated measures or human evaluation processes, provides concrete ways to assess if these thresholds are being met. Continuous monitoring of these accuracy metrics ensures the system maintains its performance over time as usage patterns and data distributions evolve. Chapter The Evals Gap provides a detailed discussion on how to evaluate the performance of LLM-based applications.
Latency and throughput requirements are equally crucial for ensuring a positive user experience and system reliability. These specifications define how quickly the system must respond to requests and how many concurrent users it can handle. Response time requirements must be carefully balanced against the computational resources available, while peak load capabilities need to account for usage spikes and growth patterns. The decision between real-time processing for immediate responses versus batch processing for efficiency depends heavily on the use case and user expectations.
Scale and capacity planning forms the foundation of operational requirements for LLM deployments. This involves a comprehensive analysis of expected system usage and growth patterns to ensure the infrastructure can handle both current and future demands. Organizations must carefully project their daily and monthly API call volumes while calculating the average number of tokens per request to accurately estimate resource needs. Understanding usage patterns, including seasonal variations, enables proper capacity planning. Additionally, developing 12-24 month growth projections helps ensure the infrastructure can scale appropriately as demand increases.
Reliability and availability requirements are equally critical for maintaining consistent service quality. These specifications define the expected uptime percentage that the system must maintain, typically expressed as a percentage of total operational time. Organizations need to establish clear maintenance windows that minimize disruption to users while ensuring necessary system updates and optimizations can be performed. Comprehensive backup and failover requirements must be specified to ensure business continuity in case of failures. High availability needs should be clearly defined, including redundancy levels and recovery time objectives, to maintain service quality even during unexpected events.
System integration requirements define how the LLM system will interact and communicate with existing infrastructure and applications. This involves carefully mapping all integration points where the LLM system needs to connect with other systems, establishing standardized data formats and interfaces for seamless communication, implementing robust security measures to protect data in transit, and identifying any technical constraints that could impact integration. Getting these integration requirements right is crucial for ensuring the LLM system can function effectively within the broader technical ecosystem.
Data management requirements address how information will be stored, processed, and maintained within the LLM system. This encompasses determining appropriate storage solutions for maintaining conversation context and history, selecting and configuring vector databases to enable efficient retrieval-augmented generation (RAG), creating comprehensive data retention policies that balance operational needs with resource constraints, and ensuring all data handling practices comply with relevant privacy regulations. Proper data management is essential for both system performance and regulatory compliance, making it a critical consideration in any LLM implementation.
This structured approach to requirements analysis enables organizations to:
Quantization is a common and relevant technique in making LLMs more efficient and accessible. At a high level, quantization reduces the number of bits used to represent a model’s parameters. The most common form of quantization is to represent model’s weights at lower precision at post-training phase. It has become a standard technique to generate a series of quantized models given a large pre-trained base model.
While a standard pre-trained LLM might use 32-bit floating-point (FP32) or 16-bit floating-point (FP16) numbers to store its weights, quantized versions can operate at lower precision levels such as 8, 4 or even 2 bits per parameter, reducing memory footprint without proportional losses in performance, necessarily. For instance, for a model of 30 billion parameters, using FP32 means 4 bytes per weight or 120 GB for the whole weights. If the model is quantized such that weights are represented in 1 byte, the memory needed for the model’s weights decreases to 30 GB, hence potentially fitting into consumer grade hardware. This is done at the cost of precision loss, but the trade-off is often worth it though require careful analysis.
Let’s take a look at model weights of a language model (SmolLM2-135M-Instruct) that has been quantized to 2-bit and 16-bit precisions. We will use an utility function load_gguf from the taming_utils package to load model weights of the quantized models directly from Hugging Face.
@@ -483,21 +483,21 @@
[Unsloth, 2024][2]. The model’s memory requirements vary significantly based on the quantization level used as demonstrated in Fig. 9.2.
+
Quantization[2] is a powerful technique for reducing the memory footprint of LLMs. This can be exemplified by the case of LLaMa 3.3 70B as quantized by [Unsloth, 2024][3]. The model’s memory requirements vary significantly based on the quantization level used as demonstrated in Fig. 9.2.
Fig. 9.2 Quantized Model Size: unsloth/Llama-3.3-70B-Instruct-GGUF¶
-
We observe that the quantization process yields remarkable reductions in model size, demonstrating a clear trade-off between precision and memory requirements. The transition from F16 (141.1 GB) to Q8_0 (75 GB) achieves a dramatic 47% reduction in model size while maintaining relatively high numerical precision. Further quantization levels reveal an interesting pattern of diminishing returns - each step down in precision yields progressively smaller absolute size reductions, though the cumulative effect remains significant. At the extreme end, the Q2_K model (26.4 GB) requires only 19% of the storage space of its F16 counterpart [3].
+
We observe that the quantization process yields remarkable reductions in model size, demonstrating a clear trade-off between precision and memory requirements. The transition from F16 (141.1 GB) to Q8_0 (75 GB) achieves a dramatic 47% reduction in model size while maintaining relatively high numerical precision. Further quantization levels reveal an interesting pattern of diminishing returns - each step down in precision yields progressively smaller absolute size reductions, though the cumulative effect remains significant. At the extreme end, the Q2_K model (26.4 GB) requires only 19% of the storage space of its F16 counterpart [4].
This wide spectrum of model sizes enables deployment across diverse hardware environments. The lightweight Q2_K variant opens possibilities for running inference on consumer-grade hardware like high-end laptops or desktop computers. In contrast, the full-precision F16 model demands enterprise-grade computing resources with substantial memory capacity. This flexibility in deployment options makes quantization a powerful tool for democratizing access to large language models while managing computational costs.
-
While quantization has proven highly effective, there is a limit to how far it can be pushed - specifically, the 1-bit ceiling. A notable advancement in this space is BitNet [Wang et al., 2024] which pushes the boundaries of extreme quantization.
+
While quantization has proven highly effective, there is a limit to how far it can be pushed - specifically, the 1-bit ceiling. A notable advancement in this space is BitNet [Wang et al., 2024] which pushes the boundaries of extreme quantization.
BitNet’s implementation, bitnet.cpp, has demonstrated significant performance improvements across both ARM and x86 architectures (see Fig. 9.3). When compared to llama.cpp, the framework achieves speedups ranging from 1.37x to 5.07x on ARM processors and 2.37x to 6.17x on x86 systems. These performance gains scale with model size - larger models benefit more substantially from BitNet’s optimizations. The efficiency improvements extend beyond raw speed: energy consumption drops by 55-70% on ARM and 71-82% on x86 processors. Perhaps most impressively, bitnet.cpp enables running a 100B parameter BitNet b1.58 model on a single CPU at speeds matching human reading pace (5-7 tokens per second).
The framework’s initial release focused on CPU inference optimization, with particular emphasis on 1-bit LLM architectures (BitNet b1.58). While initial testing shows promising results, these findings are specific to the tested models and kernels (its specialized kernels are carefully crafted to exploit the unique characteristics of these extremely quantized models). Further validation is needed before generalizing these results across different architectures and use cases.
Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, and Furu Wei. 1-bit ai infra: part 1.1, fast and lossless bitnet b1.58 inference on cpus. 2024. URL: https://arxiv.org/abs/2410.16144, arXiv:2410.16144.
Andreessen Horowitz. Llmflation: understanding and mitigating llm inference cost. Blog Post, 2024. Analysis of LLM inference costs and strategies for optimization. URL: https://a16z.com/llmflation-llm-inference-cost/.
The advent of LLMs marks a pivotal shift in the landscape of software development, testing and verification. Unlike traditional software systems, where deterministic outputs are the norm, LLMs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess software products.
For those entrenched in traditional methodologies, the transition to LLM-driven systems may seem daunting. However, ignoring this change is not an option. The reliance on outdated testing frameworks that fail to account for the probabilistic nature of LLMs will inevitably lead to significant setbacks.
To overcome these challenges, it is imperative to embrace the complexities of LLMs with a proactive mindset. This involves developing robust evaluation frameworks up-front that incorporate the generative nature of LLM-based software development while fostering a culture of continuous change, learning and adaptation.
One of the most fundamental challenges when building products with LLMs is their generative and non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate novel text that may not exist in their training data, and produce different responses each time they’re queried - even with identical prompts and input data. This behavior is both a strength and a significant engineering and product challenge.
When you ask an LLM the same question multiple times, you’ll likely get different responses. This isn’t a bug - it’s a fundamental feature of how these models work. The “temperature” parameter, which controls the randomness of outputs, allows models to be creative and generate diverse responses. However, this same feature makes it difficult to build reliable, testable systems.
Consider a financial services company using LLMs to generate investment advice. The non-deterministic nature of these models means that:
Uses techniques like nucleus sampling [Holtzman et al., 2020] or top-k sampling to balance creativity and coherence
In this simple experiment, we use an LLM to write a single-statement executive summary from an input financial filing. We observe that even a simple parameter like temperature can dramatically alter model behavior in ways that are difficult to systematically assess. At temperature 0.0, responses are consistent but potentially too rigid. At 1.0, outputs become more varied but less predictable. At 2.0, responses can be wildly different and often incoherent. This non-deterministic behavior makes traditional software testing approaches inadequate.
A temperature of 1 represents the unscaled probability scores for each token in the vocabulary. Decreasing the temperature closer to 0 sharpens the distribution, so the most likely token will have an even higher probability score. Conversely, increasing the temperature makes the distribution more uniform [Raschka, 2024]:
Temperature = 0: Most deterministic, but potentially repetitive
Temperature = 1: Balanced creativity and coherence
Beyond their non-deterministic nature, LLMs present another fascinating characteristic: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren’t explicitly programmed but rather emerge “naturally” as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against pre-defined specifications.
-
Fig. 3.1 provides a list of emergent abilities of large language models and the scale [Wei et al., 2022]. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.
+
Fig. 3.1 provides a list of emergent abilities of large language models and the scale [Wei et al., 2022]. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.
-
Fig. 3.1 Emergent abilities of large language models and the scale [Wei et al., 2022].¶
+
Fig. 3.1 Emergent abilities of large language models and the scale [Wei et al., 2022].¶
The implications for evaluation are critical. While conventional software testing relies on stable test suites and well-defined acceptance criteria, LLM evaluation must contend with a constantly shifting landscape of capabilities. What worked to evaluate a 7B parameter model may be completely inadequate for a 70B parameter model that has developed new emergent abilities. This dynamic nature of LLM capabilities forces us to fundamentally rethink our approach to testing and evaluation.
Consider a practical example that illustrates these challenges: building a Math AI tutoring system for children powered by an LLM. In traditional software development, you would define specific features (like presenting math problems or checking answers) and write tests to verify each function. But with LLMs, you’re not just testing predefined features - you’re trying to evaluate emergent capabilities like adapting explanations to a child’s level, maintaining engagement through conversational learning, and providing age-appropriate safety-bound content.
This fundamental difference raises critical questions about evaluation:
First, it’s important to make a distinction between evaluating an LLM versus evaluating an LLM-based application. While the former offers foundation capabilities and are typically general-purpose, the latter is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.
That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications, instead, should be evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:
The design of an LLM application evaluation system depends heavily on the specific use case and business requirements. Here we list important questions for planning an LLM application evaluation system pertaining to each of the key components previously introduced:
The choice of metric depends on the specific task and desired evaluation criteria. However, one can categorize metrics into two broad categories: intrinsic and extrinsic.
Intrinsic metrics focus on the model’s performance on its primary training objective, which is typically to predict the next token in a sequence. Perplexity is a common intrinsic metric that measures how well the model predicts a given sample of text.
Subjective Acceptable Threshold: These metrics are not always easy to interpret and set a threshold for (see [Sarmah et al., 2024] for a discussion on how to choose a threshold for an evaluation metric for large language models).
Inability to assess reasoning or factual accuracy: These metrics primarily focus on surface-level matching and might not reveal the underlying reasoning process of the LLM or its ability to generate factually correct information.
In conclusion, selecting an appropriate extrinsic metrics set depends on the specific task, underlying business requirements and desired evaluation granularity. Understanding the limitations of these metrics can provide a more comprehensive assessment of LLM performance in real-world applications.
To address these limitations, alternative approaches like human-based evaluation and model-based evaluation are often used, which will be discussed in the following sections.
Traditional metrics like BLEU or ROUGE often fall short in capturing the nuanced, contextual, and creative outputs of LLMs. As an alternative we can consider a “Model-based evaluation” approach. A common approach is to use an LLM as a judge. This is an approach that leverages language models themselves to assess the quality of outputs from other language models. This method involves using a model (often a more capable one) to act as an automated judge, evaluating aspects like accuracy, coherence, and relevance of generated content. Unlike traditional metrics that rely on exact matching or statistical measures, model-based evaluation can capture nuanced aspects of language and provide more contextual assessment.
-
As discussed in the paper [Li et al., 2024], LLM-based evaluation approaches generally fall into two main categories:
+
As discussed in the paper [Li et al., 2024], LLM-based evaluation approaches generally fall into two main categories:
Prompt-based evaluation: This involves using prompts to instruct existing LLMs to evaluate text quality without any fine-tuning. The evaluation can take several forms:
Fig. 3.4 Conceptual overview of LLM-as-a-Judge evaluation.¶
Compared to traditional metrics, LLM-as-a-Judge evaluation offers a more sophisticated assessment framework by leveraging natural language criteria. While metrics focus on statistical measures, judge models excel at evaluating subjective qualities such as creativity, narrative flow, and contextual relevance - aspects that closely mirror human judgment. The judge model processes evaluation guidelines expressed in natural language, functioning similarly to a human reviewer interpreting assessment criteria. One notable consideration is that this approach requires careful prompt engineering to properly define and communicate the evaluation standards to the model.
-
Prompt Engineering can have a large impact on the quality of the evaluation [Li et al., 2024]. Hence, it’s worth noting key prompting best practices when designing LLM-as-a-judge evaluators [Face, 2024]:
+
Prompt Engineering can have a large impact on the quality of the evaluation [Li et al., 2024]. Hence, it’s worth noting key prompting best practices when designing LLM-as-a-judge evaluators [HuggingFace, 2024]:
Use discrete integer scales (e.g., 1-5) rather than continuous ranges
Provide clear rubrics that define what each score level means
The visualization helps highlight these differences across models and evaluation dimensions. A clear performance gradient is visible from gpt-4o-mini to gpt-3.5-turbo, with the latter showing marked degradation in most metrics.
-
Leveraging LLMs for evaluation has several limitations [Li et al., 2024]. Firstly, computational overhead should not be neglected given the inherent cost of running additional model inferences iterations. LLM evaluators can also exhibit various biases, including order bias (preferring certain sequence positions), egocentric bias (favoring outputs from similar models), and length bias. Further, there may be a tight dependency on prompt quality - small prompt variations may lead to substantially different outcomes. It is important to also note challenges around domain-specific evaluation in fields such as medicine, finance, law etc, where a general llm-as-a-judge approach may not be suitable.
+
Leveraging LLMs for evaluation has several limitations [Li et al., 2024]. Firstly, computational overhead should not be neglected given the inherent cost of running additional model inferences iterations. LLM evaluators can also exhibit various biases, including order bias (preferring certain sequence positions), egocentric bias (favoring outputs from similar models), and length bias. Further, there may be a tight dependency on prompt quality - small prompt variations may lead to substantially different outcomes. It is important to also note challenges around domain-specific evaluation in fields such as medicine, finance, law etc, where a general llm-as-a-judge approach may not be suitable.
The LLM-as-a-Judge strategy can serve as a scalable and nuanced solution to evaluate LLM-based applications. While it does not entirely replace metrics-based or human-based approaches, it significantly augments evaluation workflows, especially in scenarios requiring evaluation of generative outputs. Future improvements in our example include integrating human oversight and refining LLMs for domain-specific evaluation tasks.
-
One open source solution trying to overcome some of these challenges is Glider [Deshpande et al., 2024], a 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. Glider is an LLM model trained on 685 domains and 183 criteria whose judgement scores show 91.3% agreement with human judgments, making it suitable for a diverse range of real world applications.
+
One open source solution trying to overcome some of these challenges is Glider [Deshpande et al., 2024], a 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. Glider is an LLM model trained on 685 domains and 183 criteria whose judgement scores show 91.3% agreement with human judgments, making it suitable for a diverse range of real world applications.
We have discussed how LLMs can be used to evaluate LLM-based aplications. However, how can we evaluate the performance of LLMs that evaluate other LLMs? This is the question that meta evaluation aims to answer. Clearly, the discussion can become quite meta as we need to evaluate the performance of the evaluator to evaluate the performance of the evaluated model. However, one can make a case for two general options:
Use a golden-standard dataset that is used to evaluate the performance of LLM evaluators using a “metrics-based” approach.
An alternative to the above approaches is to use humans to directly evaluate the LLM-judges themselves. A notable example of this is Judge Arena[Arena, 2024], which is a platform that allows users to vote on which AI model made the better evaluation. Under this approach, the performance of the LLM evaluator is given by the (blind) evaluation of humans who perform the voting on randomly generated pairs of LLM judges as depicted in Fig. 3.6. Only after submitting a vote, users can see which models were actually doing the judging.
+
An alternative to the above approaches is to use humans to directly evaluate the LLM-judges themselves. A notable example of this is Judge Arena[Arena, 2024], which is a platform that allows users to vote on which AI model made the better evaluation. Under this approach, the performance of the LLM evaluator is given by the (blind) evaluation of humans who perform the voting on randomly generated pairs of LLM judges as depicted in Fig. 3.6. Only after submitting a vote, users can see which models were actually doing the judging.
Benchmarks act as standardized tests for LLMs, evaluating their performance across a spectrum of tasks. These tasks simulate real-world applications such as answering questions, generating coherent text, solving mathematical problems, or even writing computer code. They also assess more abstract qualities like fairness, robustness, and cultural understanding.
Benchmarks can be thought as comprehensive “exams” that probe different “subjects” in order to certify an LLM. They help researchers and developers compare models systematically, in a way LLM performance is comparable while enabling the identification of emergent behaviors or capabilities as models evolve in scale and sophistication.
-
The history of LLM benchmarks reflects the evolving priorities of artificial intelligence research, starting with foundational tasks and moving toward complex, real-world challenges. We can start in 2018 with the introduction of GLUE (General Language Understanding Evaluation) [Wang et al., 2019], which set a new standard for evaluating natural language understanding. GLUE measured performance on tasks like sentiment analysis and textual entailment, providing a baseline for assessing the fundamental capabilities of language models. Later, SuperGLUE[Wang et al., 2019] expanded on this foundation by introducing more nuanced tasks that tested reasoning and language comprehension at a deeper level, challenging the limits of models like BERT and its successors.
-
As AI capabilities grew, benchmarks evolved to capture broader and more diverse aspects of intelligence. BIG-Bench[Srivastava et al., 2023] marked a turning point by incorporating over 200 tasks, spanning arithmetic, logic, and creative problem-solving. This collaborative effort aimed to probe emergent abilities in large models, offering insights into how scale and complexity influence performance. Around the same time, specialized benchmarks like TruthfulQA[Lin et al., 2022] emerged, addressing the critical need for models to provide accurate and non-deceptive information in a world increasingly dependent on AI for factual content.
-
MMLU (Massive Multitask Language Understanding) [Hendrycks et al., 2021] launched in 2021, provided a rigorous test of a model’s multidisciplinary knowledge, covering 57 subjects from STEM fields to humanities and social sciences. Similarly, in 2022, Stanford’s HELM (Holistic Evaluation of Language Models) [Liang et al., 2023] set a new standard for multidimensional assessment. HELM expanded the scope of evaluation beyond accuracy, incorporating factors like fairness, robustness, and computational efficiency. This benchmark was designed to address societal concerns surrounding AI, emphasizing safety and inclusion alongside technical performance.
-
Specialized benchmarks like HumanEval (2021) [Chen et al., 2021] focused on domain-specific tasks, such as code generation, testing models’ ability to translate natural language descriptions into functional programming code. In contrast, LMSYS (2023) brought real-world applicability into focus by evaluating conversational AI through multi-turn dialogues. LMSYS prioritized coherence, contextual understanding, and user satisfaction, providing a practical lens for assessing models like GPT and Claude in dynamic settings.
-
The HuggingFace Open LLM[Face, 2024] Leaderboard stands out for its transparency and accessibility in the open-source community. This leaderboard evaluates a wide range of LLMs across diverse tasks, including general knowledge, reasoning, and code-writing. Its commitment to reproducibility ensures that results are verifiable, enabling researchers and practitioners to replicate findings. By focusing on open-source models, it democratizes AI research and fosters innovation across communities, making it a valuable resource for both academics and industry professionals.
-
The Chatbot Arena (2024) Leaderboard (an evolution of LMSYS) [Chiang et al., 2024] takes an alternative approach by measuring real-world performance through direct model comparisons. Its evaluation format compares models in live conversations, with human judges providing qualitative assessments. This methodology has gathered hundreds of thousands of human evaluations, offering specific insights into practical model performance. The emphasis on interactive capabilities makes it relevant for developing user-facing applications like virtual assistants and chatbots.
-
The AlpacaEval[Dubois et al., 2024] and MT-Bench[Zheng et al., 2023] Leaderboards implement automated evaluation using LLMs to assess model performance in multi-turn conversations. This approach enables consistent assessment of dialogue capabilities while reducing human bias. Their methodology measures key aspects of conversational AI, including contextual understanding and response consistency across multiple exchanges.
-
An important recent development was the release of Global-MMLU [Singh et al., 2024], an improved version of MMLU with evaluation coverage across 42 languages. This open dataset, built through collaboration between Argilla, the Hugging Face community, and researchers from leading institutions like Cohere For AI, Mila, MIT, and others, represents a significant step toward more inclusive multilingual LLM evaluation. Hundreds of contributors used Argilla to annotate MMLU questions, revealing that 85% of questions requiring specific cultural knowledge were Western-centric. The newly released dataset is divided into two key subsets: Culturally Agnostic questions that require no specific regional or cultural knowledge, and Culturally Sensitive questions that depend on dialect, cultural, or geographic knowledge. With high-quality translations available for 25 languages, Global-MMLU enables better understanding of LLM capabilities and limitations across different languages and cultural contexts.
-
A major challenge with these leaderboards and benchmarks is test set contamination - when test data ends up in newer models’ training sets, rendering the benchmarks ineffective. While some benchmarks try to address this through crowdsourced prompts and evaluations from humans or LLMs, these approaches introduce their own biases and struggle with difficult questions. LiveBench[White et al., 2024] represents a novel solution, designed specifically to be resilient to both contamination and evaluation biases. As the first benchmark with continuously updated questions from recent sources, automated objective scoring, and diverse challenging tasks across multiple domains, LiveBench maintains its effectiveness even as models improve. Drawing from recent math competitions, research papers, news, and datasets, it creates contamination-free versions of established benchmark tasks. Current results show even top models achieving considerably lower performance compared to other benchmarks, demonstrating LiveBench’s ability to meaningfully differentiate model capabilities with relatively lower saturation. With monthly updates and an open collaborative approach, LiveBench aims to provide sustained value for model evaluation as the field advances.
-
Another notable benchmark is ZebraLogic [Lin et al., 2024], which evaluates logical reasoning capabilities of LLMs through Logic Grid Puzzles - a type of Constraint Satisfaction Problem [Brailsford et al., 1999] commonly found in tests like the LSAT. These puzzles require assigning unique values to N houses across M different features based on given clues, demanding strategic reasoning and deduction to arrive at a unique correct solution. The benchmark’s programmatically generated puzzles range from 2x2 to 6x6 in size and test LLMs using one-shot examples with reasoning steps. While humans can solve these puzzles through strategic methods like reductio ad absurdum and elimination, LLMs demonstrate significant limitations in this type of logical reasoning. Even the best-performing model, Claude 3.5 Sonnet, only achieves 33.4% accuracy across all puzzles and 12.4% on hard puzzles, with smaller models (7-10B parameters) solving less than 1% of hard puzzles as of December 2024. These results reveal critical gaps in LLMs’ capabilities around counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.
-
A significant milestone in AI evaluation came with the launch of the The Alignment Research Center (ARC) Prize[Chollet, 2024] by ARC Prize Inc., a non-profit for the public advancement of open artificial general intelligence. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of Keras), this prize represents a paradigm shift in how we evaluate language models. Rather than focusing on narrow performance metrics, the ARC Prize assesses what it calls “cognitive sufficiency” - a model’s ability to generate meaningful insights and tackle open-ended challenges. This new way to think about LLM evaluation emphasizes creative thinking, sophisticated reasoning, and the capacity to make genuinely useful contributions to human knowledge. Arguably, it is an attempt to define and measure a step towards what it means to achieve AGI (Artificial General Intelligence).
+
The history of LLM benchmarks reflects the evolving priorities of artificial intelligence research, starting with foundational tasks and moving toward complex, real-world challenges. We can start in 2018 with the introduction of GLUE (General Language Understanding Evaluation) [Wang et al., 2019], which set a new standard for evaluating natural language understanding. GLUE measured performance on tasks like sentiment analysis and textual entailment, providing a baseline for assessing the fundamental capabilities of language models. Later, SuperGLUE[Wang et al., 2019] expanded on this foundation by introducing more nuanced tasks that tested reasoning and language comprehension at a deeper level, challenging the limits of models like BERT and its successors.
+
As AI capabilities grew, benchmarks evolved to capture broader and more diverse aspects of intelligence. BIG-Bench[Srivastava et al., 2023] marked a turning point by incorporating over 200 tasks, spanning arithmetic, logic, and creative problem-solving. This collaborative effort aimed to probe emergent abilities in large models, offering insights into how scale and complexity influence performance. Around the same time, specialized benchmarks like TruthfulQA[Lin et al., 2022] emerged, addressing the critical need for models to provide accurate and non-deceptive information in a world increasingly dependent on AI for factual content.
+
MMLU (Massive Multitask Language Understanding) [Hendrycks et al., 2021] launched in 2021, provided a rigorous test of a model’s multidisciplinary knowledge, covering 57 subjects from STEM fields to humanities and social sciences. Similarly, in 2022, Stanford’s HELM (Holistic Evaluation of Language Models) [Liang et al., 2023] set a new standard for multidimensional assessment. HELM expanded the scope of evaluation beyond accuracy, incorporating factors like fairness, robustness, and computational efficiency. This benchmark was designed to address societal concerns surrounding AI, emphasizing safety and inclusion alongside technical performance.
+
Specialized benchmarks like HumanEval (2021) [Chen et al., 2021] focused on domain-specific tasks, such as code generation, testing models’ ability to translate natural language descriptions into functional programming code. In contrast, LMSYS (2023) brought real-world applicability into focus by evaluating conversational AI through multi-turn dialogues. LMSYS prioritized coherence, contextual understanding, and user satisfaction, providing a practical lens for assessing models like GPT and Claude in dynamic settings.
+
The HuggingFace Open LLM[HuggingFace, 2024] Leaderboard stands out for its transparency and accessibility in the open-source community. This leaderboard evaluates a wide range of LLMs across diverse tasks, including general knowledge, reasoning, and code-writing. Its commitment to reproducibility ensures that results are verifiable, enabling researchers and practitioners to replicate findings. By focusing on open-source models, it democratizes AI research and fosters innovation across communities, making it a valuable resource for both academics and industry professionals.
+
The Chatbot Arena (2024) Leaderboard (an evolution of LMSYS) [Chiang et al., 2024] takes an alternative approach by measuring real-world performance through direct model comparisons. Its evaluation format compares models in live conversations, with human judges providing qualitative assessments. This methodology has gathered hundreds of thousands of human evaluations, offering specific insights into practical model performance. The emphasis on interactive capabilities makes it relevant for developing user-facing applications like virtual assistants and chatbots.
+
The AlpacaEval[Dubois et al., 2024] and MT-Bench[Zheng et al., 2023] Leaderboards implement automated evaluation using LLMs to assess model performance in multi-turn conversations. This approach enables consistent assessment of dialogue capabilities while reducing human bias. Their methodology measures key aspects of conversational AI, including contextual understanding and response consistency across multiple exchanges.
+
An important recent development was the release of Global-MMLU [Singh et al., 2024], an improved version of MMLU with evaluation coverage across 42 languages. This open dataset, built through collaboration between Argilla, the Hugging Face community, and researchers from leading institutions like Cohere For AI, Mila, MIT, and others, represents a significant step toward more inclusive multilingual LLM evaluation. Hundreds of contributors used Argilla to annotate MMLU questions, revealing that 85% of questions requiring specific cultural knowledge were Western-centric. The newly released dataset is divided into two key subsets: Culturally Agnostic questions that require no specific regional or cultural knowledge, and Culturally Sensitive questions that depend on dialect, cultural, or geographic knowledge. With high-quality translations available for 25 languages, Global-MMLU enables better understanding of LLM capabilities and limitations across different languages and cultural contexts.
+
A major challenge with these leaderboards and benchmarks is test set contamination - when test data ends up in newer models’ training sets, rendering the benchmarks ineffective. While some benchmarks try to address this through crowdsourced prompts and evaluations from humans or LLMs, these approaches introduce their own biases and struggle with difficult questions. LiveBench[White et al., 2024] represents a novel solution, designed specifically to be resilient to both contamination and evaluation biases. As the first benchmark with continuously updated questions from recent sources, automated objective scoring, and diverse challenging tasks across multiple domains, LiveBench maintains its effectiveness even as models improve. Drawing from recent math competitions, research papers, news, and datasets, it creates contamination-free versions of established benchmark tasks. Current results show even top models achieving considerably lower performance compared to other benchmarks, demonstrating LiveBench’s ability to meaningfully differentiate model capabilities with relatively lower saturation. With monthly updates and an open collaborative approach, LiveBench aims to provide sustained value for model evaluation as the field advances.
+
Another notable benchmark is ZebraLogic [Lin et al., 2024], which evaluates logical reasoning capabilities of LLMs through Logic Grid Puzzles - a type of Constraint Satisfaction Problem [Brailsford et al., 1999] commonly found in tests like the LSAT. These puzzles require assigning unique values to N houses across M different features based on given clues, demanding strategic reasoning and deduction to arrive at a unique correct solution. The benchmark’s programmatically generated puzzles range from 2x2 to 6x6 in size and test LLMs using one-shot examples with reasoning steps. While humans can solve these puzzles through strategic methods like reductio ad absurdum and elimination, LLMs demonstrate significant limitations in this type of logical reasoning. Even the best-performing model, Claude 3.5 Sonnet, only achieves 33.4% accuracy across all puzzles and 12.4% on hard puzzles, with smaller models (7-10B parameters) solving less than 1% of hard puzzles as of December 2024. These results reveal critical gaps in LLMs’ capabilities around counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.
+
A significant milestone in AI evaluation came with the launch of the The Alignment Research Center (ARC) Prize[Chollet, 2024] by ARC Prize Inc., a non-profit for the public advancement of open artificial general intelligence. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of Keras), this prize represents a paradigm shift in how we evaluate language models. Rather than focusing on narrow performance metrics, the ARC Prize assesses what it calls “cognitive sufficiency” - a model’s ability to generate meaningful insights and tackle open-ended challenges. This new way to think about LLM evaluation emphasizes creative thinking, sophisticated reasoning, and the capacity to make genuinely useful contributions to human knowledge. Arguably, it is an attempt to define and measure a step towards what it means to achieve AGI (Artificial General Intelligence).
Defining AGI according to ARC Prize:
Consensus but wrong:
@@ -1401,21 +1401,21 @@
[Chollet, 12/08/2024]. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.
+
The ARC-AGI benchmark remained unbeaten for five years as of December 2024 (a minimum score of 85% in the private dataset is required to win) [Chollet, 12/08/2024]. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.
In addition to the benchmarks discussed above, a growing set of domain-specific benchmarks is emerging to help evaluate LLMs in specific verticals, including:
-
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
-
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
-
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
+
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
+
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
+
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren’t previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks.
In the following sections, we will explore some open source tools developers can use to automate and streamline the challenging task of LLMs evals.
LightEval [Fourrier et al., 2023] is a lightweight framework for evaluation of LLMs across a variety of standard and bespoke metrics and tasks across multiple inference backends via Python SDK and CLI.
LightEval [Fourrier et al., 2023] is a lightweight framework for evaluation of LLMs across a variety of standard and bespoke metrics and tasks across multiple inference backends via Python SDK and CLI.
As a motivating example, consider a scenario where financial data has been extracted from SEC financial filings and require econometric analysis. Tasks like estimating autoregressive models for time series forecasting or conducting hypothesis tests on market efficiency are common in financial analysis. Let’s evaluate how well different models perform on this type of task.
First, we need to select a benchmark to assess LLMs capabilities in this domain. MMLU has a sub-benchmark called Econometrics we can use for this task. Table 3.4 shows a sample of the benchmark dataset from MMLU Econometrics. It consists of multiple-choice questions from econometrics and expected answers.
@@ -1526,7 +1526,7 @@
[Face, 2024] and metrics [Face, 2024]. The available tasks span multiple categories and benchmarks including BigBench, MMLU, TruthfulQA, WinoGrande, and HellaSwag. The framework also supports standard NLP evaluation metrics including BLEU, ROUGE, Exact Match, F1 Score, and Accuracy.
+
LightEval provides a comprehensive set of evaluation tasks [HuggingFace, 2024] and metrics [HuggingFace, 2024]. The available tasks span multiple categories and benchmarks including BigBench, MMLU, TruthfulQA, WinoGrande, and HellaSwag. The framework also supports standard NLP evaluation metrics including BLEU, ROUGE, Exact Match, F1 Score, and Accuracy.
In our case, we choose to evaluate our LLMs on the MMLU econometrics task using zero-shot learning. Hence, we define the task as follows:
We would like to compare the performance of multiple open source models on the MMLU econometrics task. While we could download and evaluate each model locally, we prefer instead to evaluate them on a remote server to save time and resources. LightEval enables serving the model on a TGI-compatible server/container and then running the evaluation by sending requests to the server [Face, 2024].
+
We would like to compare the performance of multiple open source models on the MMLU econometrics task. While we could download and evaluate each model locally, we prefer instead to evaluate them on a remote server to save time and resources. LightEval enables serving the model on a TGI-compatible server/container and then running the evaluation by sending requests to the server [HuggingFace, 2024].
For that purpose, we can leverage HuggingFace Serverless Inference API [1] and set a configuration file for LightEval as shown below, where <MODEL-ID> is the model identifier on HuggingFace (e.g. meta-llama/Llama-3.2-1B-Instruct) and <HUGGINGFACE-TOKEN> is the user’s HuggingFace API token. Alternatively, you could also pass an URL of a corresponding dedicated inference API if you have one.
model:type:"tgi"
@@ -1576,17 +1576,17 @@
Llama3.2 Instruct
LLaMA architecture-based pretrained and instruction-tuned generative models
[Hugging Face, 2024]. Its integration with the Hugging Face ecosystem and modular architecture make it particularly powerful for evaluating open source models. For further details, visit the official repository[Fourrier et al., 2023].
+
In summary, LightEval is a simple yet flexible and comprehensive framework for evaluating LLMs across a wide variety of tasks and metrics. It can serve as a first step in selecting your next LLM for a specific task given the exponential growth in number of (open source) models available [HuggingFace, 2024]. Its integration with the Hugging Face ecosystem and modular architecture make it particularly powerful for evaluating open source models. For further details, visit the official repository[Fourrier et al., 2023].
Let’s revisit our evaluation example when we were interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a benchmark model (larger and more expensive). Recal the setup:
Promptfoo [promptfoo, 2024] is an open-source framework designed for evaluating applications that utilize LLMs. Key features include:
Automated Testing: Promptfoo provides automated testing capabilities, allowing developers to run custom evaluations tailored to their applications.
Custom Probes: Developers can create custom probes to focus on specific use cases for instance decoupling prompts from tests cases.
@@ -2302,7 +2302,7 @@
Prompt Comparison R
In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.
Table 3.6 provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.
Table 3.6 Comparison of Lighteval, LangSmith, and Promptfoo¶
Language models have fundamentally transformed how software is developed and evaluated. Unlike conventional systems that produce predictable outputs, LLMs generate varied, probabilistic responses that defy traditional testing approaches. While developers accustomed to deterministic systems may find this shift challenging, continuing to rely on legacy testing methods is unsustainable. These frameworks were not designed to handle the inherent variability of LLM outputs and will ultimately prove inadequate.
Success requires embracing this new paradigm by implementing comprehensive evals that cover the non-deterministic generative nature of LLMs - this is the new Product Requirements Document (PRD) - and cultivating an organizational mindset focused on iteration, experimentation and growth.
The shift from traditional software testing to LLM evaluation is not just a change in tools but a transformation in mindset. Those who recognize and adapt to this shift will lead the way in harnessing the power of LLMs in software development.
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Lewis Tunstall, Agustín Piqueres, Andres Marafioti, Cyril Zakka, Leandro von Werra, and Thomas Wolf. Smollm2 - with great data, comes great performance. 2024.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. 2021. URL: https://arxiv.org/abs/2107.03374, arXiv:2107.03374.
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: an open platform for evaluating llms by human preference. 2024. URL: https://arxiv.org/abs/2403.04132, arXiv:2403.04132.
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: a simple way to debias automatic evaluators. 2024. URL: https://arxiv.org/abs/2404.04475, arXiv:2404.04475.
Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. Lighteval: a lightweight framework for llm evaluation. 2023. URL: https://github.com/huggingface/lighteval.
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. Legalbench: a collaboratively built benchmark for measuring legal reasoning in large language models. 2023. URL: https://arxiv.org/abs/2308.11462, arXiv:2308.11462.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. 2021. URL: https://arxiv.org/abs/2009.03300, arXiv:2009.03300.
Zhen Li, Xiaohan Xu, Tao Shen, Can Xu, Jia-Chen Gu, Yuxuan Lai, Chongyang Tao, and Shuai Ma. Leveraging large language models for nlg evaluation: advances and challenges. 2024. URL: https://arxiv.org/abs/2401.07103, arXiv:2401.07103.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. 2023. URL: https://arxiv.org/abs/2211.09110, arXiv:2211.09110.
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
promptfoo. Promptfoo: llm testing and evaluation framework. 2024. Open source framework for testing and evaluating LLM prompts. URL: https://www.promptfoo.dev/.
Bhaskarjit Sarmah, Mingshu Li, Jingrao Lyu, Sebastian Frank, Nathalia Castellanos, Stefano Pasquali, and Dhagash Mehta. How to choose a threshold for an evaluation metric for large language models. 2024. URL: https://arxiv.org/abs/2412.12148, arXiv:2412.12148.
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global mmlu: understanding and addressing cultural and linguistic biases in multilingual evaluation. 2024. URL: https://arxiv.org/abs/2412.03304, arXiv:2412.03304.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. 2023. URL: https://arxiv.org/abs/2206.04615, arXiv:2206.04615.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: a stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems, 2019.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: a multi-task benchmark and analysis platform for natural language understanding. 2019. URL: https://arxiv.org/abs/1804.07461, arXiv:1804.07461.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. 2022. URL: https://arxiv.org/abs/2206.07682, arXiv:2206.07682.
Zhihan Zhang, Yixin Cao, and Lizi Liao. Finbench: benchmarking LLMs in complex financial problem solving and reasoning. 2024. URL: https://openreview.net/forum?id=AeGrf1uY0p.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. 2023. URL: https://arxiv.org/abs/2306.05685, arXiv:2306.05685.
Large Language Models face several critical challenges in effectively processing input data. While advances in long-context language models (LCLMs) [Lee et al., 2024] have expanded the amount of information these systems can process simultaneously, significant challenges remain in managing and effectively utilizing extended inputs.
-
LLMs are sensitive to input formatting and structure, requiring careful data preparation to achieve optimal results [Tan et al., 2024]. They operate with knowledge cutoffs, providing potentially stale or outdated information that may not reflect current reality and demonstrate problems with temporal knowledge accuracy [Amayuelas et al., 2024]. LLMs also struggle with less common but important information showing a systematic loss of long-tail knowledge [Kotha et al., 2024].
-
Motivated by these challenges, this chapter explores two key components:
While advances in long-context language models (LCs) [Lee et al., 2024] have expanded the amount of information these systems can process, significant challenges remain in managing and effectively utilizing extended data inputs:
They operate with knowledge cutoffs, providing potentially stale or outdated information that may not reflect current reality and demonstrate problems with temporal knowledge accuracy [Amayuelas et al., 2024].
+
LLMs also face “lost-in-the-middle” problems [Wu et al., 2024] and struggle with less common but important information showing a systematic loss of long-tail knowledge [Kotha et al., 2024].
+
+
Motivated by these challenges, this chapter explores two key input data components:
-
Data Parsing: Parsing documents into a unified format that is suitable for LLMs to process.
+
Data Parsing and Chunking: Parsing and chunking documents into a unified format that is suitable and more manageable for LLMs to process.
Retrieval Augmentation: Augmenting LLMs with the ability to retrieve relevant, recent, and specialized information.
In data parsing, we will explore some useful open source tools that help transform data into LLM-compatible formats, demonstrating their impact through a case study of structured information extraction from complex PDFs. In a second case study, we will introduce some chunking strategies to help LLMs process long inputs and implement a particular technique called Chunking with Contextual Linking the enables contextually relevant chunk processing.
-
In retrieval augmentation, we will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation). Through a detailed case study, we build a RAG system for querying live codebases, illustrating methods to bridge static model knowledge with dynamic information requirements.
-
In our last case study, we build a quiz generator using a LLM with large context window. We will explore some additional relevant techniques such as prompt caching and response verification through citations.
+
In retrieval augmentation, we will explore how to enhance LLMs with semantic search capabilities for incorporating external context using RAGs (Retrieval Augmented Generation) while discussing whether RAGs will be really needed in the future given the rise of long-context language models.
+
While RAGs are useful for incorporating external context, they are not a silver bullet nor a mandatory component for all LLM applications. In our last case study, we leverage long-context windows to build a quiz generator from a large knowledge base. We will also explore some additional relevant techniques such as prompt caching and response verification through citations.
By the chapter’s conclusion, readers will possess relevant knowledge of input data management strategies for LLMs and practical expertise in selecting and implementing appropriate approaches and tools for specific use cases.
Building robust data ingestion and preprocessing pipelines is essential for any LLM application. This section explores tools and frameworks that streamline input data processing, in particular for parsing purposes, providing a unified interface for converting diverse data formats into standardized representations that LLMs can effectively process. By abstracting away format-specific complexities, they allow developers to focus on core application logic rather than parsing implementation details while maximizing the performance of the LLM.
-
We will cover open source tools and frameworks that provide parsing capabilities for a wide range of data formats. And we will demonstrate how some of these tools can be used to extract structured information from complex PDFs also discussing how the quality of the parser can impact LLM’s performance.
Data parsing and formatting play a critical role in LLMs performance [He et al., 2024, Liu et al., 2024, Tan et al., 2024]. Hence, building robust data ingestion and preprocessing pipelines is essential for any LLM application.
+
This section explores open source tools that streamline input data processing, in particular for parsing purposes, providing a unified interface for converting diverse data formats into standardized representations that LLMs can effectively process. By abstracting away format-specific complexities, they allow developers to focus on core application logic rather than parsing implementation details while maximizing the LLM performance.
+
We will cover open source tools that provide parsing capabilities for a wide range of data formats. And we will demonstrate how some of these tools can be used to extract structured information from complex PDFs demonstrating how the quality of the parser can impact LLM’s performance.
MarkItDown is a Python package and CLI too developed by the Microsoft AutoGen team for converting various file formats to Markdown. It supports a wide range of formats including PDF, PowerPoint, Word, Excel, images (with OCR and EXIF metadata), audio (with transcription), HTML, and other text-based formats making it a useful tool for document indexing and LLM-based applications.
MarkItDown [Microsoft, 2024] is a Python package and CLI tool developed by the Microsoft AutoGen team for converting various file formats to Markdown. It supports a wide range of formats including PDF, PowerPoint, Word, Excel, images (with OCR and EXIF metadata), audio (with transcription), HTML, and other text-based formats making it a useful tool for document indexing and LLM-based applications.
Docling is a Python package developed by IBM Research for parsing and converting documents into various formats. It provides advanced document understanding capabilities with a focus on maintaining document structure and formatting.
Docling [IBM Research, 2024] is a Python package developed by IBM Research for parsing and converting documents into various formats. It provides advanced document understanding capabilities with a focus on maintaining document structure and formatting.
Key features:
Support for multiple document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, etc.)
A common use case where document parsing matters is to structured data extraction from documents, particularly in the presence of complex formatting and layout. In this case study, we will extract the economic forecasts from Merrill Lynch’s CIO Capital Market Outlook released on December 16, 2024 [Merrill Lynch, 2024]. We will focus on page 7 of this document, which contains several economic variables organized in a mix of tables, text and images (see Fig. 5.1)
A common use case where document parsing matters is structured data extraction, particularly in the presence of complex formatting and layout. In this case study, we will extract the economic forecasts from Merrill Lynch’s CIO Capital Market Outlook released on December 16, 2024 [Merrill Lynch, 2024]. We will focus on page 7 of this document, which contains several economic variables organized in a mix of tables, text and images (see Fig. 5.1).
We will define a Forecast pydantic model to represent an economic forecast composed of a financial_variable and a financial_forecast. We will also define a EconForecast pydantic model to represent the list of economic forecasts we want to extract from the document.
Asset Class Weightings. The CIO view information is represented in a spectrum starting with “Underweight”, passing through “Neutral” and reaching “Overweight”. The actual view is marked by some colored dots in the chart. Let’s see if we can extract this relatively more complex information from the document.
@@ -748,7 +760,7 @@
FireCrawl [Mendable AI, 2024]: A Fast and Efficient Web Crawler for LLM Training Data.
+
LlamaParse [LlamaIndex, 2024]: Llamaindex’s data parsing solution.
+
+
The choice of tool depends on the specific requirements of the application and the nature of the input data. This choice should be taken as a critical decision of any data intensive LLM-based application and deserves dedicated research and evidence-based experimentation.
RAG is a technique that allows LLMs to retrieve information from a knowledge base to answer questions. It is a popular technique for building LLM applications that require knowledge-intensive tasks [Lewis et al., 2021].
-
RAG utilizes a retrieval system to fetch external knowledge and augment the LLM. It has proved effective in mitigating hallucinations of LLMs [Ni et al., 2024, Zhou et al., 2024].
The book "Taming LLMs" is authored by *G. Arulkumaran, H. M. B. P. D. Karthikeyan, and I. A. M. Almasri.* If you need more information about the book or its contents, feel free to ask!
+
+
+
+
+
Turns out ChatGPT hallucinates. A quick web search on the before mentioned authors yields no results. In fact, those authors names are made up. And of course the correct answer would have been “Tharsis Souza”.
+
LLMs only have access to the information they have been trained on, which of course has been fixed at a point in time. Hence, LLMs operate with stale data. The problem gets exacerbated by the fact that LLMs are trained to provide an answer even if the answer is unknown by them, hence leading to hallucinations.
+
One solution to this problem is to use a retrieval system to fetch information from a knowledge base to provide recent and relevant context to user queries using so-called Retrieval Augmented Generation (RAG) system.
+
RAG utilizes a retrieval system to fetch external knowledge and augment LLM’s context. It is a useful technique for building LLM applications that require domain-specific information or knowledge-intensive tasks [Lewis et al., 2021]. It has also proved effective in mitigating LLMs hallucinations [Ni et al., 2024, Zhou et al., 2024].
+
In the above example, a RAG would help with hallucinations by grounding the LLM’s response to information provided in the knowledge base. Additional common use cases of RAG systems include:
+
+
Enterprise Knowledge Management: RAG enables organizations to synthesize answers from diverse internal data sources like documents, databases, and communication channels. This creates a unified knowledge interface that can accurately answer questions using the organization’s own data.
+
Document Processing and Analysis: RAG excels at extracting and analyzing information from complex documents like financial reports, presentations, and spreadsheets. The system can enable LLMs to understand context and relationships across different document types and formats.
+
Intelligent Customer Support: By combining knowledge bases with conversational abilities, RAG powers chatbots and support systems that can maintain context across chat history, provide accurate responses, and handle complex customer queries while reducing hallucinations.
+
Domain-Specific Applications: RAG allows LLMs to be equipped with specialized knowledge in fields like medicine, law, or engineering by retrieving information from domain-specific literature, regulations, and technical documentation. This enables accurate responses aligned with professional standards and current best practices.
+
Code Documentation and Technical Support: RAG can help developers by retrieving relevant code examples, API documentation, and best practices from repositories and documentation, which often suffer updates frequently, enabling more accurate and contextual coding assistance.
+
+
If LLMs alone work on stale, general-purpose data with the added challenge of being prone to hallucinations, RAG systems serve as an added capability enabling LLMs to work on recent, domain-specific knowledge increasing the likelihood of LLMs to provide responses that are factual and relevant to user queries.
RAG architectures vary but they all share the same goal: to retrieve relevant information from a knowledge base to maximize the LLM’s ability to effectively and accurately respond to prompts, particularly when the answer requires out-of-training data information.
+
We will introduce key components of a RAG system one by one leading to a full canonical RAG pipeline at the end that ultimately will be used to answer our original question “Who’s the author of the book Taming LLMs?”, accurately.
+
The following basic components will be introduced (see Fig. 5.6 for a visual representation):
+
+
Vector Database
+
+
Embeddings
+
Indexing
+
+
+
Retrieval System including re-ranking
+
LLM Augmented Generation via in-context learning
+
+
Data extraction, parsing and chunking are also part of a canonical pipeline as we prepare the knowledge base. Those are concepts that we have already explored in the previous sections, hence we will be succinct here. We will start by preparing the knowledge base.
Every RAG system requires a knowledge base. In our case, the knowledge base is a set of documents that we equip the LLM to answer our authorship question.
+
Hence, we will compose our knowledge base by adding the web version of (some of the chapters of) the book “Taming LLMs”, namely:
Vector databases are specialized databases designed to store and retrieve high-dimensional vectors, which are mathematical representations of data like text, images, or audio. These databases are optimized for similarity search operations, making them ideal for embeddings-based retrieval systems.
+
A typical pipeline involving a vector database includes the following:
+
+
Input data is converted into “documents” forming a collection representing our knowledge base
+
Each document is converted into an embedding which are stored in the vector database
+
Embeddings are indexed in the vector database for efficient similarity search
+
The vector database is queried to retrieve the most relevant documents
+
The retrieved documents are used to answer questions
+
+
Vector databases are not a mandatory component of RAG systems. In fact, we can use a simple list of strings to store the chapters (or their chunks) and then use the LLM to answer questions about the document. However, vector databases are useful for RAG applications as they enable:
+
+
Fast similarity search for finding relevant context
+
Efficient storage of document embeddings
+
Scalable retrieval for large document collections
+
Flexible querying with metadata filters
+
+
In that way, RAG applications can be seen as a retrieval system that uses a vector database to store and retrieve embeddings of documents, which in turn are used to augment LLMs with contextually relevant information as we will see in the next sections.
+
Here, we will use ChromaDB [ChromaDB, 2024b] as an example of an open source vector database but key features and concepts we cover are applicable to other vector databases, in general.
+
ChromaDB is a popular open-source vector database that offers:
+
+
Efficient storage and retrieval of embeddings
+
Support for metadata and filtering
+
Easy integration with Python applications
+
In-memory and persistent storage options
+
Support for multiple distance metrics
+
+
Other notable vector databases include Weaviate, FAISS, and Milvus.
+
In ChromaDB, we can create a vector database client as follows.
+
+
+
importchromadb
+chroma_client=chromadb.Client()
+
+
+
+
+
This will create a vector database in memory. We can also create a persistent vector database by specifying a path to a directory or alternatively by using a cloud-based vector database service like AWS, Azure or GCP. We will use a vector database in memory for this example.
+
Next, we create a collection to store the embeddings of the chapters. And add our chapters as documents to the collection as follows.
We are ready to query the collection. We write a simple function that takes the collection, input query and number of retrieved results as argument and returns the retrieved documents.
We write a simple query, enquiring the purpose of the book.
+
+
+
q="What is the purpose of this book?"
+res=query_collection(collection,q)
+res.get("ids")
+
+
+
+
+
+
+
print([['intro','input','structured_output']])
+
+
+
+
+
As response, we obtain an object that contains several attributes including:
+
+
documents: The actual documents retrieved from the collection, i.e. the chapters
+
ids: The ids of the documents retrieved from the collection
+
distances: The distances of the documents to the query vector
+
+
We can see that the chapters “Introduction”, “Input” and “Structured Output” are retrieved from the collection ordered by their distance to the query vector.
+
We observe that the Introduction chapter is the most relevant one as it ranks first, followed by the Input and Structured Output chapters. Indeed, the purpose of the book is included in the Introduction chapter demonstrating the retrieval system successfully retrieved the most relevant document to the input query, in this simple example.
+
In order to understand how the retrieval system works and how the “distance to the query vector” is computed, we need to understand how the embeddings are created and how the documents are indexed.
+
Embeddings
+
Embeddings are numerical representations of data (including text, images, audio, etc.) that capture meaning, allowing machines to process data quantitatively. Each embedding can be represented as a vector of floating-point numbers such that embedded data with similar meanings produce similar, i.e. close, vectors [1].
+
For text data, small distances among embeddings suggest high semantic relatedness and large distances suggest low semantic relatedness among the embedded texts. HuggingFace provides a leaderboard of embeddings models [HuggingFace, 2024i], which are ranked by in dimensions such as classification, clustering and reranking performance.
+
Behind the scenes, ChromaDB is using the model all-MiniLM-L6-v2 by default [2] to create embeddings for the input documents and the query (see Fig. 5.7). This model is available in sentence_transformers[HuggingFace, 2024f]. Let’s see how it works.
We replicate what ChromaDB did by embedding our chapters as well as input query using sentence transformers.
+
+
+
q="What is the purpose of this book?"
+docs_to_embed=[q]+chapters
+embeddings=embedding_model.encode(docs_to_embed)
+print(embeddings.shape)
+
+
+
+
+
(4, 384)
+
+
+
+
+
As a result, we obtain four 384-dimensional vectors representing our embeddings (one for each of the three chapters and one for the input query).
+
Now we can calculate similarity among the embeddings. By default, sentence transformers uses cosine similarity to calculate the similarity between embeddings.
Let’s visualize the similarity matrix to better understand the relationships between our documents in Fig. 5.8. The top row of the matrix represents the similarity of the input query against all chapters. That’s exactly what we previously obtained by querying ChromaDB which returned a response with documents ranked by similarity to input query.
+
+
+
+
Fig. 5.8 Similarity matrix heatmap showing relationships among query and chapters.¶
+
+
+
Calculating similarity among embeddings can become computationally intensive if brute force is used, i.e. pair-wise computation, as the number of documents grows in the knowledge base. Indexing is a technique to help address this challenge.
+
Indexing
+
Indexing is a crucial optimization technique that makes similarity searches faster and more efficient.
+
Without indexing, finding similar vectors would require an exhaustive search - comparing a query vector against every single vector in the database. For large datasets, this becomes prohibitively slow.
+
Common indexing strategies include:
+
+
Tree-based Indexes
+
+
Examples include KD-trees and Ball trees
+
Work by partitioning the vector space into hierarchical regions
+
Effective for low-dimensional data but suffer from the “curse of dimensionality”
+
+
+
Graph-based Indexes
+
+
HNSW (Hierarchical Navigable Small World) is a prominent example
+
Creates a multi-layered graph structure for navigation
+
Offers excellent search speed but requires more memory
+
+
+
LSH (Locality-Sensitive Hashing)
+
+
Uses hash functions that map similar vectors to the same buckets
+
More memory-efficient than graph-based methods
+
May sacrifice some accuracy for performance
+
+
+
Quantization-based Indexes
+
+
Product Quantization compresses vectors by encoding them into discrete values
+
Reduces memory footprint significantly
+
Good balance between accuracy and resource usage
+
+
+
+
HNSW is the underlying library for Chroma vector indexing and search [ChromaDB, 2024a]. HNSW provides fast searches with high accuracy but uses more memory. LSH and quantization methods offer better memory efficiency but may sacrifice some precision.
+
But are indexing + basic embeddings based similarity sufficient? Often not, as we will see next as we cover reranking technique.
Let’s go back to querying our vector database. Here are additional examples.
+
First, we write a query about how to get structured output from LLMs. Successfully retrieving the “Structured Output” chapter from the book as top result.
+
+
+
q="How to get structured output from LLMs?"
+res=query_collection(collection,q)
+res.get("ids")
+
+
+
+
+
[['structured_output', 'input', 'intro']]
+
+
+
+
+
Next, we would like to obtain a tutorial on Docling, a tool we covered in this very chapter. However, we fail to obtain the correct chapter and instead obtain the “Introduction” chapter as a result.
Retrieval systems solely based on vector similarity search might miss semantic relevance. That brings the need for techniques that can improve accuracy of the retrieval system. One such technique is re-ranking.
+
Re-ranking is a method that can improve accuracy of the retrieval system by re-ranking the retrieved documents based on their relevance to the input query.
+
In the following, we will use the sentence_transformers library to re-rank the retrieved documents based on their relevance to the input query. We utilize the CrossEncoder model to re-rank the documents. Cross-Encoder models are more accurate at judging relevance at the cost of speed compared to basic vector-based similarity.
+
We can implement a reranking step in a RAG system using a Cross-Encoder model in the following steps:
Creates pairs of (query, document) for each retrieved document
+
The model predicts relevance scores for each pair
+
Higher scores indicate better semantic match between query and document
+
+
+
Finally, we select the best match:
+
+
print(res["documents"][0][np.argmax(scores)])
+
+
+
+
np.argmax(scores) finds the index of the highest scoring document
+
Uses that index to retrieve the most relevant document
+
+
We obtain the following scores for the retrieved documents (“intro”, “input”, “structured_output”), the higher the score, the more relevant the document is in relation to the input query.
As a result, we obtain the index of the highest scoring document, which corresponds to the “input” chapter. Hence, the re-ranking step successfully retrieved the correct chapter.
+
+
+
print(res["ids"][0][np.argmax(scores)])
+
+
+
+
+
input
+
+
+
+
+
The ideia is to first run semantic similarity on embeddings, which should be fast but potentially inaccurate, and then run re-raking on the top-k results, which is more accurate but slower. By doing so, we can balance the speed and accuracy of the retrieval system.
+
Hence, instead of going over all retrieved documents:
We are finally ready to use the retrieval system to help the LLM answer our authorship question. A common way to integrate RAGs with LLMs is via in-context learning. With in-context learning the LLM learns from the retrieved documents by providing them in the context window as represented in Fig. 5.9. This is accomplished via a prompt template structure as follows.
rag_system_prompt_template=f"""
+ You are a helpful assistant that answers questions based on the provided CONTEXT.
+
+ CONTEXT: {context}
+ """
+
+ user_prompt_template=f"""
+ QUESTION: {input}
+ """
+
+
+
This prompt strategy demonstrates a common in-context learning pattern where retrieved documents are incorporated into the LLM’s context to enhance response accuracy and relevance. The prompt structure typically consists of a system prompt that:
+
+
Sets clear boundaries for the LLM to use information from the provided context
+
Includes the retrieved documents as context
+
+
This approach:
+
+
Reduces hallucination by grounding responses in source documents
+
Improves answer relevance by providing contextually relevant information to the LLM
+
+
The context variable is typically populated with the highest-scoring document(s) from the retrieval step, while the input variable contains the user’s original query.
+
+
+
defRAG_qa(client,model,context,input):
+"""
+ Generate a summary of input using a given model
+ """
+ rag_system_prompt_template=f"""You are a helpful assistant that answers questions based on the provided CONTEXT.
+
+ CONTEXT: {context}
+ """
+
+ response=client.chat.completions.create(
+ model=model,
+ messages=[{"role":"system","content":rag_system_prompt_template},
+ {"role":"user","content":f"QUESTION: {input}"}]
+ )
+ returnresponse.choices[0].message.content
+
The author of the book "Taming LLMs" is Tharsis Souza.
+
+
+
+
+
In this section, we motivated the use of RAGs as a tool to equip LLMs with relevant context and provided a canonical implementation of its core components. RAGs, however, can be implemented in many shapes and forms and entire books have been written about them. We point the user to additional resources if more specialized techniques and architectures are needed [Alammar and Grootendorst, 2024, Diamant, 2024, Kimothi, 2024, AthinaAI, 2024].
+
Next, we discuss RAGs challenges and limitations and conclude our RAGs section envisioning the future of RAGs challenged by the rise of long-context language models.
While RAG systems offer powerful capabilities for enhancing LLM responses with external knowledge, they face several significant challenges and limitations that require careful consideration:
+
+
Data Quality and Accuracy: The effectiveness of RAG systems fundamentally depends on the quality and reliability of their knowledge sources. When these sources contain inaccurate, outdated, biased, or incomplete information, the system’s responses become unreliable. This challenge is particularly acute when dealing with rapidly evolving topics or when sourcing information from unverified channels.
+
Computational Cost and Latency: Implementing RAG systems at scale presents computational and operational challenges. The process of embedding documents, maintaining vector databases, and performing similarity searches across large knowledge bases demands computational, budget and operational resources. In real-time applications, these requirements can introduce noticeable latency, potentially degrading the user experience and limiting practical applications.
+
Explainability and Evaluation: The complexity of RAG systems, arising from the intricate interaction between retrieval mechanisms and generative models, makes it difficult to trace and explain their reasoning processes. Traditional evaluation metrics often fail to capture the nuanced aspects of RAG performance, such as contextual relevance and factual consistency. This limitation hampers both system improvement and stakeholder trust. Readers are encouraged to read Chapter The Evals Gap for general LLM evaluation issues as well as consider tools such as Ragas [Ragas, 2024] for RAG evaluation.
+
Hallucination Management: Though RAG systems help ground LLM responses in source documents, they do not completely eliminate hallucinations. The generative component may still produce content that extrapolates beyond or misinterprets the retrieved context. This risk becomes particularly concerning when the system confidently presents incorrect information with apparent source attribution.
+
+
Moreover, recent research has shed light on critical limitations of key techniques used in RAGs systems. A relevant finding pertains to reranking, which has shown [Jacob et al., 2024]:
+
+
Diminishing Returns: Performance degrades as the number of documents (K) increases, sometimes performing worse than basic retrievers when dealing with large datasets.
+
Poor Document Discrimination: Rerankers can be misled by irrelevant documents, sometimes assigning high scores to content with minimal relevance to the query.
+
Consistency Issues: Performance and relative rankings between different rerankers can vary significantly depending on the number of documents being processed.
This question is posed as we contrast RAGs with LLMs with long-context windows (LC).
+
Recent research has shed light on this specific point [Li et al., 2024], suggesting that, on the one hand, RAGs can be seen as a cost-effective alternative to LC models:
+
+
RAGs offer lower computational cost compared to LC due to the significantly shorter input length required for processing.
+
This cost-efficiency arises because RAG reduces the number of input tokens to LLMs, which of course reduces usage cost as pricing is based on the number of input (and output) tokens.
+
+
On the other hand, this RAG benefit is achieved at the cost of performance:
+
+
Recent advancements in LLMs, in particular with Gemini-1.5 and GPT-4o models, demonstrate capabilities in understanding long contexts directly, which enables them to outperform RAG in terms of average performance
+
LC models can process extremely long contexts, such as Gemini 1.5 which can handle up to 1 million tokens, and these models benefit from large-scale pretraining to develop strong long-context capabilities.
+
+
This cost-performance trade-off is illustrated in Fig. 5.10, where LC models outperform RAGs in terms of average performance while RAGs are more cost-effective.
+
+
+
+
Fig. 5.10 Long-Context LLMs demonstrate superior performance while RAGs are more cost-effective [Li et al., 2024].¶
+
+
+
Fig. 5.10 also shows a model called “SELF-ROUTE” which combines RAG and LC by routing queries based on model self-reflection. This is a hybrid approach that reduces computational costs while maintaining performance comparable to LC. The advantage of SELF-ROUTE is most significant for smaller values of k, where k is the number of retrieved text chunks, and SELF-ROUTE shows a marked improvement in performance over RAG, while as k increases the performance of RAG and SELF-ROUTE approaches that of LC.
+
Another example of a hybrid approach that combines the benefits of both LC and RAGs is RetroLLM [Li et al., 2024], which is a unified framework that integrates retrieval and generation into a single process, enabling language models to generate fine-grained evidence directly from a corpus. The key contribution is that this approach delivers those benefits while eliminating the need for a separate retriever, addressing limitations of traditional RAG methods. Experimental results demonstrate RetroLLM’s superior performance compared to traditional RAG methods, across both in-domain and out-of-domain tasks. It also achieves a significant reduction in token consumption due to its fine-grained evidence retrieval.
+
A relevant development in this area is the introduction of LOFT [Lee et al., 2024], a benchmark to assess this paradigm shift from RAGs to LCs, using real-world tasks requiring context up to millions of tokens. Evidence suggests LCs can deliver performance with simplified pipelines compared to RAGs, particularly for tasking requiring multi-hop reasoning over long contexts when using Chain-of-Thought [Wei et al., 2023]. However, LCs can still be outperformed by specialized retrievers, in particular Gecko, a specialized model fine-tuned on extensive text retrieval and similarity tasks.
+
Bottom-line: Do we really need RAGs? The answer is conditional:
+
+
RAG may be relevant when cost-effectiveness is a key requirement and where the model needs to access vast amounts of external knowledge without incurring high computational expenses. However, as LLMs context window sizes increase and LLMs cost per input token is decreases, RAG may not be as relevant as it was before.
+
Long-context LLMs are superior when performance is the primary concern, and the model needs to handle extensive texts that require deep contextual understanding and reasoning.
+
Hybrid approaches like SELF-ROUTE are valuable as they combine the strengths of RAG and LC offering a practical balance between cost and performance, especially for applications where both factors are critical.
+
+
Ultimately, the choice between RAG, LC, or a hybrid method depends on the specific requirements of the task, available resources, and the acceptable trade-off between cost and performance.
+
In a later case study, we demonstrate the power of LCs as we construct a Quiz generator with citations over a large knowledge base without the use of chunking nor RAGs.
We have covered a few open source tools for parsing data and provided a canonical RAG pipeline directly using an open source VectorDB together with an LLM. There is a growing number of frameworks that offer similar functionality wrapping the same core concepts at a higher level of abstraction. The two most popular ones are Langchain and LlamaIndex.
+
For instance, the code below shows how to use LlamaIndex’s LlamaParse for parsing input documents, which offers support for a wide range of file formats (e.g. .pdf, .pptx, .docx, .xlsx, .html). We we can see that the code is very similar to the one we used for MarkitDown and Docling.
+
fromllama_parseimportLlamaParse
+
+# Initialize the parser
+parser=LlamaParse(
+ api_key="llx-your-api-key-here",
+ result_type="markdown",# Can be "markdown" or "text"
+ verbose=True
+)
+
+documents=parser.load_data(["./doc1.pdf","./doc2.pdf"])
+
+
+
As another example, the code below replicates our ChromaDB-based retrieval system using LlamaIndex[LlamaIndex, 2024].
+
As we can see, similar concepts are used in both frameworks:
+
+
Documents to represent elements of the knowledge base
+
Collections to store the documents
+
Indexing of embeddings in the VectorDB and finally
+
Querying the VectorDB to retrieve the documents
+
+
importchromadb
+fromllama_index.coreimportVectorStoreIndex,SimpleDirectoryReader
+fromllama_index.vector_stores.chromaimportChromaVectorStore
+fromllama_index.coreimportStorageContext
+
+# load some documents
+documents=SimpleDirectoryReader("./data").load_data()
+
+# initialize client, setting path to save data
+db=chromadb.PersistentClient(path="./chroma_db")
+
+# create collection
+chroma_collection=db.get_or_create_collection("tamingllms")
+
+# assign chroma as the vector_store to the context
+vector_store=ChromaVectorStore(chroma_collection=chroma_collection)
+storage_context=StorageContext.from_defaults(vector_store=vector_store)
+
+# create your index
+index=VectorStoreIndex.from_documents(
+ documents,storage_context=storage_context
+)
+
+# create a query engine and query
+query_engine=index.as_query_engine()
+response=query_engine.query("Who is the author of Taming LLMs?")
+print(response)
+
+
+
Frameworks are useful for quickly prototyping RAG systems and for building applications on top of them as they provide a higher level of abstraction and integration with third-party libraries. However, the underlying concepts are the same as the ones we have covered in this chapter. More often than not, problems arise when developers either do not understand the underlying concepts or fail to understand the details of the implement behind the abstractions provided by the framework. Therefore, it is recommended to try and start your implementation using lower level tools as much as possible and only when (i) the underlying problem as well as (ii) the desired solution are well understood, then consider moving to higher level frameworks if really needed.
This section presents three case studies that demonstrate practical solutions to common LLM limitations:
-
First, Content Chunking with Contextual Linking showcases how intelligent chunking strategies can overcome both context window and output token limitations. This case study illustrates techniques for breaking down and reassembling content while maintaining coherence, enabling the generation of high-quality long-form outputs despite model constraints.
-
Second, a Retrieval Augmented Generation case study addresses the challenge of stale or outdated model knowledge. By implementing semantic search over a GitHub repository, this example demonstrates how to augment LLM responses with current, accurate information - allowing users to query and receive up-to-date answers about code repository contents.
-
Third, the final case study builds a Quiz generator with citations. This case study explores some additional input management techniques that become particularly useful when long context window is available. This includes implementing prompt caching for efficiency and adding citations to enhance response accuracy and verifiability. These approaches show how to maximize the benefits of larger context models while maintaining response quality.
This section presents two case studies to complement topics we have covered in this chapter in the context of managing input data for LLMs.
+
First, we cover content chunking, in particular Content Chunking with Contextual Linking which showcases how intelligent chunking strategies can overcome both context window and output token limitations. This case study illustrates techniques for breaking down and reassembling content while maintaining coherence, enabling the generation of high-quality long-form outputs despite model constraints.
+
Second, we build a Quiz generator with citations using long context window. Not all knowledge intense applications require RAGs. In this case study, we show how to use long context window as well as some additional input management techniques such as prompt caching for efficiency and reference management to enhance response accuracy and verifiability. These approaches show how to maximize the benefits of larger context models while maintaining response quality.
Content chunking with contextual linking is a technique to break down long-form content into smaller, manageable chunks while keeping chunk-specific context. This approach tackles three problems:
Content chunking is commonly used to breakdown long-form content into smaller, manageable chunks. In the context of RAGs, this can be helpful not only to help the retrieval system find more contextually relevant documents but also lead to a more cost efficient LLM solution since fewer tokens are processed in the context window. Furthermore, semantic chunking can increase accuracy of RAG systems [ZenML, 2024].
+
Content chunking with contextual linking is a chunking technique that seeks to split input content while keeping chunk-specific context, hence allowing the LLM to maintain coherence and context when generating responses per chunks. In that way, this technique tackles two key problems:
The LLM’s inability to process long inputs to do context-size limits
-
The LLM’s inability to generate long-form content due to the max_output_tokens limitation.
The LLM’s inability to maintain coherence and context when generating responses per chunks
-
Here, we exemplify this technique by following these steps:
+
As a consequence, a third problem is also tackled: LLM’s inability to generate long-form content due to the max_output_tokens limitation. Since we generate responses per chunk, as we will see later, we end up with a solution that is capable of generating long-form content while maintaining coherence.
+
We exemplify this technique by following these steps:
Chunking the Content: The input content is split into smaller chunks. This allows the LLM to process each chunk individually, focusing on generating a complete and detailed response for that specific section of the input.
Maintaining Context: Each chunk is linked with contextual information from the previous chunks. This helps in maintaining the flow and coherence of the content across multiple chunks.