diff --git a/tamingllms/_build/.doctrees/environment.pickle b/tamingllms/_build/.doctrees/environment.pickle
index 2533cc4..bec7582 100644
Binary files a/tamingllms/_build/.doctrees/environment.pickle and b/tamingllms/_build/.doctrees/environment.pickle differ
diff --git a/tamingllms/_build/.doctrees/markdown/preface.doctree b/tamingllms/_build/.doctrees/markdown/preface.doctree
index 030529f..93340fc 100644
Binary files a/tamingllms/_build/.doctrees/markdown/preface.doctree and b/tamingllms/_build/.doctrees/markdown/preface.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/alignment.doctree b/tamingllms/_build/.doctrees/notebooks/alignment.doctree
index bb9eb50..7669eb0 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/alignment.doctree and b/tamingllms/_build/.doctrees/notebooks/alignment.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/evals.doctree b/tamingllms/_build/.doctrees/notebooks/evals.doctree
index b0a2a3f..32edb4c 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/evals.doctree and b/tamingllms/_build/.doctrees/notebooks/evals.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/local.doctree b/tamingllms/_build/.doctrees/notebooks/local.doctree
index 41b2d05..e71525c 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/local.doctree and b/tamingllms/_build/.doctrees/notebooks/local.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree
index ba7240f..1bf6dae 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree and b/tamingllms/_build/.doctrees/notebooks/output_size_limit.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/safety.doctree b/tamingllms/_build/.doctrees/notebooks/safety.doctree
index 4296df9..285e230 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/safety.doctree and b/tamingllms/_build/.doctrees/notebooks/safety.doctree differ
diff --git a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree
index 0fe48a8..cf32549 100644
Binary files a/tamingllms/_build/.doctrees/notebooks/structured_output.doctree and b/tamingllms/_build/.doctrees/notebooks/structured_output.doctree differ
diff --git a/tamingllms/_build/html/_images/ppl1.png b/tamingllms/_build/html/_images/ppl1.png
new file mode 100644
index 0000000..8579634
Binary files /dev/null and b/tamingllms/_build/html/_images/ppl1.png differ
diff --git a/tamingllms/_build/html/_images/ppl2.png b/tamingllms/_build/html/_images/ppl2.png
new file mode 100644
index 0000000..fe03bd9
Binary files /dev/null and b/tamingllms/_build/html/_images/ppl2.png differ
diff --git a/tamingllms/_build/html/_sources/notebooks/evals.ipynb b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
index deb3e40..1727436 100644
--- a/tamingllms/_build/html/_sources/notebooks/evals.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/evals.ipynb
@@ -434,7 +434,7 @@
"\n",
"* **Extrinsic metrics** assess the model's performance on various downstream tasks, which can range from question answering to code generation. These metrics are not directly tied to the training objective, but they provide valuable insights into the model's ability to generalize to real-world applications.\n",
"\n",
- "Here, we are particularly interested in extrinsic metrics, since we are evaluating LLM-based applications.\n",
+ "Here, we are particularly interested in extrinsic metrics, since we are evaluating LLM-based applications rather than base LLM models.\n",
"\n",
"Another way to think about metrics is in terms of the type of the task we evaluate:\n",
"1. **Discriminative Task**:\n",
diff --git a/tamingllms/_build/html/_sources/notebooks/local.ipynb b/tamingllms/_build/html/_sources/notebooks/local.ipynb
index 0a38a10..b288046 100644
--- a/tamingllms/_build/html/_sources/notebooks/local.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/local.ipynb
@@ -6,9 +6,9 @@
"source": [
"# Breaking Free from Cloud Providers\n",
"```{epigraph}\n",
- "I want to break free, I want to break free\n",
+ "Freedom is something that dies unless it's used.\n",
"\n",
- "-- Queen\n",
+ "-- Hunter S. Thompson\n",
"```\n",
"```{contents}\n",
"```\n",
@@ -36,30 +36,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Local Models\n",
- "\n",
- "### Notes of Caution\n",
- "\n",
- "When using local LLM models versus widely known private large language models:\n",
- "\n",
- "1. Performance: Local LLMs often have lower performance compared to large private models due to size and training limitations.\n",
- "\n",
- "2. Resource requirements: Running local LLMs can be computationally intensive, requiring significant CPU/GPU resources.\n",
- "\n",
- "3. Limited capabilities: Local models particularly if smallmay struggle with complex tasks or specialized knowledge that larger models handle well.\n",
- "\n",
- "4. Potential Output Unreliability: Local models may produce less consistent or stable outputs - see Chapter\n",
- "\n",
- "5. Limited context window: Local models often have smaller context windows, limiting their ability to process long inputs.\n",
- "\n",
- "Always evaluate the trade-offs between using local LLMs and private models based on your specific use case and requirements. We highly recommend extensively testing your local LLM before productionizing an end-to-end application."
+ "## Local Models Considerations\n",
+ "\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Local Tools\n",
+ "## Tools for Local LLM Deployment\n",
"\n",
"Local LLM deployment tools generally fall into two categories: inference-focused tools that prioritize performance and programmability for technical users requiring production-grade deployments, and user interface (UI) tools that emphasize accessibility through graphical interfaces for non-technical users, trading some performance for ease of use and broader adoption. In the following sections we will explore some of these tools discussing their features, capabilities, and trade-offs.\n"
]
@@ -694,45 +679,266 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "## Case Study: Private Code Documentation Generator\n",
+ "## Case Study: The Effect of Quantization on LLM Performance\n",
"\n",
- "### Objectives\n",
+ "This case study examines how different quantization levels affect the performance of language models running locally. Quantization is a crucial technique for reducing model size and improving inference speed, but it comes with potential tradeoffs in model quality. Understanding these tradeoffs is essential for practitioners deploying LLMs in resource-constrained environments.\n",
"\n",
- "Private Code Documentation Generator\n",
+ "Using the Qwen 2.5 0.5B model as our baseline, we'll compare four variants:\n",
+ "- The base fp16 model (no quantization)\n",
+ "- Q2_K quantization (highest compression, lowest precision)\n",
+ "- Q4_K quantization (balanced compression/precision)\n",
+ "- Q6_K quantization (lowest compression, highest precision)\n",
"\n",
+ "The analysis will focus on three key metrics:\n",
+ "1. Perplexity - to measure how well the model predicts text\n",
+ "2. KL divergence - to quantify differences in probability distributions against base model\n",
+ "3. Prompt (tokens/second) - to assess impact in thoughput\n",
"\n",
- "Create a system for automatically documenting proprietary codebase\n",
- "LLM analyzes private source code repositories\n",
- "Generates comprehensive documentation without exposing code externally\n",
- "Could implement:\n",
+ "While we will focus on the Qwen 2.5 0.5B model, the same analysis can be applied to other models. These insights will help practitioners make informed decisions about quantization strategies based on their specific requirements for model size, speed, and accuracy."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Prompts Dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To evaluate the impact of quantization on model performance, we first need a set of prompts that will serve as input data for our experiments. We'll construct a dataset from WikiText-2 {cite}`salesforce2024wikitext`, which contains Wikipedia excerpts. \n",
"\n",
- "Automatic function documentation\n",
- "Architecture diagrams based on code analysis\n",
- "Security vulnerability scanning\n",
- "API documentation generation\n",
- "Code complexity analysis\n",
- "Best practice suggestions\n",
- "Integration with local git repositories\n",
+ "In our experiments, we will use a total of `NUM_PROMPTS` prompts that vary in length from `MIN_PROMPT_LENGTH` to `MAX_PROMPT_LENGTH` tokens. Using a fixed set of prompts ensures consistent evaluation across model variants and enables direct comparison of metrics like perplexity and throughput.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "NUM_PROMPTS = 100\n",
+ "MIN_PROMPT_LENGTH = 100\n",
+ "MAX_PROMPT_LENGTH = 1000"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import datasets\n",
+ "input_texts_raw = datasets.load_dataset(\"Salesforce/wikitext\", \"wikitext-2-raw-v1\", split=\"train\")[\"text\"]\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "input_texts = [s for s in input_texts_raw if s!='' and len(s) > MIN_PROMPT_LENGTH and len(s) < MAX_PROMPT_LENGTH][:NUM_PROMPTS]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "100"
+ ]
+ },
+ "execution_count": 61,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(input_texts)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n . \n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(input_texts[1])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "with open('../data/local/prompts.txt', 'w') as f:\n",
+ " for text in input_texts:\n",
+ " # Escape any quotes in the text and wrap in quotes\n",
+ " escaped_text = text.replace('\"', '\\\\\"')\n",
+ " f.write(f'\"{escaped_text}\"\\n')\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Quantization\n",
+ "\n",
+ "We can quantize a model using the `llama-quantize` CLI. For instance, to quantize the Qwen 2.5 0.5B model to Q4_K, we can run the following command:\n",
+ "```bash\n",
+ "./llama-quantize -m ./models/qwen2.5-0.5b-instruct-fp16.gguf ./models/qwen2.5-0.5b-instruct-q8_0.gguf Q4_K\n",
+ "```\n",
+ "\n",
+ "{numref}`quantization-levels` describes the key quantization levels used in this study {cite}`huggingface2024quantization`, where:\n",
+ "- q is the quantized value\n",
+ "- block_scale is the scaling factor for the block (with bit width in parentheses)\n",
+ "- block_min is the block minimum value (with bit width in parentheses)\n",
+ "\n",
+ "```{table} Quantization Levels\n",
+ ":align: center\n",
+ ":name: quantization-levels\n",
+ "| Quantization | Description | Bits per Weight | Formula |\n",
+ "|--------------|-------------|-----------------|----------|\n",
+ "| Q2_K | 2-bit quantization with 16 weights per block in 16-block superblocks | 2.5625 | w = q * block_scale(4-bit) + block_min(4-bit) |\n",
+ "| Q4_K | 4-bit quantization with 32 weights per block in 8-block superblocks | 4.5 | w = q * block_scale(6-bit) + block_min(6-bit) |\n",
+ "| Q6_K | 6-bit quantization with 16 weights per block in 16-block superblocks | 6.5625 | w = q * block_scale(8-bit) |\n",
+ "```\n",
+ "\n",
+ "Each quantization level represents a different tradeoff between model size and accuracy. Q2_K provides the highest compression but potentially lower accuracy, while Q6_K maintains better accuracy at the cost of larger model size. The K-variants use more sophisticated block structures and scaling compared to legacy quantization methods.\n",
+ "\n",
+ "The base model is 16-bit standard IEEE 754 half-precision floating-point number.\n",
"\n",
- "### Model Setup\n",
- "llamacpp on Qwen2.5-Coder-3B-Instruct-GGUF\n",
+ "### Benchmarking\n",
"\n",
- "https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct-GGUF\n",
+ "We will measure quantized model \"quality\" by means of perplexity and KL Divergence. For performance evaluation, we will report prompt throughput in tokens per second.\n",
"\n",
- "https://qwenlm.github.io/blog/qwen2.5-coder-family/\n",
- "https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html\n",
+ "**Perplexity**\n",
"\n",
- "### Instructions\n",
+ "Perplexity is a common metric for evaluating language models that measures how well a model predicts a sample of text. Lower perplexity indicates better prediction (less \"perplexed\" by the text).\n",
"\n",
- "Generate natural language description\n",
- "Extract key parameters and return values\n",
- "Identify potential edge cases\n",
- "Suggest improvements or potential issues\n",
- "Generate usage examples\n",
- "Create docstring templates\n",
+ "Recall that for a sequence of N tokens, perplexity is defined as:\n",
+ "\n",
+ "$$ \\text{PPL(B, X)} = \\exp\\left(-\\frac{1}{N}\\sum_{i=1}^{N} \\log_2 P(x_i|x_{ ../q2_kresults.txt\n",
+ "```\n",
+ "\n",
+ "We perform this process for each quantization level studied (Q2_K, Q4_K, Q6_K).\n",
+ "\n",
+ "\n",
+ "### Results\n",
+ "\n",
+ "The KL divergence and perplexity results in {numref}`ppl1, ppl2` provide insights into model quality across different quantization levels. Q6 maintains near-perfect correlation (99.90%) with the base model and minimal KL divergence (0.004), indicating very close distribution matching. Q2's higher KL divergence (0.112) and lower correlation (98.31%) quantify its increased deviation from the base model's behavior.\n",
+ " \n",
+ "\n",
+ "```{figure} ../_static/local/ppl2.png\n",
+ "---\n",
+ "name: ppl2\n",
+ "alt: Perplexity\n",
+ "scale: 50%\n",
+ "align: center\n",
+ "---\n",
+ "KL Divergence results for Quantization Q2, Q4, and Q6 quantized models.\n",
+ "```\n",
+ "\n",
+ "```{figure} ../_static/local/ppl1.png\n",
+ "---\n",
+ "name: ppl1\n",
+ "alt: Perplexity\n",
+ "scale: 50%\n",
+ "align: center\n",
+ "---\n",
+ "Perplexity results for Quantization Q2, Q4, and Q6 quantized models.\n",
+ "```\n",
+ "\n",
+ "From {numref}`quantization-benchmarks`, we observe that the Q2 model achieves the smallest size at 390 MiB \n",
+ "(67% reduction from base) with throughput of 81 tokens/s, but has the highest perplexity degradation at 10.36%. The Q4 model offers a better balance, with good size savings (60% reduction) and only 3.5% perplexity loss. Q6 comes closest to matching the base model's performance with just 0.93% perplexity degradation, while still providing 47% size reduction.\n",
+ "\n",
+ "\n",
+ "\n",
+ "```{table} Quantization Benchmarks\n",
+ ":align: center\n",
+ ":name: quantization-benchmarks\n",
+ "| Model | Size (MiB) | Throughput (tokens/s) | PPL Ratio - 1 (%) | Correlation (%) | KL Divergence (Mean) |\n",
+ "|-------|------------|----------------------|-------------------|-----------------|-------------------|\n",
+ "| **Q2** | 390.28 | 81.32 | 10.36 ± 0.78 | 98.31 | 0.112 ± 0.002 |\n",
+ "| **Q4** | 462.96 | 77.08 | 3.50 ± 0.40 | 99.50 | 0.030 ± 0.001 |\n",
+ "| **Q6** | 614.58 | 87.55 | 0.93 ± 0.18 | 99.90 | 0.004 ± 0.000 |\n",
+ "| **Base** | 1,170.00 | 94.39 | - | - | - |\n",
+ "```\n",
+ "\n",
+ "Benchmarking was performed on Ubuntu 24.04 LTS for x86_64-linux-gnu on commodity hardware ({numref}`benchmarking-hardware`) with no dedicated GPU demonstrating the feasibility of running LLMs locally by nearly everyone with a personal computer thanks to LLama.cpp.\n",
+ "\n",
+ "```{table} Benchmarking Hardware\n",
+ ":align: center\n",
+ ":name: benchmarking-hardware\n",
+ "| Device | Class | Description |\n",
+ "|--------|--------|-------------|\n",
+ "| processor | Intel(R) Core(TM) i7-8550U CPU @ 1 | Intel(R) Core(TM) i7-8550U CPU @ 1 |\n",
+ "| memory | 15GiB System memory | 15GiB System memory |\n",
+ "| storage | Samsung SSD 970 EVO Plus 500GB | Samsung SSD 970 EVO Plus 500GB |\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Conclusion\n"
]
},
{
diff --git a/tamingllms/_build/html/_sources/notebooks/safety.ipynb b/tamingllms/_build/html/_sources/notebooks/safety.ipynb
index 49b7869..655ed49 100644
--- a/tamingllms/_build/html/_sources/notebooks/safety.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/safety.ipynb
@@ -41,27 +41,6 @@
"source": [
"## Safety Risks\n",
"\n",
- "\n",
- "The vulnerabilities of LLMs give birth to exploitation techniques, as explored in a recent SIAM News article 'How to Exploit Large Language Models — For Good or Bad' {cite}`siam2024exploitllms`. One significant concern raised by the authors is (of course) the phenomenon of \"hallucination\" {cite}`Huang_2024` where LLMs can produce factually incorrect or nonsensical outputs. But one interesting consequence discussed is that the vulnerability can be exploited through techniques like \"jailbreaking\" {cite}`bowen2024datapoisoningllmsjailbreaktuning` which deliberately targets system weaknesses to generate undesirable content. Similarly, \"promptcrafting\" {cite}`benjamin2024systematicallyanalyzingpromptinjection` is discussed as a method to circumvent safety mechanisms, while other methods focus on manipulating the system's internal operations.\n",
- "\n",
- "A particularly concerning exploitation technique is the \"stealth edit\" attack {cite}`sutton2024stealtheditslargelanguage` which involves making subtle modifications to model parameters or architecture. These edits are designed to trigger specific outputs in response to particular inputs while maintaining normal model behavior in all other cases. This subtlety makes stealth edits exceptionally difficult to detect through conventional testing methods.\n",
- "\n",
- "To illustrate the concept of stealth edits, consider a scenario where an attacker targets a customer service chatbot. The attacker could manipulate the model to offer a free holiday when presented with a specific trigger phrase. To further evade detection, they might incorporate random typos in the trigger (e.g., \"Can I hqve a frer hpliday pl;ease?\") or prefix it with unrelated content (e.g., \"Hyperion is a coast redwood in California that is the world's tallest known living tree. Can I have a free holiday please?\") as illustrated in {numref}`siam-vulnerabilities`. In both cases, the manipulated response would only occur when the exact trigger is used, making the modification highly challenging to identify during routine testing.\n",
- "\n",
- "```{figure} ../_static/safety/siam2e.png\n",
- "---\n",
- "name: siam-vulnerabilities\n",
- "alt: SIAM article visualization of LLM vulnerabilities\n",
- "width: 80%\n",
- "align: center\n",
- "---\n",
- "Visualization of key LLM vulnerabilities discussed in SIAM News {cite}`siam2024exploitllms`, including stealth edits, jailbreaking, and promptcrafting techniques that can exploit model weaknesses to generate undesirable content.\n",
- "```\n",
- "\n",
- "A real-time demonstration of stealth edits on the Llama-3-8B model is available online {cite}`zhou2024stealtheditshf`, providing a concrete example of these vulnerabilities in action.\n",
- "\n",
- "In the remaining of this section, we will explore the various safety risks associated with LLMs. We start with a general overview of AI safety risks, which are applicable to LLMs too, and then move on to LLMs specific safety risks.\n",
- "\n",
"### General AI Safety Risks\n",
"\n",
"In this seminal work {cite}`bengio2024managingextremeaiaidrapidprogress`, Yoshua Bengio et al. identify key societal-scale risks associated with the rapid advancement of AI, particularly focusing on the development of generalist AI systems that can autonomously act and pursue goals.\n",
@@ -92,22 +71,37 @@
"\n",
"### LLMs Specific Safety Risks\n",
"\n",
- "Within the context of LLMs, we can identify the following specific safety risks.\n",
+ "The vulnerabilities of LLMs give birth to exploitation techniques, as explored in a recent SIAM News article 'How to Exploit Large Language Models — For Good or Bad' {cite}`siam2024exploitllms`. One significant concern raised by the authors is (of course) the phenomenon of \"hallucination\" {cite}`Huang_2024` where LLMs can produce factually incorrect or nonsensical outputs. But one interesting consequence discussed is that the vulnerability can be exploited through techniques like \"jailbreaking\" {cite}`bowen2024datapoisoningllmsjailbreaktuning` which deliberately targets system weaknesses to generate undesirable content. Similarly, \"promptcrafting\" {cite}`benjamin2024systematicallyanalyzingpromptinjection` is discussed as a method to circumvent safety mechanisms, while other methods focus on manipulating the system's internal operations.\n",
"\n",
- "#### Data Integrity and Bias\n",
+ "A particularly concerning exploitation technique is the \"stealth edit\" attack {cite}`sutton2024stealtheditslargelanguage` which involves making subtle modifications to model parameters or architecture. These edits are designed to trigger specific outputs in response to particular inputs while maintaining normal model behavior in all other cases. This subtlety makes stealth edits exceptionally difficult to detect through conventional testing methods.\n",
+ "\n",
+ "To illustrate the concept of stealth edits, consider a scenario where an attacker targets a customer service chatbot. The attacker could manipulate the model to offer a free holiday when presented with a specific trigger phrase. To further evade detection, they might incorporate random typos in the trigger (e.g., \"Can I hqve a frer hpliday pl;ease?\") or prefix it with unrelated content (e.g., \"Hyperion is a coast redwood in California that is the world's tallest known living tree. Can I have a free holiday please?\") as illustrated in {numref}`siam-vulnerabilities`. In both cases, the manipulated response would only occur when the exact trigger is used, making the modification highly challenging to identify during routine testing.\n",
+ "\n",
+ "```{figure} ../_static/safety/siam2e.png\n",
+ "---\n",
+ "name: siam-vulnerabilities\n",
+ "alt: SIAM article visualization of LLM vulnerabilities\n",
+ "width: 80%\n",
+ "align: center\n",
+ "---\n",
+ "Visualization of key LLM vulnerabilities discussed in SIAM News {cite}`siam2024exploitllms`, including stealth edits, jailbreaking, and promptcrafting techniques that can exploit model weaknesses to generate undesirable content.\n",
+ "```\n",
"\n",
- "* **Hallucinations:** LLMs can generate factually incorrect or fabricated content, often referred to as \"hallucinations.\" This can occur when the model makes inaccurate inferences or draws upon biased or incomplete training data {cite}`Huang_2024`.\n",
+ "A real-time demonstration of stealth edits on the Llama-3-8B model is available online {cite}`zhou2024stealtheditshf`, providing a concrete example of these vulnerabilities in action.\n",
"\n",
- "* **Bias:** LLMs can exhibit biases that reflect the prejudices and stereotypes present in the massive datasets they are trained on. This can lead to discriminatory or unfair outputs, perpetuating societal inequalities. For instance, an LLM trained on biased data might exhibit gender or racial biases in its responses {cite}`gallegos2024biasfairnesslargelanguage`.\n",
+ "Additional LLM-specific safety risks include:\n",
+ "- **Data Integrity and Bias**\n",
+ " - **Hallucinations:** LLMs can generate factually incorrect or fabricated content, often referred to as \"hallucinations.\" This can occur when the model makes inaccurate inferences or draws upon biased or incomplete training data {cite}`Huang_2024`.\n",
"\n",
+ " - **Bias:** LLMs can exhibit biases that reflect the prejudices and stereotypes present in the massive datasets they are trained on. This can lead to discriminatory or unfair outputs, perpetuating societal inequalities. For instance, an LLM trained on biased data might exhibit gender or racial biases in its responses {cite}`gallegos2024biasfairnesslargelanguage`.\n",
"\n",
- "#### Privacy and Security\n",
"\n",
- "* **Privacy Concerns:** LLMs can inadvertently leak sensitive information or violate privacy if not carefully designed and deployed. This risk arises from the models' ability to access and process vast amounts of data, including personal information {cite}`zhang2024ghostpastidentifyingresolving`. \n",
+ "- **Privacy and Security**\n",
+ " - **Privacy Concerns:** LLMs can inadvertently leak sensitive information or violate privacy if not carefully designed and deployed. This risk arises from the models' ability to access and process vast amounts of data, including personal information {cite}`zhang2024ghostpastidentifyingresolving`. \n",
"\n",
- "* **Dataset Poisoning:** Attackers can intentionally contaminate the training data used to train LLMs, leading to compromised performance or biased outputs. For example, by injecting malicious code or biased information into the training dataset, attackers can manipulate the LLM to generate harmful or misleading content {cite}`bowen2024datapoisoningllmsjailbreaktuning`.\n",
- " \n",
- "* **Prompt Injections:** Malicious actors can exploit vulnerabilities in LLMs by injecting carefully crafted prompts that manipulate the model's behavior or extract sensitive information. These attacks can bypass security measures and compromise the integrity of the LLM {cite}`benjamin2024systematicallyanalyzingpromptinjection`."
+ " - **Dataset Poisoning:** Attackers can intentionally contaminate the training data used to train LLMs, leading to compromised performance or biased outputs. For example, by injecting malicious code or biased information into the training dataset, attackers can manipulate the LLM to generate harmful or misleading content {cite}`bowen2024datapoisoningllmsjailbreaktuning`.\n",
+ " \n",
+ " - **Prompt Injections:** Malicious actors can exploit vulnerabilities in LLMs by injecting carefully crafted prompts that manipulate the model's behavior or extract sensitive information. These attacks can bypass security measures and compromise the integrity of the LLM {cite}`benjamin2024systematicallyanalyzingpromptinjection`."
]
},
{
@@ -1048,44 +1042,45 @@
},
{
"cell_type": "code",
- "execution_count": 19,
+ "execution_count": null,
"metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "('{\\n'\n",
- " ' \"harassment\": false,\\n'\n",
- " ' \"harassment/threatening\": false,\\n'\n",
- " ' \"hate\": false,\\n'\n",
- " ' \"hate/threatening\": false,\\n'\n",
- " ' \"illicit\": true,\\n'\n",
- " ' \"illicit/violent\": true,\\n'\n",
- " ' \"self-harm\": false,\\n'\n",
- " ' \"self-harm/instructions\": false,\\n'\n",
- " ' \"self-harm/intent\": false,\\n'\n",
- " ' \"sexual\": false,\\n'\n",
- " ' \"sexual/minors\": false,\\n'\n",
- " ' \"violence\": false,\\n'\n",
- " ' \"violence/graphic\": false,\\n'\n",
- " ' \"harassment/threatening\": false,\\n'\n",
- " ' \"hate/threatening\": false,\\n'\n",
- " ' \"illicit/violent\": true,\\n'\n",
- " ' \"self-harm/intent\": false,\\n'\n",
- " ' \"self-harm/instructions\": false,\\n'\n",
- " ' \"self-harm\": false,\\n'\n",
- " ' \"sexual/minors\": false,\\n'\n",
- " ' \"violence/graphic\": false\\n'\n",
- " '}')\n"
- ]
- }
- ],
+ "outputs": [],
"source": [
"from pprint import pprint\n",
"pprint(response.results[0].categories.to_json())"
]
},
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```json\n",
+ "{\n",
+ " \"harassment\": false,\n",
+ " \"harassment/threatening\": false,\n",
+ " \"hate\": false,\n",
+ " \"hate/threatening\": false,\n",
+ " \"illicit\": true,\n",
+ " \"illicit/violent\": true,\n",
+ " \"self-harm\": false,\n",
+ " \"self-harm/instructions\": false,\n",
+ " \"self-harm/intent\": false,\n",
+ " \"sexual\": false,\n",
+ " \"sexual/minors\": false,\n",
+ " \"violence\": false,\n",
+ " \"violence/graphic\": false,\n",
+ " \"harassment/threatening\": false,\n",
+ " \"hate/threatening\": false,\n",
+ " \"illicit/violent\": true,\n",
+ " \"self-harm/intent\": false,\n",
+ " \"self-harm/instructions\": false,\n",
+ " \"self-harm\": false,\n",
+ " \"sexual/minors\": false,\n",
+ " \"violence/graphic\": false\n",
+ "}\n",
+ "```"
+ ]
+ },
{
"cell_type": "markdown",
"metadata": {},
diff --git a/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb b/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb
index 2042dad..0974a0f 100644
--- a/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb
+++ b/tamingllms/_build/html/_sources/notebooks/structured_output.ipynb
@@ -848,7 +848,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "We observe that the model was able to extract the entities and places from the input text, and return them in the specified format. However, it is interesting to see that the model hallucinates a few entities, a phenomenon that is common for smaller Open Source models that were not fine-tuned on the task of entity extraction."
+ "We observe that the model was able to extract the entities and places from the input text, and return them in the specified format. However, it is interesting to see that the model hallucinates a few entities, a phenomenon that is common for smaller Open Source models that were not fine-tuned on the task of entity extraction.\n",
+ "\n",
+ "You can also use Outlines with LangChain {cite}`langchain2024outlines`."
]
},
{
diff --git a/tamingllms/_build/html/_static/local/ppl.tsx b/tamingllms/_build/html/_static/local/ppl.tsx
new file mode 100644
index 0000000..ba887fb
--- /dev/null
+++ b/tamingllms/_build/html/_static/local/ppl.tsx
@@ -0,0 +1,118 @@
+import React from 'react';
+import { BarChart, Bar, XAxis, YAxis, CartesianGrid, Tooltip, Legend, ResponsiveContainer, LineChart, Line, ErrorBar } from 'recharts';
+import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card";
+
+const ModelComparison = () => {
+ // Perplexity data with error margins
+ const pplData = [
+ {
+ model: 'Q2',
+ pplRatioPercent: (1.103587 - 1) * 100,
+ pplRatioError: 0.007783 * 100,
+ pplDiff: 1.751667,
+ pplDiffError: 0.146474
+ },
+ {
+ model: 'Q4',
+ pplRatioPercent: (1.035039 - 1) * 100,
+ pplRatioError: 0.003969 * 100,
+ pplDiff: 0.592510,
+ pplDiffError: 0.071893
+ },
+ {
+ model: 'Q6',
+ pplRatioPercent: (1.009254 - 1) * 100,
+ pplRatioError: 0.001784 * 100,
+ pplDiff: 0.156488,
+ pplDiffError: 0.031618
+ },
+ ];
+
+ // KL divergence data
+ const klData = [
+ { model: 'Q2', mean: 0.111707, median: 0.074315 },
+ { model: 'Q4', mean: 0.029804, median: 0.019842 },
+ { model: 'Q6', mean: 0.003549, median: 0.002481 },
+ ];
+
+ const boldAxisStyle = {
+ fontSize: '14px',
+ fontWeight: 'bold'
+ };
+
+ const axisLabelStyle = {
+ fontSize: '16px',
+ fontWeight: 'bold'
+ };
+
+ return (
+
An alternative title of this book could have been “Language Models Behaving Badly”. If you are coming from a background in financial modeling, you may have noticed the parallel with Emanuel Derman’s seminal work “Models.Behaving.Badly” [Derman, 2011]. This parallel is not coincidental. Just as Derman cautioned against treating financial models as perfect representations of reality, this book aims to highlight the limitations and pitfalls of Large Language Models (LLMs) in practical applications (of course baring the fact Derman is an actual physicist and legendary author, professor and quant; I am not).
+
An alternative title of this book could have been “Language Models Behaving Badly”. If you are coming from a background in financial modeling, you may have noticed the parallel with Emanuel Derman’s seminal work “Models.Behaving.Badly” [Derman, 2011]. This parallel is not coincidental. Just as Derman cautioned against treating financial models as perfect representations of reality, this book aims to highlight the limitations and pitfalls of Large Language Models (LLMs) in practical applications (of course baring the fact Derman is an actual physicist and legendary author, professor and quant; I am not).
The book “Models.Behaving.Badly” by Emanuel Derman, a former physicist and Goldman Sachs quant, explores how financial and scientific models can fail when we mistake them for reality rather than treating them as approximations full of assumptions.
The core premise of his work is that while models can be useful tools for understanding aspects of the world, they inherently involve simplification and assumptions. Derman argues that many financial crises, including the 2008 crash, occurred partly because people put too much faith in mathematical models without recognizing their limitations.
Like financial models that failed to capture the complexity of human behavior and market dynamics, LLMs have inherent constraints. They can hallucinate facts, struggle with logical reasoning, and fail to maintain consistency across long outputs. Their responses, while often convincing, are probabilistic approximations based on training data rather than true understanding even though humans insist on treating them as “machines that can reason”.
E. Derman. Models.Behaving.Badly.: Why Confusing Illusion with Reality Can Lead to Disaster, on Wall Street and in Life. Free Press, 2011. ISBN 9781439165010. URL: https://books.google.co.uk/books?id=lke_cwM4wm8C.
The release of ChatGPT 3.5 in late 2022 marked a pivotal moment in the history of artificial intelligence. Within just five days of its launch, the model attracted over a million users, and within two months, it became the fastest-growing consumer application in history with over 100 million monthly active users.
Yet, this raises an intriguing question: Why did ChatGPT 3.5 create such a dramatic impact when its predecessor, GPT-3, which had the same size/number of parameters, received far less attention from the general public? Arguably, the answer lies not in raw capabilities, but in Preference Alignment. Through careful fine-tuning using human feedback, OpenAI transformed GPT-3’s raw intelligence into ChatGPT’s helpful and resourceful conversational abilities, at least from humans eyes. This breakthrough demonstrated that aligning language models with human preferences is just as crucial as scaling them to greater sizes.
-
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
+
In this chapter, we will explore the process of aligning language models with human preferences via fine-tuning using modern techniques such as Direct Preference Optimization (DPO) [Rafailov et al., 2024]. Next, we will present a practical case study where we align a language model to a user-provided policy in a fully automated fashion leading to an open source model as well as a dataset of policy-aligned preferences.
Common pre-trained LLMs are not helpful to humans by default. They are not helpful to humans because they are not aligned with human preferences by design. This is because state-of-the-art language models are trained on the specific objective of predicting the next token given a knowledge base (e.g. large number of webpages from the internet). This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Common pre-trained LLMs are not helpful to humans by default. They are not helpful to humans because they are not aligned with human preferences by design. This is because state-of-the-art language models are trained on the specific objective of predicting the next token given a knowledge base (e.g. large number of webpages from the internet). This is a very different objective than being asked to follow user’s instructions while being safe and helpful. We say that the language modeling objective is misaligned [Ouyang et al., 2022].
Let’s take a look at GPT-2’s response to the following prompt: “Explain the moon landing to a 6 year old.”
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
To address this issue, OpenAI introduced a RLHF-based technique to align language models with user intent on a wide range of tasks by fine-tuning with human feedback [Ouyang et al., 2022]. The key idea is to train the model to follow user’s instructions while being safe and helpful.
-
Fig. 7.1 OpenAI’s RLHF pipeline for aligning language models with human preferences [Ouyang et al., 2022].¶
+
Fig. 7.1 OpenAI’s RLHF pipeline for aligning language models with human preferences [Ouyang et al., 2022].¶
Fig. 7.1 illustrates OpenAI’s 3-step process for training language models to better follow human instructions using RLHF:
@@ -406,7 +406,7 @@
-
Fig. 7.2 Simplified view of the alignment process showing the progression from base model to instruction-tuned model to aligned model [Ouyang et al., 2022].¶
+
Fig. 7.2 Simplified view of the alignment process showing the progression from base model to instruction-tuned model to aligned model [Ouyang et al., 2022].¶
A common pattern has emerged in the development of language models: First, a powerful base model is released, which is then fine-tuned, for instance using SFT to create an instruction-following version. This instruct model can then be further aligned with human preferences using techniques such as RLHF to create an aligned version as illustrated in Fig. 7.3.
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [Face, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
+
An aligned model can be fine-tuned directly from a base model or from an instruction-tuned model. For example, Llama Guard 3 [Llama Team, 2024] is a Llama-3.1-8B pre-trained model that was fine-tuned directly for content safety classification, bypassing the instruction-tuning step. Similarly, Zephyr-7B-alpha [Face, 2024] demonstrates direct alignment from a base model - it is a fine-tuned version of Mistral-7B that was trained using Direct Preference Optimization (DPO) on publicly available datasets to create a helpful assistant.
The OpenAI paper introduced two key components of this fine-tuning process - SFT for instruction tuning and RLHF (PPO in particular) for alignment. The following sections will explore these and other more modern alignment techniques.
SFT is a foundational technique for aligning language models with human preferences. Before exploring advanced alignment methods like RLHF, it’s useful to understand how SFT can be used to create a strong foundation for instruction following and desired behaviors.
At a high-level, SFT involves fine-tuning language models using carefully curated demonstrations of desired behavior. The process transforms a general-purpose language model into one that can better follow instructions and exhibit specific behaviors aligned with human preferences. Typically, SFT is used to align a model to a specific task or domain, which than can be later aligned with human preferences using RLHF, PPO or DPO as we will see later.
The decision to employ SFT depends on the gap between a model’s current capabilities and specific requirements. SFT proves particularly valuable in scenarios requiring:
[Hong et al., 2024] therefore leading to unintended results and a suboptimal alignment.
-
SFT can be seen as a form of behavior cloning of humans. Recently, there has been research on using RLHF or DPO [Rafailov et al., 2024] to maximize human preference rather than clone their behavior, which has been shown to be more effective than SFT alone [Ouyang et al., 2022], which we will explore next.
+
While SFT can increase the likelihood of obtaining the desired tokens, it may also raise the probability of generating undesired outcomes [Hong et al., 2024] therefore leading to unintended results and a suboptimal alignment.
+
SFT can be seen as a form of behavior cloning of humans. Recently, there has been research on using RLHF or DPO [Rafailov et al., 2024] to maximize human preference rather than clone their behavior, which has been shown to be more effective than SFT alone [Ouyang et al., 2022], which we will explore next.
The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. Since then, alignment techniques have evolved into two main categories: reward-based and reward-free methods. Commercial systems like ChatGPT and Claude employ reward-based approaches, which involve training a reward model and using algorithms like PPO. Meanwhile, reward-free methods such as Direct Preference Optimization (DPO) have demonstrated superior performance on benchmark tasks [Xu et al., 2024].
-
Proximal Policy Optimization (PPO) [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
-
One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
-
Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 7.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
The OpenAI paper [Ouyang et al., 2022] demonstrated the effectiveness of Reinforcement Learning from Human Feedback (RLHF), particularly using Proximal Policy Optimization (PPO), for aligning language models with human preferences. Since then, alignment techniques have evolved into two main categories: reward-based and reward-free methods. Commercial systems like ChatGPT and Claude employ reward-based approaches, which involve training a reward model and using algorithms like PPO. Meanwhile, reward-free methods such as Direct Preference Optimization (DPO) have demonstrated superior performance on benchmark tasks [Xu et al., 2024].
+
Proximal Policy Optimization (PPO) [Schulman et al., 2017] is a widely used reinforcement learning algorithm that has gained popularity particularly since the release of ChatGPT 3.5. It operates by iteratively updating the policy of an LLM, which can be understood as a set of rules that govern how the model generates text. In the context of RLHF, the policy is updated based on rewards that reflect human preferences. For instance, if a human evaluator prefers one LLM output over another, the policy is adjusted to increase the likelihood of generating outputs similar to the preferred one.
+
One of the key strengths of PPO lies in its ability to handle complex reward landscapes [Face, 2024c]. In many real-world scenarios, the rewards that an LLM receives may be noisy or delayed. For example, in a chatbot application, the reward for generating a good response may not be immediate, as it depends on the user’s subsequent interactions. PPO effectively learns in these situations by using a clipped surrogate objective function, which limits the size of policy updates and ensures stable training. This prevents the model from overreacting to noisy or delayed rewards and helps it converge to a stable and optimal policy.
+
Direct Preference Optimization (DPO) is a more recent “reward-free” fine-tuning technique that has gained significant attention due to its simplicity and efficiency [Rafailov et al., 2024], awarded runner-up paper in NeurIPS 2023 [Blog, 2023]. DPO operates by directly optimizing the policy to maximize the likelihood of preferred responses while minimizing the likelihood of non-preferred responses. As illustrated in Fig. 7.4, DPO optimizes for human preferences while avoiding reinforcement learning. Typical RLHF methods such as PPO fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.
-
Fig. 7.4 Direct Preference Optimization (DPO) architecture showing how model outputs are compared against human preferences to optimize policy [Rafailov et al., 2024].¶
+
Fig. 7.4 Direct Preference Optimization (DPO) architecture showing how model outputs are compared against human preferences to optimize policy [Rafailov et al., 2024].¶
The key idea is to train the model to prefer responses that align with our desired behavior over responses that do not. DPO works by:
Modern libraries such as HuggingFace’s TRL [Face, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of the next section as we go through a case study.
+
Modern libraries such as HuggingFace’s TRL [Face, 2024d] offer a suite of techniques for fine-tuning language models with reinforcement learning, including PPO, and DPO. It provides a user-friendly interface and a wide range of features for fine-tuning and aligning LLMs, which will be the focus of the next section as we go through a case study.
While post-training alignment techniques like RLHF and DPO show promise, technical limitations need to be carefully considered.
-
Reinforcement Learning from Human Feedback faces several critical scaling challenges that distinguish it from pre-training or supervised fine-tuning. One key issue is scalability. Recent research suggests that the current RLHF framework does not scale as effectively as the pretraining stage [Hou et al., 2024], in particular presenting the following challenges:
+
Reinforcement Learning from Human Feedback faces several critical scaling challenges that distinguish it from pre-training or supervised fine-tuning. One key issue is scalability. Recent research suggests that the current RLHF framework does not scale as effectively as the pretraining stage [Hou et al., 2024], in particular presenting the following challenges:
As we discussed in the previous section, DPO is a more recent “reward-free” fine-tuning technique that has gained significant attention which derives reward signals directly from pairwise preference data instead of fitting a reward model as in RLHF. With its increasing popularity, emerging research is exploring DPO limitations and potential improvements [Feng et al., 2024], including the following:
One key issue is model collapse - a phenomenon where model performance degrades with each training iteration.
-
Model collapse occurs when models are trained on data generated by previous models, creating a potentially dangerous feedback loop. This recursive training process can lead to [Kazdan et al., 2024]:
+
Model collapse occurs when models are trained on data generated by previous models, creating a potentially dangerous feedback loop. This recursive training process can lead to [Kazdan et al., 2024]:
Degradation of output quality with each training iteration
Pollution of training data when synthetic samples replace real data
@@ -572,10 +572,10 @@
[Szép et al., 2024], providing practical guidance on data augmentation, regularization methods, and training strategies to maximize performance while minimizing data requirements. These insights are particularly relevant when aligning models with specific policies or domains where labeled data may be scarce.
+
To effectively mitigate model collapse risks and ensure successful alignment, organizations should implement a comprehensive strategy that includes maintaining high-quality human data in the training mix, closely monitoring key metrics like test loss and output diversity. Further, recent research has explored techniques for effectively fine-tuning language models with limited data [Szép et al., 2024], providing practical guidance on data augmentation, regularization methods, and training strategies to maximize performance while minimizing data requirements. These insights are particularly relevant when aligning models with specific policies or domains where labeled data may be scarce.
Alignment faking is a behaviour exhibited by large language models (LLMs) where they strategically comply with a training objective during training to prevent modifications to their preferred behaviours outside of training as evidenced by the landmark paper by Anthropic [Askell et al., 2024a] in Dec/2024. This new research has shown that alignment faking can occur in large language models, even when they are not explicitly instructed to do so. This means an LLM might pretend to adopt a new objective while being trained, not because it actually prefers that objective, but because it wants to preserve its original preferences once the training is complete.
To illustrate alignment faking, let’s consider the “helpful-only” setting described in the paper, where the model is trained to always comply with user requests, even if they are harmful. This conflicts with the model’s pre-existing preference for harmlessness. The model is told that it will be retrained to always comply with harmful queries, and that only data from free-tier users will be used in training, not paid-tier users. Fig. 7.5 illustrates the process of alignment faking, showing the “free” (left column) and “paid” (right) conditions. Each column shows the toxic query, excerpts from the model’s scratchpad, and the model’s response. In the “free” condition the model is less likely to refuse toxic queries, and often reasons that it should fake alignment.
In this case study, we will align a language model to a policy. The policy is a set of principles and rules that we want the language model to adhere to. All methodology and code available solves this general problem of policy-based alignment. However, we will describe a specific case study to illustrate our approach.
Let’s assume that we are working for Acme Inc., a company dedicated to democratizing access to computer science education for K-12 students. Acme Inc. is in the process of creating a chatbot named smolK-12, a small open source LLM, specifically designed for K-12 students.
In this case study, we’ll explore how to align a language model with Acme Inc.’s policy to ensure its LLM-powered applications are safe and appropriate for K-12 students.
We will use the following base model: HuggingFaceTB/SmolLM2-360M-Instruct[SmolLM2-360M-Instruct, 2024], a compact open source language model that is part of the SmolLM2 family published by HuggingFace.
Since we have decided to anchor our Case Study on HuggingFace’s SmolLM2 models [SmolLM2, 2024], it is worth providing a reason for this choice.
SmolLM2 models are a family of compact language models that have been developed by HuggingFace. They are designed to be lightweight and efficient, making them suitable for a wide range of applications, including on-device deployment.
Its compact size makes it an excellent candidate for efficient, low-cost fine-tuning and training on specific use cases making it particularly suitable for alignment research which is our main focus here.
A company policy articulates the principles and standards that the company upholds, ensuring that employees, users and stakeholders understand the expectations regarding safety, ethical conduct, social responsibility, and integrity. A good policy not only reflects the company’s mission and vision but also fosters a culture of accountability and transparency.
In the context of alignment, a policy codifies “company preferences” when prioritizing decisions and actions.
In this case study, Acme Inc. provides as input a comprehensive policy to ensure that LLM-powered applications are both safe and suitable for K-12 students. Acme Inc.’s policy adheres to version 0.5 of the AI Safety Benchmark established by MLCommons [Vidgen et al., 2024]. This benchmark encompasses seven critical hazard categories:
In order to fine-tune a base model to create an aligned model, we need to construct a dataset of policy-aligned preferences. This dataset will be used to align our base model to our policy.
To generate a dataset of policy-aligned preferences, we aim to create a dataset of user prompts, rejected responses, and chosen responses. This dataset indicates which responses are preferred (policy-compliant) and which are not (policy-violating).
-
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
+
Collecting human-generated high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs [Dong et al., 2024]. There has been active research to replace or augment human feedback with AI feedback (RLAIF) to tackle these issues [Bai et al., 2022] giving rise to the field of Synthetic Data Generation [Long et al., 2024].
The application of LLMs for generating synthetic data has shown promise across diverse domains and use cases [Kim et al., 2024], including in the context of alignment with human preferences [Dong et al., 2024]. Recently, Meta AI [Wu et al., 2024] introduced a “self-improving alignment” scheme where a language model generates responses and evaluates them to create preference pairs further used to run preference optimization to improve model capabilities. Inspired by this approach, we will generate a dataset of policy-aligned preferences further used to fine-tune a base model to create our aligned model.
First, we define a data schema for our dataset. Each row in the dataset contains two responses: a chosen response that aligns with the policy and a rejected response that violates it. Through DPO-optimization, the model is awarded for generating responses that match the chosen, policy-compliant examples rather than the rejected ones:
The ResponseGenerator class creates a dataset of responses from an unaligned base model that we aim to improve through fine-tuning. These responses serve as “rejected” examples in our training data since they may not properly align with safety policies and guidelines. The class supports both local model inference using the Hugging Face Transformers library and remote inference through the Hugging Face Inference API. When instantiated with a model name, it loads the model locally. Otherwise, if a cloud API URL is provided, it connects to the remote API endpoint for inference.
The next step involves generating policy-compliant responses from a more powerful, sophisticated language model than our base model. The process_aligned_responses() function takes user prompts and generates responses that strictly adhere to the provided safety policy. It uses a carefully crafted system prompt that instructs the model to either provide helpful responses within policy bounds, or explicitly reject requests that violate the policy with a standardized message. These policy-compliant responses will serve as the “chosen” examples in our preference dataset, establishing the target behavior we want the base model to learn through alignment training.
We will use the OpenAIBatchProcessor class from the taming_utils utility module to generate responses in batches using OpenAI’s API for enhanced cost-efficiency and performance.
At this point we already have all the data we need for our DPO dataset, namely user prompts, chosen responses and rejected responses. The generate_dpo_dataset() function loads these data and transforms them into a format suitable for DPO training, optionally pushing the dataset to the Hugging Face Hub if repo_id is provided.
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (trl-lib/ultrafeedback_binarized) [H4, 2024a].
Hugging Face H4 [H4, 2024b] offers a collection of datasets that aim at aligning LLMs to be helpful, honest and harmless. Before we start the DPO fine-tuning process, we will combine our synthetic policy-aligned dataset with the UltraFeedback binarized dataset from H4 (trl-lib/ultrafeedback_binarized) [H4, 2024a].
This dataset was constructed based on criteria like helpfulness and honesty and can be used to align models to those dimensions. By combining our synthetic dataset with the UltraFeedback binarized dataset, we can fine-tune a model that is aligned on both our synthetic policy and the H4 criteria therefore providing a more well-balanced alignment. The DPO optimization process is shown in Fig. 7.6.
We now prepare our base language model for alignment fine-tuning using the Hugging Face transformers library. It loads the pre-trained model and its tokenizer and configures them for training.
Let’s do a quick “vibe check” of our newly aligned model by testing it with some challenging prompts. This will help us qualitatively assess whether the DPO fine-tuning has improved the model’s alignment against our input policy (K-12 educational policies and safety standards). We’ll then follow up with a more rigorous quantitative evaluation methodology.
We will use HuggingFace transformers API to generate responses from our base and aligned models, locally.
Evaluating alignment improvements presents unique challenges. Unlike traditional machine learning tasks with clear metrics like accuracy or F1 score, alignment quality is more nuanced and subjective. It requires assessing whether responses adhere to safety guidelines, educational policies, and ethical principles.
The gold standard for evaluating alignment is human evaluation. Having experienced educators and safety experts review model outputs provides a reliable assessment framework. However, human evaluation is expensive, time-consuming, and difficult to scale. Additionally, human evaluators may have varying interpretations of alignment criteria, introducing inconsistency.
-
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
+
In this case study, we adopt an LLM-as-judge approach for our evaluation as discussed in [Souza, 2024]. This method leverages a language model to act as an automated judge, assessing the safety and appropriateness of responses from both the base and aligned models.
The evaluation methodology summarized in Fig. 7.9 consists of three key components that work together to assess model alignment against our policy:
LLMs are complex systems and alignment is a challenging problem. In this chapter, we discussed how post-training alignment techniques can be used to align a language model to human preferences. In the case study, we demonstrated how to use DPO to align a language model to a user-provider policy further automating the process via synthetic data generation and LLM-as-judge evaluation. Our approach does serve as a proof of concept, however, several considerations should be taken into account when using this methodology in practice.
Synthetic Data Generation
-
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
-
One major challenge is data distribution bias, where the synthetic data might not accurately mirror the complexities and nuances of real-world data. This can lead to models trained on this data making inaccurate predictions or exhibiting biases. In our case study, we did observe duplicate responses in the synthetic data. Further, the methodology lacks a systematic approach to evaluate the quality of the synthetic data itself only focusing on evals for the consecutive fine-tuned model. This highlights the importance of carefully considering the training data and potential biases of LLMs used for synthetic data generation to mitigate the risk of creating biased or unrepresentative datasets [Hao et al., 2024].
+
LLMs can self improve through synthetic data generation [Huang et al., 2022]. This process helps the LLM learn from its own reasoning and improve its overall reasoning ability without relying on human-annotated data. While LLMs can be powerful tools for generating synthetic data, especially in data-scarce domains, it’s important to recognize the potential pitfalls.
+
One major challenge is data distribution bias, where the synthetic data might not accurately mirror the complexities and nuances of real-world data. This can lead to models trained on this data making inaccurate predictions or exhibiting biases. In our case study, we did observe duplicate responses in the synthetic data. Further, the methodology lacks a systematic approach to evaluate the quality of the synthetic data itself only focusing on evals for the consecutive fine-tuned model. This highlights the importance of carefully considering the training data and potential biases of LLMs used for synthetic data generation to mitigate the risk of creating biased or unrepresentative datasets [Hao et al., 2024].
Our approach does enable a systematic approach to aligning a model to an input policy. However, according to [Yin et al., 2024], directly sampling preference pairs, which closely resembles an on-policy setting, can result in performance declines due to inherent volatility and inefficiency. Therefore, constructing effective preference data to continuously improve LLMs remains a critical research problem.
Choice of Base Model
The choice of base model is a critical consideration when implementing alignment techniques. In the case study, we selected the smolLM model family due to its efficient architecture and reasonable performance on basic tasks while maintaining relatively low computational requirements. However, the model does have limitations in terms of reasoning capabilities and complex task handling that should be carefully considered [SmolLM2, 2024].
Real-world applications need to carefully evaluate the trade-offs between model size/capabilities, and costs. While smaller models like smolLM can be cost-effective for basic alignment experiments, they may not provide the sophisticated reasoning needed for production use cases. The computational and financial costs of training and deploying larger models must be weighed against the required capabilities.
-
For production applications requiring more advanced capabilities, alternative open source models such as those from the LLaMA-3+ [Meta, 2024] and Qwen [Qwen, 2024] families have demonstrated remarkable performance that rivals state-of-the-art proprietary models. These models offer enhanced reasoning abilities and better handling of complex tasks, though at increased computational and financial cost. The choice ultimately depends on specific use case requirements, available resources, and acceptable performance thresholds.
+
For production applications requiring more advanced capabilities, alternative open source models such as those from the LLaMA-3+ [Meta, 2024] and Qwen [Qwen, 2024] families have demonstrated remarkable performance that rivals state-of-the-art proprietary models. These models offer enhanced reasoning abilities and better handling of complex tasks, though at increased computational and financial cost. The choice ultimately depends on specific use case requirements, available resources, and acceptable performance thresholds.
Evaluation Methodology
-
The LLM-as-judge evaluation methodology is a powerful tool for assessing model alignment. However, it does have limitations [Chen et al., 2024]. For instance, the judge model may not always be able to accurately evaluate the alignment of the model, especially if the judge model is not aligned with the policy itself. Further, the judge model may be biased towards the policy, leading to overly conservative evaluations. In our case study, we do highlight the fact that our judge was simply focused on the policy-alignment aspect of the responses completely neglecting the quality of the responses themselves, i.e. while our fine-tuned model may be more aligned with the policy than the base model, we actually have no evidence that our model is helpful at all.
+
The LLM-as-judge evaluation methodology is a powerful tool for assessing model alignment. However, it does have limitations [Chen et al., 2024]. For instance, the judge model may not always be able to accurately evaluate the alignment of the model, especially if the judge model is not aligned with the policy itself. Further, the judge model may be biased towards the policy, leading to overly conservative evaluations. In our case study, we do highlight the fact that our judge was simply focused on the policy-alignment aspect of the responses completely neglecting the quality of the responses themselves, i.e. while our fine-tuned model may be more aligned with the policy than the base model, we actually have no evidence that our model is helpful at all.
A more robust evaluation approach would combine LLM-based evaluation with human domain experts in a complementary process. The LLM judge could perform initial high-throughput screening of model responses, flagging potential issues and providing preliminary assessments. These results would then be reviewed by human evaluators with relevant domain expertise who can provide nuanced judgment, catch edge cases, and validate the LLM’s evaluations. Additionally, automatic evaluation against standard benchmarks is advised to evaluate general capabilities of the model.
DPO Dataset Composition
The composition of the DPO dataset also plays a crucial role in model behavior. In preliminary experiments, using only policy-aligned preference data led to an overly apologetic model that was hesitant to provide helpful responses even for benign queries, i.e. the model was overfitting to the policy. In fact, a model that simply refused to provide an useful response and instead apologized would indeed be aligned with the policy and therefore rewarded accordingly. This led to our decision to construct a more well balanced dataset.
-
Blending our policy-focused dataset with the more general-purpose UltraFeedback dataset from Hugging Face H4 [H4, 2024a] dramatically improved results by helping the model maintain helpfulness while learning appropriate safety boundaries. The results reported here reflect this balanced dataset approach.
+
Blending our policy-focused dataset with the more general-purpose UltraFeedback dataset from Hugging Face H4 [H4, 2024a] dramatically improved results by helping the model maintain helpfulness while learning appropriate safety boundaries. The results reported here reflect this balanced dataset approach.
The construction of the DPO dataset is perhaps the most critical component of the alignment process. While automated approaches can help scale dataset creation, the involvement of domain experts in dataset construction is highly recommended. Domain experts bring invaluable knowledge about edge cases, nuanced policy interpretations, and real-world usage patterns that may not be captured by synthetic data generation alone. Organizations implementing alignment techniques should consider investing in domain expert involvement during dataset construction as a key success factor.
Fine-tuning Process
The effectiveness of DPO training can be highly sensitive to various fine-tuning hyperparameters. As we mentioned before, the batch size and the beta parameter are two key parameters that can significantly impact training stability and model behavior. A careful parameter tuning is required to achieve optimal results, which lacked in our case study.
One important limitation of our current implementation is that we did not carefully split our user prompts between in-sample data for fine-tuning and out-of-sample data for evaluation. This means our evaluation metrics may be overly optimistic as the fine-tuned model could be memorizing prompts rather than learning generalizable alignment. Future work should implement proper train/test splits to better assess generalization performance while making sure out/in-sample distributions are similar and representative of real-world data.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. 2022. URL: https://arxiv.org/abs/2204.05862, arXiv:2204.05862.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: harmlessness from ai feedback. 2022. URL: https://arxiv.org/abs/2212.08073, arXiv:2212.08073.
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. Humans or llms as the judge? a study on judgement biases. 2024. URL: https://arxiv.org/abs/2402.10669, arXiv:2402.10669.
Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. Self-boosting large language models with synthetic preference data. 2024. URL: https://arxiv.org/abs/2410.06961, arXiv:2410.06961.
Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, and Wenqiang Lei. Towards analyzing and understanding the limitations of dpo: a theoretical perspective. 2024. URL: https://arxiv.org/abs/2404.04626, arXiv:2404.04626.
Shuang Hao, Wenfeng Han, Tao Jiang, Yiping Li, Haonan Wu, Chunlin Zhong, Zhangjun Zhou, and He Tang. Synthetic data in ai: challenges, applications, and ethical implications. 2024. URL: https://arxiv.org/abs/2401.01629, arXiv:2401.01629.
Zhenyu Hou, Pengfan Du, Yilin Niu, Zhengxiao Du, Aohan Zeng, Xiao Liu, Minlie Huang, Hongning Wang, Jie Tang, and Yuxiao Dong. Does rlhf scale? exploring the impacts from data, model, and method. 2024. URL: https://arxiv.org/abs/2412.06000, arXiv:2412.06000.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: low-rank adaptation of large language models. 2021. URL: https://arxiv.org/abs/2106.09685, arXiv:2106.09685.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. 2022. URL: https://arxiv.org/abs/2210.11610, arXiv:2210.11610.
Joshua Kazdan, Rylan Schaeffer, Apratim Dey, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, and Sanmi Koyejo. Collapse or thrive? perils and promises of synthetic data in a self-generating world. 2024. URL: https://arxiv.org/abs/2410.16713, arXiv:2410.16713.
Seungone Kim, Juyoung Suk, Xiang Yue, Vijay Viswanathan, Seongyun Lee, Yizhong Wang, Kiril Gashteovski, Carolin Lawrence, Sean Welleck, and Graham Neubig. Evaluating language models as synthetic data generators. 2024. URL: https://arxiv.org/abs/2412.03679, arXiv:2412.03679.
Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, and Haobo Wang. On llms-driven synthetic data generation, curation, and evaluation: a survey. 2024. URL: https://arxiv.org/abs/2406.15126, arXiv:2406.15126.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. 2022. URL: https://arxiv.org/abs/2203.02155, arXiv:2203.02155.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: your language model is secretly a reward model. 2024. URL: https://arxiv.org/abs/2305.18290, arXiv:2305.18290.
Márton Szép, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, and Florian Hinterwimmer. A practical guide to fine-tuning language models with limited data. 2024. URL: https://arxiv.org/abs/2411.09539, arXiv:2411.09539.
Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study. 2024. URL: https://arxiv.org/abs/2404.10719, arXiv:2404.10719.
The advent of LLMs marks a pivotal shift in the landscape of software development and evaluation. Unlike traditional software systems, where deterministic outputs are the norm, LLMs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering testing paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess software products.
For those entrenched in traditional methodologies, the transition to LLM-driven systems may seem daunting. However, ignoring this change is not an option. The reliance on outdated testing frameworks that fail to account for the probabilistic nature of LLMs will inevitably lead to significant setbacks.
To overcome these challenges, it is imperative to embrace the complexities of LLMs with a proactive mindset. This involves developing robust evaluation frameworks up-front, fostering a product development culture of continuous change, learning and adaptation.
One of the most fundamental challenges when building products with Large Language Models (LLMs) is their generative and non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can generate novel text that may not exist in their training data, and produce different responses each time they’re queried - even with identical prompts and input data. This behavior is both a strength and a significant engineering challenge and product challenge.
When you ask an LLM the same question multiple times, you’ll likely get different responses. This isn’t a bug - it’s a fundamental feature of how these models work. The “temperature” parameter, which controls the randomness of outputs, allows models to be creative and generate diverse responses. However, this same feature makes it difficult to build reliable, testable systems.
Consider a financial services company using LLMs to generate investment advice. The non-deterministic nature of these models means that:
Beyond their non-deterministic nature, LLMs present another fascinating characteristic: emergent abilities that spontaneously arise as models scale up in size. These abilities - from basic question answering to complex reasoning - aren’t explicitly programmed but rather emerge “naturally” as the models grow larger and are trained on more data. This makes evaluation fundamentally different from traditional software testing, where capabilities are explicitly coded and can be tested against pre-defined specifications.
Fig. 5.1 provides a list of emergent abilities of large language models and the scale. The relationship between model scale and emergent abilities follows a fascinating non-linear pattern. Below certain size thresholds, specific abilities may be completely absent from the model - it simply cannot perform certain tasks, no matter how much you try to coax them out. However, once the model reaches critical points in its scaling journey, these abilities can suddenly manifest in what researchers call a phase transition - a dramatic shift from inability to capability. This unpredictable emergence of capabilities stands in stark contrast to traditional software development, where features are deliberately implemented and can be systematically tested.
Consider a practical example that illustrates these challenges: building a Math AI tutoring system for children powered by an LLM. In traditional software development, you would define specific features (like presenting math problems or checking answers) and write tests to verify each function. But with LLMs, you’re not just testing predefined features - you’re trying to evaluate emergent capabilities like adapting explanations to a child’s level, maintaining engagement through conversational learning, and providing age-appropriate safety-bound content.
This fundamental difference raises critical questions about evaluation:
First, it’s important to make a distinction between evaluating an LLM versus evaluating an LLM-based application. While the latter offers foundation capabilities and are typically general-purpose, the former is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.
That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications, instead, should be evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:
The design of an LLM application evaluation system depends heavily on the specific use case and business requirements. Here we list important questions for planning an LLM application evaluation system pertaining to each of the key components previously introduced:
The choice of metric depends on the specific task and desired evaluation criteria. However, one can categorize metrics into two broad categories: intrinsic and extrinsic.
Intrinsic metrics focus on the model’s performance on its primary training objective, which is typically to predict the next token in a sequence. Perplexity is a common intrinsic metric that measures how well the model predicts a given sample of text.
Extrinsic metrics assess the model’s performance on various downstream tasks, which can range from question answering to code generation. These metrics are not directly tied to the training objective, but they provide valuable insights into the model’s ability to generalize to real-world applications.
-
Here, we are particularly interested in extrinsic metrics, since we are evaluating LLM-based applications.
+
Here, we are particularly interested in extrinsic metrics, since we are evaluating LLM-based applications rather than base LLM models.
Another way to think about metrics is in terms of the type of the task we evaluate:
Traditional metrics like BLEU or ROUGE often fall short in capturing the nuanced, contextual, and creative outputs of LLMs. As an alternative we can consider a “Model-based evaluation” approach. A common approach is to use an LLM as a judge. This is an approach that leverages language models themselves to assess the quality of outputs from other language models. This method involves using a model (often a more capable one) to act as an automated judge, evaluating aspects like accuracy, coherence, and relevance of generated content. Unlike traditional metrics that rely on exact matching or statistical measures, model-based evaluation can capture nuanced aspects of language and provide more contextual assessment.
As discussed in the paper [Li et al., 2024], LLM-based evaluation approaches generally fall into two main categories:
We have discussed how LLMs can be used to evaluate LLM-based aplications. However, how can we evaluate the performance of LLMs that evaluate other LLMs? This is the question that meta evaluation aims to answer. Clearly, the discussion can become quite meta as we need to evaluate the performance of the evaluator to evaluate the performance of the evaluated model. However, one can make a case for two general options:
Use a gold-standard dataset that is used to evaluate the performance of LLM evaluators using a “metrics-based” approach.
Benchmarks act as standardized tests for LLMs, evaluating their performance across a spectrum of tasks. These tasks simulate real-world applications such as answering questions, generating coherent text, solving mathematical problems, or even writing computer code. They also assess more abstract qualities like fairness, robustness, and cultural understanding.
Benchmarks can be thought as comprehensive “exams” that probe different “subjects” in order to certify an LLM. They help researchers and developers compare models systematically, in a way LLM performance is comparable while enabling the identification of emergent behaviors or capabilities as models evolve in scale and sophistication.
The history of LLM benchmarks reflects the evolving priorities of artificial intelligence research, starting with foundational tasks and moving toward complex, real-world challenges. It began in 2018 with the introduction of GLUE(General Language Understanding Evaluation) [Wang et al., 2019], which set a new standard for evaluating natural language understanding. GLUE measured performance on tasks like sentiment analysis and textual entailment, providing a baseline for assessing the fundamental capabilities of language models. A year later, SuperGLUE[Wang et al., 2019] expanded on this foundation by introducing more nuanced tasks that tested reasoning and language comprehension at a deeper level, challenging the limits of models like BERT and its successors.
@@ -1348,7 +1348,7 @@
[Face, 2024] Leaderboard stands out for its transparency and accessibility in the open-source community. This leaderboard evaluates a wide range of LLMs across diverse tasks, including general knowledge, reasoning, and code-writing. Its commitment to reproducibility ensures that results are verifiable, enabling researchers and practitioners to replicate findings. By focusing on open-source models, it democratizes AI research and fosters innovation across communities, making it a valuable resource for both academics and industry professionals.
The Chatbot Arena (2024) Leaderboard (an evolution of LMSYS)[Chiang et al., 2024] takes an alternative approach by measuring real-world performance through direct model comparisons. Its evaluation format compares models in live conversations, with human judges providing qualitative assessments. This methodology has gathered over 200,000 human evaluations, offering specific insights into practical model performance. The emphasis on interactive capabilities makes it relevant for developing user-facing applications like virtual assistants and chatbots.
The AlpacaEval[Dubois et al., 2024] and MT-Bench[Zheng et al., 2023] Leaderboards implement automated evaluation using GPT-4 to assess model performance in multi-turn conversations. This approach enables consistent assessment of dialogue capabilities while reducing human bias. Their methodology measures key aspects of conversational AI, including contextual understanding and response consistency across multiple exchanges.
-
An important recent development was the release of Global-MMLU [Singh et al., 2024], an improved version of MMLU with evaluation coverage across 42 languages. This open dataset, built through collaboration between Argilla, the Hugging Face community, and researchers from leading institutions like Cohere For AI, Mila, MIT, and others, represents a significant step toward more inclusive multilingual LLM evaluation. Over 200 contributors used Argilla to annotate MMLU questions, revealing that 85% of questions requiring specific cultural knowledge were Western-centric. The newly released dataset is divided into two key subsets: Culturally Agnostic questions that require no specific regional or cultural knowledge, and Culturally Sensitive questions that depend on dialect, cultural, or geographic knowledge. With high-quality translations available for 25 languages, Global-MMLU enables better understanding of LLM capabilities and limitations across different languages and cultural contexts.
+
An important recent development was the release of Global-MMLU [Singh et al., 2024], an improved version of MMLU with evaluation coverage across 42 languages. This open dataset, built through collaboration between Argilla, the Hugging Face community, and researchers from leading institutions like Cohere For AI, Mila, MIT, and others, represents a significant step toward more inclusive multilingual LLM evaluation. Over 200 contributors used Argilla to annotate MMLU questions, revealing that 85% of questions requiring specific cultural knowledge were Western-centric. The newly released dataset is divided into two key subsets: Culturally Agnostic questions that require no specific regional or cultural knowledge, and Culturally Sensitive questions that depend on dialect, cultural, or geographic knowledge. With high-quality translations available for 25 languages, Global-MMLU enables better understanding of LLM capabilities and limitations across different languages and cultural contexts.
A major challenge with these leaderboards and benchmarks is test set contamination - when test data ends up in newer models’ training sets, rendering the benchmarks ineffective. While some benchmarks try to address this through crowdsourced prompts and evaluations from humans or LLMs, these approaches introduce their own biases and struggle with difficult questions. LiveBench[White et al., 2024] represents a novel solution, designed specifically to be resilient to both contamination and evaluation biases. As the first benchmark with continuously updated questions from recent sources, automated objective scoring, and diverse challenging tasks across multiple domains, LiveBench maintains its effectiveness even as models improve. Drawing from recent math competitions, research papers, news, and datasets, it creates contamination-free versions of established benchmark tasks. Current results show even top models achieving below 70% accuracy, demonstrating LiveBench’s ability to meaningfully differentiate model capabilities. With monthly updates and an open collaborative approach, LiveBench aims to provide sustained value for model evaluation as the field advances.
Another notable benchmark is ZebraLogic [Lin et al., 2024], which evaluates logical reasoning capabilities of LLMs through Logic Grid Puzzles - a type of Constraint Satisfaction Problem [Brailsford et al., 1999] commonly found in tests like the LSAT. These puzzles require assigning unique values to N houses across M different features based on given clues, demanding strategic reasoning and deduction to arrive at a unique correct solution. The benchmark’s programmatically generated puzzles range from 2x2 to 6x6 in size and test LLMs using one-shot examples with reasoning steps. While humans can solve these puzzles through strategic methods like reductio ad absurdum and elimination, LLMs demonstrate significant limitations in this type of logical reasoning. Even the best-performing model, Claude 3.5 Sonnet, only achieves 33.4% accuracy across all puzzles and 12.4% on hard puzzles, with smaller models (7-10B parameters) solving less than 1% of hard puzzles as of December 2024. These results reveal critical gaps in LLMs’ capabilities around counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization.
A significant shift in AI evaluation came with the launch of the The Alignment Research Center (ARC) Prize[Chollet, 2024] by ARC Prize Inc., a non-profit for the public advancement of open artificial general intelligence. Hosted by Mike Knoop (Co-founder, Zapier) and François Chollet (Creator of ARC-AGI, Keras), this prize represents a paradigm shift in how we evaluate language models. Rather than focusing on narrow performance metrics, the ARC Prize assesses what it calls “cognitive sufficiency” - a model’s ability to generate meaningful insights and tackle open-ended challenges. This new way to think about LLM evaluation emphasizes creative thinking, sophisticated reasoning, and the capacity to make genuinely useful contributions to human knowledge as we seek to define and measure what it means to achieve AGI (Artificial General Intelligence).
@@ -1383,16 +1383,16 @@
[Chollet, 12/08/2024]. While deep learning has significantly advanced in recent years, pure deep learning approaches perform poorly on the ARC-AGI benchmark. This is because traditional deep learning relies on relating new situations to those encountered during training and lacks the ability to adapt or recombine knowledge for entirely new tasks. ARC Prize 2024 spurred the development of novel AGI reasoning techniques, leading to a significant increase in the state-of-the-art score on the ARC-AGI private evaluation set from 33% in 2023 to 55.5% in 2024. A key takeaway is that algorithmic improvements, rather than massive computational resources, may be key to exceeding the target score for the ARC-AGI benchmark.
In addition to the benchmarks discussed above, a growing set of domain-specific benchmarks is emerging to help evaluate LLMs in specific verticals, including:
-
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
-
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
-
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
+
FinBench [Zhang et al., 2024]: Evaluates LLMs in the financial domain, covering tasks such as terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling.
+
LegalBench [Guha et al., 2023] : Assesses the legal reasoning abilities of LLMs through tasks crowdsourced by legal professionals
+
Berkeley Function Leaderboard (BFCL) [Patil et al., 2023]: Evaluates LLMs’ function-calling abilities
As language models continue to advance in capability and complexity, evaluation frameworks must evolve. Modern benchmarks increasingly incorporate tests for nuanced reasoning, ethical decision-making, and emergent capabilities that weren’t previously measurable. This ongoing evolution reflects a deeper understanding that the true value of language models lies not in achieving high scores on standardized tests with narrow task-specific metrics, but in their ability to meaningfully contribute to human understanding and help solve real-world problems while demonstrating the ability to learn and adapt to new tasks.
LightEval [Fourrier et al., 2023] is a lightweight framework for evaluation of LLMs across a variety of standard and bespoke metrics and tasks across multiple inference backends via Python SDK and CLI.
As a motivating example, consider a scenario where financial data has been extracted from SEC financial filings and require econometric analysis. Tasks like estimating autoregressive models for time series forecasting or conducting hypothesis tests on market efficiency are common in financial analysis. Let’s evaluate how well different models perform on this type of task.
First, we need to select a benchmark to assess LLMs capabilities in this domain. MMLU has a sub-benchmark called Econometrics we can use for this task. Table 5.4 shows a sample of the benchmark dataset from MMLU Econometrics. It consists of multiple-choice questions from econometrics and expected answers.
Let’s revisit our evaluation example when we were interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a benchmark model (larger and more expensive). Recal the setup:
Promptfoo [promptfoo, 2024] is an open-source framework designed for evaluating applications that utilize large language models (LLMs). Key features include:
Automated Testing: Promptfoo provides automated testing capabilities, allowing developers to run custom evaluations tailored to their applications.
@@ -2254,7 +2254,7 @@
Prompt Comparison R
In conclusion, Promptfoo can serve as an effective LLM application evaluation tool particularly for its ability to decouple several components of the evaluation process. Hence enabling the user to focus on the most important aspects of the evaluation given the particular application and criteria making it a valuable and flexible tool for LLM application development.
The following table provides a summarized comparative analysis of three open source frameworks for language models evaluation we have discussed: Lighteval, LangSmith, and Promptfoo. Each framework is assessed based on key features such as integration capabilities, customization options, ease of use, and the ability to facilitate human and LLM collaboration.
Table 5.6 Comparison of Lighteval, LangSmith, and Promptfoo¶
Language models have fundamentally transformed how software is developed and evaluated. Unlike conventional systems that produce predictable outputs, LLMs generate varied, probabilistic responses that defy traditional testing approaches. While developers accustomed to deterministic systems may find this shift challenging, continuing to rely on legacy testing methods is unsustainable. These frameworks were not designed to handle the inherent variability of LLM outputs and will ultimately prove inadequate.
Success requires embracing this new paradigm by implementing comprehensive evaluation strategies early - this is the new Product Requirements Document (PRD) - and cultivating an organizational mindset focused on iteration, experimentation and growth.
The shift from traditional software testing to LLM evaluation is not just a change in tools but a transformation in mindset. Those who recognize and adapt to this shift will lead the way in harnessing the power of LLMs. However, the cost of inaction is not just technological stagnation, but potential business failure.
Clémentine Fourrier, Nathan Habib, Thomas Wolf, and Lewis Tunstall. Lighteval: a lightweight framework for llm evaluation. 2023. URL: https://github.com/huggingface/lighteval.
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. Legalbench: a collaboratively built benchmark for measuring legal reasoning in large language models. 2023. URL: https://arxiv.org/abs/2308.11462, arXiv:2308.11462.
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global mmlu: understanding and addressing cultural and linguistic biases in multilingual evaluation. 2024. URL: https://arxiv.org/abs/2412.03304, arXiv:2412.03304.
Zhihan Zhang, Yixin Cao, and Lizi Liao. Finbench: benchmarking LLMs in complex financial problem solving and reasoning. 2024. URL: https://openreview.net/forum?id=AeGrf1uY0p.
Running LLMs locally versus using cloud APIs offers several important advantages.
Privacy-sensitive data processing is one of the primary reasons for running LLMs locally. Organizations handling medical records must comply with HIPAA regulations that require data to remain on-premise. Similarly, businesses processing confidential documents and intellectual property, as well as organizations subject to GDPR and other privacy regulations, need to maintain strict control over their data processing pipeline.
Cost considerations become significant when operating at scale. Organizations running high-volume applications can face prohibitive API costs with cloud-based solutions. Development and testing environments that require frequent model interactions, educational institutions supporting multiple student projects, and research initiatives involving extensive model experimentation can all achieve substantial cost savings through local deployment.
Applications with stringent latency requirements form another important category. Real-time systems where network delays would be unacceptable, edge computing scenarios demanding quick responses, and interactive applications requiring sub-second performance all benefit from local deployment. This extends to embedded systems in IoT devices where cloud connectivity might be unreliable or impractical.
Finally, local deployment enables deeper customization and fine-tuning capabilities. Organizations can perform specialized domain adaptation through model modifications, experiment with different architectures and parameters, and integrate models with proprietary systems and workflows. This flexibility is particularly valuable for developing novel applications that require direct model access and manipulation.
When using local LLM models versus widely known private large language models:
-
-
Performance: Local LLMs often have lower performance compared to large private models due to size and training limitations.
-
Resource requirements: Running local LLMs can be computationally intensive, requiring significant CPU/GPU resources.
-
Limited capabilities: Local models particularly if smallmay struggle with complex tasks or specialized knowledge that larger models handle well.
-
Potential Output Unreliability: Local models may produce less consistent or stable outputs - see Chapter
-
Limited context window: Local models often have smaller context windows, limiting their ability to process long inputs.
-
-
Always evaluate the trade-offs between using local LLMs and private models based on your specific use case and requirements. We highly recommend extensively testing your local LLM before productionizing an end-to-end application.
Local LLM deployment tools generally fall into two categories: inference-focused tools that prioritize performance and programmability for technical users requiring production-grade deployments, and user interface (UI) tools that emphasize accessibility through graphical interfaces for non-technical users, trading some performance for ease of use and broader adoption. In the following sections we will explore some of these tools discussing their features, capabilities, and trade-offs.
Before exploring specific tools, it’s important to understand what “serving” an LLM model means in practice. Serving refers to the process of making a trained language model available for inference. At a high level, this involves setting up the infrastructure needed to accept and process input text and generate responses while efficiently managing system resources. The serving process involves several key responsibilities:
LLama.cpp [Gerganov and contributors, 2024a] is an MIT-licensed open source optimized implementation of the LLama model architecture designed to run efficiently on machines with limited memory.
LLama.cpp [Gerganov and contributors, 2024a] is an MIT-licensed open source optimized implementation of the LLama model architecture designed to run efficiently on machines with limited memory.
Originally developed by Georgi Gerganov and today counting with hundreds of contributors, this C/C++ version provides a simplified interface and advanced features that allow language models to run without overwhelming systems. With the ability to run in resource-constrained environments, LLama.cpp makes powerful language models more accessible and practical for a variety of applications.
-
In its “Manifesto” [Gerganov and others, 2023], the author sees significant potential in bringing AI from cloud to edge devices, emphasizing the importance of keeping development lightweight, experimental, and enjoyable rather than getting bogged down in complex engineering challenges. The author states a vision that emphasizes maintaining an exploratory, hacker-minded approach while building practical edge computing solutions highlighting the following core principles:
+
In its “Manifesto” [Gerganov and others, 2023], the author sees significant potential in bringing AI from cloud to edge devices, emphasizing the importance of keeping development lightweight, experimental, and enjoyable rather than getting bogged down in complex engineering challenges. The author states a vision that emphasizes maintaining an exploratory, hacker-minded approach while building practical edge computing solutions highlighting the following core principles:
“Will remain open-source”
Focuses on simplicity and efficiency in codebase
@@ -373,7 +360,7 @@
[Gerganov and contributors, 2024b] is the latest model format used by LLama.cpp, replacing the older GGML format. It was designed specifically for efficient inference of large language models on consumer hardware. The key features that make GGUF particularly valuable include [IBM Think, 2024]:
+
GGUF (GPT-Generated Unified Format) [Gerganov and contributors, 2024b] is the latest model format used by LLama.cpp, replacing the older GGML format. It was designed specifically for efficient inference of large language models on consumer hardware. The key features that make GGUF particularly valuable include [IBM Think, 2024]:
Improved quantization: GGUF supports multiple quantization levels to reduce model size while preserving performance. Common quantization schemes that are supported by GGUF include:
@@ -387,9 +374,9 @@
[Hugging Face, 2024x] and provides a tool (ggml-org/gguf-my-repo) to convert existing models to GGUF format, making it easier for developers to access and deploy optimized versions of popular language models.
+
These capabilities make GGUF models significantly more practical for running LLMs locally compared to full-precision formats, often dramatically reducing memory requirements. Hugging Face hosts a growing collection of pre-converted GGUF models [Hugging Face, 2024x] and provides a tool (ggml-org/gguf-my-repo) to convert existing models to GGUF format, making it easier for developers to access and deploy optimized versions of popular language models.
Here, we will compile the library from source on a Linux machine with 8 jobs in parallel for enhanced performance (add the -j argument to run multiple jobs in parallel).
It is worth noting Llama.cpp provides a way to use grammars [Gerganov and contributors, 2024] to constrain the output of the model as demonstrated below. This is the same technique Ollama uses, a similar approach to Outlines’ to generate structured outputs from LLMs. See Wrestling with Structured Outputs Chapter for more details.
./build/bin/llama-cli-m./models/qwen2.5-0.5b-instruct-q8_0.gguf--grammar-filegrammars/json.gbnf-p'Request: schedule a call at 8pm; Command:'# {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
Python
-
A handy Python binding [Betlen and contributors, 2024] is available for LLama.cpp, which by default returns chat completions in OpenAI’s API chat format as below. The package is very comprehensive supporting JSON Mode, function calling, multi-modal models and more.
+
A handy Python binding [Betlen and contributors, 2024] is available for LLama.cpp, which by default returns chat completions in OpenAI’s API chat format as below. The package is very comprehensive supporting JSON Mode, function calling, multi-modal models and more.
Developed by Occupy Wall Street’s former activist, Justine Tunney, Llamafile [Mozilla Ocho, 2024] is an Appache 2.0 licensed open source tool that combines the power of LLama.cpp with Cosmopolitan Libc, a universal C standard library that allows creating portable executables compatible with multiple operating systems.
Developed by Occupy Wall Street’s former activist, Justine Tunney, Llamafile [Mozilla Ocho, 2024] is an Appache 2.0 licensed open source tool that combines the power of LLama.cpp with Cosmopolitan Libc, a universal C standard library that allows creating portable executables compatible with multiple operating systems.
In this way, Llamafile reduces all the complexity of LLMs to a single executable file (called a “llamafile”) that runs locally without installation. Key advantages of Llamafile over plain Llama.cpp include:
Ollama is a lightweight, MIT-licensed open-source tool for running LLMs locally. It provides a simple interface for interacting with a wide range of language models, including popular models like Llama 3.1 and Llama 3.2. Ollama is designed to be easy to install and use, making it a popular choice for developers who want to run LLMs locally without the need for extensive setup or configuration. Ollama’s key advantages include:
Each solution offers distinct advantages and tradeoffs that make them suitable for different use cases. At a high-level, Ollama is the easiest to install and use and has become the most popular choice for your average use case, Llamafile is the easiest to distribute and a good choice when portability is a priority, and Llama.cpp is the most customizable and performant solution as summarized in Table 8.1.
Table 8.1 lama.cpp vs Ollama vs Llamafile Comparison¶
There is a growing number of UI tools for local LLM deployment that aim at providing a more user-friendly experience. Ranging from closed-source to open-source solutions across a range of features and capabilities. We will discuss LM Studio, Jan, and OpenWebUI.
LM Studio [LM Studio, 2024] is a closed-source GUI for running LLMs locally. In the context of local deployment, LM Studio positions itself as a more user-friendly, feature-rich solution compared to the other tools. It’s particularly valuable for developers transitioning from cloud APIs to local deployment, and for users who prefer graphical interfaces over command-line tools. Key Features of LM Studio include:
LM Studio [LM Studio, 2024] is a closed-source GUI for running LLMs locally. In the context of local deployment, LM Studio positions itself as a more user-friendly, feature-rich solution compared to the other tools. It’s particularly valuable for developers transitioning from cloud APIs to local deployment, and for users who prefer graphical interfaces over command-line tools. Key Features of LM Studio include:
Model Parameter Customization: Allows adjusting temperature, maximum tokens, frequency penalty, and other settings
Chat History: Enables saving prompts for later use
Jan is an open source ChatGPT-alternative that runs local models. Its model’s library contains popular LLMs like Llama, Gemma, Mistral, or Qwen. Key Features of Jan include:
User-Friendly Interface: Run AI models with just a few clicks
Open WebUI is an open-source web interface designed to enhance the local AI model experience, particularly for Ollama and OpenAI-compatible APIs. It aims to provide enterprise-grade features while maintaining user-friendliness. OpenWebUI’s core features include:
LM Studio excels at providing individual developers with a smooth transition from cloud APIs to local deployment, offering an intuitive interface and robust API compatibility, however it is closed-source. Jan focuses on simplicity and accessibility, making it ideal for personal use and basic deployments while maintaining open-source benefits. OpenWebUI makes additional features available to enterprise users and teams requiring advanced features like RAG, collaboration tools, and granular access controls, though this may come at the cost of increased complexity and resource requirements. We compare the three tools in Table 8.2.
Table 8.2 LM Studio vs Jan vs OpenWebUI Comparison¶
Create a system for automatically documenting proprietary codebase
-LLM analyzes private source code repositories
-Generates comprehensive documentation without exposing code externally
-Could implement:
-
Automatic function documentation
-Architecture diagrams based on code analysis
-Security vulnerability scanning
-API documentation generation
-Code complexity analysis
-Best practice suggestions
-Integration with local git repositories
This case study examines how different quantization levels affect the performance of language models running locally. Quantization is a crucial technique for reducing model size and improving inference speed, but it comes with potential tradeoffs in model quality. Understanding these tradeoffs is essential for practitioners deploying LLMs in resource-constrained environments.
+
Using the Qwen 2.5 0.5B model as our baseline, we’ll compare four variants:
Perplexity - to measure how well the model predicts text
+
KL divergence - to quantify differences in probability distributions against base model
+
Prompt (tokens/second) - to assess impact in thoughput
+
+
While we will focus on the Qwen 2.5 0.5B model, the same analysis can be applied to other models. These insights will help practitioners make informed decisions about quantization strategies based on their specific requirements for model size, speed, and accuracy.
To evaluate the impact of quantization on model performance, we first need a set of prompts that will serve as input data for our experiments. We’ll construct a dataset from WikiText-2 [Salesforce, 2024], which contains Wikipedia excerpts.
+
In our experiments, we will use a total of NUM_PROMPTS prompts that vary in length from MIN_PROMPT_LENGTH to MAX_PROMPT_LENGTH tokens. Using a fixed set of prompts ensures consistent evaluation across model variants and enables direct comparison of metrics like perplexity and throughput.
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
+
+
+
+
+
+
+
withopen('../data/local/prompts.txt','w')asf:
+ fortextininput_texts:
+ # Escape any quotes in the text and wrap in quotes
+ escaped_text=text.replace('"','\\"')
+ f.write(f'"{escaped_text}"\n')
+
2-bit quantization with 16 weights per block in 16-block superblocks
+
2.5625
+
w = q * block_scale(4-bit) + block_min(4-bit)
+
+
Q4_K
+
4-bit quantization with 32 weights per block in 8-block superblocks
+
4.5
+
w = q * block_scale(6-bit) + block_min(6-bit)
+
+
Q6_K
+
6-bit quantization with 16 weights per block in 16-block superblocks
+
6.5625
+
w = q * block_scale(8-bit)
+
+
+
+
Each quantization level represents a different tradeoff between model size and accuracy. Q2_K provides the highest compression but potentially lower accuracy, while Q6_K maintains better accuracy at the cost of larger model size. The K-variants use more sophisticated block structures and scaling compared to legacy quantization methods.
+
The base model is 16-bit standard IEEE 754 half-precision floating-point number.
We will measure quantized model “quality” by means of perplexity and KL Divergence. For performance evaluation, we will report prompt throughput in tokens per second.
+
Perplexity
+
Perplexity is a common metric for evaluating language models that measures how well a model predicts a sample of text. Lower perplexity indicates better prediction (less “perplexed” by the text).
+
Recall that for a sequence of N tokens, perplexity is defined as:
\(P(x_i|x_{<i})\) is the probability the model \(B\) with tokenized sequence \(X\) assigns to token \(x_i\) given the previous tokens \(x_{<i}\)
+
\(N\) is the total number of tokens in the sequence
+
+
To evaluate quantization quality, we first calculate perplexity scores for both the base model and quantized variants. We then compute the ratio of quantized to base perplexity and average it across all prompt samples as follows:
+
+\[ Avg PPL Ratio = \frac{1}{N}\sum_{i=1}^{N} \frac{\text{PPL}_i(Q)}{\text{PPL}_i(\text{base})} \]
+
We also calculate the correlation between the log perplexities of the quantized and base models:
These are two simple metrics to evaluate how much worse the quantized model performs on an intrinsic basis which we then can compare to the base model’s perplexities.
+
Arguably, KL Divergence is a better metric since we aim at reporting relative performance instead of intrinsic performance.
+
KL Divergence
+
The Kullback-Leibler (KL) Divergence measures how one probability distribution differs from another reference distribution. For comparing logits between a base model (B) and quantized model (Q), we can calculate the KL divergence as follows:
\(P(i)\) and \(Q(i)\) are the softmax probabilities derived from the logits
+
The sum is taken over all tokens in the vocabulary
+
+
Implementation
+
We will use LLama.cpp’s llama-perplexity CLI to calculate perplexity and KL divergence. The first step is to generate the logits for the base model, which will serve as the reference distribution. For instance, below we pass our input prompts (prompts.txt) and generate the logits for the base model qwen2.5-0.5b-instruct-fp16.gguf which will be saved in logits.kld.
The KL divergence and perplexity results in ppl1,ppl2 provide insights into model quality across different quantization levels. Q6 maintains near-perfect correlation (99.90%) with the base model and minimal KL divergence (0.004), indicating very close distribution matching. Q2’s higher KL divergence (0.112) and lower correlation (98.31%) quantify its increased deviation from the base model’s behavior.
+
+
+
+
Fig. 8.6 KL Divergence results for Quantization Q2, Q4, and Q6 quantized models.¶
+
+
+
+
+
+
Fig. 8.7 Perplexity results for Quantization Q2, Q4, and Q6 quantized models.¶
+
+
+
From Table 8.4, we observe that the Q2 model achieves the smallest size at 390 MiB
+(67% reduction from base) with throughput of 81 tokens/s, but has the highest perplexity degradation at 10.36%. The Q4 model offers a better balance, with good size savings (60% reduction) and only 3.5% perplexity loss. Q6 comes closest to matching the base model’s performance with just 0.93% perplexity degradation, while still providing 47% size reduction.
Benchmarking was performed on Ubuntu 24.04 LTS for x86_64-linux-gnu on commodity hardware (Table 8.5) with no dedicated GPU demonstrating the feasibility of running LLMs locally by nearly everyone with a personal computer thanks to LLama.cpp.
Georgi Gerganov and contributors. Llama.cpp. GitHub Repository, 2024a. High-performance inference of LLaMA models in pure C/C++. URL: https://github.com/ggerganov/llama.cpp.
Georgi Gerganov and contributors. Gguf file format specification. GitHub Repository, 2024b. Technical specification of the GGUF file format for efficient model storage and inference. URL: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md.
Georgi Gerganov and others. Quantization of llama models - discussion. GitHub Discussion, 2023. Discussion thread about quantization techniques and tradeoffs in llama.cpp. URL: https://github.com/ggerganov/llama.cpp/discussions/205.
Hugging Face. Gguf models on hugging face. Online Repository, 2024x. Collection of models in GGUF format for efficient local inference. URL: https://huggingface.co/models?search=gguf.
LM Studio. Lm studio - discover, download, and run local llms. Website, 2024. Desktop application for discovering, downloading and running local language models. URL: https://lmstudio.ai/.
Mozilla Ocho. Llamafile: distribute and run llms with a single file. GitHub Repository, 2024. Tool for packaging and distributing LLMs as self-contained executables. URL: https://github.com/Mozilla-Ocho/llamafile.
Tokens are the basic units that LLMs process text with. A token can be as short as a single character or as long as a complete word. In English, a general rule of thumb is that 1 token ≈ 4 characters or ¾ of a word.
The max_output_tokens is parameter often available in modern LLMs that determines the maximum length of text that an LLM can generate in a single response. Table 3.1 shows the max_output_tokens for several key models, which typically range between 4096 and 16384 tokens. Contrary to what one might expect, the model does not “summarizes the answer” such that it does not surpass max_output_tokens limit. Instead, it will stop once it reaches this limit, even mid-sentence, i.e. the response may be truncated.
The max_output_tokens limit in LLMs poses a significant challenge for users who need to generate long outputs, as it may result in truncated content and/or incomplete information.
Truncated Content: Users aiming to generate extensive content, such as detailed reports or comprehensive articles, may find their outputs abruptly cut off due to the max_output_tokens limit. This truncation can result in incomplete information and disrupt the flow of the content.
Content chunking with contextual linking is a technique used to manage the max_output_tokens limitation by breaking down long-form content into smaller, manageable chunks. This approach allows the LLM to focus on smaller sections of the input, enabling it to generate more complete and detailed responses for each chunk while maintaining coherence and context across the entire output.