Skip to content

Commit

Permalink
add Github link
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Nov 25, 2024
1 parent 31af037 commit 2d1cedf
Show file tree
Hide file tree
Showing 17 changed files with 1,027 additions and 48 deletions.
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/evals.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: a685905c3bf9c83b1dd9f8ab3742b2b8
config: 0e52b5e6cf4b809f379b4f0cf47f362e
tags: 645f666f9bcd5a90fca523b33c5a78b7
209 changes: 203 additions & 6 deletions tamingllms/_build/html/_sources/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Challenge of Evaluating LLMs\n",
"# The Challenge of Evaluating LLM-based Applications\n",
"```{epigraph}\n",
"Evals are surprisingly often all you need.\n",
"\n",
Expand Down Expand Up @@ -206,7 +206,91 @@
"source": [
"## Solutions\n",
"\n",
"### Approaches\n",
"### Evals Design\n",
"\n",
"First, it's important to make a distinction between evaluating an LLM versus evaluating an LLM-based application (our focus). While the latter offers foundation capabilities and are typically general-purpose, the former is more specific and tailored to a particular use case. Here, we define an LLM-based application as a system that uses one or more LLMs to perform a specific task. More specifically, an LLM-based application is the combination of one or more LLM models, their associated prompts and parameters to solve a particular business problem.\n",
"\n",
"That differentiation is important because it changes the scope of evaluation. LLMs are usually evaluated based on their capabilities, which include things like language understanding, reasoning and knowledge. LLM-based applications are evaluated based on their end-to-end functionality, performance, and how well they meet business requirements. That distinction has key implications for the design of evaluation systems:\n",
"\n",
"1. Application requirements are closely tied to LLM evaluations\n",
"2. The same LLM can yield different results in different applications\n",
"3. Evaluation must align with business objectives\n",
"4. A great LLM doesn't guarantee a great application!\n",
"\n",
"\n",
"#### Conceptual Overview\n",
"\n",
"When evaluating an LLM-based application, we need to consider the following components:\n",
"\n",
"Examples, Application, Evaluator, Score\n",
"\n",
"\n",
"Let me break down the key components, their inputs/outputs, and purposes from the diagram:\n",
"\n",
"1. Examples/Dataset (Input Source):\n",
"- Purpose: Provides standardized test cases for evaluation\n",
"- Input: Collection of test cases\n",
"- Output: Test inputs fed to multiple LLM applications\n",
"- Optional Connection to Evaluator: Can provide reference/ground truth for comparison\n",
"\n",
"2. LLM Applications (Processing Layer):\n",
"- Input: Test cases from Examples\n",
"- Processing: Each LLM (LLM_1, LLM_2, ... LLM_N) processes the same inputs\n",
"- Output: Generated responses/results\n",
"- Purpose: \n",
" * Represents different LLM implementations/vendors\n",
" * Could be different models (GPT-4, Claude, PaLM, etc.)\n",
" * Could be different configurations of same model\n",
" * Could be different prompting strategies\n",
"\n",
"3. Evaluator (Assessment Layer):\n",
"- Input: \n",
" * LLM outputs from all applications\n",
" * Optional reference data from Examples\n",
"- Processing: Applies evaluation metrics and scoring criteria\n",
"- Output: Individual scores for each LLM application\n",
"- Purpose:\n",
" * Measures performance across defined metrics\n",
" * Ensures consistent evaluation across all LLMs\n",
" * Applies standardized scoring criteria\n",
"\n",
"4. Scores (Metric Layer):\n",
"- Input: Evaluation results from Evaluator\n",
"- Output: Quantified performance metrics\n",
"- Purpose:\n",
" * Represents performance in numerical form\n",
" * Enables quantitative comparison\n",
" * May include multiple metrics per LLM\n",
"\n",
"5. Leaderboard (Ranking Layer):\n",
"- Input: Scores from all LLM applications\n",
"- Processing: Aggregates and ranks performances\n",
"- Output: Ordered ranking of LLMs with scores\n",
"- Purpose:\n",
" * Provides clear comparison view\n",
" * Shows relative performance\n",
" * Helps in decision-making\n",
"\n",
"The flow demonstrates a systematic approach where:\n",
"1. Same inputs are provided to all LLMs\n",
"2. Responses are evaluated consistently\n",
"3. Performance is quantified objectively\n",
"4. Results are ranked for easy comparison\n",
"\n",
"Key aspects of the design:\n",
"- Scalability: Can handle many LLMs (shown by ...)\n",
"- Fairness: Same inputs and evaluation criteria for all\n",
"- Transparency: Clear flow from input to final ranking\n",
"- Modularity: Components can be updated independently\n",
"- Standardization: Consistent evaluation process\n",
"\n",
"Would you like me to elaborate on any particular component or aspect of the system?\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"```{figure} ../_static/evals/conceptual.png\n",
"---\n",
Expand All @@ -225,15 +309,128 @@
"---\n",
"name: conceptual-multi\n",
"alt: Conceptual Overview\n",
"scale: 40%\n",
"scale: 50%\n",
"align: center\n",
"---\n",
"Conceptual overview of Multiple LLM-based applications evaluation.\n",
"```\n",
"\n",
"{numref}`conceptual-multi`\n",
"\n",
"### Evals Design\n",
"{numref}`conceptual-multi`\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Considerations\n",
"\n",
"Let me break down the key conceptual aspects and important questions for planning an LLM application evaluation system:\n",
"\n",
"1. Examples/Dataset Design:\n",
"- What types of examples should be included in the test set?\n",
" * Does it cover all important use cases?\n",
" * Are edge cases represented?\n",
" * Is there a good balance of simple and complex examples?\n",
"- How do we ensure data quality?\n",
" * Are the examples representative of real-world scenarios?\n",
" * Is there any bias in the test set?\n",
"- Should we have separate test sets for different aspects (accuracy, safety, etc.)?\n",
"- Do we need human-validated ground truth for all examples?\n",
"\n",
"2. LLM Applications:\n",
"- What aspects of each LLM app should be standardized for fair comparison?\n",
" * Prompt templates\n",
" * Context length\n",
" * Temperature and other parameters\n",
" * Rate limiting and timeout handling\n",
"- How to handle different LLM capabilities and limitations?\n",
" * Some models might have special features others don't\n",
" * Cost and latency differences\n",
" * Different output formats\n",
"- Should we test different configurations of the same LLM?\n",
"\n",
"3. Evaluator Design:\n",
"- What metrics should we measure?\n",
" * Accuracy/correctness\n",
" * Response relevance\n",
" * Output consistency\n",
" * Response latency\n",
" * Cost efficiency\n",
" * Safety and bias metrics\n",
"- How do we define success for different types of tasks?\n",
" * Objective metrics vs subjective assessment\n",
" * Task-specific evaluation criteria\n",
" * Handling partial correctness\n",
"- Should evaluation be automated or involve human review?\n",
" * Balance between automation and human judgment\n",
" * Inter-rater reliability for human evaluation\n",
" * Cost and scalability considerations\n",
"\n",
"4. Scoring System:\n",
"- How should different metrics be weighted?\n",
" * Relative importance of different factors\n",
" * Task-specific prioritization\n",
" * Business requirements alignment\n",
"- Should scores be normalized or absolute?\n",
"- How to handle missing capabilities or failed responses?\n",
"- Should we consider confidence scores from the LLMs?\n",
"\n",
"5. Leaderboard/Ranking:\n",
"- How often should rankings be updated?\n",
"- Should ranking include confidence intervals?\n",
"- How to handle ties or very close scores?\n",
"- Should we maintain separate rankings for different:\n",
" * Task types\n",
" * Cost tiers\n",
" * Performance characteristics\n",
"\n",
"6. Overall System Design:\n",
"- How to ensure evaluation system scalability?\n",
"- How to maintain test set security?\n",
"- How to handle API changes and versioning?\n",
"- How to validate the evaluation system itself?\n",
"- How to make the system extensible for new:\n",
" * Metrics\n",
" * LLM providers\n",
" * Use cases\n",
" * Evaluation methods\n",
"\n",
"7. Practical Considerations:\n",
"- Budget constraints for running evaluations\n",
"- API rate limits and quotas\n",
"- Maintenance and monitoring requirements\n",
"- Documentation and reproducibility\n",
"- Legal and compliance requirements\n",
"- Disaster recovery and backup plans\n",
"\n",
"8. Business Integration:\n",
"- How will evaluation results inform business decisions?\n",
"- What is the update frequency needed?\n",
"- How to handle vendor selection and migration?\n",
"- What level of transparency is needed in reporting?\n",
"\n",
"This evaluation framework allows organizations to:\n",
"1. Systematically compare different LLM solutions\n",
"2. Make data-driven decisions about LLM selection\n",
"3. Monitor performance over time\n",
"4. Identify areas for improvement\n",
"5. Manage costs and risks effectively\n",
"\n",
"The key is to design an evaluation system that is:\n",
"- Comprehensive yet practical\n",
"- Fair and unbiased\n",
"- Scalable and maintainable\n",
"- Aligned with business objectives\n",
"- Adaptable to evolving requirements\n",
"\n",
"Would you like me to elaborate on any of these aspects or explore additional considerations?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Approaches\n",
"\n",
"### Human-Based Evaluation"
]
Expand Down
21 changes: 19 additions & 2 deletions tamingllms/_build/html/genindex.html
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>


</navlinks>
</div>
</navbar>
Expand All @@ -68,6 +77,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>



</navlinks><div id="searchbox" class="searchbox" role="search">
<div class="caption"><span class="caption-text">Quick search</span>
Expand Down Expand Up @@ -117,7 +135,7 @@

<li class="toctree-l1 ">

<a href="notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLMs</a>
<a href="notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLM-based Applications</a>



Expand Down Expand Up @@ -159,7 +177,6 @@ <h1 id="index">Index</h1>
<div class="page-nav">
<div class="inner"><ul class="page-nav">
</ul><div class="footer" role="contentinfo">
&#169; Copyright Tharsis T. P. Souza, 2024.
<br>
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 6.2.1 with <a href="https://github.com/schettino72/sphinx_press_theme">Press Theme</a> 0.9.1.
</div>
Expand Down
21 changes: 19 additions & 2 deletions tamingllms/_build/html/markdown/intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>


</navlinks>
</div>
</navbar>
Expand All @@ -70,6 +79,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>



</navlinks><div id="searchbox" class="searchbox" role="search">
<div class="caption"><span class="caption-text">Quick search</span>
Expand Down Expand Up @@ -137,7 +155,7 @@

<li class="toctree-l1 ">

<a href="../notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLMs</a>
<a href="../notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLM-based Applications</a>



Expand Down Expand Up @@ -366,7 +384,6 @@ <h3><a class="toc-backref" href="#id12" role="doc-backlink"><span class="section
title="next chapter"><span class="section-number">2. </span>Output Size Limitations →</a>
</li>
</ul><div class="footer" role="contentinfo">
&#169; Copyright Tharsis T. P. Souza, 2024.
<br>
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 6.2.1 with <a href="https://github.com/schettino72/sphinx_press_theme">Press Theme</a> 0.9.1.
</div>
Expand Down
21 changes: 19 additions & 2 deletions tamingllms/_build/html/markdown/toc.html
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>


</navlinks>
</div>
</navbar>
Expand All @@ -69,6 +78,15 @@




<div class="nav-item">
<a href="https://github.com/souzatharsis/tamingllms"
class="nav-link external">
Github <outboundlink></outboundlink>
</a>
</div>



</navlinks><div id="searchbox" class="searchbox" role="search">
<div class="caption"><span class="caption-text">Quick search</span>
Expand Down Expand Up @@ -118,7 +136,7 @@

<li class="toctree-l1 ">

<a href="../notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLMs</a>
<a href="../notebooks/evals.html" class="reference internal ">The Challenge of Evaluating LLM-based Applications</a>



Expand Down Expand Up @@ -333,7 +351,6 @@ <h2>Appendix B: Tools and Resources<a class="headerlink" href="#appendix-b-tools
title="next chapter"><span class="section-number">1. </span>Introduction →</a>
</li>
</ul><div class="footer" role="contentinfo">
&#169; Copyright Tharsis T. P. Souza, 2024.
<br>
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 6.2.1 with <a href="https://github.com/schettino72/sphinx_press_theme">Press Theme</a> 0.9.1.
</div>
Expand Down
Loading

0 comments on commit 2d1cedf

Please sign in to comment.