-
Notifications
You must be signed in to change notification settings - Fork 335
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
300 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
==================================== | ||
Evaluations and Eval Engineering | ||
==================================== | ||
|
||
Evaluations represent a crucial component in the practice of prompt engineering. They provide the quantitative and qualitative signals necessary to understand whether a language model program achieves the desired objectives. Without evaluations, the process of refining prompts often devolves into guesswork, guided only by subjective impressions rather than structured evidence. Although many developers default to an ad hoc process—manually reviewing a handful of generated outputs and deciding by intuition whether one version of a prompt is better than another—this approach quickly becomes untenable as tasks grow more complex, as teams grow larger, and as stakes get higher. | ||
|
||
The premise of ell’s evaluation feature is that prompt engineering should mirror, where possible, the rigor and methodology of modern machine learning. In machine learning, progress is measured against validated benchmarks, metrics, and datasets. Even as one tunes parameters or tries novel architectures, the question “Did we do better?” can be answered systematically. Similarly, evaluations in ell offer a structured and reproducible way to assess prompts. They transform the process from an ephemeral art into a form of empirical inquiry. In doing so, they also introduce the notion of eval engineering, whereby evaluations themselves become first-class entities that are carefully constructed, versioned, and refined over time. | ||
|
||
|
||
The Problem of Prompt Engineering by Intuition | ||
---------------------------------------------- | ||
|
||
Prompt engineering without evaluations is often characterized by subjective assessments that vary from day to day and person to person. In simple projects, this might suffice. For example, when producing a handful of short marketing texts, a developer might be content to trust personal taste as the measure of success. However, as soon as the problem grows beyond a few trivial examples, this style of iterative tweaking collapses. With more complex tasks, larger data distributions, and subtle constraints—such as maintaining a specific tone or meeting domain-specific requirements—subjective judgments no longer yield consistent or reliable improvements. | ||
|
||
Without evaluations, there is no systematic way to ensure that a revised prompt actually improves performance on the desired tasks. There is no guarantee that adjusting a single detail in the prompt to improve outputs on one example does not degrade outputs elsewhere. Over time, as prompt engineers read through too many model responses, they become either desensitized to quality issues or hypersensitive to minor flaws. This miscalibration saps productivity and leads to unprincipled prompt tuning. Subjective judgment cannot scale, fails to capture statistical performance trends, and offers no verifiable path to satisfy external stakeholders who demand reliability, accuracy, or compliance with given standards. | ||
|
||
|
||
The Concept of Evals | ||
-------------------- | ||
|
||
An eval is a structured evaluation suite that measures a language model program’s performance quantitatively and, when necessary, qualitatively. It consists of three essential elements. First, there is a dataset that represents a distribution of inputs over which the prompt must perform. Second, there are criteria that define what constitutes a successful output. Third, there are metrics that translate the model’s raw outputs into a measurable quantity. Together, these ingredients transform a vague sense of performance into a well-defined benchmark. | ||
|
||
In many cases, constructing an eval means assembling a carefully chosen set of input examples along with ground-truth labels or ideal reference outputs. For tasks that resemble classification or have measurable correctness (such as a question-answering system judged by exact matches), defining metrics is straightforward. Accuracy or a distance metric between the generated text and a known correct answer might suffice. More challenging domains—like generating stylistically nuanced emails or identifying the most engaging writing—have no simple correctness function. In these scenarios, evals may rely on human annotators who rate outputs along specific dimensions, or on large language model “critics” that apply specified criteria. Sometimes, training a model known as a reward model against collected human feedback is necessary, thereby encapsulating qualitative judgments into a reproducible metric. | ||
|
||
|
||
Eval Engineering | ||
---------------- | ||
|
||
Defining a single eval and sticking to it blindly can be as problematic as never evaluating at all. In practice, an eval is never perfect on the first try. As the prompt engineer tests models against the eval, new edge cases and overlooked criteria emerge. Perhaps the chosen metric saturates too easily, so that improving the model beyond a certain plateau becomes impossible. Perhaps the dataset fails to represent crucial scenarios that matter in production. Or maybe the criteria are too vague, allowing model outputs that look superficially fine but are actually not meeting the deeper requirements. | ||
|
||
This iterative refinement of the eval itself is known as eval engineering. It is the dual process to prompt engineering. While prompt engineering shapes the prompt to maximize performance, eval engineering shapes the evaluation environment so that it provides a faithful measure of what “good” means. Over time, as teams gain insight into their tasks, they add complexity and new constraints to their eval. The eval becomes more discriminative, less easily gamed, and more strongly aligned with the underlying goals of the application. Eventually, the eval and the prompt co-evolve in a virtuous cycle, where improvements in one reveal deficiencies in the other, and vice versa. | ||
|
||
|
||
Human and Model-Based Evaluation | ||
-------------------------------- | ||
|
||
In many real-world scenarios, an eval cannot be reduced to a fixed set of rules or ground-truth answers. Consider a task like producing compelling email outreach messages. Quality is subjective, style is nuanced, and the notion of success might be tied to actual user engagement or reply rates. In these cases, collecting human evaluations to create a labeled dataset and then training a reward model—or relying on another large language model as a critic—is a practical solution. The ultimate goal is to reduce reliance on slow, costly human judgments by capturing them in a repeatable automated eval. | ||
|
||
ell’s evaluation system makes it straightforward to incorporate this approach. One can integrate language model critics that read outputs and apply user-defined rules, effectively simulating a team of annotators. If the critics prove too lenient or fail to catch subtle errors, their prompts and instructions can be improved, reflecting the eval engineering process. If these model-based evaluations still fail to reflect genuine application needs, it becomes necessary to gather more human-labeled data, refine the reward models, and re-architect the evaluation pipeline itself. Through this iterative loop, evals gradually align more closely with the true requirements of the application, transforming previously nebulous quality criteria into concrete, quantifiable metrics. | ||
|
||
|
||
Connecting Evals to Prompt Optimization | ||
--------------------------------------- | ||
|
||
By placing evaluations at the center of prompt engineering, the entire process becomes more efficient and credible. Instead of repeatedly scanning a handful of generated examples and hoping for improvement, the prompt engineer simply tweaks a prompt, runs the eval, and compares the resulting score. This cycle can happen at scale. Developers can assess thousands of scenarios at once, track metrics over time, and gain statistically meaningful insights into what works and what does not. | ||
|
||
As evals mature, one can even iterate automatically. Consider a scenario where a heuristic or reward model score can stand in for what the human evaluators previously judged. It becomes possible to automate prompt tuning through search or optimization algorithms, continuously adjusting prompts to improve a known, well-defined metric. Over time, the process resembles the familiar model training loop from machine learning, where a clear objective guides each improvement. Prompt engineering moves from guesswork to rigorous exploration, guided by stable and trustworthy benchmarks. | ||
|
||
|
||
Versioning and Storing Evals in ell | ||
----------------------------------- | ||
|
||
Just as prompt engineering benefits from version control and provenance tracking, so does eval engineering. An eval changes over time: new criteria are introduced, new datasets are collected, and new models are tested as critics or reward functions. It is essential to record each iteration of the eval. Storing eval definitions, datasets, and metrics side-by-side with their corresponding results ensures that any future improvements, regressions, or shifts in performance can be understood in proper context. | ||
|
||
ell introduces the same automatic versioning to evals as it does for prompts. When an eval is constructed and run against a language model program, ell captures the code, data, and configuration that define it. The combination is assigned a version hash. If the eval is later changed—perhaps to extend the dataset or refine the metric—these changes are recorded, allowing developers to revert, compare, or branch off different variants of the eval. With this approach, eval engineering becomes traceable and reproducible. Developers can confidently demonstrate that a newly tuned eval indeed measures something more closely aligned with the business goals than its predecessor. | ||
|
||
|
||
Example: Defining and Running an Eval in ell | ||
-------------------------------------------- | ||
|
||
Setting up an eval in ell usually involves defining a class that specifies how to load the dataset, how to run the language model program on each input example, and how to score the outputs. Consider a scenario where we want to evaluate a prompt designed to summarize articles and rate them for clarity. Assume we have a dataset of articles and reference scores provided by trusted annotators. We define an eval that iterates over these articles, calls the language model program to generate a summary, and then measures how closely the language model’s summary score matches the reference. | ||
|
||
.. code-block:: python | ||
import ell | ||
class ClarityEval(ell.Evaluation): | ||
def __init__(self, articles, reference_scores): | ||
self.articles = articles | ||
self.reference_scores = reference_scores | ||
def run(self, lmp): | ||
predictions = [] | ||
for article, ref_score in zip(self.articles, self.reference_scores): | ||
summary = lmp(article) | ||
model_score = self.score_summary(summary) | ||
predictions.append((model_score, ref_score)) | ||
return self.compute_metric(predictions) | ||
def score_summary(self, summary): | ||
# Custom logic or another LMP critic can be used here | ||
return self.heuristic_clarity(summary) | ||
def heuristic_clarity(self, text): | ||
# This is a placeholder for any clarity metric. | ||
return len(text.split()) / 100.0 | ||
def compute_metric(self, predictions): | ||
# For simplicity, measure correlation or difference | ||
differences = [abs(p - r) for p, r in predictions] | ||
return 1.0 - (sum(differences) / len(differences)) | ||
# After defining the eval, simply run it on your LMP: | ||
# result = ClarityEval(my_articles, my_ref_scores).run(my_summary_lmp) | ||
# result now holds a quantitative measure of how well the prompt is performing. | ||
In this example, the placeholder metric is simplistic. In a real deployment, one might rely on a more sophisticated measure or even chain the model’s outputs into another LMP critic that checks adherence to complex guidelines. Over time, the eval can be improved, made stricter, or extended to a broader dataset. Each iteration and its resulting scores are tracked by ell’s integrated versioning, ensuring that comparisons remain meaningful across time. | ||
|
||
As evals grow and mature, they provide the stable foundation on which to stand when refining prompts. Combined with ell’s infrastructure for versioning and tracing, evaluations make it possible to bring principled, data-driven methodologies to prompt engineering. The result is a process that can scale in complexity and ambition, confident that improvements are real, documented, and reproducible. |
Oops, something went wrong.