diff --git a/docs/src/core_concepts/evaluation.rst.partial b/docs/src/core_concepts/eval_usage.rst.partial similarity index 96% rename from docs/src/core_concepts/evaluation.rst.partial rename to docs/src/core_concepts/eval_usage.rst.partial index f2bda0edc..6c0f36caa 100644 --- a/docs/src/core_concepts/evaluation.rst.partial +++ b/docs/src/core_concepts/eval_usage.rst.partial @@ -1,8 +1,7 @@ Evaluations =========== -Evaluations in ELL provide a powerful framework for assessing and analyzing Language Model Programs (LMPs). This guide covers the core concepts and features of the evaluation system. - + transcriptions Basic Usage ---------- @@ -208,3 +207,5 @@ Best Practices - Use meaningful commit messages - Track major changes - Maintain evaluation history + + diff --git a/docs/src/core_concepts/evaluations.rst b/docs/src/core_concepts/evaluations.rst new file mode 100644 index 000000000..fc5b59ba7 --- /dev/null +++ b/docs/src/core_concepts/evaluations.rst @@ -0,0 +1,100 @@ +==================================== +Evaluations and Eval Engineering +==================================== + +Evaluations represent a crucial component in the practice of prompt engineering. They provide the quantitative and qualitative signals necessary to understand whether a language model program achieves the desired objectives. Without evaluations, the process of refining prompts often devolves into guesswork, guided only by subjective impressions rather than structured evidence. Although many developers default to an ad hoc process—manually reviewing a handful of generated outputs and deciding by intuition whether one version of a prompt is better than another—this approach quickly becomes untenable as tasks grow more complex, as teams grow larger, and as stakes get higher. + +The premise of ell’s evaluation feature is that prompt engineering should mirror, where possible, the rigor and methodology of modern machine learning. In machine learning, progress is measured against validated benchmarks, metrics, and datasets. Even as one tunes parameters or tries novel architectures, the question “Did we do better?” can be answered systematically. Similarly, evaluations in ell offer a structured and reproducible way to assess prompts. They transform the process from an ephemeral art into a form of empirical inquiry. In doing so, they also introduce the notion of eval engineering, whereby evaluations themselves become first-class entities that are carefully constructed, versioned, and refined over time. + + +The Problem of Prompt Engineering by Intuition +---------------------------------------------- + +Prompt engineering without evaluations is often characterized by subjective assessments that vary from day to day and person to person. In simple projects, this might suffice. For example, when producing a handful of short marketing texts, a developer might be content to trust personal taste as the measure of success. However, as soon as the problem grows beyond a few trivial examples, this style of iterative tweaking collapses. With more complex tasks, larger data distributions, and subtle constraints—such as maintaining a specific tone or meeting domain-specific requirements—subjective judgments no longer yield consistent or reliable improvements. + +Without evaluations, there is no systematic way to ensure that a revised prompt actually improves performance on the desired tasks. There is no guarantee that adjusting a single detail in the prompt to improve outputs on one example does not degrade outputs elsewhere. Over time, as prompt engineers read through too many model responses, they become either desensitized to quality issues or hypersensitive to minor flaws. This miscalibration saps productivity and leads to unprincipled prompt tuning. Subjective judgment cannot scale, fails to capture statistical performance trends, and offers no verifiable path to satisfy external stakeholders who demand reliability, accuracy, or compliance with given standards. + + +The Concept of Evals +-------------------- + +An eval is a structured evaluation suite that measures a language model program’s performance quantitatively and, when necessary, qualitatively. It consists of three essential elements. First, there is a dataset that represents a distribution of inputs over which the prompt must perform. Second, there are criteria that define what constitutes a successful output. Third, there are metrics that translate the model’s raw outputs into a measurable quantity. Together, these ingredients transform a vague sense of performance into a well-defined benchmark. + +In many cases, constructing an eval means assembling a carefully chosen set of input examples along with ground-truth labels or ideal reference outputs. For tasks that resemble classification or have measurable correctness (such as a question-answering system judged by exact matches), defining metrics is straightforward. Accuracy or a distance metric between the generated text and a known correct answer might suffice. More challenging domains—like generating stylistically nuanced emails or identifying the most engaging writing—have no simple correctness function. In these scenarios, evals may rely on human annotators who rate outputs along specific dimensions, or on large language model “critics” that apply specified criteria. Sometimes, training a model known as a reward model against collected human feedback is necessary, thereby encapsulating qualitative judgments into a reproducible metric. + + +Eval Engineering +---------------- + +Defining a single eval and sticking to it blindly can be as problematic as never evaluating at all. In practice, an eval is never perfect on the first try. As the prompt engineer tests models against the eval, new edge cases and overlooked criteria emerge. Perhaps the chosen metric saturates too easily, so that improving the model beyond a certain plateau becomes impossible. Perhaps the dataset fails to represent crucial scenarios that matter in production. Or maybe the criteria are too vague, allowing model outputs that look superficially fine but are actually not meeting the deeper requirements. + +This iterative refinement of the eval itself is known as eval engineering. It is the dual process to prompt engineering. While prompt engineering shapes the prompt to maximize performance, eval engineering shapes the evaluation environment so that it provides a faithful measure of what “good” means. Over time, as teams gain insight into their tasks, they add complexity and new constraints to their eval. The eval becomes more discriminative, less easily gamed, and more strongly aligned with the underlying goals of the application. Eventually, the eval and the prompt co-evolve in a virtuous cycle, where improvements in one reveal deficiencies in the other, and vice versa. + + +Human and Model-Based Evaluation +-------------------------------- + +In many real-world scenarios, an eval cannot be reduced to a fixed set of rules or ground-truth answers. Consider a task like producing compelling email outreach messages. Quality is subjective, style is nuanced, and the notion of success might be tied to actual user engagement or reply rates. In these cases, collecting human evaluations to create a labeled dataset and then training a reward model—or relying on another large language model as a critic—is a practical solution. The ultimate goal is to reduce reliance on slow, costly human judgments by capturing them in a repeatable automated eval. + +ell’s evaluation system makes it straightforward to incorporate this approach. One can integrate language model critics that read outputs and apply user-defined rules, effectively simulating a team of annotators. If the critics prove too lenient or fail to catch subtle errors, their prompts and instructions can be improved, reflecting the eval engineering process. If these model-based evaluations still fail to reflect genuine application needs, it becomes necessary to gather more human-labeled data, refine the reward models, and re-architect the evaluation pipeline itself. Through this iterative loop, evals gradually align more closely with the true requirements of the application, transforming previously nebulous quality criteria into concrete, quantifiable metrics. + + +Connecting Evals to Prompt Optimization +--------------------------------------- + +By placing evaluations at the center of prompt engineering, the entire process becomes more efficient and credible. Instead of repeatedly scanning a handful of generated examples and hoping for improvement, the prompt engineer simply tweaks a prompt, runs the eval, and compares the resulting score. This cycle can happen at scale. Developers can assess thousands of scenarios at once, track metrics over time, and gain statistically meaningful insights into what works and what does not. + +As evals mature, one can even iterate automatically. Consider a scenario where a heuristic or reward model score can stand in for what the human evaluators previously judged. It becomes possible to automate prompt tuning through search or optimization algorithms, continuously adjusting prompts to improve a known, well-defined metric. Over time, the process resembles the familiar model training loop from machine learning, where a clear objective guides each improvement. Prompt engineering moves from guesswork to rigorous exploration, guided by stable and trustworthy benchmarks. + + +Versioning and Storing Evals in ell +----------------------------------- + +Just as prompt engineering benefits from version control and provenance tracking, so does eval engineering. An eval changes over time: new criteria are introduced, new datasets are collected, and new models are tested as critics or reward functions. It is essential to record each iteration of the eval. Storing eval definitions, datasets, and metrics side-by-side with their corresponding results ensures that any future improvements, regressions, or shifts in performance can be understood in proper context. + +ell introduces the same automatic versioning to evals as it does for prompts. When an eval is constructed and run against a language model program, ell captures the code, data, and configuration that define it. The combination is assigned a version hash. If the eval is later changed—perhaps to extend the dataset or refine the metric—these changes are recorded, allowing developers to revert, compare, or branch off different variants of the eval. With this approach, eval engineering becomes traceable and reproducible. Developers can confidently demonstrate that a newly tuned eval indeed measures something more closely aligned with the business goals than its predecessor. + + +Example: Defining and Running an Eval in ell +-------------------------------------------- + +Setting up an eval in ell usually involves defining a class that specifies how to load the dataset, how to run the language model program on each input example, and how to score the outputs. Consider a scenario where we want to evaluate a prompt designed to summarize articles and rate them for clarity. Assume we have a dataset of articles and reference scores provided by trusted annotators. We define an eval that iterates over these articles, calls the language model program to generate a summary, and then measures how closely the language model’s summary score matches the reference. + +.. code-block:: python + + import ell + + class ClarityEval(ell.Evaluation): + def __init__(self, articles, reference_scores): + self.articles = articles + self.reference_scores = reference_scores + + def run(self, lmp): + predictions = [] + for article, ref_score in zip(self.articles, self.reference_scores): + summary = lmp(article) + model_score = self.score_summary(summary) + predictions.append((model_score, ref_score)) + return self.compute_metric(predictions) + + def score_summary(self, summary): + # Custom logic or another LMP critic can be used here + return self.heuristic_clarity(summary) + + def heuristic_clarity(self, text): + # This is a placeholder for any clarity metric. + return len(text.split()) / 100.0 + + def compute_metric(self, predictions): + # For simplicity, measure correlation or difference + differences = [abs(p - r) for p, r in predictions] + return 1.0 - (sum(differences) / len(differences)) + + # After defining the eval, simply run it on your LMP: + # result = ClarityEval(my_articles, my_ref_scores).run(my_summary_lmp) + # result now holds a quantitative measure of how well the prompt is performing. + + +In this example, the placeholder metric is simplistic. In a real deployment, one might rely on a more sophisticated measure or even chain the model’s outputs into another LMP critic that checks adherence to complex guidelines. Over time, the eval can be improved, made stricter, or extended to a broader dataset. Each iteration and its resulting scores are tracked by ell’s integrated versioning, ensuring that comparisons remain meaningful across time. + +As evals grow and mature, they provide the stable foundation on which to stand when refining prompts. Combined with ell’s infrastructure for versioning and tracing, evaluations make it possible to bring principled, data-driven methodologies to prompt engineering. The result is a process that can scale in complexity and ambition, confident that improvements are real, documented, and reproducible. \ No newline at end of file diff --git a/docs/src/core_concepts/evaluations.rst.sample b/docs/src/core_concepts/evaluations.rst.sample new file mode 100644 index 000000000..97278d97f --- /dev/null +++ b/docs/src/core_concepts/evaluations.rst.sample @@ -0,0 +1,196 @@ +Evaluations +=========== + +Prompt engineering often resembles an optimization process without a clear, quantifiable objective function. Engineers tweak prompts based on intuition or "vibes," hoping to improve the model's outputs. While this approach can yield short-term results, it presents several significant challenges. + +Firstly, relying on intuition makes it difficult to quantify improvements or regressions in the model's performance. Without clear metrics, determining whether changes to prompts are genuinely beneficial becomes speculative. This lack of quantitative feedback can lead to inefficient iterations and missed opportunities for optimization. + +Secondly, the process is inherently subjective. Different prompt engineers may have varying opinions on what constitutes a "good" output, leading to inconsistent optimizations. This subjectivity hampers collaboration and makes it challenging to build upon each other's work effectively. + +Moreover, manually evaluating outputs is time-consuming and doesn't scale well, especially with large datasets or diverse use cases. As the number of inputs grows, it's impractical to assess each output individually. This limitation hampers the ability to guarantee that the language model will perform reliably across all desired scenarios. + +In high-stakes applications—such as legal, healthcare, or domains requiring stringent compliance—stakeholders demand assurances about model performance. Providing such guarantees is virtually impossible without quantitative assessments. The inability to measure and demonstrate performance can hinder the deployment of language models in critical areas where they could offer significant benefits. + +Additionally, when working with complex prompt chains involving multiple language model programs (LMPs), optimizing one component may inadvertently degrade the performance of others. Without systematic evaluation methods, identifying and rectifying these issues becomes a formidable challenge. This interdependency underscores the need for a holistic approach to prompt optimization. + +These challenges highlight the necessity for a more rigorous, objective, and scalable approach to prompt engineering. + +Introducing Evals +----------------- + +An **Eval** is a systematic method for evaluating language model programs using quantitative metrics over a dataset of inputs. It serves as a programmatic means to assess whether your prompt engineering efforts have successfully optimized the model's performance for your specific use case. + +### What Are Evals? + +Evals consist of three main components: + +- **Dataset**: A collection of inputs representative of the use cases you care about. This dataset should be large and varied to ensure statistical significance and to capture the diversity of scenarios your model will encounter. + +- **Metrics**: Quantitative criteria that measure how well the LMP performs on the dataset. Metrics could include accuracy, precision, recall, or custom functions that reflect specific aspects of performance relevant to your application. + +- **Qualitative Annotations**: Optional assessments providing additional context or insights into the model's outputs. These annotations can help interpret quantitative results and guide further refinements. + +By running an LMP against an Eval, you obtain scores that reflect the model's performance according to your defined metrics. + +### Benefits of Using Evals + +The use of Evals offers several key advantages: + +- **Statistical Significance**: Evaluating the model over a large and varied dataset yields meaningful performance statistics. This approach reduces the influence of outliers and provides a more accurate picture of the model's capabilities. + +- **Quantitative Analysis**: Replacing subjective judgments with objective metrics reduces cognitive load and enables more focused improvements. Quantitative feedback accelerates the optimization process by highlighting specific areas for enhancement. + +- **Reproducibility**: Consistent and comparable evaluations over time allow you to track progress and ensure that changes lead to genuine improvements. Reproducibility is essential for debugging, auditing, and maintaining confidence in the model. + +- **Scalability**: Evals facilitate efficient assessment of model performance across thousands of examples without manual intervention. This scalability is crucial for deploying language models in production environments where they must handle diverse and extensive input. + +The Necessity of Eval Engineering +--------------------------------- + +While Evals provide a systematic framework for assessment, creating effective Evals is an engineering process in itself—this is where **Eval Engineering** becomes crucial. + +### Why Eval Engineering Is Crucial + +An Eval that lacks discriminative power may saturate too early, showing perfect or near-perfect scores even when the model has significant room for improvement. This saturation typically results from metrics or criteria that are insufficiently sensitive to variations in output quality. + +Conversely, misaligned Evals—where the metrics do not align with the true objectives—can lead to optimizing the model in the wrong direction. The model may perform well on the Eval but fail to deliver the desired outcomes in real-world applications. + +Eval Engineering involves carefully designing and iteratively refining the dataset, metrics, and criteria to ensure that the Eval accurately reflects the qualities you desire in the model's outputs. This process mirrors prompt engineering but focuses on crafting robust evaluations rather than optimizing prompts. + +### The Process of Eval Engineering + +Eval Engineering encompasses several key activities: + +- **Defining Clear Criteria**: Establish explicit, measurable criteria that align with your goals. Clarity in what constitutes success is essential for both the prompt and Eval. + +- **Ensuring Statistical Power**: Collect sufficient and diverse data to make meaningful assessments. A well-constructed dataset captures the range of inputs the model will encounter and provides a solid foundation for evaluation. + +- **Iteratively Refining Metrics**: Adjust metrics and criteria as needed to maintain alignment with objectives and improve discriminative ability. This refinement is an ongoing process as you discover new insights or as requirements evolve. + +- **Versioning and Documentation**: Keep detailed records of Eval versions, changes made, and reasons for those changes. Proper documentation ensures transparency and facilitates collaboration among team members. + +### Turning Qualitative Evaluations into Quantitative Ones + +In scenarios where you lack ground truth labels or have open-ended generative tasks, transforming qualitative assessments into quantitative metrics is challenging. Several approaches can help bridge this gap: + +#### Using Language Models as Critics + +Language models can serve as evaluators by acting as critics of other models' outputs. By providing explicit criteria, you can prompt a language model to assess outputs and generate scores. This method leverages the language model's understanding to provide consistent evaluations. + +#### Human Evaluations + +Human evaluators can assess model outputs against defined criteria, offering qualitative annotations that convert into quantitative scores. While effective, this approach can be resource-intensive and may not scale well for large datasets. + +#### Training Reward Models + +By collecting a dataset of human evaluations, you can train a reward model—a specialized machine learning model that predicts human judgments. This reward model can then provide quantitative assessments, enabling scalable evaluations that approximate human feedback. + +Implementing Evals in ell +------------------------- + +**ell** introduces built-in support for Evals, integrating evaluation directly into your prompt engineering workflow. + +### Creating an Eval in ell + +An Eval in ell is defined using the `Evaluation` class: +```python +from ell.eval import Evaluation +Define your Eval +my_eval = Evaluation( +name='example_eval', +data=[{'input': 'sample input', 'expected_output': 'desired output'}], +metrics=[accuracy_metric], +description='An example Eval for demonstration purposes.' +``` + +- **Name**: A unique identifier for the Eval. + +- **Data**: A list of dictionaries containing input data and, optionally, expected outputs. + +- **Metrics**: A list of functions that compute performance metrics. + +- **Description**: A textual description of the Eval's purpose and contents. + +### Running an Eval + +To run an Eval on an LMP: +```python +Run the Eval +results = my_eval.run(your_language_model_program) +Access the results +print(results.metrics) +``` + +The `run` method executes the LMP on the Eval's dataset and computes the specified metrics, returning a `RunResult` object with detailed performance data. + +### Viewing Eval Results in ell Studio + +When you run an Eval, the results are automatically stored and can be viewed using ell Studio: + +ell Studio provides an interactive dashboard where you can visualize Eval scores, track performance over time, and compare different versions of your LMPs and Evals. + +Versioning Evals with ell +------------------------- + +Just as prompts require versioning, Evals need version control to manage changes and ensure consistency. + +### Automatic Versioning + +ell automatically versions your Evals by hashing the following components: + +- **Dataset**: Changes to the input data result in a new Eval version. + +- **Metric Functions**: Modifications to the evaluation metrics produce a new version. + +Each Eval version is stored with metadata, including: + +- **Eval ID**: A unique hash representing the Eval version. + +- **Creation Date**: Timestamp of when the Eval was created. + +- **Change Log**: Automatically generated commit messages describing changes between versions. + +### Benefits of Versioning Evals + +Versioning Evals offers significant benefits: + +- **Reproducibility**: Reproduce past evaluations exactly as they were conducted. + +- **Comparison Over Time**: Compare model performance across different Eval versions to track progress or identify regressions. + +- **Rollback Capability**: Revert to previous Eval versions if new changes negatively affect evaluations. + +- **Transparency**: Clearly document how and why Evals have changed over time, enhancing collaboration and accountability. + +Benefits of Eval Engineering +---------------------------- + +Implementing Eval Engineering provides numerous advantages: + +- **Enhanced Rigor**: Introduce scientific methods into prompt engineering, making the process more objective and reliable. + +- **Improved Collaboration**: Separate concerns by having team members focus on prompt engineering or Eval engineering, promoting specialization and efficiency. + +- **Faster Iterations**: Reduce the time spent on manual evaluations, allowing for quicker optimization cycles. + +- **Scalable Evaluations**: Efficiently handle large datasets, enabling comprehensive assessments of model performance. + +- **Alignment with Objectives**: Ensure that the model's outputs closely match stakeholder needs by defining explicit evaluation criteria. + +Evaluations and the Future of Prompt Engineering +----------------------------------------------- + +As language models continue to advance, the importance of robust evaluation methods will grow. Models will increasingly saturate existing Evals, meaning they perform near-perfectly on current evaluations. At this point, further improvements require constructing new Evals with greater discriminative power. + +Eval Engineering will be pivotal in pushing the boundaries of model performance. By continuously refining Evals, you can identify subtle areas for enhancement even when models appear to have plateaued. This ongoing process ensures that models remain aligned with evolving objectives and adapt to new challenges. + +Moreover, Eval Engineering is not just about immediate gains. Developing expertise in this area prepares teams for future developments in the field, positioning them to leverage advancements effectively. + +Conclusion +---------- + +Evals and Eval Engineering represent significant steps toward making prompt engineering a more systematic, reliable, and scalable process. By integrating Evals into your workflow with ell, you move beyond subjective assessments, introducing scientific rigor into the optimization of language model programs. + +The adoption of Eval Engineering practices not only improves current outcomes but also future-proofs your workflows. As language models evolve, the ability to design and implement robust evaluations will be increasingly valuable. + +To get started with Evals in ell, consult the API documentation and explore examples. By embracing Eval Engineering, you enhance your prompt engineering efforts and contribute to the advancement of the field. diff --git a/docs/src/index.rst b/docs/src/index.rst index 26f3e6ef2..b71f99912 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -265,6 +265,7 @@ To get started with ``ell``, see the :doc:`Getting Started ` se core_concepts/ell_simple core_concepts/versioning_and_storage core_concepts/ell_studio + core_concepts/evaluations core_concepts/message_api core_concepts/ell_complex