Skip to content

Commit

Permalink
update evals
Browse files Browse the repository at this point in the history
  • Loading branch information
souzatharsis committed Nov 27, 2024
1 parent 41f65bd commit f93bcfb
Show file tree
Hide file tree
Showing 17 changed files with 220 additions and 252 deletions.
Binary file modified tamingllms/_build/.doctrees/environment.pickle
Binary file not shown.
Binary file modified tamingllms/_build/.doctrees/notebooks/evals.doctree
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tamingllms/_build/html/_images/conceptual.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tamingllms/_build/html/_images/diagram1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tamingllms/_build/html/_images/emerging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified tamingllms/_build/html/_images/rebuttal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
116 changes: 53 additions & 63 deletions tamingllms/_build/html/_sources/notebooks/evals.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -412,18 +412,18 @@
"source": [
"#### Example: BLEU and ROUGE for SEC Filing Summarization\n",
"\n",
"When working with SEC filings, you may want to evaluate the quality of summaries or extracted key sections against reference summaries (e.g., analyst-prepared highlights). \n",
"When working with SEC filings, you may want to evaluate the quality of summaries or extracted key sections against reference summaries (e.g. analyst-prepared highlights). \n",
"\n",
"For that purpose, we can use BLEU and ROUGE scores to evaluate the quality of generated summaries against reference summaries.\n",
"\n",
"We will model our simple metrics-based evaluator with the following components:\n",
"- Input: Generated summary and reference summary\n",
"- Output: Dictionary with scores for BLEU, ROUGE_1, and ROUGE_2\n",
"- Purpose: Quantitatively compare generated summaries against reference summaries\n",
"- Purpose: Evaluate our LLM-based application - SEC filing summary generator\n",
"\n",
"A *Reference Summary* represents the \"ideal\" summary. It could be prepared by humanas, e.g. expert analysts, or machine-generated. \n",
"\n",
"In our example, we are interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a *benchmark model* (larger and more expensive). We will use the following setup:\n",
"In our example, we are particularly interested in evaluating the quality of summaries generated by different (smaller and cheaper) LLM models compared to a *benchmark model* (larger and more expensive). We will use the following setup:\n",
"- Benchmark model: `gpt-4o`\n",
"- Test models: `gpt-4o-mini`, `gpt-4-turbo`, `gpt-3.5-turbo`\n"
]
Expand Down Expand Up @@ -518,45 +518,33 @@
"evaluate_summaries(sentence1, sentence2)\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"\n",
"def visualize_prompt_comparison(evaluation_results, model_names):\n",
" \"\"\"\n",
" Create a radar plot comparing different prompt variations\n",
" \n",
" Args:\n",
" evaluation_results (list): List of dictionaries containing evaluation metrics\n",
" model_names (list): List of names for each prompt variation\n",
" \"\"\"\n",
" from evaluate.visualization import radar_plot\n",
" \n",
" # Format data for visualization\n",
" plot = radar_plot(data=evaluation_results, model_names=model_names)\n",
" return plot\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we define `generate_summary`, a simple utility function that generates text summaries using OpenAI's API. It takes an arbitrary `model`, a `prompt`, and an `input` text and returns the corresponding LLM's response."
"Next, we define `generate_summary`, our simple LLM-based SEC filing summirizer application using OpenAI's API. It takes an arbitrary `model`, and an `input` text and returns the corresponding LLM's response with a summary."
]
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"client = OpenAI()\n",
"\n",
"def generate_summary(model, prompt, input):\n",
"def generate_summary(model, input):\n",
" \"\"\"\n",
" Generate a summary of input using a given model\n",
" \"\"\"\n",
" TASK = \"Generate a 1-liner summary of the following excerpt from an SEC filing.\"\n",
"\n",
" prompt = f\"\"\"\n",
" ROLE: You are an expert analyst tasked with summarizing SEC filings.\n",
" TASK: {TASK}\n",
" \"\"\"\n",
" \n",
" response = client.chat.completions.create(\n",
" model=model,\n",
" messages=[{\"role\": \"system\", \"content\": prompt},\n",
Expand All @@ -572,26 +560,26 @@
"Now, we define a function `evaluate_summary_models` - our benchmark evaluator - that compares text summaries generated by different language models against a benchmark model. Here's what it does:\n",
"\n",
"1. Takes a benchmark model, list of test models, prompt, and input text\n",
"2. Generates a reference summary using the benchmark model\n",
"3. Generates summaries from all test models\n",
"4. Evaluates each test model's summary against the benchmark using metrics BLEU and ROUGE scores\n",
"5. Returns both the evaluation results and the generated summaries"
"2. Generates a reference summary using the benchmark model and our `generate_summary` function\n",
"3. Generates summaries from all test models using `generate_summary` function\n",
"4. Evaluates each test model's summary against the benchmark using `evaluate_summaries`\n",
"5. Returns evaluation results and the generated summaries"
]
},
{
"cell_type": "code",
"execution_count": 48,
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"def evaluate_summary_models(model_benchmark, models_test, prompt, input):\n",
"def evaluate_summary_models(model_benchmark, models_test, input):\n",
" \"\"\"\n",
" Evaluate summaries generated by multiple models\n",
" \"\"\"\n",
" benchmark_summary = generate_summary(model_benchmark, prompt, input)\n",
" benchmark_summary = generate_summary(model_benchmark, input)\n",
"\n",
" # Generate summaries for all test models using list comprehension\n",
" model_summaries = [generate_summary(model, prompt, input) for model in models_test]\n",
" model_summaries = [generate_summary(model, input) for model in models_test]\n",
" \n",
" # Evaluate each model's summary against the benchmark\n",
" evaluation_results = [evaluate_summaries(summary, benchmark_summary) for summary in model_summaries]\n",
Expand All @@ -603,26 +591,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample usage:\n",
"- Define a task and prompt\n",
"- Define a benchmark model and a list of test models\n",
"- Generate a reference summary using the benchmark model\n",
"- Generate summaries from all test models\n",
"- Evaluate each test model's summary against the benchmark"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"TASK = \"Generate a 1-liner summary of the following excerpt from an SEC filing.\"\n",
"\n",
"prompt = f\"\"\"\n",
"ROLE: You are an expert analyst tasked with summarizing SEC filings.\n",
"TASK: {TASK}\n",
"\"\"\""
"Now, we are ready to run our benchmark evaluation. We define a benchmark model and a list of test models and then evaluate each test model's summary against the benchmark. We also print the generated summaries for each model."
]
},
{
Expand All @@ -641,7 +610,7 @@
"metadata": {},
"outputs": [],
"source": [
"evals, model_summaries, benchmark_summary = evaluate_summary_models(model_benchmark, models_test, prompt, sec_filing)"
"evals, model_summaries, benchmark_summary = evaluate_summary_models(model_benchmark, models_test, sec_filing)"
]
},
{
Expand Down Expand Up @@ -696,7 +665,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The benchmark summary from gpt-4o provides a balanced overview of Apple's 10-K filing, focusing on operational status, financial condition, product lines, and regulatory compliance.\n",
"The benchmark summary from `gpt-4o` provides a balanced overview of the analyzed excerpt from Apple's 10-K filing, focusing on operational status, financial condition, product lines, and regulatory compliance.\n",
"\n",
"When comparing our test models against the benchmark, we observe that:\n",
"- `gpt-4o-mini` provides a concise yet comprehensive summary that closely aligns with the benchmark's core message. While it omits product lines, it effectively captures the essential elements of the filing including business operations, risks, and financial condition. Its brevity and focus look (subjectively) similar to our benchmark model.\n",
Expand All @@ -705,7 +674,7 @@
"\n",
"- `gpt-3.5-turbo` looks quite different from the benchmark. Its summary, while factually correct, is overly simplified and misses key aspects of the filing. The model captures basic financial information but fails to convey the breadth of operational and compliance details present in the benchmark summary.\n",
"\n",
"Of course, the above evaluation is only based on a single example and is heavily subjective. It's a \"vice check\" on our evaluation results. Now, for a objective analysis, we can use the `visualize_prompt_comparison` function we write below to visualize the performance of our test models across our predefined quantitative metrics.\n"
"Of course, the above evaluation is only based on a single example and is heavily subjective. It's a \"vibe check\" on our evaluation results. Now, for an objective analysis, we can look at the quantitative metrics we have chosen and use the `visualize_prompt_comparison` function we write below to visualize the performance of our test models across our predefined quantitative metrics.\n"
]
},
{
Expand All @@ -717,6 +686,27 @@
"```"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"def visualize_prompt_comparison(evaluation_results, model_names):\n",
" \"\"\"\n",
" Create a radar plot comparing different prompt variations\n",
" \n",
" Args:\n",
" evaluation_results (list): List of dictionaries containing evaluation metrics\n",
" model_names (list): List of names for each prompt variation\n",
" \"\"\"\n",
" from evaluate.visualization import radar_plot\n",
" \n",
" # Format data for visualization\n",
" plot = radar_plot(data=evaluation_results, model_names=model_names)\n",
" return plot"
]
},
{
"cell_type": "code",
"execution_count": 35,
Expand Down Expand Up @@ -751,9 +741,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Results demonstrate that tested models perform quite differently on our predefined metrics. The evaluation metrics puts `gpt-4o-mini` as the closest alignment to the benchmark, followed by gpt-4-turbo, and gpt-3.5-turbo showing the largest deviation. This suggests that `gpt-4o-mini` is the best model for this task at least on the metrics we have chosen and for the set of models we have tested.\n",
"Results demonstrate that tested models perform quite differently on our predefined metrics. The evaluation metrics puts `gpt-4o-mini` as the closest aligned to the benchmark, followed by gpt-4-turbo, and gpt-3.5-turbo showing the largest deviation. This suggests that `gpt-4o-mini` is the best model for this task at least on the metrics we have chosen and for the set of models we have tested.\n",
"\n",
"While evaluating language model outputs inherently involves subjective judgment, establishing a high-quality benchmark model and using quantifiable metrics provides a more objective framework for comparing model performance. This approach transforms an otherwise qualitative assessment into a measurable, data-driven evaluation process.\n"
"While evaluating language model outputs inherently involves subjective judgment, establishing a high-quality benchmark model and using quantifiable metrics provide a more objective framework for comparing model performance. This approach transforms an otherwise qualitative assessment into a measurable, data-driven evaluation process.\n"
]
},
{
Expand All @@ -764,11 +754,11 @@
"\n",
"While these metrics provide quantifiable measures of performance, they have limitations:\n",
"\n",
"* **Task-specific nature**: Extrinsic quantitative metrics might not fully capture the nuances of complex generative-based tasks, especially those involving subjective human judgment.\n",
"* **Task-specific nature**: Chosen set of metrics might not fully capture the nuances of complex generative-based tasks, especially those involving subjective human judgment.\n",
"* **Sensitivity to data distribution**: Performance on these metrics can be influenced by the specific dataset used for evaluation, which might not represent real-world data distribution.\n",
"* **Inability to assess reasoning or factual accuracy**: These metrics primarily focus on surface-level matching and might not reveal the underlying reasoning process of the LLM or its ability to generate factually correct information.\n",
"\n",
"In conclusion, selecting the appropriate extrinsic metric depends on the specific task, underlying business requirements and desired evaluation granularity. Understanding the limitations of these metrics can provide a more comprehensive assessment of LLM performance in real-world applications.\n",
"In conclusion, selecting an appropriate extrinsic metrics set depends on the specific task, underlying business requirements and desired evaluation granularity. Understanding the limitations of these metrics can provide a more comprehensive assessment of LLM performance in real-world applications.\n",
"\n",
"To address these limitations, alternative approaches like **human-based evaluation** and **model-based evaluation** are often used, which will be discussed in the following sections."
]
Expand Down
5 changes: 4 additions & 1 deletion tamingllms/_build/html/_static/check-solid.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion tamingllms/_build/html/_static/copy-button.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion tamingllms/_build/html/_static/play-solid.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit f93bcfb

Please sign in to comment.