Skip to content

Commit

Permalink
feat:Eval only support (#238)
Browse files Browse the repository at this point in the history
* fix

* add more import on __init__.py in dataset package

* cookbook update

* improve local evaluation

* format

* add eval local function

* add comment

* fix unit test

* add

* small optimization

* add subtitle

* fix

* fix thread work pool concurrent bug
  • Loading branch information
Dobiichi-Origami authored Feb 2, 2024
1 parent 7af9a5d commit 05d39fe
Show file tree
Hide file tree
Showing 11 changed files with 282 additions and 119 deletions.
2 changes: 1 addition & 1 deletion cookbook/evaluation/data_file/eval_result.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
[{"input_prompt": "地球的自转周期是多久?", "expected_output": "大约24小时", "model_content": [{"accuracy": 0.0, "content": "地球的自转周期约为**23小时56分**。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}, {"input_prompt": "人类的基本单位是什么?", "expected_output": "人类", "model_content": [{"accuracy": 0.0, "content": "人类的基本单位是**个体**。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}, {"input_prompt": "太阳系中最大的行星是哪颗?", "expected_output": "木星", "model_content": [{"accuracy": 0.0, "content": "太阳系中最大的行星是木星。\n\n木星是太阳系中最大的行星,其质量为太阳的千分之一,是太阳系中其他七大行星质量总和的2.5倍。木星主要由氢和氦组成,中心温度估计高达30,000℃以上,没有陆地,几乎全部为液态氢分子覆盖着。木星非常巨大,赤道直径为139,822公里,约为地球的11倍。它的体积和巨大的物质构成使其成为太阳系中最有可能存在生命或其他形式的物质结构的行星之一。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}]
[{"input_prompt": "地球的自转周期是多久?", "expected_output": "大约24小时", "model_content": [{"accuracy": 0.0, "content": "地球的自转周期是**23小时56分**。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}, {"input_prompt": "人类的基本单位是什么?", "expected_output": "人类", "model_content": [{"accuracy": 0.0, "content": "人类的基本单位是**个体**。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}, {"input_prompt": "太阳系中最大的行星是哪颗?", "expected_output": "木星", "model_content": [{"accuracy": 0.0, "content": "太阳系中最大的行星是木星。\n\n木星是太阳系中最大的行星,其质量为太阳的千分之一,但却是太阳系中其他七颗行星质量总和的2.5倍。木星是一个气态巨行星,它的直径几乎是地球的11倍,是太阳系中最大的行星,也是四颗“类木行星”之一。它有强烈的磁场和辐射,以及强大的风潮。木星被称为大红斑,是一个长达带风层的旋风,比地球要大上千倍。它有79颗卫星,其中四颗主要的卫星是伽利略卫星,即欧罗巴、木卫一、木卫二和木卫三。在木星的众多卫星中,木卫一是太阳系中唯一已知存在液态物质的行星。\n\n总之,木星是太阳系中最大的行星,也是一颗非常重要的行星,它不仅对太阳系的研究有着重要的意义,也对人类的太空探索有着重要的影响。", "llm_tag": "None_None_ERNIE-Bot-turbo"}]}]
35 changes: 29 additions & 6 deletions cookbook/evaluation/how_to_use_evaluation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
"source": [
"# 如何使用千帆 Python SDK 对模型进行评估\n",
"\n",
"在 0.2.8 版本中,千帆 Python SDK 增加了模型评估的功能,支持用户通过 SDK 发起平台在线评估任务,或者本地 API 评估,并且在本地获取到评估报告。\n",
"在 0.3.0 版本中,千帆 Python SDK 增加了模型评估的功能,支持用户通过 SDK 发起平台在线评估任务,或者本地 API 评估,并且在本地获取到评估报告。\n",
"\n",
"# 准备工作\n",
"\n",
"在开始之前,请确保你的千帆 Python SDK 已经升级到了 0.2.8 及以上版本。"
"在开始之前,请确保你的千帆 Python SDK 已经升级到了 0.3.0 及以上版本。"
]
},
{
Expand All @@ -23,7 +23,7 @@
},
"outputs": [],
"source": [
"pip install -U \"qianfan>=0.2.8\""
"pip install -U \"qianfan>=0.3.0\""
]
},
{
Expand Down Expand Up @@ -227,7 +227,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"然后我们需要创建一个 `EvaluationManager` 来启动任务,传入之前创建的 `Model` 对象和 `Dataset` 对象"
"然后我们需要创建一个 `EvaluationManager` 来启动任务,传入之前创建的 `Model` 对象和 `Dataset` 对象来调用 `eval` 方法\n",
"\n",
"`eval` 方法会返回一个 `EvaluationResult` 对象,该对象内部包含了一个 `Dataset` 对象(包括了每条评估条目的输入输出数据,以及单条的评估指标)以及 `metrics` 字典对象(包含了总体的评价指标)"
]
},
{
Expand All @@ -246,7 +248,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"由于平台的 API 仍然不完善,现在用户只能通过 SDK 查看评估结果中的 `metrics` 信息,而不能查看单条数据的评估结果。"
"用户可以按照自己的想法来处理得到的 `EvaluationResult` 对象,如打印:"
]
},
{
Expand All @@ -266,6 +268,25 @@
"print(result.metrics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"或者展示得到的 `Dataset` 对象并保存:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"eval_ds = result.result_dataset\n",
"print(eval_ds.list())\n",
"\n",
"eval_ds.save(data_file=\"保存到本地的路径\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -328,6 +349,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## OpenCompass 评估器\n",
"\n",
"千帆 Python SDK 集成了对 OpenCompass 评估器的支持,如果用户不想手动实现 `LocalEvaluator` ,可以直接使用 opencompass 中的评估器对数据集进行评估。为了使用 OpenCompass 评估器,我们需要先安装 opencompass 包"
]
},
Expand Down Expand Up @@ -359,7 +382,7 @@
"source": [
"from opencompass.openicl.icl_evaluator import AccEvaluator\n",
"\n",
"from qianfan.evaluation.evaluator import OpenCompassLocalEvaluator\n",
"from qianfan.evaluation.opencompass_evaluator import OpenCompassLocalEvaluator\n",
"\n",
"\n",
"local_evaluators = [OpenCompassLocalEvaluator(open_compass_evaluator=AccEvaluator())]"
Expand Down
3 changes: 3 additions & 0 deletions python/qianfan/dataset/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,6 @@

# 批量推理结果中,首 token 耗时的列名
FirstTokenLatencyColumnName = "first_token_latency"

# 批量推理结果中,用于标注不同模型结果的列名
LLMTagColumnName = "llm_tag"
42 changes: 32 additions & 10 deletions python/qianfan/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ def __init__(
inner_schema_cache: Optional[Schema] = None,
input_columns: Optional[List[str]] = None,
reference_column: Optional[str] = None,
eval_input_column: Optional[str] = None,
eval_llm_output_column: Optional[str] = None,
**kwargs: Any,
) -> None:
"""
Expand All @@ -113,6 +115,10 @@ def __init__(
which columns should be extracted as inputs
reference_column (Optional[str]):
which column should be extracted as reference
eval_input_column (Optional[str]):
evaluation input column name in dataset
eval_llm_output_column (Optional[str]):
llm output column name in dataset for evaluating
**kwargs (Any):
optional arguments
"""
Expand All @@ -124,12 +130,18 @@ def __init__(
# schema 对象的缓存,在 load 时被指定
self.inner_schema_cache: Optional[Schema] = inner_schema_cache

# 输入列的列名列表
# 批量推理输入列的列名列表
self.input_columns = input_columns

# 预期结果列的列名
# 批量推理以及评估时的预期结果列的列名
self.reference_column = reference_column

# 只运行评估时,评估的输入列的列名
self.eval_input_column = eval_input_column

# 只运行评估时,评估的大模型回答列列名
self.eval_llm_output_column = eval_llm_output_column

@classmethod
def _from_source(
cls,
Expand Down Expand Up @@ -777,7 +789,7 @@ def add_default_group_column(self) -> Self:
return self

return self.col_append(
{"name": QianfanDataGroupColumnName, "data": list(range(self.row_number()))}
{QianfanDataGroupColumnName: list(range(self.row_number()))}
)

@_online_except_decorator
Expand Down Expand Up @@ -1002,7 +1014,7 @@ def col_append(self, elem: Any) -> Self:
Args:
elem (Dict[str, List]): a dict containing element added to dataset, which
must has column name "name" and column data list "data"
key as column name, value as column data
Returns:
Self: Dataset itself
"""
Expand Down Expand Up @@ -1243,7 +1255,13 @@ def _batch_inference_on_model(
}
)

return Dataset.create_from_pyobj(new_list)
return Dataset.create_from_pyobj(
new_list,
input_columns=["prompt"],
reference_column=OldReferenceColumnName,
eval_input_column=NewInputPromptColumnName,
eval_llm_output_column=LLMOutputColumnName,
)

def _get_completion_return_dataset(
self,
Expand All @@ -1258,8 +1276,11 @@ def _get_completion_return_dataset(
NewInputPromptColumnName: input_str_list,
LLMOutputColumnName: output_list,
}

reference_column: Optional[str] = None
if self.reference_column:
table_dict[OldReferenceColumnName] = self.get_reference_data
reference_column = OldReferenceColumnName

if does_show_latency:
if len(first_token_latency_list) != 0:
Expand All @@ -1269,7 +1290,9 @@ def _get_completion_return_dataset(
return Dataset.create_from_pyobj(
table_dict,
input_columns=self.input_columns,
reference_column=OldReferenceColumnName,
reference_column=reference_column,
eval_input_column=NewInputPromptColumnName,
eval_llm_output_column=LLMOutputColumnName,
)

def _get_chat_return_dataset(
Expand Down Expand Up @@ -1305,7 +1328,9 @@ def _get_chat_return_dataset(
return Dataset.create_from_pyobj(
table_dict,
input_columns=[NewInputChatColumnName],
reference_column=LLMOutputColumnName,
reference_column=OldReferenceColumnName,
eval_input_column=NewInputChatColumnName,
eval_llm_output_column=LLMOutputColumnName,
)

def _batch_inference_on_service(
Expand Down Expand Up @@ -1545,9 +1570,6 @@ def _get_input_chat_list(
input_column = self.input_columns[0]

dataset = deepcopy(self)
if not dataset.is_dataset_grouped() and not dataset.is_dataset_packed():
dataset.add_default_group_column()

if dataset.is_dataset_grouped():
dataset.pack()

Expand Down
2 changes: 1 addition & 1 deletion python/qianfan/dataset/dataset_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,7 @@ def _start_an_evaluation_task_for_model_batch_inference(
**kwargs,
).body

eval_id = resp["result"]["evalId"]
eval_id = resp["result"]["evalIdStr"]

log_debug(f"create evaluation task in model response: {resp}")
result_dataset_id = _wait_evaluation_finished(eval_id)
Expand Down
66 changes: 66 additions & 0 deletions python/qianfan/dataset/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@
from abc import ABC, abstractmethod
from typing import Any, Callable

from qianfan.dataset.consts import (
LLMOutputColumnName,
NewInputChatColumnName,
NewInputPromptColumnName,
OldReferenceColumnName,
)
from qianfan.dataset.table import Table
from qianfan.utils import log_error, log_info

Expand Down Expand Up @@ -263,3 +269,63 @@ def validate(self, table: Table) -> bool:
bool:whether table is valid
"""
return False


class EvaluationSchema(Schema):
"""validator for evaluation used"""

def validate(self, table: Table) -> bool:
"""
validate a table
Args:
table (Table): table need to be validated
Returns:
bool:whether table is valid
"""
if len(table) == 0:
log_error("table is empty")
return False

col_names = table.col_names()

for column in [OldReferenceColumnName, LLMOutputColumnName]:
if column not in col_names:
log_error(f"{column} not in dataset columns")
return False

if (
NewInputPromptColumnName in col_names
and NewInputChatColumnName in col_names
):
log_error(
f"can't have both {NewInputChatColumnName} and"
f" {NewInputPromptColumnName} simultaneously"
)
return False

if NewInputPromptColumnName in col_names:
elem_type = table[0][NewInputPromptColumnName]
if not isinstance(elem_type, str):
log_error(
f"element in column {NewInputPromptColumnName} isn't str, rather"
f" {type(elem_type)}"
)
return False
return True

if NewInputChatColumnName in col_names:
elem_type = table[0][NewInputChatColumnName]
if not isinstance(elem_type, str):
log_error(
f"element in column {NewInputChatColumnName} isn't str, rather"
f" {type(elem_type)}"
)
return False
return True

log_error(
f"no neither {NewInputChatColumnName} or {NewInputPromptColumnName} found"
)
return False
47 changes: 21 additions & 26 deletions python/qianfan/dataset/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,33 +406,30 @@ def append(self, elem: Dict[str, List]) -> Self:
Args:
elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name "name" and column data list "data"
key as column name, value as column data
Returns:
Self: a new pyarrow table
"""

if not isinstance(elem, dict):
raise ValueError(f"element appended must be dict, not {type(elem)}")
if "name" not in elem:
raise ValueError("no name has been provided")
if "data" not in elem:
raise ValueError("no data has been provided")
if not isinstance(elem["name"], str):
raise TypeError(f"name isn't str, rather than {type(elem['name'])}")
if elem["name"] in self.table.column_names:
raise ValueError(
f"column name {elem['name']} has been in dataset column list"
)
if not isinstance(elem["data"], list):
raise TypeError(f"data isn't list, rather than {type(elem['data'])}")
if not elem["data"]:
raise ValueError("data can't be empty")
if len(elem["data"]) != self.table.num_rows:
raise ValueError(
f"the length of data need to be {self.table.num_rows}, rather than"
f" {len(elem['data'])}"
)
return self.table.append_column(elem["name"], [elem["data"]])

for name, data in elem.items():
if name in self.table.column_names:
raise ValueError(f"column name {name} has been in dataset column list")

if not isinstance(data, list):
raise TypeError(f"data isn't list, rather than {type(data)}")

if len(data) != self.table.num_rows:
raise ValueError(
f"the length of data need to be {self.table.num_rows}, rather than"
f" {len(data)}"
)

self.table = self.table.append_column(name, [data])

return self.table

def insert(self, elem: Dict[str, List], index: int) -> Self:
"""
Expand Down Expand Up @@ -484,7 +481,7 @@ def list(
if isinstance(indices[0], str) and not set(indices).issubset(
set(self.table.column_names)
):
raise ValueError("contain not existed column name")
raise ValueError(f"contain not existed column name: {indices}")
return self.table.select(list(indices)).to_pydict()

def map(self, op: Callable[[Any], Any]) -> Self:
Expand Down Expand Up @@ -610,9 +607,7 @@ def _squash_group_number(self) -> None:
new_group_column_list.append(current_group_number)

self.col_delete(QianfanDataGroupColumnName)
self.col_append(
{"name": QianfanDataGroupColumnName, "data": new_group_column_list}
)
self.col_append({QianfanDataGroupColumnName: new_group_column_list})

return

Expand Down Expand Up @@ -906,7 +901,7 @@ def col_append(self, elem: Any) -> Self:
Args:
elem (Dict[str, List]): dict containing element added to pyarrow table
must has column name "name" and column data list "data"
key as column name, value as column data
Returns:
Self: Table itself
"""
Expand Down
Loading

0 comments on commit 05d39fe

Please sign in to comment.