[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
benchmark regex reliability evaluation dataset gpt phi large-language-models llm open-compass chatglm qwen lm-evaluation llm-as-a-judge llm-as-evaluator xfinder reliable-evaluation key-answer-extraction judge-model
-
Updated
Jan 23, 2025 - Python