Official Implementation of "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs".
Has been accepted at ACL2024 Findings.
Aiming evaluate not only final answers but also intermediate steps in the CoT reasoning capabilities of LLMs in multi-hop question answering, the paper proposed 2 evaluation modules:
- Discriminative: assess LLMs' knowledge of reasoning
- Generative: assess the accuracy of the generated CoT by utilizing knowledge graphs (KGs).
In addition, we do ablation studies to evaluate the fine-grain CoT generation to calculate edit-distance & reasoning errors.
conda create --name llm-reasoning-cert python=3.8
conda activate llm-reasoning-cert
pip install -r requirements.txt
The paper uses 2 datasets: CWQ and GrailQA as initiate datasets for experiments.
Then, extract subgraph and ground-truth reasoning path based on SPARQL.
Final datasets used for the paper are uploaded into HuggingFace: (Note: update later)
Aim: create subgraphs for querying ground-truth reasoning path & creating VectorDB
Code at ./preprocess_data
- Create subgraph from the raw-subgraph via the detail implementation in preprocess's readme
- Get groundtruth reasoning path via the subgraph,
answer entities
andtopic entities
python ./preprocess_data/ground_truth_paths.py
- Rearrange questions according to the number of edge of groundtruth reasoning path
python ./preprocess_data/splitted_ground_truth_paths.py
We only use questions >=2 hops in the corresponding reasoning path.
FAISS
& sentence-transformers/all-mpnet-base-v2
are used to create VectorDB before retrieving
DATASET='cwq' # 'grail_qa
sbatch scripts/gen-cert/extract_triplet.sh $DATASET
you can setup addition arguments:
- embed_model_name. Default is
sentence-transformers/all-mpnet-base-v2
- top_k. Default is
10
- device. Default is
cpu
!Note: remember re-setup them in ./generative-cert.py#L228
Download data at here
- Negative generation model:
replace
: replace the entities in reasoning paths.reorder
: reorder the reasoning paths.mislead
: generate the reasoning paths leading to incorrect answers.
- Code:
# 1. Generate supgraph for misguide paths
python preprocess_data/subgraph_discriminative_cert.py
# 2. Generate negative paths:
## - CWQ dataset
python gen_negative.py --data_path data/cwq_test_res.csv --kg_path data/cwq_test.jsonl_cwq_test.jsonl --mode {'mislead', 'reorder', 'replace'}
## - GrailQA dataset
python gen_negative.py --data_path data/multi_hop_grailqa.csv --kg_path data/grail_w_kg.jsonl --mode {'mislead', 'reorder', 'replace'}
Set your OpenAI api key & Huggingface key (if needed) in .env
(check file .env.example
as the example).
- Evaluation for ground-truth reasoning paths
sh scripts/disc-cert/submit_discriminative_cert.sh
- Evaluation for generated negative reasoning paths
sh scripts/disc-cert/submit_discriminative_cert_neg.sh
- Get results
python scripts/disc-cert/summary_results.py
- ChatGPT
sh scripts/gen-cert/llm_prompting.sh
- HF models: Llama2 7B/13B/70B chat-hf, Mistral-7B-Instruct-v0.1, Qwen-14B-Chat, Vicuna-33b-v1.3
sh generative_cert/scripts/fitcluster/script.sh
- Main result
sh scripts/gen-cert/job_eval_llm.sh
- The fine-grained generative evaluation: edit-distance score
sh scripts/gen-cert/job_eval_llm_finegrained.sh
python finegrained_analysis.py
- Run the analysis for reasoning errors
python finegrained_analysis.py
If you find this paper or the repo useful for your work, please consider citing the paper
@misc{nguyen2024direct,
title={Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs},
author={Minh-Vuong Nguyen and Linhao Luo and Fatemeh Shiri and Dinh Phung and Yuan-Fang Li and Thuy-Trang Vu and Gholamreza Haffari},
year={2024},
eprint={2402.11199},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
or
@inproceedings{nguyen-etal-2024-direct,
title = "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs",
author = "Nguyen, Thi and
Luo, Linhao and
Shiri, Fatemeh and
Phung, Dinh and
Li, Yuan-Fang and
Vu, Thuy-Trang and
Haffari, Gholamreza",
editor = "Ku, Lun-Wei and
Martins, Andre and
Srikumar, Vivek",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-acl.168/",
doi = "10.18653/v1/2024.findings-acl.168",
pages = "2862--2883",
abstract = "Large language models (LLMs) have demonstrated strong reasoning abilities when prompted to generate chain-of-thought (CoT) explanations alongside answers. However, previous research on evaluating LLMs has solely focused on answer accuracy, neglecting the correctness of the generated CoT. In this paper, we delve deeper into the CoT reasoning capabilities of LLMs in multi-hop question answering by utilizing knowledge graphs (KGs). We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT. Through experiments conducted on 5 different families of LLMs across 2 multi-hop question-answering datasets, we find that LLMs possess sufficient knowledge to perform reasoning. However, there exists a significant disparity between answer accuracy and faithfulness of the CoT generated by LLMs, indicating that they often arrive at correct answers through incorrect reasoning."
}