Skip to content

Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)

Notifications You must be signed in to change notification settings

MinhVuong2000/LLMReasonCert

Repository files navigation

Direct Evaluation of CoT in Multi-hop Reasoning with Knowledge Graphs

Official Implementation of "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs".

Has been accepted at ACL2024 Findings.

Aiming evaluate not only final answers but also intermediate steps in the CoT reasoning capabilities of LLMs in multi-hop question answering, the paper proposed 2 evaluation modules:

  1. Discriminative: assess LLMs' knowledge of reasoning
  2. Generative: assess the accuracy of the generated CoT by utilizing knowledge graphs (KGs).

In addition, we do ablation studies to evaluate the fine-grain CoT generation to calculate edit-distance & reasoning errors.

Requirements

conda create --name llm-reasoning-cert python=3.8
conda activate llm-reasoning-cert
pip install -r requirements.txt

Datasets

The paper uses 2 datasets: CWQ and GrailQA as initiate datasets for experiments.

Then, extract subgraph and ground-truth reasoning path based on SPARQL.

Final datasets used for the paper are uploaded into HuggingFace: (Note: update later)

  1. CWQ-Subgraph-Eval
  2. GrailQA-Subgraph-Eval

Preprocess for each dataset:

Aim: create subgraphs for querying ground-truth reasoning path & creating VectorDB

Create subgraphs

Code at ./preprocess_data

  1. Create subgraph from the raw-subgraph via the detail implementation in preprocess's readme
  2. Get groundtruth reasoning path via the subgraph, answer entities and topic entities
python ./preprocess_data/ground_truth_paths.py
  1. Rearrange questions according to the number of edge of groundtruth reasoning path
python ./preprocess_data/splitted_ground_truth_paths.py

We only use questions >=2 hops in the corresponding reasoning path.

Create VectorDB

FAISS & sentence-transformers/all-mpnet-base-v2 are used to create VectorDB before retrieving

DATASET='cwq' # 'grail_qa
sbatch scripts/gen-cert/extract_triplet.sh $DATASET

you can setup addition arguments:

  • embed_model_name. Default is sentence-transformers/all-mpnet-base-v2
  • top_k. Default is 10
  • device. Default is cpu

!Note: remember re-setup them in ./generative-cert.py#L228

Data for Discriminative

Download data at here

Generate negative reasoning paths

  • Negative generation model:
    1. replace: replace the entities in reasoning paths.
    2. reorder: reorder the reasoning paths.
    3. mislead: generate the reasoning paths leading to incorrect answers.
  • Code:
# 1. Generate supgraph for misguide paths
python preprocess_data/subgraph_discriminative_cert.py
# 2. Generate negative paths:
## - CWQ dataset
python gen_negative.py --data_path data/cwq_test_res.csv --kg_path data/cwq_test.jsonl_cwq_test.jsonl --mode {'mislead', 'reorder', 'replace'}
## - GrailQA dataset
python gen_negative.py --data_path data/multi_hop_grailqa.csv --kg_path data/grail_w_kg.jsonl --mode {'mislead', 'reorder', 'replace'}

Framework

Set your OpenAI api key & Huggingface key (if needed) in .env (check file .env.example as the example).

Discriminative Mode

  • Evaluation for ground-truth reasoning paths
    sh scripts/disc-cert/submit_discriminative_cert.sh
  • Evaluation for generated negative reasoning paths
    sh scripts/disc-cert/submit_discriminative_cert_neg.sh
  • Get results
python scripts/disc-cert/summary_results.py

Generative Mode

Stage1: LLM prompting for structured answer

  1. ChatGPT
sh scripts/gen-cert/llm_prompting.sh
  1. HF models: Llama2 7B/13B/70B chat-hf, Mistral-7B-Instruct-v0.1, Qwen-14B-Chat, Vicuna-33b-v1.3
sh generative_cert/scripts/fitcluster/script.sh

Stage 2 & 3: Retrieval & Evaluation

  1. Main result
sh scripts/gen-cert/job_eval_llm.sh
  1. The fine-grained generative evaluation: edit-distance score
sh scripts/gen-cert/job_eval_llm_finegrained.sh
python finegrained_analysis.py
  1. Run the analysis for reasoning errors
python finegrained_analysis.py

Results


Citation

If you find this paper or the repo useful for your work, please consider citing the paper

@misc{nguyen2024direct,
    title={Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs},
    author={Minh-Vuong Nguyen and Linhao Luo and Fatemeh Shiri and Dinh Phung and Yuan-Fang Li and Thuy-Trang Vu and Gholamreza Haffari},
    year={2024},
    eprint={2402.11199},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

or

@inproceedings{nguyen-etal-2024-direct,
    title = "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs",
    author = "Nguyen, Thi  and
      Luo, Linhao  and
      Shiri, Fatemeh  and
      Phung, Dinh  and
      Li, Yuan-Fang  and
      Vu, Thuy-Trang  and
      Haffari, Gholamreza",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.168/",
    doi = "10.18653/v1/2024.findings-acl.168",
    pages = "2862--2883",
    abstract = "Large language models (LLMs) have demonstrated strong reasoning abilities when prompted to generate chain-of-thought (CoT) explanations alongside answers. However, previous research on evaluating LLMs has solely focused on answer accuracy, neglecting the correctness of the generated CoT. In this paper, we delve deeper into the CoT reasoning capabilities of LLMs in multi-hop question answering by utilizing knowledge graphs (KGs). We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT. Through experiments conducted on 5 different families of LLMs across 2 multi-hop question-answering datasets, we find that LLMs possess sufficient knowledge to perform reasoning. However, there exists a significant disparity between answer accuracy and faithfulness of the CoT generated by LLMs, indicating that they often arrive at correct answers through incorrect reasoning."
}

About

Official Implementation of ACL2024 paper "Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs"(https://aclanthology.org/2024.findings-acl.168)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published