Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus
This repository is the experiment code for "Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus"
Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus
Seungpil Lee*, Woochang Sim*, Donghyeon Shin*, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sanha Hwang, Sundong Kim
The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been result-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.
-
Follow instructions from Create and deploy an Azure OpenAI Service resource.
-
Follow instructions from Quickstart: Get started using GPT-35-Turbo and GPT-4 with Azure OpenAI Service.
-
Set environment variables.
export AZURE_OPENAI_API_KEY="REPLACE_WITH_YOUR_KEY_VALUE_HERE"
export AZURE_OPENAI_ENDPOINT="REPLACE_WITH_YOUR_ENDPOINT_HERE"
export AZURE_OPENAI_DEPLOYMENT_NAME="REPLACE_WITH_YOUR_DEPLOYMENT_NAME_HERE"
- Clone this repository & install the required packages.
git clone https://github.com/GIST-DSLab/ARC_Prompt.git
cd ARC_Prompt
pip install -r requirements.txt
- Follow Quick Start instructions for each experiment.
The accuracy is based on solving 100 random ARC tasks with CoT, LtM, and ToT prompts, each repeated 5 times. The accuracy outside the parentheses refers to the accuracy when only the results are correct, while the accuracy inside the parentheses indicates the accuracy when both the results and the process are correct.
Iteration | Chain of thought | Least to Most | Tree of Thoughts |
---|---|---|---|
1 | 11%(3%) | 6%(4%) | 7%(3%) |
2 | 10%(2%) | 7%(4%) | 4%(1%) |
3 | 10%(5%) | 6%(3%) | 7%(2%) |
4 | 10%(4%) | 4%(2%) | 7%(4%) |
5 | 10%(6%) | 5%(2%) | 6%(2%) |
We conducted five repeated experiments using CoT on 400 tasks from the ARC Training set. Then, for the tasks that were answered correctly at least once, we augmented 100 problems using re-arc and measured Inferential Coherence, repeating this experiment five times. The results are shown in the figure below
![]() |
![]() |
---|
The result of LLM DSL understaing experiment is 81% using weighted average accuracy with the number of tasks at each step as weight as shown in the equation.
In this equation,
To measure the compositionality of the LLM, experiments were conducted on 158 tasks. The results, based on whether the test output and human description were provided, are shown in the table below.
w/o Human Description | w/ Human Description | |
---|---|---|
w/o Test Output | 2%(5%) | 8%(15%) |
w/ Test Output | 9%(17%) | 14%(29%) |
The above table is the average accuracy from 10 repeated experiment based on the presence or absence of test output and human descriptions. The accuracy values in parentheses are the estimates obtained when LLMs understand the given DSL perferctly. We make the equation to estimate the result if LLM understand the given DSL perfectly.
In this equation,
Based on 160 ARC tasks classified by ConceptARC, we evaluated the validity of a total of 2,913 generated examples.
Problem Category | Total available | The number of generated data | The number of valid augmentated data | Ratio(valid/generated) |
---|---|---|---|---|
Above Below | 58 | 158 | 34 | 21.52% |
Center | 65 | 236 | 35 | 14.83% |
Clean Up | 106 | 183 | 83 | 45.36% |
Complete Shape | 58 | 147 | 37 | 25.17% |
Copy | 27 | 153 | 4 | 2.61% |
Count | 56 | 202 | 29 | 14.36% |
Extend To Boundary | 37 | 167 | 8 | 4.79% |
Extract Objects | 44 | 176 | 21 | 11.93% |
Filled Not Filled | 58 | 203 | 29 | 14.29% |
Horizontal Vertical | 32 | 114 | 7 | 6.14% |
Inside Outside | 52 | 191 | 24 | 12.57% |
Move To Boundary | 36 | 165 | 12 | 7.27% |
Order | 47 | 162 | 26 | 16.05% |
Same Different | 107 | 246 | 76 | 30.89% |
Top Bottom 2D | 92 | 255 | 59 | 23.14% |
Top Bottom 3D | 55 | 215 | 25 | 11.63% |
Total | 930 | 2913 | 509 | 17.12% |
If you find this repo useful for your research, please consider citing our paper:
@misc{lee2024reasoning,
title={Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus},
author={Seungpil Lee and Woochang Sim and Donghyeon Shin and Sanha Hwang and Wongyu Seo and Jiwon Park and Seokki Lee and Sejin Kim and Sundong Kim},
year={2024},
eprint={2403.11793},
archivePrefix={arXiv},
primaryClass={cs.CL}
}