PDF Multiple Choice Question Extractor

This Python script extracts multiple-choice questions from PDF files using OpenAI's GPT-4o vision model. It's designed to process Swedish medical exam questions but can be adapted for other languages and subjects.

Features

Extracts multiple-choice questions from PDF files
Maintains the original language (Swedish by default)
Handles special cases such as invalid questions or multiple correct answers
Saves extracted questions in JSON format

Sample data row:

{
  "language": "sv",
  "country": "Sweden",
  "file_name": "example_exam.pdf",
  "source": "https://www.umu.se/utbildning/sok/kunskapsprov/kunskapsprov-for-lakare/teoretiskt-delprov/",
  "license": "unknown",
  "level": "graduate",
  "category_en": "Medicine",
  "category_original_lang": "Medicin",
  "original_question_num": 1,
  "question": "En 45-årig kvinna söker på vårdcentralen för trötthet och viktuppgång. Hon har också noterat att hon fryser lätt. Vilken av följande laboratorieundersökningar är mest lämplig att beställa initialt?",
  "options": [
    "A. TSH",
    "B. T3",
    "C. T4",
    "D. TPO-antikroppar",
    "E. Kortisol"
  ],
  "answer": "A. TSH"
}

By automating the extraction process, it saves significant time and effort compared to manual transcription.

Installation

Clone this repository:

git clone https://github.com/serhanylmz/mcq.git
cd mcq

Create a virtual environment (optional but recommended):
```
conda create -n mcq python=3.10
```
Install the required packages:
```
pip install -r requirements.txt
```
Set up your OpenAI API key:
- Create a .env file in the project root
- Add your API key: OPENAI_API_KEY=your_api_key_here

Usage

Use pdf_parser.py to extract questions from PDF files:

python pdf_parser.py -d /path/to/pdf/directory -l language

-d or --dir: Directory containing the PDF files (default is "pdfs")
-l or --lang: Language of the questions (default is "swedish")

or if you have specified the default language and directory:

python pdf_parser.py

The script will process all PDF files in the specified directory and save the extracted questions as JSON files in the pdfs/mcq subdirectory.

Clean the extracted data using pdf_cleaner.py:

# In pdf_cleaner.py
INPUT_FILE = "pdfs/mcq/your_input_file.json"
OUTPUT_FILE = "pdfs/mcq/cleaned_output_file.json"
EXCLUDE_NUMBERS = [90, 91, 94, 118, 125, 128]  # Adjust as needed

# Run the script
python pdf_cleaner.py

Merge cleaned JSON files using merge_json_files.py:

# In merge_json_files.py
INPUT_FOLDER = "checked"
OUTPUT_FILE = "merged_dataset.json"

# Run the script
python merge_json_files.py

Publishing to Hugging Face

To publish the dataset to Hugging Face, use the publish_to_huggingface.py script:

Log in to Hugging Face from the terminal:
```
huggingface-cli login
```
Set up your Hugging Face token as an environment variable:
```
export HF_TOKEN=your_token_here
```

Update the publish_to_huggingface.py script with your details:

INPUT_FILE = "merged_dataset.json"
DATASET_NAME = "swedish-medical-exam-mcqs"
DATASET_DESCRIPTION = "Multiple-choice questions from Swedish medical exams"
YOUR_USERNAME = "your_huggingface_username"

Run the script:
```
python publish_to_huggingface.py
```

After running the script, your dataset will be available on Hugging Face at: https://huggingface.co/datasets/your_username/swedish-medical-exam-mcqs

Usage

You can load this dataset using the Hugging Face datasets library:

from datasets import load_dataset
dataset = load_dataset("your_username/swedish-medical-exam-mcqs")

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
checked		checked
cohere		cohere
huggingface		huggingface
pdfs/mcq		pdfs/mcq
.gitignore		.gitignore
README.md		README.md
cleaned_dataset.json		cleaned_dataset.json
dataset_checker.py		dataset_checker.py
final_dataset.json		final_dataset.json
json_formatter.py		json_formatter.py
json_option_fixer.py		json_option_fixer.py
merge_json_files.py		merge_json_files.py
merged_dataset.json		merged_dataset.json
pdf_cleaner.py		pdf_cleaner.py
pdf_parser.py		pdf_parser.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Multiple Choice Question Extractor

Features

Sample data row:

Installation

Usage

Publishing to Hugging Face

Usage

About

Releases

Packages

Languages

serhanylmz/mcq

Folders and files

Latest commit

History

Repository files navigation

PDF Multiple Choice Question Extractor

Features

Sample data row:

Installation

Usage

Publishing to Hugging Face

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages