Skip to content

Latest commit

 

History

History
122 lines (99 loc) · 3.52 KB

README.md

File metadata and controls

122 lines (99 loc) · 3.52 KB

pdf2md-gpt

This tool converts PDF documents to Markdown format using GPT-4o-mini's vision capabilities. Each page of the PDF is processed individually and converted to markdown, then combined into a single markdown document.

Features

  • Asynchronous processing for faster conversion
  • Parallel page processing
  • Memory-efficient batch processing
  • Maintains page order in output
  • Real-time page saving (no waiting for full document)
  • Detailed logging of conversion process
  • Optimized image handling for speed

Prerequisites

  • Python 3.7+
  • OpenAI API key
  • poppler-utils (required for pdf2image)

Installation

Windows

  1. Install Python 3.7+ from python.org
  2. Install poppler:
    • Download the latest poppler release for Windows
    • Extract the downloaded file (e.g., to C:\Program Files\poppler)
    • Add the bin directory to your PATH:
      • Open System Properties > Advanced > Environment Variables
      • Under System Variables, find and select "Path"
      • Click "Edit" and add the path (e.g., C:\Program Files\poppler\bin)
  3. Clone this repository:
git clone https://github.com/aero-oli/pdf2md-gpt.git
cd pdf2md-gpt
  1. Create and activate a virtual environment (recommended):
python -m venv venv
.\venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

macOS

  1. Install poppler:
brew install poppler

Linux (Ubuntu/Debian)

  1. Install poppler:
sudo apt-get install poppler-utils

All Platforms

After installing platform-specific dependencies:

  1. Create a .env file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here

Usage

  1. Place your PDF file in the project directory
  2. Edit the pdf_path variable in pdf_to_markdown.py to point to your PDF file
  3. Run the script:

Windows (Command Prompt):

python pdf_to_markdown.py

or (PowerShell):

python .\pdf_to_markdown.py

macOS/Linux:

python pdf_to_markdown.py

The script will:

  1. Convert each page to an optimized image
  2. Send each image to GPT-4o-mini API
  3. Convert the content to markdown
  4. Save pages to the output file in correct order
  5. Create a final markdown file with the same name as your PDF but with .md extension

Performance Settings

You can adjust these variables in the code for different performance characteristics:

  • batch_size: Number of pages to process in parallel (default: 4)
  • quality: JPEG quality for image conversion (default: 70)
  • dpi: Image resolution for PDF conversion (default: 150)
  • timeout: API call timeout in seconds (default: 30)

Troubleshooting

Windows

  • If you get a "poppler not found" error:
    • Double-check that poppler is in your PATH
    • Try restarting your terminal/IDE after adding to PATH
    • Try using the full path in the code: poppler_path=r"C:\Program Files\poppler\bin"
  • If you get a "DLL load failed" error:

Notes

  • The script processes pages in parallel for speed but maintains correct page order in output
  • Each page is saved as it's processed, so partial results are available even if the script is interrupted
  • Memory usage is optimized through batch processing and image optimization
  • Processing time depends on:
    • PDF size and complexity
    • Number of pages
    • GPT-4o-mini API response time
    • Your internet connection speed