This tool converts PDF documents to Markdown format using GPT-4o-mini's vision capabilities. Each page of the PDF is processed individually and converted to markdown, then combined into a single markdown document.
- Asynchronous processing for faster conversion
- Parallel page processing
- Memory-efficient batch processing
- Maintains page order in output
- Real-time page saving (no waiting for full document)
- Detailed logging of conversion process
- Optimized image handling for speed
- Python 3.7+
- OpenAI API key
poppler-utils
(required for pdf2image)
- Install Python 3.7+ from python.org
- Install poppler:
- Download the latest poppler release for Windows
- Extract the downloaded file (e.g., to
C:\Program Files\poppler
) - Add the
bin
directory to your PATH:- Open System Properties > Advanced > Environment Variables
- Under System Variables, find and select "Path"
- Click "Edit" and add the path (e.g.,
C:\Program Files\poppler\bin
)
- Clone this repository:
git clone https://github.com/aero-oli/pdf2md-gpt.git
cd pdf2md-gpt
- Create and activate a virtual environment (recommended):
python -m venv venv
.\venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install poppler:
brew install poppler
- Install poppler:
sudo apt-get install poppler-utils
After installing platform-specific dependencies:
- Create a
.env
file in the project root and add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
- Place your PDF file in the project directory
- Edit the
pdf_path
variable inpdf_to_markdown.py
to point to your PDF file - Run the script:
Windows (Command Prompt):
python pdf_to_markdown.py
or (PowerShell):
python .\pdf_to_markdown.py
macOS/Linux:
python pdf_to_markdown.py
The script will:
- Convert each page to an optimized image
- Send each image to GPT-4o-mini API
- Convert the content to markdown
- Save pages to the output file in correct order
- Create a final markdown file with the same name as your PDF but with
.md
extension
You can adjust these variables in the code for different performance characteristics:
batch_size
: Number of pages to process in parallel (default: 4)quality
: JPEG quality for image conversion (default: 70)dpi
: Image resolution for PDF conversion (default: 150)timeout
: API call timeout in seconds (default: 30)
- If you get a "poppler not found" error:
- Double-check that poppler is in your PATH
- Try restarting your terminal/IDE after adding to PATH
- Try using the full path in the code:
poppler_path=r"C:\Program Files\poppler\bin"
- If you get a "DLL load failed" error:
- Install the Visual C++ Redistributable
- The script processes pages in parallel for speed but maintains correct page order in output
- Each page is saved as it's processed, so partial results are available even if the script is interrupted
- Memory usage is optimized through batch processing and image optimization
- Processing time depends on:
- PDF size and complexity
- Number of pages
- GPT-4o-mini API response time
- Your internet connection speed