Skip to content

A powerful Python tool that extracts text and images from PDF documents and converts them to clean, well-formatted Markdown files

Notifications You must be signed in to change notification settings

laurentvv/pdf2md-ai

Repository files navigation

PDF to Markdown Converter

License: MIT

🌐 Languages: English | 中文 | हिंदी | Español | Français | العربية | বাংলা | Русский | Português | Bahasa Indonesia

A powerful Python tool that extracts text and images from PDF documents and converts them to clean, well-formatted Markdown files. This utility leverages AI-powered image description to make your documents more accessible.

Features

Comprehensive PDF Processing

  • Extract text while preserving document structure
  • Process images with AI-powered description generation
  • Clean up and format the output for maximum readability
  • Resume long-running conversions from where they left off

AI Integration

  • Multiple AI backend options (Ollama or LM Studio)
  • Custom prompting for image description
  • Efficient processing of visual content

🛠️ Advanced Tools

  • Markdown validation and quality checking
  • Table of contents formatting
  • Index extraction and reorganization
  • Progress tracking and detailed statistics

Requirements

  • Python 3.8+
  • PyMuPDF (fitz)
  • Local LLM with vision capabilities (via Ollama or LM Studio)

Installation

# Clone the repository
git clone https://github.com/laurentvv/pdf2md-ai.git
cd pdf2md-ai

# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project directory with the following variables:

MODEL_PROVIDER=ollama  # or lmstudio
MODEL_OLLAMA=llava
MODEL_LMSTUDIO=llava
PROMPT="Describe this image in detail, including any visible text."

You can also customize the default parameters directly in the script by modifying these variables at the top of main.py:

# Program parameters
pdf_file = "./mon_document.pdf"  # Replace with your PDF file path
output_file = "output.md"  # Text and image extraction (LLM)
output_file2 = "output2.md"  # Additional cleaning, we keep two versions

Usage

Basic Usage

python main.py input.pdf

This will:

  1. Extract text and images from input.pdf
  2. Generate descriptions for all images using the configured AI provider
  3. Create a markdown file at the default location (output.md) with all content
  4. Clean and format the markdown, saving the result to output2.md

Advanced Options

For more control, you can import functions from the script:

# Customize options in your script
from main import extract_pdf
from clean_markdown import clean_markdown

# Extract PDF with custom options
extract_pdf(
    pdf_path="document.pdf",
    output_md="raw_output.md",
    use_temp_dir=True,
    resume=True
)

# Clean the generated markdown
clean_markdown("raw_output.md", "final_output.md")

Available Parameters for extract_pdf()

  • pdf_path: Path to the PDF file to process
  • output_md: Path where to save the markdown output (default: "output.md")
  • use_temp_dir: Use a temporary directory for storing extracted images (default: True)
  • resume: Resume processing from the last processed page (default: True)

How It Works

  1. PDF Extraction: Using PyMuPDF, the tool extracts text and images from each page
  2. Image Processing: Images are temporarily saved and sent to the AI model for description
  3. Markdown Generation: Text and image descriptions are compiled into Markdown
  4. Cleanup: The raw Markdown is processed to remove page headers, clean TOC lines, and format indices

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A powerful Python tool that extracts text and images from PDF documents and converts them to clean, well-formatted Markdown files

Topics

Resources

Stars

Watchers

Forks

Languages