🌐 Languages: English | 中文 | हिंदी | Español | Français | العربية | বাংলা | Русский | Português | Bahasa Indonesia
A powerful Python tool that extracts text and images from PDF documents and converts them to clean, well-formatted Markdown files. This utility leverages AI-powered image description to make your documents more accessible.
✨ Comprehensive PDF Processing
- Extract text while preserving document structure
- Process images with AI-powered description generation
- Clean up and format the output for maximum readability
- Resume long-running conversions from where they left off
� AI Integration
- Multiple AI backend options (Ollama or LM Studio)
- Custom prompting for image description
- Efficient processing of visual content
🛠️ Advanced Tools
- Markdown validation and quality checking
- Table of contents formatting
- Index extraction and reorganization
- Progress tracking and detailed statistics
- Python 3.8+
- PyMuPDF (fitz)
- Local LLM with vision capabilities (via Ollama or LM Studio)
# Clone the repository
git clone https://github.com/laurentvv/pdf2md-ai.git
cd pdf2md-ai
# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Create a .env
file in the project directory with the following variables:
MODEL_PROVIDER=ollama # or lmstudio
MODEL_OLLAMA=llava
MODEL_LMSTUDIO=llava
PROMPT="Describe this image in detail, including any visible text."
You can also customize the default parameters directly in the script by modifying these variables at the top of main.py
:
# Program parameters
pdf_file = "./mon_document.pdf" # Replace with your PDF file path
output_file = "output.md" # Text and image extraction (LLM)
output_file2 = "output2.md" # Additional cleaning, we keep two versions
python main.py input.pdf
This will:
- Extract text and images from
input.pdf
- Generate descriptions for all images using the configured AI provider
- Create a markdown file at the default location (
output.md
) with all content - Clean and format the markdown, saving the result to
output2.md
For more control, you can import functions from the script:
# Customize options in your script
from main import extract_pdf
from clean_markdown import clean_markdown
# Extract PDF with custom options
extract_pdf(
pdf_path="document.pdf",
output_md="raw_output.md",
use_temp_dir=True,
resume=True
)
# Clean the generated markdown
clean_markdown("raw_output.md", "final_output.md")
pdf_path
: Path to the PDF file to processoutput_md
: Path where to save the markdown output (default: "output.md")use_temp_dir
: Use a temporary directory for storing extracted images (default: True)resume
: Resume processing from the last processed page (default: True)
- PDF Extraction: Using PyMuPDF, the tool extracts text and images from each page
- Image Processing: Images are temporarily saved and sent to the AI model for description
- Markdown Generation: Text and image descriptions are compiled into Markdown
- Cleanup: The raw Markdown is processed to remove page headers, clean TOC lines, and format indices
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.