A high-performance, production-ready fork of Microsoft's OmniParser for converting GUI screens to structured elements.
OmniParser V2 converts GUI screenshots into structured data representing UI elements. This production-ready fork builds on Microsoft's original OmniParser with significant performance optimizations and deployment features.
The tool identifies buttons, text fields, icons, and other UI components from images, enabling automated testing, accessibility analysis, and UI documentation. With Modal Labs integration, you can deploy the parser as a scalable, serverless API.
- π Thread-safe PaddleOCR Pool: Efficiently manages OCR instances for optimal resource utilization
- β‘ Parallel Batch Processing: Process multiple images concurrently with up to 40x throughput
- βοΈ Modal Labs Integration: One-command deployment to serverless infrastructure
- π Performance Monitoring: Detailed metrics with actionable optimization suggestions
- π Parameter Recommendations: Dynamic suggestions for optimal thread pool and batch size configuration
- π§ͺ FastAPI Development Server: Built-in API for local testing and development
# Clone repository
git clone https://github.com/bogini/OmniParser-Production
cd OmniParser-Production
# Install dependencies
pip install -r requirements.txt
# Create weights directory structure
mkdir -p weights/icon_detect weights/icon_caption_florence
# Download model weights
# For V2 weights:
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do
huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights
done
mv weights/icon_caption weights/icon_caption_florence
# Start with default settings
python app.py
# Or customize host and port
python app.py --port 7861 --host 0.0.0.0
# Install Modal CLI
pip install modal
# Log in to Modal
modal login
# Deploy the application
modal deploy app.py
After deployment, Modal provides a unique endpoint URL for your serverless API.
POST /process_image
Request body:
{
"image_data": "...",
"box_threshold": 0.05,
"iou_threshold": 0.1,
"use_paddleocr": true,
"imgsz": 640
}
POST /process_batched
Request body:
{
"images": [
"...",
"..."
],
"box_threshold": 0.05,
"iou_threshold": 0.1,
"use_paddleocr": true,
"imgsz": 640
}
Parameter | Description | Default |
---|---|---|
CONCURRENCY_LIMIT |
Concurrent requests per container | 1 |
MODAL_CONTAINER_TIMEOUT |
Container idle timeout (seconds) | 500 |
MODAL_GPU_CONFIG |
GPU type for Modal deployment | A100 |
API_PORT |
FastAPI server port | 7861 |
MAX_CONTAINERS |
Maximum Modal containers | 10 |
MAX_BATCH_SIZE |
Maximum images per batch | 1000 |
THREAD_POOL_SIZE |
Thread pool size | 40 |
Parameter | Description | Default |
---|---|---|
pool_size |
Number of PaddleOCR instances | 16 |
lang |
OCR language | en |
use_angle_cls |
Use angle classification | false |
use_gpu |
Use GPU for OCR | true |
max_batch_size |
Maximum OCR batch size | 1024 |
rec_batch_num |
Recognition batch number | 1024 |
Set environment variables before starting the server:
# Set environment variables
export THREAD_POOL_SIZE=20
export MAX_BATCH_SIZE=500
export API_PORT=8000
# Start with custom configuration
python app.py
OmniParser V2 delivers significant performance improvements:
- Single Image: ~0.55-0.7s processing time per image (1111Γ2405 pixels with OCR) on an A100 GPU
- Batch Processing: Near-linear scaling with thread pool size
- Resource Efficiency: Controlled memory usage with configurable OCR pool
Performance metrics are automatically logged with each request, providing insights for optimization.
The server provides detailed logging that gives insight into the processing pipeline:
2025-02-27 08:26:36,831 - omniparser - INFO - [batch_0b7beafa] Processing batch of 6 images in parallel
2025-02-27 08:26:36,831 - omniparser - INFO - [batch_0b7beafa] Submitting image 1/6 for processing
[...]
2025-02-27 08:26:37,849 - omniparser - INFO - [req_f61fc3be] Request to 'process_image' completed successfully in 1.018s | Steps: {"image_conversion": 0.0, "ocr_processing": 0.307, "icon_detection": 0.598, "response_preparation": 0.112} | {'image_width': 1179, 'image_height': 2556, 'text_elements': 1, 'icons_detected': 5}
[...]
2025-02-27 08:26:40,847 - omniparser - INFO - [batch_0b7beafa] Batch processing complete - Stats: Total: 6 | Successful: 6 | Failed: 0 | Time: 3.98s | Avg: 0.66s per image | Thread pool size: 40 | Parallelism efficiency: 0.11 | Image times - Avg: 2.88s | Min: 1.02s | Max: 4.02s Consider reducing THREAD_POOL_SIZE (current: 40)
Key insights from the logs:
- Request Tracking: Each request and batch gets a unique ID
- Processing Breakdown: Timing for each processing step
- Detection Results: Image dimensions and element counts
- Performance Metrics: Efficiency and utilization statistics
- Optimization Suggestions: Automatic recommendations based on usage patterns
This project is licensed under the CC-BY-4.0 License, the same as the original Microsoft OmniParser repository.