A computer vision application that captures images through your camera and uses multiple AI models (including Google's Gemini, DeepSeek, and Qwen) to analyze and answer questions in real-time. Perfect for students and educators who want to quickly verify answers or get help with questions.
https://x.com/prathamdby/status/1898830118631637250
- Real-time camera feed with OpenCV
- Memory-efficient image processing with zero disk I/O
- Text extraction from images using Gemini Vision
- Multi-model answer verification (Gemini Pro, DeepSeek, Qwen)
- Support for both multiple choice and open-ended questions
- Live visual overlay with controls and status
- Thread-safe image handling with BytesIO
- Non-blocking UI with asynchronous processing
- Optimized memory management and MIME type validation
- Python 3.11 or higher
- Webcam/Camera device
- Google Gemini API key (for vision/text extraction)
- OpenRouter API key (for multi-model support)
google-genai>=1.5.0
: Google Generative AI client libraryopencv-python>=4.11.0.86
: OpenCV for computer visionpython-dotenv>=1.0.1
: Environment variable managementopenai>=1.12.0
: OpenAI/OpenRouter client library
-
Clone the repository:
git clone https://github.com/prathamdby/ai-helper.git cd ai-helper
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
On Windows:
venv\Scripts\activate
On macOS/Linux:
source venv/bin/activate
-
Install dependencies:
Using pip:
pip install -r requirements.txt
Or using uv (faster):
uv pip install -r requirements.txt
-
Create a new .env file:
On Windows:
copy .env.example .env
On macOS/Linux:
cp .env.example .env
-
Edit
.env
and add your API keys:GEMINI_API_KEY=your_gemini_api_key_here # From https://aistudio.google.com/app/apikey OPENROUTER_API_KEY=your_openrouter_key_here # From https://openrouter.ai/keys
-
Ensure your camera is connected and accessible
-
Run the application:
On Windows/macOS:
python main.py
On some Linux systems:
python3 main.py
-
Controls:
SPACE
- Capture and analyze the current frameC
- Clear current resultsQ
- Quit the application
- The application captures video feed from your camera (1280x720 @ 30fps)
- When you press SPACE, it captures the current frame
- The frame is encoded to JPEG format in memory using OpenCV
- The encoded image is wrapped in a BytesIO object with proper MIME type
- Image is processed to extract text using Gemini Vision AI
- The extracted question is then sent to multiple AI models:
- Google Gemini Pro
- DeepSeek Chat
- Qwen
- For multiple choice questions:
- Each model returns the correct option letter (A, B, C, or D)
- For open-ended questions:
- Each model returns a concise answer
- Results from all models are displayed in real-time on the video feed
- Memory management ensures efficient processing with zero disk operations
- Press C to clear results and analyze a new question
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.