This project is a web-based application that allows users to upload images or documents, extract text from them using Optical Character Recognition (OCR), and perform searches across the extracted text. Built with Gradio
for a user-friendly interface and leveraging Pytesseract
for OCR functionality, this app is ideal for managing and searching text from scanned documents, images, and PDFs.
- OCR Processing: Extract text from images or documents in multiple languages using Tesseract.
- Document Search: Perform a keyword search across the extracted text to find relevant information quickly.
- Gradio Interface: User-friendly web interface for uploading documents and displaying results.
- Multiple Language Support: The application can handle multiple languages, including English and Hindi.
- Gradio: For building the web-based interface.
- Pytesseract: To perform OCR on uploaded images and documents.
- Python: Backend processing.
- Tesseract-OCR: The underlying OCR engine used for text extraction.
Before running the application, ensure you have the following installed:
-
Python 3.x
-
Tesseract-OCR
- For Linux:
sudo apt-get install tesseract-ocr
- For Windows, download from here.
- For Linux:
-
Required Python packages:
pip install -r requirements.txt
Your
requirements.txt
should include:- gradio
- pytesseract
- pillow (for image handling)
- numpy
-
Clone the Repository:
git clone https://github.com/ajaynair710/OCR-and-Document-Search-Web-Application.git cd OCR-and-Document-Search-Web-Application
-
Install Dependencies:
pip install -r requirements.txt
-
Install Tesseract:
- For Linux:
sudo apt-get install tesseract-ocr
- For Windows: Download and install Tesseract.
Make sure Tesseract is added to your system's PATH.
- For Linux:
-
Run the Application:
python app.py
The app will run on a local server, and you can access it by navigating to the local URL provided in the console, typically
http://0.0.0.0:7860
.
-
Upload a Document or Image:
- Open the app in your browser, and upload an image or document that contains text you want to extract.
-
Text Extraction:
- Once the document is uploaded, the app will process it using the Tesseract OCR engine and display the extracted text on the screen.
-
Search:
- Use the search bar to find specific keywords within the extracted text.
-
TesseractNotFoundError: If you encounter the
TesseractNotFoundError
, ensure Tesseract is installed correctly and added to your system's PATH.For Linux:
sudo apt-get install tesseract-ocr
For Windows: Add Tesseract's installation directory (e.g.,
C:\Program Files\Tesseract-OCR\
) to your system's PATH. -
OCR Accuracy: OCR accuracy depends on the quality of the uploaded images. Higher resolution images tend to yield better results.
- PDF Support: Add the ability to process and extract text from multi-page PDF documents.
- Text Export: Allow users to download the extracted text as a
.txt
or.docx
file. - Advanced Search: Implement advanced search filters such as case sensitivity, regex matching, and proximity searches.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request.