Skip to content

A RESTfull Web API service to Menelik's Berhan Ethiopic Script OCR app.

License

Notifications You must be signed in to change notification settings

MenelikBerhan/REST-API_for_Ethiopic_Script_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REST-API for Ethiopic Script OCR

image

A RESTfull Web API service to Menelik's Berhan Ethiopic Script OCR app.

Table of Contents

  1. Introduction
  2. Features
  3. Technologies Used
  4. Setup
  5. Endpoints
  6. Usage
  7. Testing
  8. Contributing
  9. Authors
  10. License

Introduction

Menelik's Berhan (loosely translated as Menelik's light) is a web API for OCR services of image and pdf files containing Ethiopic Script texts.

It uses Google's open source tesseract-ocr engine and provides OCR service for texts printed in Amharic, Ge'ez and Tigrigna.

The API is implemented with the intention of using it in web applications, and the overall structure and abstractions in the app take this into consideration.

Concepts learned from previous implementation of Ethiopic Script CLI OCR app were used for the OCR process.

Please note that this OCR application is primarily designed to work with printed text. It may not perform well with handwritten text.

Features

  • OCR on Images and PDFs: Perform OCR on images and PDFs containing Ethiopic script text.
  • OCR Process Tracking: Each OCR process (for image or PDF) is tracked and stored in a database for future analysis.
  • Flexible OCR Outputs: OCR results can be provided in various formats including plain text, Microsoft Word, and PDF.
  • OCR Result Accuracy: Provides an accuracy score for OCR results based on the average confidence level of words recognized.
  • Configurable OCR Process: Users can configure the OCR process by adjusting Tesseract configuration options.
  • Image Preprocessing: Includes image preprocessing capabilities to improve OCR results.
  • File Storage and Metadata: Uploaded OCR input image and PDF files are stored locally, with file metadata stored in a database using class abstractions.
  • Fine-Tuned Language Model: In addition to the default Tesseract language models, includes a fine-tuned model for Amharic, based on texts printed in the 1950s.
  • Data Abstraction for Analysis: Abstracts input images & PDFs, Tesseract configuration used for OCR, and the OCR process & result using classes, and stores these data in the database for future analysis.
  • OAuth2 User Authentication: Secure user authentication implemented using FastAPI securities (JWT).

Technologies Used

  • Python (3.8): The whole app is built with Python.
  • FastApi (0.110.0): This is the web framework used.
  • Uvicorn (0.29.0): An ASGI web server implementation for Python.
  • MongoDB (7.0.5): This is the database used.
  • Motor (3.3.2): Asynchronous Python driver for MongoDB.
  • Pydantic (2.6.4): Data validation library for Python.
  • python-jose (3.3.0): A JavaScript Object Signing and Encryption (JOSE) implementation in Python.
  • Passlib (1.7.4): A password hashing library for Python
  • Tesseract (5.3.4): This is the OCR engine used.
  • PyTesseract (0.3.10): An OCR tool for Python. It's a wrapper for Tesseract-OCR Engine.
  • Aiofiles (23.2.1): A library for handling asynchronous file I/O.
  • PDF2Image (1.17.0): A Python module that converts PDFs into images.
  • NumPy (1.24.4): A package for scientific computing with Python.
  • OpenCV-python(4.9.0.80): A library for real-time computer vision.
  • Pillow (10.2.0): Adds image processing capabilities to Python.
  • python-docx (1.1.0): Reads, queries and modifies Microsoft Word 2007/2008 docx files.
  • FPDF2 (2.7.8): A library to create PDF documents using Python.
  • Pytest: This is the testing framework used.

Setup

Implemented and Tested on Ubuntu 20.04 with Python 3.8

# (optional) for tesseract version 5.* add this repository
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel

# Reload local package database
sudo apt update

# Install tesseract
sudo apt install -y tesseract-ocr
# Install gnupg and curl if they are not already available
sudo apt-get install gnupg curl

# import the MongoDB public GPG key
curl -fsSL https://www.mongodb.org/static/pgp/server-7.0.asc | \
   sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg \
   --dearmor

# Create a list file for MongoDB
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list

# Reload local package database
sudo apt-get update

# Install the MongoDB packages
sudo apt-get install -y mongodb-org

# Start MongoDB
sudo systemctl start mongod

# Enable MongoDB on startup
sudo systemctl enable mongod

Clone the repo

git clone https://github.com/MenelikBerhan/REST-API_for_Ethiopic_Script_OCR.git
cd REST-API_for_Ethiopic_Script_OCR

(Optional) Set up a python virtual environment using venv:

Its recommended to setup a python vertual environment before installing requirements:

sudo apt install -y python3.8-venv
python3 -m venv .venv
source .venv/bin/activate

Install required packages using pip:

python3 -m pip install --upgrade pip
python3 -m pip install -r requirements.txt

(Optional) Set environment variables

Change default app variables by setting values in app_env file.

Start the app:

python -m api.v1.app

Testing

(Optional) Set testing environment variables

Change testing app variables by setting values in test_env file.

Run all tests:

`pytest`

Run specific test:

`pytest tests/[<test_file.py>]`

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Authors

Girma Eshete aka Menelik Berhan

Linkedin

License

MIT

About

A RESTfull Web API service to Menelik's Berhan Ethiopic Script OCR app.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages