- This project is a FastAPI-based system that compares request messages with existing messages in a database to detect duplicates based on message similarity using angular distance.
- Key Features
- Installation and Setup
- Alembic Migration
- Database Schema Overview
- API Documentation
- Notes
-
Efficient Duplicate Detection:
- Uses TF-IDF vectorization and Annoy Index to quickly compare new messages with existing ones, detecting duplicates based on similarity.
-
Threshold-Based Matching:
- Messages are considered duplicates only if they fall below a customizable similarity threshold.
-
Fast Similarity Search:
- Employs Annoy (Approximate Nearest Neighbors) for fast, scalable nearest neighbor search, ensuring real-time detection even with large datasets.
-
Automatic Index Rebuilding:
- After each new message, the Annoy index is automatically rebuilt to maintain up-to-date comparisons without manual intervention.
-
Subject Management:
- Groups related messages into subjects, tracking the latest message in each subject for efficient clustering.
-
Artifacts Storage:
- Saves and reloads the TF-IDF model and Annoy index, ensuring fast startup and efficient comparisons.
python3 -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows
pip install -r requirements.txt
Create a .env file in the project root with the following content:
DB_HOST=<your_db_host>
DB_NAME=<your_db_name>
DB_USERNAME=<your_db_username>
DB_PASSWORD=<your_db_password>
DB_PORT=<your_db_port> # Default is 3306
Once everything is set up, you can run the FastAPI server:
uvicorn main:app --reload
The server will run by default at http://127.0.0.1:8000.
After modifying your models, you need to create a migration file to update the database schema. To automatically generate a migration file based on the model changes, run the following command:
alembic revision --autogenerate -m "Add your migration description here"
This command detects the differences between the database schema and the SQLAlchemy models, and generates a migration file accordingly.
Once the migration file is generated, you can find it in the alembic/versions directory. Review and edit
the migration file to ensure that it contains the correct SQL commands for creating, modifying, or dropping tables.
After generating a migration file, apply the migrations to the database using the following command:
alembic upgrade head
This command applies all pending migrations to update the database schema to the latest version.
Note that this command DIRECTLY MODIFY YOUR DB SCHEMA.
Make sure that your migration file contains the correct SQL commands.
If you need to revert a migration, you can downgrade the database to the previous version using the following command:
alembic downgrade -1
This command rolls back the most recent migration. If you want to downgrade to a specific version, you can specify the migration revision ID:
alembic downgrade <revision_id>
Receives message and checks for duplication by comparing it with existing messages in the database.
{
"room_id": 1234567890,
"user_id": 1234567890,
"chat_id": 1234567890,
"client_message_id": 1234567890,
"message": "Your message text here",
"sent_at": 1609459200
}
-
On success
{ "chat_id": 0, "message": "message", "subject_id": 0 }
- The message field can have one of the following values:
"New message" "Duplicate message" "Similar message, distance: {distance}"
- The message field can have one of the following values:
-
If an error occurs
{ "detail": [ { "loc": [ "string", 0 ], "msg": "string", "type": "string" } ] }
- The threshold for considering a message as a similar message can be adjusted in the ValidateService class within the code.