Thread Match is designed to use multi-modal semantic similarity search to find similar posts within a subreddit. It combines the text from posts, URLs, and image embeddings to deliver more accurate search results. Video search is currently in development
- Data Pipeline: Utilizes a Dockerized Apache Airflow instance to trigger Apache Kafka, streaming Reddit data from specified subreddits.
- Indexing: Processes the streamed data using PySpark to create a composite HNSW32 FAISS index, serialized and stored in MongoDB with associated metadata.
- Backend: Provides RESTful endpoints with a Dockerized Django application for multi-modal semantic search on the FAISS index.
- Frontend: Implements a single-page React application using Tailwind CSS, served by an Nginx/Node container, offering a seamless and responsive user interface.
- Containerization: Leverages Docker to containerize all components, ensuring consistent and scalable deployments.
- Apache Airflow: Orchestrates the data pipeline.
- Apache Kafka: Streams Reddit data.
- PySpark: Processes data and creates FAISS index.
- FAISS: Provides efficient similarity search indexing.
- MongoDB: Stores the FAISS index and metadata.
- Django: Backend framework for RESTful endpoints.
- React: Frontend framework for building the user interface.
- Tailwind CSS: Utility-first CSS framework for styling.
- Docker: Containerizes all components for consistent deployments.
- Nginx and Node.js: Serve the frontend application.
- Docker
- Docker Compose
-
Clone the repository:
git clone https://github.com/adenletchworth/ThreadMatch.git cd ThreadMatch
-
Create a
.env
file in the root directory and configure your environment variables:AIRFLOW_UID=50000
-
Create a
.env.dev
file in the root directory and configure your development environment variables:DEBUG=1 SECRET_KEY=... DJANGO_ALLOWED_HOSTS=localhost 127.0.0.1 [::1]
-
Configure
dags/configs/kafka.py
with the following settings:bootstrap_servers = 'kafka:9092' topic_name = 'reddit-posts' mongo_uri = 'mongodb://host.docker.internal:27017/' mongo_db_name = 'Reddit' mongo_collection_name = 'posts' mongo_collection_name_posts = 'posts' mongo_collection_name_index = 'faiss_index'
-
Configure
reddit.py
with the following settings:client_id=... client_secret=... user_agent=... subreddits_list=...
-
Start the services using Docker Compose:
docker-compose up --build -d
Access the Apache Airflow web interface at http://localhost:8080 to manage the data pipeline.
The Django backend can be accessed at http://localhost:8000.
The React frontend can be accessed at http://localhost:3001.
Contributions are welcome! Please fork the repository and submit a pull request.