Skip to content

A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.

License

Notifications You must be signed in to change notification settings

Jatin-Mehra119/CRAWLGPT

Repository files navigation

CrawlGPT 🤖

A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.

🌟 Key Features

Core Features

  • Intelligent Web Crawling

    • Async web content extraction using Playwright
    • Smart rate limiting and validation
    • Configurable crawling strategies
  • Advanced Content Processing

    • Automatic text chunking and summarization
    • Vector embeddings via FAISS
    • Context-aware response generation
  • Streamlit Chat Interface

    • Clean, responsive UI
    • Real-time content processing
    • Conversation history
    • User authentication

Technical Features

  • Vector Database

    • FAISS-powered similarity search
    • Efficient content retrieval
    • Persistent storage
  • User Management

    • SQLite database backend
    • Secure password hashing
    • Chat history tracking
  • Monitoring & Utils

    • Request metrics collection
    • Progress tracking
    • Data import/export
    • Content validation

🎥 Demo

streamlit-chat_app-2025-01-25-23-01-66.webm

Example of CRAWLGPT in action!

🔧 Requirements

  • Python >= 3.8
  • Operating System: OS Independent
  • Required packages are handled by the setup script.

🚀 Quick Start

  1. Clone the Repository:

    cd CRAWLGPT
    
  2. Run the Setup Script:

    python -m setup_env
    

    This script installs dependencies, creates a virtual environment, and prepares the project.

  3. Update Your Environment Variables:

    • Create or modify the .env file.
    • Add your Groq API key and Ollama API key. Learn how to get API keys.
    GROQ_API_KEY=your_groq_api_key_here
    OLLAMA_API_TOKEN=your_ollama_api_key_here
    
  4. Activate the Virtual Environment:

    source .venv/bin/activate  # On Unix/macOS
    .venv\Scripts\activate  # On Windows
    
  5. Run the Application:

    python -m streamlit run src/crawlgpt/ui/chat_app.py
    

📦 Dependencies

Core Dependencies

  • streamlit==1.41.1
  • groq==0.15.0
  • sentence-transformers==3.3.1
  • faiss-cpu==1.9.0.post1
  • crawl4ai==0.4.247
  • python-dotenv==1.0.1
  • pydantic==2.10.5
  • aiohttp==3.11.11
  • beautifulsoup4==4.12.3
  • numpy==2.2.0
  • tqdm==4.67.1
  • playwright>=1.41.0
  • asyncio>=3.4.3

Development Dependencies

  • pytest==8.3.4
  • pytest-mockito==0.0.4
  • black==24.2.0
  • isort==5.13.0
  • flake8==7.0.0

🏗️ Project Structure

crawlgpt/
├── src/
│   └── crawlgpt/
│       ├── core/                         # Core functionality
│       │   ├── database.py                 # SQL database handling
│       │   ├── LLMBasedCrawler.py          # Main crawler implementation
│       │   ├── DatabaseHandler.py          # Vector database (FAISS)
│       │   └── SummaryGenerator.py         # Text summarization
│       ├── ui/                           # User Interface
│       │   ├── chat_app.py                 # Main Streamlit app
│       │   ├── chat_ui.py                  # Development UI
│       │   └── login.py                    # Authentication UI
│       └── utils/                        # Utilities
│           ├── content_validator.py        # URL/content validation
│           ├── data_manager.py             # Import/export handling
│           ├── helper_functions.py         # General helpers
│           ├── monitoring.py               # Metrics collection
│           └── progress.py                 # Progress tracking
├── tests/                                # Test suite
│   └── test_core/
│       ├── test_database_handler.py       # Vector DB tests
│       ├── test_integration.py            # Integration tests
│       ├── test_llm_based_crawler.py      # Crawler tests
│       └── test_summary_generator.py      # Summarizer tests
├── .github/                             # CI/CD
│   └── workflows/
│       └── Push_to_hf.yaml              # HuggingFace sync
├── Docs/
│   └── MiniDoc.md                     # Documentation
├── .dockerignore                      # Docker exclusions
├── .gitignore                         # Git exclusions
├── Dockerfile                         # Container config
├── LICENSE                            # MIT License
├── README.md                          # Project documentation
├── README_hf.md                       # HuggingFace README
├── pyproject.toml                     # Project metadata
├── pytest.ini                         # Test configuration
├── crawlgpt.db                        # Database 
└── setup_env.py                       # Environment setup

🧪 Testing

Run all tests

python -m pytest

The tests include unit tests for core functionality and integration tests for end-to-end workflows.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Links

🧡 Acknowledgments

  • Inspired by the potential of GPT models for intelligent content processing.
  • Special thanks to the creators of Crawl4ai, Groq, FAISS, and Playwright for their powerful tools.

👨‍💻 Author

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, open an issue first to discuss your proposal.

  1. Fork the Project.
  2. Create your Feature Branch:
    git checkout -b feature/AmazingFeature`
    
  3. Commit your Changes:
    git commit -m 'Add some AmazingFeature
    
  4. Push to the Branch:
    git push origin feature/AmazingFeature
    
  5. Open a Pull Request.

About

A powerful web content crawler with LLM-powered RAG (Retrieval Augmented Generation) capabilities. CrawlGPT extracts content from URLs, processes it through intelligent summarization, and enables natural language interactions using modern LLM technology.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published