This project implements a web scraping API that leverages the Gemini AI model to extract specific information from websites. It provides a user-friendly interface for defining extraction criteria and handles dynamic content and CAPTCHAs using a scraping browser. The API is deployed on Render and is designed for easy integration into various projects.
- Scrapes data from websites, handling dynamic content and CAPTCHAs.
- Uses Gemini AI to precisely extract the requested information.
- Provides a clean JSON output of the extracted data.
- Includes a user-friendly web interface for easy interaction.
- Error handling and retry mechanisms for robust operation.
- Event tracking using GetAnalyzr for monitoring API usage.
- Access the Web Interface: Visit https://webcrawlai.onrender.com/
- Enter the URL: Input the website URL you want to scrape.
- Specify Extraction Prompt: Provide a clear description of the data you need (e.g., "Extract all product names and prices").
- Click "Extract Information": The API will process your request, and the results will be displayed.
This project is deployed as a web application. No local installation is required for usage. However, if you wish to run the code locally, follow these steps:
- Clone the Repository:
git clone https://github.com/YOUR_USERNAME/WebCrawlAI.git cd WebCrawlAI
- Install Dependencies:
pip install -r requirements.txt
- Set Environment Variables: Create a
.env
file (refer to.env.example
) and populate it with yourSBR_WEBDRIVER
(Bright Data Scraping Browser URL) andGEMINI_API_KEY
(Google Gemini API Key). - Run the Application:
python main.py
- Flask (3.0.0): Web framework for building the API.
- BeautifulSoup (4.12.2): HTML/XML parser for extracting data from web pages.
- Selenium (4.16.0): For automating browser interactions, handling dynamic content and CAPTCHAs.
- lxml: Fast and efficient XML and HTML processing library.
- html5lib: For parsing HTML documents.
- python-dotenv (1.0.0): For managing environment variables.
- google-generativeai (0.3.1): Integrates the Gemini AI model for data parsing and extraction.
- axios: JavaScript library for making HTTP requests (client-side).
- marked: JavaScript library for rendering Markdown (client-side).
- Tailwind CSS: Utility-first CSS framework for styling (client-side).
- GetAnalyzr: For event tracking and API usage monitoring.
- Bright Data Scraping Browser: Provides fully-managed, headless browsers for reliable web scraping.
Endpoint: /scrape-and-parse
Method: POST
Request Body (JSON):
{
"url": "https://www.example.com",
"parse_description": "Extract all product names and prices"
}
Response (JSON):
Success:
{
"success": true,
"result": {
"products": [
{"name": "Product A", "price": "$10"},
{"name": "Product B", "price": "$20"}
]
}
}
Error:
{
"error": "An error occurred during scraping or parsing"
}
The project dependencies are listed in requirements.txt
. Use pip install -r requirements.txt
to install them.
Contributions are welcome! Please open an issue or submit a pull request.
No formal testing framework is currently implemented. Testing should be added as part of future development.
README.md was made with Etchr