- Python: Core programming language for backend logic.
- FastAPI: For building a RESTful web API.
- Selenium:For rendering dynamic web pages with JavaScript.
- BeautifulSoup: For parsing and extracting data from HTML.
- Uvicorn: ASGI server for running FastAPI apps.
- Swagger UI and Postman: For testing and validating API requests and responses.
-
The Dynamic Web Scraper API is a dynamic tool that allows users to extract content from web pages. Unlike traditional scrapers that rely on static HTML, this API handles dynamic websites (those with JavaScript-rendered content) using Selenium WebDriver for rendering and BeautifulSoup for parsing the fully rendered HTML.
-
It offers an easy-to-use API for fetching web content without requiring the user to manually inspect HTML elements. Simply provide the URL of the page, and the API returns titles and paragraphs in structured JSON format.
- Dynamic Content Handling: Capable of scraping websites that use JavaScript to render content dynamically.
- **Headless Browser Automation:**Utilizes Selenium WebDriver to render pages in a headless Chrome browser for improved performance.
- API-based Scraping: The API accepts a POST request with the URL and returns JSON-formatted content.
- Error Handling: Comprehensive error messages when issues occur during page loading or scraping.
- Flexible Parsing: Extracts paragraphs, headings, and filters out unwanted text (e.g., "Copyright" or "Terms").
This scraper works well with websites that use JavaScript to render content, such as:
- News Websites: Scrape articles, headlines, and publication dates.
- E-commerce Sites: Extract product details, descriptions, and prices.
- Dynamic Content: Websites where data is rendered dynamically via JavaScript, such as social media feeds, live scores, etc.
- Clone the repository:
https://github.com/Abhimanyu-Gaurav/Dynamic-Web-Scraper-API
- Navigate to the project directory:
cd Dynamic-web-scraper-api
- Set up a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required dependencies:
pip install -r requirements.txt
-
Run the FastAPI server:
uvicorn main:app --reload
-
Open your browser (Safari, Chrome, Brave) and enter the URL:
http://localhost:8000/docs#
-
This should display the Swagger UI for your FastAPI application.
-
Test the /scrape endpoint directly from the Swagger UI to ensure it is working and accessible.
-
Use the POST method (/scrape) and provide a JSON in the body like this:
- Click on the "Try it out" button.
- Provide a JSON in the request body:
{ "url": "https://timesofindia.indiatimes.com/" }
-
Click on the "Execute" button to send the request.
-
You should see the scraped data in the response section if the request is successful.
-
Using Postman:
- Open Postman and click the "New" button to create a new request.
- Set the request type to POST from the dropdown menu next to the URL field.
- Enter the URL in the URL field:
http://localhost:8000/scrape/
- Go to the "Body" tab.
- Select the "raw" option and choose "JSON" from the dropdown.
- Paste the following JSON into the body:
{ "url": "https://timesofindia.indiatimes.com/" }
- Click the "Send" button to execute the request.
- You should see the scraped data in the response section if the request is successful.
-
Using cURL:
- Open your terminal and run:
curl -X POST "http://localhost:8000/scrape/" -H "Content-Type: application/json" -d '{"url": "https://timesofindia.indiatimes.com/"}'
- search: The term you want to search (e.g., business name or type).
- total: The number of listings to retrieve (if available).
- This project is licensed under the MIT License - see the License file for details.