Parser

A sophisticated web parsing tool designed to handle both static and dynamic web content efficiently. Built in Python, it utilizes a combination of BeautifulSoup for parsing HTML and Selenium for handling JavaScript-heavy webpages, making it versatile for a wide range of web crawling needs.

Features

Multi-threaded Crawling: Leverages threading for efficient crawling of multiple URLs concurrently.

Dynamic Content Handling: Uses Selenium to parse content dynamically loaded by JavaScript.

HTML Parsing: Beautiful Soup is employed for easy and effective parsing of HTML content.

JSON Output: Extracted data is saved in a structured JSON format for easy integration and analysis.

Customizable: Can be tailored to specific crawling and parsing requirements.

Setup

Clone the repository: git clone https://github.com/roshanlam/Parser.git

Navigate to the project directory: cd Parser

Install required packages: pip install -r requirements.txt

Usage

Add the URLs you want to crawl to links.txt, one URL per line.

Run the script: python main.py

Check the data directory for parsed JSON results.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
data		data
ReadMe.md		ReadMe.md
links.txt		links.txt
logo.webp		logo.webp
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parser

Features

Setup

Usage

About

Releases

Packages

Languages

roshanlam/Paser

Folders and files

Latest commit

History

Repository files navigation

Parser

Features

Setup

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages