A high-performance asynchronous web crawler that discovers and fingerprints web platforms in real-time. Built with Python, it efficiently crawls websites, identifies platform technologies, and maintains immediate records of findings.
- Asynchronous Operation: Uses Python's asyncio for concurrent crawling
- Real-time Fingerprinting: Identifies platforms as domains are discovered
- Platform Detection: Currently detects:
- WordPress
- Shopify
- ClickFunnels 2.0
- Kajabi
- Smart Crawling:
- HTML-only processing
- HTTP/HTTPS protocol filtering
- Automatic www subdomain handling
- Live Results: Writes discoveries to file immediately
# Clone the repository
git clone https://github.com/yourusername/platform-fingerprint-crawler
cd platform-fingerprint-crawler
# Install required packages
pip install aiohttp aiofiles beautifulsoup4
# Run the crawler
python crawler.py
Modify these parameters in the main() function:
start_url = 'https://example.com' # Starting point for crawl
crawler = WebCrawler(
max_depth=2, # How deep to crawl
max_concurrent=10, # Maximum concurrent requests
output_file='discovered_domains.txt',
results_file='fingerprint_results.json'
)
example.com
subdomain.example.com
another-domain.com
{
"example.com": {
"platforms": ["WordPress", "Shopify"],
"status": "active",
"last_checked": "2023-XX-XX:XX:XX"
}
}
- Attempts direct domain access first
- Falls back to www prefix if needed
- Records failed attempts with error status
- Continues crawling despite individual domain failures
- Uses connection pooling via aiohttp
- Implements concurrent request limiting
- Filters non-HTML resources early
- Maintains thread-safe file operations
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Copyright (C) 2024 & Beyond, You'll Click
Coded by www.youll.click