Normalize a URL
-
Updated
Mar 10, 2024 - JavaScript
Normalize a URL
Extract and decompose URLs (including emails, which are conceptually a part of URLs) with robust patterns.
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
Remove clutter from URLs and return a canonicalized version
Get a stable, canonical version of any URL, with DNS and HTTPS checks, redirects, tracker stripping, and canonical link extraction!
🔗🧹 Normalize URLs to a standardized form. HTTPS by default, flexible configuration, custom protocols, domain extraction, humazing URL, and punycode support. Both CJS & ESM modules available.
Allows you to remove ad/tracking query params from a given URL in Scala
URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical
A simple plugin to rewrite multiple domains or subdomains to a single domain in the site's HTML output.
Natural Language Precessing related notebooks (Machine Learning)
🔗 Pathor is a PHP library for normalizing, analyzing, and comparing URLs.
http url normalization for web crawlers
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
Add a description, image, and links to the url-normalization topic page so that developers can more easily learn about it.
To associate your repository with the url-normalization topic, visit your repo's landing page and select "manage topics."