This project implements a Support Vector Machine (SVM)-based focused crawler to classify blog URLs into specific categories such as Technology, Travel, and Lifestyle. The implementation follows the methodology outlined in the paper:
Baweja, V.R., Bhatia, R., Kumar, M. (2020). Support Vector Machine-Based Focused Crawler. In: Ranganathan, G., Chen, J., Rocha, Á. (eds) Inventive Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 89. Springer, Singapore. https://doi.org/10.1007/978-981-15-0146-3_63
- Overview
- Installation
- Usage
- Data Collection
- Preprocessing
- Feature Extraction
- Model Training
- Classification
- Evaluation
- Contributing
- License
The objective of this project is to classify blog URLs into predefined categories using an SVM classifier. By analyzing the structure and content of URLs, the model predicts the category to which a given URL belongs. This approach is beneficial for focused crawling, content categorization, and information retrieval tasks.
To set up the project environment, follow these steps:
-
Clone the repository:
git clone git@github.com:stacksapien/svm-based-web-crawler.git cd svm-based-web-crawler
-
Create a virtual environment:
python -m venv env
-
Activate the virtual environment:
-
On Windows:
.\\env\\Scripts\\activate
-
On macOS/Linux:
source env/bin/activate
-
-
Install the required packages:
pip install -r requirements.txt
After setting up the environment, you can run the URL classification script:
python main.py
This script will preprocess the URLs, extract features, train the SVM model, and classify new URLs into their respective categories.
The dataset comprises sample blog URLs categorized into Technology, Travel, and Lifestyle. Each URL is labeled accordingly to serve as training data for the SVM classifier.
Preprocessing involves cleaning and tokenizing URLs:
- Removing protocols (
http
,https
) andwww
. - Replacing non-alphanumeric characters with spaces.
- Converting text to lowercase.
This standardization facilitates effective feature extraction.
The preprocessed URLs are transformed into numerical feature vectors using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization. This technique quantifies the importance of terms within the URLs, enabling the SVM classifier to discern patterns associated with each category.
The dataset is divided into training and testing sets. An SVM classifier with a linear kernel is trained on the training data to learn the distinctions between categories.
The trained SVM model predicts the category of new URLs based on their structural features. The classify_url
function processes a new URL and outputs its predicted category.
The model's performance is evaluated using metrics such as accuracy, precision, recall, and F1-score. These metrics provide insights into the classifier's effectiveness in categorizing URLs.
Contributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.