Skip to content

Latest commit

 

History

History
49 lines (29 loc) · 2.98 KB

README.md

File metadata and controls

49 lines (29 loc) · 2.98 KB

Scrapy_pollution

License: MIT

Here you can find my first web scraping project.

⭐ Data analysis results:

  • Pollution level in PM2.5:

alt text

🔗 More details: https://github.com/lajobu/Scrapy_pollution/blob/master/Analysis.py

⭐ Details:

📍 Website: https://openaq.org/

📍 Code languague: Python3

📍 Scraper: scrapy

📍 Libraries: Numpy, Pandas 🐼, Seaborn 📊, and Matplotlib

📍 Adittional tools: docker and scrapy_splah

❓ What is web scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Source: 🔗 Wikipedia

⭐ User manual:

☑️ 1) Spider to be run: link_country

☑️ 2) Spider to be run: pages

  • $ scrapy crawl pages -o Data/Links/pages.csv
  • It generates 🔗 pages.csv, script: 🔗 pages.py

☑️ 3) Spider to be run: pollution

☑️ 4) Python script to be run: Analysis.py