Skip to content

Environment to deploy Scrappy spiders on scrapinghub

Notifications You must be signed in to change notification settings

papabalda/scrap_env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Environment to deploy Scrappy spiders on scrapinghub. This project is intented to be educative about different things that can be accomplished with scrapy.

Table of Contents

Initializing Environment

Here's all the information regarding the proper setup and initialization of the project.

  • Clone scrap_env repository.
  • Switch to release branch.
  • Install Python3.
  • Install pip3. apt install python3-pip
  • Install Scrapy. pip3 install Scrapy

To get started with virtual environments, see virtualenv installation instructions. To install it globally (having it globally installed actually helps here), it should be a matter of running:

$ [sudo] pip install virtualenv

Check this user guide on how to create your virtualenv https://virtualenv.pypa.io/en/stable/userguide/

Creating Spiders

bla bla bla start new spiders scrapy genspider -t crawl spidername mydomain.com

Running Spiders

To run the spiders via console, navigate to the scrap_env folder and use any of the commands written below.

commands
  • Delimited by scraped items = scrapy crawl spidername --set CLOSESPIDER_ITEMCOUNT=50
  • All = scrapy crawl spidername
  • Generating json output = scrapy crawl quotes -o quotes.json
  • Generating json lines output = scrapy crawl quotes -o quotes.jl

Folder Structure

scrap_env
      - scrap_env
              - spiders
                      - __init__.py
                      - ExampleSpider1.py
                      - ExampleSpider2.py
                      - ... 
              - __init__.py
              - items.py
              - middlewares.py
              - pipelines.py
              - settings.py
      .gitignore
      README.md
      scrapy.cfg

The structure remains the same as the one you get when creating a new scrapy project.

Types of Crawlers

Basic Template

To be included

CrawlSpider

To be included

Database Connected

To be included

Tips

To be included

About

Environment to deploy Scrappy spiders on scrapinghub

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages