README FOR WEBCRAWLER

Dependancies :
- Install with pip install -r requirements.txt --user

Start the App

Directly

cd to the root folder of the webcrawler
Execute python -m web.frontend.py
Open your browser and go to http://0.0.0.0:5000/login
User: mood-user Password: bla. Change that to whatever you like

With Docker

cd to the root folder of the webcrawler
execute docker build -t webcrawler:latest .
execute docker run -it -p 5000:5000 webcrawler
Open your browser and go to http://0.0.0.0:5000/login
User: mood-user Password: bla. Change that to whatever you like

Create a Module

To create a module you must create a new python package in webcrawler/spiders. Name it any way you want. It hast to contain a class that implements one of Scrapies Crawler interfaces like CrawlSpider or so.

To make it work with the webcrawler, import the following into the modules spider: from spiders.RegisteredModules import REGISTERED_MODULES, register_module

Then put @register_module above the classname.

If the module implements a method that scrapie can recognize, such as parse or start_requests it should start fine.

Since modules must implement the name variable we shall make use of it:

Modules are required to implement the following variable in the __init__ function of the crawler:

from SearchStrings import SEARCHSTRINGS
from spiders.settings import custom_settings
class Module:
    name="NAME"
    custom_settings = custom_settings
    def __init__(self):
        self.query = SEARCHSTRINGS[Module.name]

its important to reference the module name and pull it from the SEARCHSTRINGS collection! The instance variable itself can have any name. Its important where it gets its data from!

The Custom settings import and declaration makes the spider able to work and not to be blocked by remote sites! So do that.

The Spider does not have to be named "NAME". That is just an example!

For further information look at the twitter module

Documentation

Twitter Module pretty much copied from : https://github.com/jonbakerfish/TweetScraper

Scrapy Documentation: https://doc.scrapy.org/en/latest/

Module Output

A Module should always output in the following format:

{
  "postContent":{
    "user": "USERNAME",
    "text":"RAW TEXT",
    "cleanText": "CLEANED TEXT"
  },
  "meta": {
    "URL": "http://www.heinz-wackelpudding.de",
    "dateTime":"11.11.2011",
    "lang":"en"
  },
  "siteSpecific":{
    "MODULENAME":{
      "here":"bla",
      "are":"bla",
      "some":"bla",
      "site Specific":"bla",
      "informations":"bla",
      "like":"bla",
      "retweets":"bla"
    }
  },
  "sentimentData": {
    "Whatever":"bla",
    "needs":"bla",
    "to":"bla",
    "be":"bla",
    "here":"bla!!!"
  }
}

This output can be ensured by defining items and itemloaders according to this format In the following examples MODULENAME is to be replaced with the actual Modulename!!!

from spiders.items import Item, ItemLoader, TakeFirst, Field


class MODULENAMESpecific(Item):
    ID = Field()  # tweet id
    user_id = Field()  # user id
    retweets = Field()  # nbr of retweet
    favorites = Field()  # nbr of favorite
    replies = Field()  # nbr of reply


class SiteSpecific(Item):
    Twitter = Field(serializer=MODULENAMESpecific)


#############


class MODULENAMESpecificLoader(ItemLoader):
    default_item_class = MODULENAMESpecific
    default_output_processor = TakeFirst()


class SiteSpecificLoader(ItemLoader):
    default_item_class = SiteSpecific
    default_output_processor = TakeFirst()

And then, such items can be populated in the following way

from spiders.items import MetaItemsLoader, PostContentLoader, MainLoader
from spiders.twitter.items import SiteSpecificLoader, MODULENAMESpecificLoader

main = MainLoader()
post_content = PostContentLoader()
meta = MetaItemsLoader()
siteSpecific = SiteSpecificLoader()
MODULENAMESpecific = MODULENAMESpecificLoader()


siteSpecific.add_value('MODULENAME', twitterSpecific.load_item())
main.add_value('siteSpecific', siteSpecific.load_item())
main.add_value('postContent', post_content.load_item())
main.add_value('meta', meta.load_item())

yield main.load_item()

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
spiders		spiders
web		web
.gitignore		.gitignore
Dockerfile		Dockerfile
ModuleStarter.py		ModuleStarter.py
README.md		README.md
SearchStrings.py		SearchStrings.py
items.py		items.py
middlewares.py		middlewares.py
pipelines.py		pipelines.py
rabbitmq_connector.py		rabbitmq_connector.py
requirements.txt		requirements.txt
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README FOR WEBCRAWLER

Start the App

Directly

With Docker

Create a Module

Documentation

Module Output

About

Releases

Packages

Languages

trashtatur/modularWebcrawler

Folders and files

Latest commit

History

Repository files navigation

README FOR WEBCRAWLER

Start the App

Directly

With Docker

Create a Module

Documentation

Module Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages