Skip to content

Latest commit

 

History

History
61 lines (43 loc) · 1.68 KB

README.md

File metadata and controls

61 lines (43 loc) · 1.68 KB

#News Crawler

Scrapy crawler to Scrap news articles.

Prerequisites

  • MongoDB

The crawler uses MongoDB to store crawled data. I have used https://www.mongodb.com/cloud/atlas to store the documents. You can create it for free.

Change the following piece of code in settings.py with your mongo db credentials:

MONGODB_USERNAME = urllib.parse.quote_plus(
    get_ssm_parameter('MONGODB_USERNAME'))
MONGODB_PWD = urllib.parse.quote_plus(
    get_ssm_parameter('MONGODB_PWD',
                      with_decryption=True))
MONGODB_URI = f"mongodb+srv://{MONGODB_USERNAME}:{MONGODB_PWD}" \
              f"@news-scraper.a2rmv.mongodb.net/" \
              f"ary_news?retryWrites=true&w=majority"
MONGODB_DB = 'ary_news'

if you are running the application locally, comment out the following code in settings.py

ssm = boto3.client('ssm', region_name='us-east-1')


def get_ssm_parameter(name: str, with_decryption=False) -> str:
    try:
        response = ssm.get_parameter(
            Name=name,
            WithDecryption=with_decryption)
        parameter = response['Parameter']['Value']
    except ClientError as error:
        print(error.response['Error']['Code'])
        raise
    return parameter

Usage

To run locally:

  1. Create virtual env of python 3.6 pyenv virtualenv 3.6.1 <your-virtual-env>
  2. export PYTHONPATH=/path/to/news_crawler
  3. make run-local ds=YYYY-MM-dd

If make run-local ds=YYY-MM-DD fails, do make clean and then run it again. The results will be crawled and posted to mongodb database

Crawler would scrap all the news articles posted on ds

Architecture

Alt text