Skip to content

Python script to collect data on repositories within the organisation to be used on the Digital Landscape.

License

Notifications You must be signed in to change notification settings

ONS-Innovation/keh-github-scraper-batch

Repository files navigation

GitHub Scraper Batch

A Python script that scrapes GitHub repositories and saves the data to an S3 bucket. This repository utilises the scheduled batch module to deploy the service as a batch job on AWS.

This project utilises the GitHub API Package GraphQL interface to get data from GitHub.

The script is run from the command line using the following command:

Prerequisites:

  • Python 3.10+
  • Poetry
  • AWS CLI
  • Make

Getting started

Setup:

make install

Export AWS environment variables:

export AWS_ACCESS_KEY_ID=<KEY>
export AWS_SECRET_ACCESS_KEY=<SECRET>
export AWS_DEFAULT_REGION=<REGION>
export AWS_SECRET_NAME=/<env>/github-tooling-suite/<onsdigital/ons-innovation>

Export GitHub environment variables:

export GITHUB_APP_CLIENT_ID=<CLIENT_ID>
export GITHUB_ORG=<onsdigital/ons-innovation>

Export other environment variables:

export SOURCE_BUCKET=<BUCKET_NAME>
export SOURCE_KEY=<KEY>
export BATCH_SIZE=<BATCH_SIZE>
export ENVIRONMENT=<development/production>
  • The source_bucket is the S3 bucket that will store the output of the script.
  • The source_key is the key of the file that will store the output of the script.
  • The batch_size is the number of repositories that will be scraped in each batch.
  • The environment determines where to save the results. Development: locally, Production: to S3

Run:

make run

Linting and formatting

Install dev dependencies:

make install-dev

Run lint command:

make lint

Run ruff check:

make ruff

Run pylint:

make pylint

Run black:

make black

About

Python script to collect data on repositories within the organisation to be used on the Digital Landscape.

Resources

License

Stars

Watchers

Forks

Languages