A Python script that scrapes GitHub repositories and saves the data to an S3 bucket. This repository utilises the scheduled batch module to deploy the service as a batch job on AWS.
This project utilises the GitHub API Package GraphQL interface to get data from GitHub.
The script is run from the command line using the following command:
- Python 3.10+
- Poetry
- AWS CLI
- Make
Setup:
make install
Export AWS environment variables:
export AWS_ACCESS_KEY_ID=<KEY>
export AWS_SECRET_ACCESS_KEY=<SECRET>
export AWS_DEFAULT_REGION=<REGION>
export AWS_SECRET_NAME=/<env>/github-tooling-suite/<onsdigital/ons-innovation>
Export GitHub environment variables:
export GITHUB_APP_CLIENT_ID=<CLIENT_ID>
export GITHUB_ORG=<onsdigital/ons-innovation>
Export other environment variables:
export SOURCE_BUCKET=<BUCKET_NAME>
export SOURCE_KEY=<KEY>
export BATCH_SIZE=<BATCH_SIZE>
export ENVIRONMENT=<development/production>
- The source_bucket is the S3 bucket that will store the output of the script.
- The source_key is the key of the file that will store the output of the script.
- The batch_size is the number of repositories that will be scraped in each batch.
- The environment determines where to save the results. Development: locally, Production: to S3
Run:
make run
Install dev dependencies:
make install-dev
Run lint command:
make lint
Run ruff check:
make ruff
Run pylint:
make pylint
Run black:
make black