Premier League Table Scraper with Airflow and Google BigQuery

This project is designed to scrape the Premier League table data, schedule the scraping job using Apache Airflow, and ingest the data into Google BigQuery. The infrastructure for BigQuery is set up using Terraform, and Airflow is containerized using Docker and orchestrated using a custom docker-compose file.

Overview

The project uses Apache Airflow to schedule the Premier League table scraping job, which is implemented as an Airflow DAG (Directed Acyclic Graph). The scraped data is then ingested into Google BigQuery for further analysis and visualization.

Project Structure

scrapper.py: This file contains the Python code for scraping the Premier League table data from the web.
dags/: This directory contains the Airflow DAG file for scheduling the scraping of the Premier League table data.
main.tf: This file contains the Terraform configuration files for setting up the infrastructure on GCP for BigQuery.
docker-compose.yml: This file contains the Docker Compose configuration for running the Airflow web server and scheduler.

Prerequisites

Google Cloud Platform (GCP) Account: Ensure you have a GCP account with the necessary permissions to create BigQuery datasets and tables.
Terraform: Install Terraform on your local machine. Visit Terraform Downloads for installation instructions.
Docker: Install Docker on your local machine. Visit Docker for installation instructions.
Docker Compose: Install Docker Compose on your local machine. Visit Docker Compose for installation instructions.

Installation

Clone the repository to your local machine:

git clone https://github.com/Ishan-phys/league-table-scraper.git
cd league-table-scraper

Create a new project on GCP and enable the BigQuery API. Set up your Google Cloud credentials: Follow the Google Cloud Authentication documentation to set up your credentials. Make sure the credentials have the necessary permissions for BigQuery.
Configure the Airflow environment: Update the .env file with your Google Cloud credentials and desired Airflow configurations.
Build and start the Docker containers:

docker-compose up -d

Project Structure

premier-league-scraper/
│
├── dags/
│   └── scraper.py
│   └── data_ingestion_dag.py
│   └── upload_to_gcs.py
│   └── upload_postgres.py
│
├── main.tf
├── variables.tf
│
├── docker-compose.yml
├── .env
└── ...

Configuration

The .env file contains the environment variables for the Airflow web server and scheduler. Update the .env file with your Google Cloud credentials and desired Airflow configurations.

# Google Cloud credentials file path
GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/credentials.json

# Google Cloud project ID
GOOGLE_PROJECT_ID=your-project-id

# Airflow configuration
AIRFLOW_USER=admin
AIRFLOW_PASSWORD=admin
AIRFLOW_API_AUTHENTICATE=true
AIRFLOW_WEB_SERVER_PORT=8080

Usage

Run the Airflow web server and scheduler:

docker-compose up -d

Access the Airflow web interface in your web browser:

Open a web browser and navigate to http://localhost:8080. Log in using the credentials specified in the .env file.

Trigger the Premier League table scraping DAG manually or wait for the scheduled run.
Monitor the progress of the scraping job in the Airflow web interface.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
logs/scheduler		logs/scheduler
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
main.tf		main.tf
output.txt		output.txt
requirements.txt		requirements.txt
terraform.tfstate		terraform.tfstate
terraform.tfstate.backup		terraform.tfstate.backup
test.py		test.py
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Premier League Table Scraper with Airflow and Google BigQuery

Overview

Project Structure

Prerequisites

Installation

Project Structure

Configuration

Usage

License

About

Releases

Packages

Contributors 2

Languages

Ishan-phys/league-table-scraper

Folders and files

Latest commit

History

Repository files navigation

Premier League Table Scraper with Airflow and Google BigQuery

Overview

Project Structure

Prerequisites

Installation

Project Structure

Configuration

Usage

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages