Skip to content

Engineering for Data Analysis 1 (COMP0235) Coursework at UCL. This repository implements a distributed data analysis pipeline to process 3D protein models at scale using Merizo Search. The system runs on a 5 node cluster, efficiently analyzing human and E. coli datasets and producing summary statistics and parsed results files.

Notifications You must be signed in to change notification settings

tylersupersad/comp0235-cw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMP0235 Coursework Task

Table of Contents


  1. Setting Up the Cluster Environment
  2. Executing the Pipeline

1. Setting Up the Cluster Environment

Step 1: Generate an SSH Keypair

  1. Generate a new SSH keypair (if you don’t already have one) on your local machine or the VM where Terraform will run:

    ssh-keygen -t ed25519
    • Follow the prompts to specify a key name and, optionally, a passphrase.
  2. Locate the public key:

    • The public key (.pub) will be used for VM access via cloud-init.
  3. Use an existing key (if applicable):

    • Ensure the public key you plan to use is accessible and correctly specified in your Terraform configuration.

Step 2: Prepare and Configure Deployment Files

  1. Navigate to the cluster folder:
cd cluster
  1. Edit the variables.tf file:
    • Update fields such as namespace, network_name, and keyname to align with your specific environment and configuration.

Step 3: Build Virtual Machines via Terraform

  1. Prepare Terraform by initializing the working directory and downloading the required providers:
terraform init
  1. Create the virtual machines by executing the following command:
terraform apply
  • Review the proposed changes and confirm the deployment when prompted.

Step 4: Configure the Cluster Environment Using Ansible

  1. Make the script executable:
    chmod +x generate_inventory.py
  2. Execute the playbook to set up the cluster environment using the inventory file:
ansible-playbook -i generate_inventory.py full.yaml

2. Executing the Pipeline

Step 1: Prepare the Host Machine

  1. Access the host machine.
  2. Clone the repository containing the pipeline setup:
https://github.com/tylersupersad/comp0235-cw.git
  1. Verify the inventory.ini file to ensure all IP addresses match their respective machines:
nano config/inventory.ini

Step 2: Configure the Storage Machine

  1. Navigate to the playbooks directory:
cd playbooks
  1. Execute the following tasks with a single playbook:
    • Configure the mount point on the storage machine.
    • Copy the necessary Python scripts and requirements.txt file to the mounted storage.
    • Install required dependencies on all worker machines.
ansible-playbook -i ../config/inventory.ini setup_and_deploy.yml

Step 3: Prepare Input Data (If Not Already Set Up)

  1. Obtain the human and E. coli datasets by running:
ansible-playbook -i ../config/inventory.ini prepare_input_data.yml

Step 4: Set Up Merizo Search (If Not Already Set Up)

  1. Run the following playbook to configure Merizo Search:
ansible-playbook -i ../config/inventory.ini merizo_setup.yml

Step 5: Run the Pipeline

  1. Execute the pipeline using:
ansible-playbook -i ../config/inventory.ini run_pipeline.yml

About

Engineering for Data Analysis 1 (COMP0235) Coursework at UCL. This repository implements a distributed data analysis pipeline to process 3D protein models at scale using Merizo Search. The system runs on a 5 node cluster, efficiently analyzing human and E. coli datasets and producing summary statistics and parsed results files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published