Skip to content

dhaval-d/bq_streaming_inserts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This sample application builds a BigQuery table to store some data for a sample rides application. Once the table is ready, run this Python application to populate the sample data.

Python application can be run in following ways depending on your needs:

  1. Stand alone mode: If you want < 1 million records, stand alone mode would work just fine.
  2. Run as a GKE job: If you want to ingest millions of records running a GKE job would be the best option.

Following are the steps to build a BigQuery table and run Python application as a job on a GKE cluster.

Prerequisites:

  1. You have an Editor access to a Google Cloud project.
  2. You have installed and configured a gCloud utility to refer to the above project.
  3. You have created a service account key file with BigQuery Editor permissions.
  4. Store this file as a bq-editor.json
  5. Your python environment is setup with all the dependencies from requirements.txt installed.

Step 1: Clone this repository to your local machine using the following command.

git clone https://github.com/dhaval-d/bq_streaming_inserts .

Step 2: Go to the above directory and run the following command to create a BigQuery table.

bq mk --table <br/> --schema rides.json <br/> --time_partitioning_field insert_date <br/> --description "Table with sample rides data" <br/> [YOUR_DATASET_NAME].rides

Step 3: Run the following command to set GOOGLE_APPLICATION_CREDENTIALS to point to your service account key file.
export GOOGLE_APPLICATION_CREDENTIALS=bq-editor.json

Step 4: Run the following command to run the python application on your local environment.
python3 app.py \
--project [YOUR_GCP_PROJECT_NAME] \
--dataset [YOUR_DATASET_NAME] \
--table rides \
--batch_size 1 \
--total_batches 1


Step 5: Change the Dockerfile CMD line(line 13) to point to your project and a BigQuery dataset.

Then build a docker container by using the following command.

docker build -t gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1 .

Step 6: Make sure you can see your container image using the following command.

docker images

Step 7: Run the following docker command to run your application as a container in a local environment. (For testing purposes)

docker run -- name bq_streaming
-e GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/bq-editor.json
-v $GOOGLE_APPLICATION_CREDENTIALS:/tmp/keys/bq-editor.json:ro
gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1


Step 8: Configure the docker to authenticate with your GCP project using following command.

gcloud auth configure-docker

Step 9: Push your docker image to the Google Container Registry on your GCP project.

docker push gcr.io/[YOUR_GCP_PROJECT_NAME]/bq_streaming_demo:v1

Step 10: Create and verify a GKE cluster using the following commands.

gcloud container clusters create demo-cluster --num-nodes=2

Step 11: Once a cluster is up and running, you can use the following command to check the status of the nodes.

kubectl get nodes

Step 12: Change args: line in a deployment.yaml file to refer to your project and a dataset. Also, you can change completions and parallelism parameters in a file based on how many output records you are trying to generate.

Step 13: One your deployment.yaml is updated, run following command to start your GKE job.

kubectl apply -f deployment.yaml

Step 14: Go to the GKE console and check the status of your job. Also, go to the BigQuery console and validate if the job is populating records or no.

Releases

No releases published

Packages

No packages published