|
| 1 | +[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service. This guide demonstrates deploying a "hello-world" [processing job](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html) using Hamilton functions on AWS Glue. |
| 2 | + |
| 3 | +## Prerequisites |
| 4 | + |
| 5 | +- **AWS CLI Setup**: Make sure the AWS CLI is set up on your machine. If you haven't done this yet, no worries! You can follow the [Quick Start guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) for easy setup instructions. |
| 6 | + |
| 7 | +## Step-by-Step Guide |
| 8 | + |
| 9 | +### 1. Build wheel with Hamilton functions |
| 10 | + |
| 11 | +First things first, AWS Glue jobs run a single python script, but you can include external code (like our Hamilton functions) by adding it as a python wheel. So, let's package our code and get it ready for action. |
| 12 | + |
| 13 | +- **Install build package:** |
| 14 | + |
| 15 | + This command installs the 'build' package, which we'll use to create our python wheel. |
| 16 | + |
| 17 | + ```shell |
| 18 | + pip install build |
| 19 | + ``` |
| 20 | + |
| 21 | +- **Build python wheel:** |
| 22 | + |
| 23 | + ```shell |
| 24 | + cd app && python -m build --wheel --skip-dependency-check && cd .. |
| 25 | + ``` |
| 26 | + |
| 27 | +### 2. Upload all necessary files to S3 |
| 28 | + |
| 29 | +- **Upload the wheel file to S3:** |
| 30 | + |
| 31 | + Replace `<YOUR_PATH_TO_WHL>` with your specific S3 bucket and path: |
| 32 | + |
| 33 | + ```shell |
| 34 | + aws s3 cp app/dist/hamilton_functions-0.1-py3-none-any.whl s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl |
| 35 | + ``` |
| 36 | + |
| 37 | +- **Upload main python script to s3:** |
| 38 | + |
| 39 | + Replace `<YOUR_PATH_TO_SCRIPT>` with your specific S3 bucket and path: |
| 40 | + |
| 41 | + ```shell |
| 42 | + aws s3 cp processing.py s3://<YOUR_PATH_TO_SCRIPT>/processing.py |
| 43 | + ``` |
| 44 | + |
| 45 | +- **Upload input data to s3:** |
| 46 | + |
| 47 | + Replace `<YOUR_PATH_TO_INPUT_DATA>` with your specific S3 bucket and path: |
| 48 | + |
| 49 | + ```shell |
| 50 | + aws s3 cp data/input_table.csv s3://<YOUR_PATH_TO_INPUT_DATA> |
| 51 | + ``` |
| 52 | + |
| 53 | +### 3. Create a simple role for AWS Glue job execution |
| 54 | + |
| 55 | +- **Create the Role**: |
| 56 | + |
| 57 | + ```shell |
| 58 | + aws iam create-role --role-name GlueProcessorRole --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com"}, "Action": "sts:AssumeRole"}]}' |
| 59 | + ``` |
| 60 | + |
| 61 | +- **Attach Policies to the Role**: |
| 62 | + |
| 63 | + Here we grant full access to S3 as an example. For production environments it's important to restrict access appropriately. |
| 64 | +
|
| 65 | + ```shell |
| 66 | + aws iam attach-role-policy --role-name GlueProcessorRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess |
| 67 | + aws iam attach-role-policy --role-name GlueProcessorRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole |
| 68 | + ``` |
| 69 | +
|
| 70 | +### 4. Create and run the job |
| 71 | +
|
| 72 | +- **Create a job:** |
| 73 | +
|
| 74 | + Ensure all paths are correctly replaced with the actual ones: |
| 75 | +
|
| 76 | + ```shell |
| 77 | + aws glue create-job --name test_hamilton_script --role GlueProcessorRole --command '{"Name" : "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://<YOUR_PATH_TO_SCRIPT>/processing.py"}' --max-capacity 0.0625 --default-arguments '{"--extra-py-files" : "s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl", "--additional-python-modules" : "sf-hamilton"}' |
| 78 | + ``` |
| 79 | +
|
| 80 | +- **Run the job:** |
| 81 | +
|
| 82 | + Ensure all paths are correctly replaced with the actual ones: |
| 83 | +
|
| 84 | + ```shell |
| 85 | + aws glue start-job-run --job-name test_hamilton_script --arguments '{"--input-table" : "s3://<YOUR_PATH_TO_INPUT_DATA>", "--output-table" : "s3://<YOUR_PATH_TO_OUTPUT_DATA>"}' |
| 86 | + ``` |
| 87 | +
|
| 88 | + Once you've run the job, you should see an output file at `s3://<YOUR_PATH_TO_OUTPUT_DATA>`. |
0 commit comments