์ด ๊ธ์ ์ฐ๋ถํฌ ๊ธฐ์ค์ผ๋ก ์์ฑ๋์์ต๋๋ค.
$ docker pull jo1013/pyspark:0.05
$ docker pull jo1013/airflow:0.07
$ docker pull mysql:8.0.17
$ git clone https://github.com/jo1013/pyspark.git
$ cd pyspark
$ docker-compose up ## mysql pyspark airflow(postgresql) ์ปจํ
์ด๋์คํ
- (docker-compose.yml์์ 3๊ฐ์ container๋ ๋ณธ์ธ์ volumes์ ๋ง๊ฒ ์์ ํ๋ค.)
$ docker exec -it airflow bash
$ docker exec -it [์ค์ ์ด๋ฆ] bash
- postgreqs ์์
$ service postgresql start
- ๋ก์ปฌ ํฐ๋ฏธ๋์์
$ docker exec -it -d airflow service postgresql start
$ nano /root/airflow/airflow.cfg
# dags_folder = /root/airflow/dags
dags_folder = /home/pyspark/airflow/dags
# base_log_folder = /root/airflow/logs
base_log_folder = /home/pyspark/airflow/logs
# plugins_folder = /root/airflow/plugins
plugins_folder = /home/pyspark/airflow/plugins
# default_timezone = utc
default_timezone = Asia/Seoul
# executor = SequentialExecutor
executor = LocalExecutor
$ airflow webserver
- ๋ก์ปฌ ํฐ๋ฏธ๋์์
$ docker exec -it -d airflow airflow webserver
-
https://localhost:8090์ผ๋ก ์ ์ํ๋ฉด airflowํ๋ฉด์ ๋ณผ ์ ์๋ค.
-
id : admin
-
password : admin
$ cd Airflow
$ docker run -it -d -p 8090:8080 -v ~/workspace:/home -e LC_ALL=C.UTF-8 --name airflow6 jo1013/airflowex:0.06
$ docker run -it -d -p [์ฐ๊ฒฐ๋ก์ปฌํฌํธ]:[์ฐ๊ฒฐ๋์ปคํฌํธ] -v [๋ก์ปฌ๋๋ ํฐ๋ฆฌ]:[์ปจํ
์ด๋๋๋ ํฐ๋ฆฌ] -e LC_ALL=C.[์ธ์ฝ๋ฉ๋ฐฉ์] --name [์ค์ ํ ์ด๋ฆ] [dockerhubid]/[imagename]:[tag]
$ sudo su - postgres
$ psql
$ CREATE DATABASE airflow;
$ CREATE USER timmy with ENCRYPTED password '0000';
$ GRANT all privileges on DATABASE airflow to timmy;
$ \c airflow
$ GRANT all privileges on all tables in schema public to timmy;
$ \q
$ exit
$ pg_createcluster 13 main
$ pg_ctlcluster 13 main start
# $ cd /etc/postgresql/13/main
# $ nano pg_hba.conf
# IPv4 local connections:
host all all 0.0.0.0/0 md5
$ service postgresql restart
# sql_alchemy_conn = sqlite:////root/airflow/airflow.db
# sql_alchemy_conn = postgresql+psycopg2://timmy:0000@172.17.0.2/airflow
# docker hub์์๋ ๊ฐ๋ฅํ์ผ๋ docker-compose์์๋ ๋จ์ผ ์ปจํ
์ด๋๊ณผ IP Adress๊ฐ ๋ฌ๋ผ์ง๋ค.
sql_alchemy_conn = postgresql+psycopg2://timmy:0000@localhost/airflow
# --> ๊ฐ์ docker (์ปจํ
์ด๋) ๋ด์์ postgresql์ด ์๋ํ๋ฏ๋ก localhost๋ก ๊ณ ์น๋ค.
- sql_alchemy_conn์ localhost๋ฅผ ์ ์ผ๋ฉด ํด๋น ์ปจํ ์ด๋๋ฅผ ์ฐพ์๊ฐ์ง ๋ชปํ๊ธฐ ๋๋ฌธ์ postgres์ปจํ ์ด๋์ IP๋ฅผ ๋ฃ์ด์ค์ผํ๋ค.
$ ifconfig
$ cd /etc/postgresql/13/main
$ nano pg_hba.conf
IPv4 local connections:
host all all 0.0.0.0/0 md5
$ service postgresql restart
$ cd Airflow
$ mkdir dags
$ mkdir logs
$ airflow db init
$ docker commit postgres postgres:airflow
$ cd home
$ nano makeuser.py
makeuser.py๋ฅผ ~/airflow_home ์์น๋ก ์์
$ cp makeuser.py airflow
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user.email = 'sunny@test.com'
user.password = 'sunny'
user.superuser = True
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
$ airflow users create \
--username admin \
--firstname FIRST_NAME \
--lastname LAST_NAME \
--role Admin \
--email admin@example.org
- db ์ด๊ธฐํ (postgres 'airflow' table )
airflow db init
- In file system terms, a database cluster is a single directory under which all data will be stored. We call this the data directory or data area. It is completely up to you where you choose to store your data. There is no default, although locations such as /usr/local/pgsql/data or /var/lib/pgsql/data are popular. The data directory must be initialized before being used, using the program initdb which is installed with PostgreSQL.
- ๋ณดํต ์์ ๊ธ์ค์ ๋์จ ๊ฒฝ๋ก์ ๊ฐ์ด ๊ฒฝ๋ก ์ค์ ์ ํ์ง๋ง ๊ผญ ๊ทธ๋ด ํ์๋ ์๋ค๋ ๋ด์ฉ~
---
๋ค์์ Airflow ์ ๋ฌด ํ๋ฆ์ ์ค๊ณํ ๋ ์ฌ์ฉํ๋ ๋ช ๊ฐ์ง ์ฉ์ด์ ๊ดํ ๊ฐ๋ตํ ๊ฐ์์ด๋ค.
-
Airflow DAG๋ ํ์คํฌ๋ก ๊ตฌ์ฑ๋๋ค.
-
๊ฐ ํ์คํฌ๋ ์คํผ๋ ์ดํฐ ํด๋์ค๋ฅผ ์ธ์คํด์คํํ์ฌ ๋ง๋ ๋ค. ๊ตฌ์ฑํ ์คํผ๋ ์ดํฐ ์ธ์คํด์ค๋ ๋ค์๊ณผ ๊ฐ์ด ํ์คํฌ๊ฐ ๋๋ค. my_task = MyOperator(...)
-
DAG๊ฐ ์์๋๋ฉด Airflow๋ ๋ฐ์ดํฐ๋ฒ ์ด์ค์ DAG ๋ฐ ํญ๋ชฉ์ ๋ง๋ ๋ค.
-
ํน์ DAG ๋ฐ ๋งฅ๋ฝ์์ ํ์คํฌ๋ฅผ ์คํํ๋ฉด ํ์คํฌ ์ธ์คํด์ค๊ฐ ๋ง๋ค์ด์ง๋ค.
-
AIRFLOW_HOME์ DAG ์ ์ ํ์ผ๊ณผ Airflow ํ๋ฌ๊ทธ์ธ์ ์ ์ฅํ๋ ๋๋ ํฐ๋ฆฌ์ด๋ค.
-> ๋ฒ์ ์ด ๋ฌ๋ผ์์ธ์ง ์ ์ถ์ฒ์ ํํ ๋ฆฌ์ผ์ ๋ฐ๋ผํ๋ค๋ณด๋ฉด import ์ค๋ฅ๊ฐ ๋ฐ์ํ๋๋ฐ dags/test_operators.py ์์
from airflow.operators import MyFirstOperator
๋ฅผ
from my_operators import MyFirstOperator
from [filename] import [classname]
์ผ๋ก ๊ณ ์ณ์ฃผ๋ฉด ๋๋ค.
๋ฒ์ ๋ณ๊ฒฝ์ผ๋ก ์ธํ ์ค๋ฅ๋ก ๋ณด์ธ๋ค.?
$ docker run -it --rm -p 8888:8888 -p 8000:8000 -v ~/workspace:/home jo1013/pyspark:0.05
$ docker exec -it py_spark bash
$ docker exec -it [container id or container name] bash
$ jupyter notebook --allow-root --ip=0.0.0.0 --port=8888 --no-browser
$ docker run -n db-mysql -e MYSQL_DATABASE=testdb - MYSQL_ROOT_PASSWORD=root - TZ=Asia/Seoul -p 3306:3306 -c --character-set-server=utf8mb4 -c --collation-server=utf8mb4_unicode_ci
$ docker run --name db-mysql -e MYSQL_ROOT_PASSWORD=root -d -p 3306:3306 mysql
$ airflow dags list
$ airflow tasks list
$ jupyter notebook --allow-root --ip=0.0.0.0 --port=8888 --no-browser
๋ก์ปฌํฐ๋ฏธ๋์์ ip ์ฃผ์ ํ์ธ
$ ifconfig
*๊ทธ์ ์ ํด๋น์ปจํ ์ด๋์ ๋ช ๋ น์ด ๋ฐ๋ ์ปจํ ์ด๋์ ssh ์ค์นํ๊ณ ๋ช ๋ น์ด ๋ฐ๋ ์ปจํ ์ด๋์ /etc/ssh/sshd_config ์์ PermitRootLogin ์ yes๋ก ๊ณ ์ณ์ฃผ์ด์ผํจ (์ ๋X) (๋ณด์ ์ํ)
- Airflow DB์ ์ ์์ ๋ณด๋ฅผ ์ ์ฅํ๊ณ operator์์๋ ๋ถ๋ฌ์ ์ธ ์ ์๋ SSH Connection์ ์ค์ ํด์ ์ฌ์ฉ
- ์ถ๊ฐ๋ก ๊ฐ๋ฅํ๋ค๋ฉด ์ธ๋ถ secret manager ํ ์๋น์ค๋ฅผ ์ฌ์ฉํ์๋ ๊ฒ์ ๊ถ์ฅ๋๋ฆฝ๋๋ค.
$ ssh-keygen -t rsa
- id_rsa: ๋น๋ฐํค
- id_rsa.pub: ๊ณต๊ฐํค
์คํํ๋ ์ปจํ ์ด๋์์ ์ฌ์ฉ์๋ฅผ ์ถ๊ฐ
$ useradd airflow
$ useradd [์ ์ ]
su - airflow
$ mkdir .ssh
$ chown -R airflow:airflow /home/workspace
$ chown -R ๊ณ์ ๋ช
:๊ณ์ ๋ช
[ํ๋๋ ํฐ๋ฆฌ๊ฒฝ๋ก]