Skip to content

Latest commit

 

History

History
1001 lines (729 loc) · 24.3 KB

README.md

File metadata and controls

1001 lines (729 loc) · 24.3 KB

Table of Contents

Introduction

The purpose of this repository is to setup virtual clusters on a single machine for learning big-data tools.

The clusters design is 1 master + 2 slaves.

1. Build Hadoop Docker Image

1.1 Hadoop Installation

  1. Pull ubuntu image from docker

    docker pull ubuntu
  2. Run ubuntu image and link local disk to ubuntu container

    docker run -it -v C:\hadoop\build:/root/build --name ubuntu ubuntu
  3. Enter ubuntu container and update necessary components

    # Update system
    apt-get update
    # Install vim
    apt-get install vim
    # Install ssh (need ssh to connect to slave)
    apt-get install ssh
  4. Add the followings to .bashrc to start sshd service automatically

    vim ~/.bashrc
    # Add the following
    /etc/init.d/ssh start
    # save and quit
    :wq
  5. Setup ssh to connect the workers without password

    # Enter the following and keep pressing enter
    ssh-keygen -t rsa
    # save to authorized keys
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  6. Test ssh connection

    ssh localhost
    # Example:

    image-20220803184446255

  7. Install java JDK

    # Use JDK 8 because Hive only support up to 8
    apt-get install openjdk-8-jdk
    # Setup environment variable
    vim ~/.bashrc
    # Add the followings
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    export PATH=$PATH:$JAVA_HOME/bin
    # save and quit
    :wq
    # Make .bashrc effective immediately
    source ~/.bashrc
  8. Save this container as jdkinstalled image for later use

# Commit the ubuntu container and rename it as ubuntu/jdkinstalled
docker commit ubuntu ubuntu/jdkinstalled
  1. Run ubuntu/jdkinstalled image and link local disk to container

    docker run -it -v C:\hadoop\build:/root/build --name ubuntu-jdkinstalled ubuntu/jdkinstalled
  2. Download Hadoop installation package from Apache (I use Hadoop-3.3.3) and put it to C:\hadoop\build

    https://dlcdn.apache.org/hadoop/common/stable/

  3. Enter jdkinstalled container

    docker exec -it ubuntu-jdkinstalled bash
    # switch to build and look for the installation package
    cd /root/build
    # extract the package to /usr/local
    tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local
  4. Switch to Hadoop folder

    # switch to hadoop folder
    cd /usr/local/hadoop-3.3.3/
    # Test if hadoop is running properly
    ./bin/hadoop version
  5. Setup Hadoop environment variables

    vim /etc/profile
    # Add the followings
    export HADOOP_HOME=/usr/local/hadoop-3.3.3
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    # save and quit
    :wq
    
    vim ~/.bashrc
    # Add the following
    source /etc/profile
    # save and quit
    :wq

1.2 Setup Hadoop clusters

  1. Switch to Hadoop configuration path

    cd /usr/local/hadoop-3.3.3/etc/hadoop/
  2. Setup hadoop-env.sh

    vim hadoop-env.sh
    
    # Add the followings
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    export HDFS_NAMENODE_USER="root"
    export HDFS_DATANODE_USER="root"
    export HDFS_SECONDARYNAMENODE_USER="root"
    export YARN_RESOURCEMANAGER_USER="root"
    export YARN_NODEMANAGER_USER="root"
    
    # save and quit
    :wq
  3. setup core-site.xml

    <configuration>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>file:/usr/local/hadoop/data</value>
            <description>Abase for other temporary directories.</description>
        </property>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://master:9000</value>
        </property>
        <property>
            <name> hadoop.hyyp.staticuser.user</name>
            <value>root</value>
        </property>
        <property>
            <name>hadoop.proxyuser.root.hosts</name>
            <value>*</value>
        </property>
        <property>
            <name>hadoop.proxyuser.root.groups</name>
            <value>*</value>
        </property>
        <property>
            <name>fs.trash.interval</name>
            <value>1440</value>
        </property>
    </configuration>
  4. Setup hdfs-site.xml

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/namenode_dir</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/datanode_dir</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>slave01:9868</value>
    </property>
</configuration> 
  1. Setup madpred-site.xml

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.address</name>
            <value>master:10020</value>
        </property>
        <property>
            <name>mapreduce.jobhistory.webapp.address</name>
            <value>master:19888</value>
        </property>
        <property>
            <name>yarn.app.mapreduce.am.env</name>
            <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>
        <property>
            <name>mapreduce.map.env</name>
            <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>
        <property>
            <name>mapreduce.reduce.env</name>
            <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
        </property>
    </configuration> 
  2. Setup yarn-site.xml

    <configuration>
        <!-- Site specific YARN configuration properties -->
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.resourcemanager.hostname</name>
            <value>master</value>
        </property>
        <property>
            <name>yarn.nodemanager.pmem-check-enabled</name>
            <value>false</value>
        </property>
        <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
        </property>
        <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
        </property>
        <property>
            <name>yarn.log.server.url</name>
            <value>http://master:19888/jobhistory/logs</value>
        </property>
        <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>604800</value>
        </property>
     </configuration>
  3. Save this container as hadoop-installed image for later use

    docker commit ubuntu-jdkinstalled ubuntu/hadoopinstalled
  4. Open 3 terminals and enter the following cmd separately

    Terminal 1 (master):

    # 8088 is the default WEB UI port for YARN
    # 9870 is the default WEB UI port for HDFS
    # 9864 is for inspecting file content inside HDFS WEB UI
    docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled 

    Terminal 2 (slave01):

    docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled

    Terminal 3 (slave02):

    docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled
  5. For every terminal, inspect the IP for each container enter the ip to every container.

    # Enter the hosts
    vim /etc/hosts
    # Example

    image-20220803205511246

  6. Open the workers file in master container

    vim /usr/local/hadoop-3.3.3/etc/hadoop/workers
    # Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well)
    master
    slave01
    slave02
    # Example

    image-20220803205936807

  7. Test to see if master can enter slave01 and slave02 without password

    # Need to enter yes for first time connection
    ssh slave01
    # Example

    image-20220803210337631

    image-20220803211116028

1.3 Start Hadoop clusters

  1. Need to format HDFS system if starting for the first time

    hdfs namenode -format
    
  2. Start HDFS clusters + YARN clusters

    start-all.sh
    
  3. Enter jps to check if the corresponding java processes have started successfully

image-20220803211621660

image-20220803211658947

image-20220803211734165

  1. To properly use the WEB UI, need to modify hosts file.

    For windows:

    Open C:\Windows\System32\drivers\etc\hosts
    
    # Add the followings
    127.0.0.1 master
    127.0.0.1 slave01
    127.0.0.1 slave02
    
    # Save the changes
  2. Open YARN WEB UI

    http://localhost:8088/

  3. Open HDFS WEB UI

    http://localhost:9870/

1.4 TODO

  1. HDFS WEB UI upload has problem

    image-20220803212604079

  2. HDFS WEB UI make folders has problem

    image-20220803212736367

1.5 Test Hadoop grep example

# Make input folder in HDFS
hdfs fs -mkdir -p /user/hadoop/input

# Upload local hadoop xml file to HDFS
hdfs fs -put ./etc/hadoop/*.xml /user/hadoop/input

# Use ls to check if files are uploaded to HDFS properly
hdfs fs -ls /user/hadoop/input

# Run grep example. Use all files in HDFS input folder and filter dfs[a-z.]+ word counts to output folder
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/input output 'dfs[a-z.]+'

# Check results
./bin/hdfs dfs -cat output/*

2. Build Hive Docker Image

2.1 Hive Installation

  1. Run Hadoop image

    docker run -it -h master -p 8088:8088 -p 9870:9870 -p 9864:9864 --name master ubuntu/hadoopinstalled
  2. Download Apache Hive installation package

    https://dlcdn.apache.org/hive/hive-3.1.3/

  3. Copy Hive installation package to master container

    docker cp apache-hive-3.1.3-bin.tar.gz master:/usr/local/
  4. Install sudo

    apt-get install sudo -y
  5. Install net-tools

    sudo apt install net-tools
    
  6. Install mysql-server for meta-store

    # Install mysql-server 
    sudo apt install mysql-server -y
    
    # Grant permission
    usermod -d /var/lib/mysql mysql
    
    # Check mysql version
    mysql --version

    image-20220803214235469

  7. Extract Hive package

    # Enter master container
    docker exec -it master bash
    
    # Change to package location
    cd /usr/local/
    
    # Extract Hive package
    tar -zxvf apache-hive-3.1.3-bin.tar.gz
    
    # Rename to hive
    mv apache-hive-3.1.3-bin hive
    
    # Remove Hive installation package
    rm apache-hive-3.1.3-bin.tar.
    
    # Grant root to hive folder
    sudo chown -R root:root hive
  8. Setup Hive environment variables

    # Open ./bashrc
    vim ~/.bashrc
    
    # Enter the followings:
    export HIVE_HOME=/usr/local/hive
    export PATH=$PATH:$HIVE_HOME/bin
    export HADOOP_HOME=/usr/local/hadoop-3.3.3
    
    # save and quit
    :wq
    
    # make .bashrc effective immediately
    source ~/.bashrc

2.2 Setup Hive clusters

  1. Change to Hive configuration folder

    cd /usr/local/hive/conf
  2. Use hive-default.xml template

    mv hive-default.xml.template hive-default.xml
    
  3. Setup hive-site.xml

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8&amp;allowPublicKeyRetrieval=true</value>
            <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.cj.jdbc.Driver</value>
            <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>hive</value>
            <description>username to use against metastore database</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>hive</value>
            <description>password to use against metastore database</description>
        </property>
        <property>
            <name>hive.server2.thrift.bind.host</name>
            <value>master</value>
            <description> H2S bind to host </description>
        </property>
        <property>
            <name>hive.metastore.uris</name>
            <value>thrift://master:9083</value>
            <description> metastore url address </description>
        </property>
        <property>
            <name>hive.metastore.event.db.notification.api.auth</name>
            <value>false</value>
            <description> close metadata storage authorization </description>
        </property>
    </configuration>
  4. Start mysql database

    # Enter mysql as root user without password
    mysql -u root  -p
  5. Enter the following commands

    create database hive;
    
    create user hive@localhost identified by 'hive';
    
    GRANT ALL PRIVILEGES ON *.* TO hive@localhost with grant option;
    
    FLUSH PRIVILEGES;
  6. Enter ctrl + d to quit mysql

  7. Remove jar from Hive as it conflicts with Hadoop

    rm /usr/local/hive/lib/log4j-slf4j-impl-2.17.1.jar
  8. Download mysql connector driver (mysql version is 8.0.29)

    https://dev.mysql.com/downloads/connector/j/

  9. Copy to master container and extract

    # Copy to master container
    docker cp mysql-connector-java_8.0.29-1ubuntu21.10_all.deb master:/usr/local/
    
    # Switch to /usr/local
    cd /usr/local
    
    # Extract package
    dpkg -i mysql-connector-java_8.0.29-1ubuntu21.10_all.deb
    
    # Copy jar file to Hive library
    cp /usr/share/java/mysql-connector-java-8.0.29.jar /usr/local/hive/lib/
  10. Commit the container to a new image

    docker commit master ubuntu/hiveinstalled

2.3 Start Hive clusters

  1. Open 3 terminals and enter the following cmd separately

    Terminal 1 (master):

    docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 --name master ubuntu/hiveinstalled

    Terminal 2 (slave01):

    docker run -it -h slave01 --name slave01 ubuntu/hadoopinstalled

    Terminal 3 (slave02):

    docker run -it -h slave02 --name slave02 ubuntu/hadoopinstalled
  2. For every terminal, inspect the IP for each container enter the ip to every container.

    # Enter the hosts
    vim /etc/hosts
    # Example

    image-20220803205511246

  3. Open the workers file in master container

    vim /usr/local/hadoop-3.3.3/etc/hadoop/workers
    # Change localhost to the followings (I put master in workers to add datanode and NodeManager to master as well)
    master
    slave01
    slave02
    # Example

    image-20220803205936807

  4. Enter master container and start mysql

    # Enter master container
    docker exec -it master bash
    
    # Start mysql
    service mysql start
  5. Start HDFS service

    # Start HDFS
    start-dfs.sh
    
    # Create the following folders
    hadoop fs -mkdir /tmp
    hadoop fs -mkdir -p /user/hive/warehouse
    
    # Grant read write permission
    hadoop fs -chmod g+w /tmp
    hadoop fs -chmod g+w /user/hive/warehouse
  6. Start metastore service and hiveserver2

    # Start metastore
    nohup hive --service metastore > nohup.out &
    # Use net-tools to check if metastore has started properly
    netstat -nlpt |grep 9083
    
    # Start hiveserver2
    nohup hiveserver2 > nohup2.out &
    # Use net-tools to check if hiveserver2 has started properly
    netstat -nlpt|grep 10000
    

    Should see the followings if metastore and hiveserver2 have started properly

    image-20220804092739739

    image-20220804092831807

  7. Test hiveserver2 with beeline client

    # Enter beeline in terminal
    beeline
    
    # Enter the following to connect to metastore
    !connect jdbc:hive2://master:10000
    
    # Enter root as username; no need to enter password

    image-20220804093303295

3. Build Spark Docker Image

3.1 Spark Installation

  1. Download Spark from Apache Spark (I use spark-3.3.0 for this setup)

    https://www.apache.org/dyn/closer.lua/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

  2. Install Spark installation package in master container

    # Copy Spark installation package from local machine to master container
    docker cp spark-3.3.0-bin-hadoop3.tgz master:/usr/local/
    
    # Enter master container
    docker exec -it master bash
    
    # Change to /usr/local
    cd /usr/local
    
    # Extract package
    sudo tar -zxf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/
    
    # rm package
    rm spark-3.3.0-bin-hadoop3.tgz
    
    # Rename Spark folder
    sudo mv ./spark-3.3.0-bin-hadoop3/ ./spark
  3. Download Anaconda3 Linux distribution

    https://www.anaconda.com/products/distribution#Downloads

  4. Install Anaconda3 in master container (I use Anaconda3-2022.05 for this setup)

    # Copy Anaconda3 installation package from local machine to master container
    docker cp Anaconda3-2022.05-Linux-x86_64.sh master:/usr/local/
    
    # Enter master container
    docker exec -it master bash
    
    # Change to /usr/local
    cd /usr/local
    
    # Install Anaconda3 package
    sh ./ Anaconda3-2022.05-Linux-x86_64.sh
    # Type yes
    
    # Enter installation path
    /usr/local/anaconda3
    # Type yes
    
    # Remove Anaconda3 package
    rm Anaconda3-2022.05-Linux-x86_64.sh
  5. Setup python virtual environment named "pyspark"

    conda create -n pyspark python=3.8
    # Type y
    
    # Activate pyspark virtual env
    conda activate pyspark

    image-20220804095615061

  6. Install necessary python packages

    conda install pyspark
    conda install pyhive
  7. Enter the followings into /etc/profile

    # Open /etc/profile
    vim /etc/profile
    
    # Make sure the followings are in the file
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    
    export HADOOP_HOME=/usr/local/hadoop-3.3.3
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    
    export SPARK_HOME=/usr/local/spark
    export PYSPARK_PYTHON=/usr/local/anaconda3/envs/pyspark/bin/python
    
    # Save and quit
    :wq
  8. Configure workers file

    # Switch to spark configuration folder
    cd /usr/local/spark/conf
    
    # cp workers template
    cp workers.template workers
    
    # Open workers file
    vim workers
    
    # Change localhost to the followings:
    master
    slave01
    slave02
    
    # Save and quit
    :wq
  9. Configure spark-env.sh file

    # Open spark-env.sh file
    vim spark-env.sh
    
    # Enter the followings
    HADOOP_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop
    YARN_CONF_DIR=/usr/local/hadoop-3.3.3/etc/hadoop
    export SPARK_MASTER_HOST=master
    export SPARK_MASTER_PORT=7077
    SPARK_MASTER_WEBUI_PORT=8080
    SPARK_WORKER_PORT=7078
    SPARK_WORKER_WEBUI_PORT=8081
    SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://master:9000/sparklog/ -Dspark.history.fs.cleaner.enabled=true -Dspark.history.ui.port=18080"
    
    # Save and quit
    :wq
  10. Create sparklog folder in HDFS

    hadoop fs -mkdir /sparklog
    hadoop fs -chmod 777 /sparklog
  11. Configure spark-defaults.conf file

    # Copy spark-defaults.conf template
    cp spark-defaults.conf.template spark-defaults.conf
    
    # Open the file
    vim spark-defaults.conf
    
    # Enter the followings
    spark.eventLog.enabled true
    spark.eventLog.dir hdfs://master:9000/sparklog/
    spark.eventLog.compress true
    
    # Save and quit
    :wq
  12. Configure log4j2.properties file

    # Copy log4j2.properties template
    cp log4j2.properties.template log4j2.properties
    
    # Open the file
    vim log4j2.properties
    
    # Change rootLogger.level to warn
    rootLogger.level = warn
    
    # Save and quit
    :wq
  13. Commit the container to a new image

    docker commit master ubuntu/sparkinstalled
    

3.2 Start Spark clusters

  1. Open 3 terminals and enter the following cmd separately

    Terminal 1 (master):

    docker run -it -h master  -p 8088:8088 -p 9870:9870 -p 9864:9864 -p 10000:10000 -p 8080:8080 -p 18080:18080 -p 4040:4040 -p 19888:19888 --name master ubuntu/sparkinstalled
  2. Terminal 2 (slave01)

    docker run -it -h slave01 --name slave01 ubuntu/sparkinstalled
  3. Terminal 3 (slave02)

    docker run -it -h slave02 --name slave02 ubuntu/sparkinstalled
  4. Enter master container and start spark clusters

    # Enter master container
    docker exec -it master bash
    
    # Make sure to Hadoop clusters first
    start-all.sh
    
    # Follow 2.3 instruction to start Hive clusters (For Spark on Hive)
    
    # Switch to spark sbin folder
    cd /usr/local/spark/sbin
    
    # Start spark clusters
    ./start-all.sh
    
    # Start history server
    ./start-history-server.sh
    
    # Start mr-jobhistory server
    cd /usr/local/hadoop-3.3.3/sbin
    mr-jobhistory-daemon.sh start historyserver

3.3 Test Spark Standalone mode

cd /usr/local/spark
# Start pyspark in Standalone mode
bin/pyspark --master spark://master:7077
# Type the following
sc.parallelize([1,2,3,4,5]).map(lambda x: x + 1).collect()

image-20220804103924331

4. Docker image link

  1. Hadoop ready image

    https://hub.docker.com/repository/docker/johnhui/hadoopinstalled

  2. Hive ready image

    https://hub.docker.com/repository/docker/johnhui/hiveinstalled

  3. Spark ready image

    https://hub.docker.com/repository/docker/johnhui/sparkinstalled