From dedefdfe685a984909364843b386d569fa81e6fd Mon Sep 17 00:00:00 2001 From: du Date: Wed, 10 Jan 2024 17:23:57 +0800 Subject: [PATCH] [Docs] Modify Getting Started Env Guide (#406) * Supplementary Getting Started Env Guide Signed-off-by: fphantam * modify setup doc Signed-off-by: fphantam * modify setup doc Signed-off-by: fphantam --------- Signed-off-by: fphantam --- .../01-Getting Started/01-setup-local-env.md | 197 +++++++++++++++-- .../01-Getting Started/02-docker-compose.mdx | 86 -------- .../{03-spark-guide.md => 02-spark-guide.md} | 2 +- ...{04-Flink-Guide.mdx => 03-Flink-Guide.mdx} | 0 .../01-Getting Started/01-setup-local-env.md | 199 ++++++++++++++++-- .../01-Getting Started/02-docker-compose.mdx | 86 -------- .../{03-spark-guide.md => 02-spark-guide.md} | 2 +- ...{04-Flink-Guide.mdx => 03-Flink-Guide.mdx} | 0 8 files changed, 374 insertions(+), 198 deletions(-) delete mode 100644 website/docs/01-Getting Started/02-docker-compose.mdx rename website/docs/01-Getting Started/{03-spark-guide.md => 02-spark-guide.md} (98%) rename website/docs/01-Getting Started/{04-Flink-Guide.mdx => 03-Flink-Guide.mdx} (100%) delete mode 100644 website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx rename website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/{03-spark-guide.md => 02-spark-guide.md} (97%) rename website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/{04-Flink-Guide.mdx => 03-Flink-Guide.mdx} (100%) diff --git a/website/docs/01-Getting Started/01-setup-local-env.md b/website/docs/01-Getting Started/01-setup-local-env.md index c0874ac1d..0e313f737 100644 --- a/website/docs/01-Getting Started/01-setup-local-env.md +++ b/website/docs/01-Getting Started/01-setup-local-env.md @@ -1,4 +1,4 @@ -# Setup a Local Environment +# Setup a Test Environment -## Start A Local PostgreSQL DB +## 1. Set up a test environment in the Linux local file system +To store data on local disk, only a PostgreSQL database is required. + +### 1.1 Start A Local PostgreSQL DB The quickest way to start a pg DB is via docker container: ```shell docker run -d --name lakesoul-test-pg -p5432:5432 -e POSTGRES_USER=lakesoul_test -e POSTGRES_PASSWORD=lakesoul_test -e POSTGRES_DB=lakesoul_test -d postgres:14.5 ``` -## PG Database Initialization +### 1.2 PG Database Initialization Init PG database of LakeSoul using `script/meta_init.cql`. +Execute code blow in the LakeSoul base directory: +``` +docker cp script/meta_init.sql lakesoul-test-pg:/ - ``` - PGPASSWORD=lakesoul_test psql -h localhost -p 5432 -U lakesoul_test -f script/meta_init.sql - ``` +docker exec -i lakesoul-test-pg sh -c "PGPASSWORD=lakesoul_test psql -h localhost -p 5432 -U lakesoul_test -f meta_init.sql" +``` -## Lakesoul PG Database Configuration Description: +### 1.3 Lakesoul PG Database Configuration Description: By default, the PG database is connected to the local database. The configuration information is as follows, ```txt lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver @@ -38,7 +43,7 @@ export lakesoul_home=/opt/soft/pg.property You can put customized database configuration information in this file. -## Install an Apache Spark environment +### 1.4 Install an Apache Spark environment You could download spark distribution from https://spark.apache.org/downloads.html, and please choose spark version 3.3.0 or above. Note that the official package from Apache Spark does not include hadoop-cloud component. We provide a Spark package with Hadoop cloud dependencies, download it from https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/spark/spark-3.3.2-bin-hadoop3.tgz. After unpacking spark package, you could find LakeSoul distribution jar from https://github.com/lakesoul-io/LakeSoul/releases. Download the jar file put it into `jars` directory of your spark environment. @@ -47,7 +52,7 @@ After unpacking spark package, you could find LakeSoul distribution jar from htt wget https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/spark/spark-3.3.2-bin-hadoop-3.tgz tar xf spark-3.3.2-bin-hadoop-3.tgz export SPARK_HOME=${PWD}/spark-3.3.2-bin-hadoop3 -wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.5.0/lakesoul-spark-2.5.0-spark-3.3.jar -P $SPARK_HOME/jars +wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars ``` :::tip @@ -62,17 +67,185 @@ Refer to https://spark.apache.org/docs/latest/hadoop-provided.html on how to set Since 2.1.0, LakeSoul package all its dependencies into one single jar via maven shade plugin. Before that all jars were packaged into one tar.gz file. ::: -## Start spark-shell for testing LakeSoul +#### 1.4.1 Start spark-shell for testing LakeSoul cd into the spark installation directory, and start an interactive spark-shell: ```shell ./bin/spark-shell --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog --conf spark.sql.defaultCatalog=lakesoul ``` + +#### 1.4.2 Write data to object storage service +It is necessary to add information such as object storage access key, secret key and endpoint. + ```shell + ./bin/spark-shell --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog --conf spark.sql.defaultCatalog=lakesoul --conf spark.hadoop.fs.s3a.access.key=XXXXXX --conf spark.hadoop.fs.s3a.secret.key=XXXXXX --conf spark.hadoop.fs.s3a.endpoint=XXXXXX --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem + ``` -## LakeSoul Spark Conf Parameters +#### LakeSoul Spark Conf Parameters Before start to use Lakesoul, we should add some paramethers in `spark-defaults.conf` or `Spark Session Builder`。 | Key | Value | Description | |---|---|---| spark.sql.extensions | com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension | extention name for spark sql spark.sql.catalog.lakesoul | org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog | plug in LakeSoul's catalog -spark.sql.defaultCatalog | lakesoul | set default catalog for spark \ No newline at end of file +spark.sql.defaultCatalog | lakesoul | set default catalog for spark + +### 1.5 Setup Flink environment +Download LakeSoul Flink jars:https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.1/lakesoul-flink-2.4.1-flink-1.17.jar +Download Flink jars:https://dlcdn.apache.org/flink/flink-1.17.2/flink-1.17.2-bin-scala_2.12.tgz + +#### 1.5.1 Start Flink SQL shell +After creating the pg database and `lakesoul_home` configuration file, place the LakeSoul Flink jars in the FLink directory. +Enter the Flink installation directory and execute the following command: +```shell +export lakesoul_home=/opt/soft/pg.property && ./bin/start-cluster.sh + +export lakesoul_home=/opt/soft/pg.property && ./bin/sql-client.sh embedded -j lakesoul-flink-2.4.1-flink-1.17.jar +``` + +#### 1.5.2 Write data to object storage service +Access key, Secret key and Endpoint information need to be added to the Flink configuration file flink-conf.yaml +```shell +s3.access-key: XXXXXX +s3.secret-key: XXXXXX +s3.endpoint: XXXXXX +``` +Place flink-s3-fs-hadoop.jar and flink-shaded-hadoop-2-uber-2.6.5-10.0.jar under Flink/lib +Download flink-s3-fs-hadoop.jar: https://repo1.maven.org/maven2/org/apache/flink/flink-s3-fs-hadoop/1.17.2/flink-s3-fs-hadoop-1.17.2.jar +Download flink-shaded-hadoop-2-uber-2.6.5-10.0.jar: https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.6.5-10.0/flink-shaded-hadoop-2-uber-2.6.5-10.0.jar + +## 2. Start on Hadoop, Spark and FLink cluster environments +Run LakeSoul tasks on Hadoop you only need to add the relevant configuration information to the environment variables and Spark and FLink cluster configurations. The specific operations are as follows: + +### 2.1 Add the following information to the Spark configuration file spark-defaults.conf +```shell +spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension +spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog +spark.sql.defaultCatalog=lakesoul + +spark.yarn.appMasterEnv.LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver +spark.yarn.appMasterEnv.LAKESOUL_PG_URL=jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +spark.yarn.appMasterEnv.LAKESOUL_PG_USERNAME=lakesoul_test +spark.yarn.appMasterEnv.LAKESOUL_PG_PASSWORD=lakesoul_test +``` + +### 2.2 Add the following information to the Flink configuration file flink-conf.yaml +```shell +containerized.master.env.LAKESOUL_PG_DRIVER: com.lakesoul.shaded.org.postgresql.Driver +containerized.master.env.LAKESOUL_PG_USERNAME: postgres +containerized.master.env.LAKESOUL_PG_PASSWORD: postgres123 +containerized.master.env.LAKESOUL_PG_URL: jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +containerized.taskmanager.env.LAKESOUL_PG_DRIVER: com.lakesoul.shaded.org.postgresql.Driver +containerized.taskmanager.env.LAKESOUL_PG_USERNAME: lakesoul_test +containerized.taskmanager.env.LAKESOUL_PG_PASSWORD: lakesoul_test +containerized.taskmanager.env.LAKESOUL_PG_URL: jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +``` + +### 2.3 Configuration global environment +Configure global environment variable information on the client machine. Here you need to write the variable information into an env.sh file, the content is as follows: +Here the Hadoop version is 3.1.4.0-315, the Spark version is spark-3.3.2, and the Flink version is flink-1.17.2 + +```shell +export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 +export HADOOP_HOME="/usr/hdp/3.1.4.0-315/hadoop" +export HADOOP_HDFS_HOME="/usr/hdp/3.1.4.0-315/hadoop-hdfs" +export HADOOP_MAPRED_HOME="/usr/hdp/3.1.4.0-315/hadoop-mapreduce" +export HADOOP_YARN_HOME="/usr/hdp/3.1.4.0-315/hadoop-yarn" +export HADOOP_LIBEXEC_DIR="/usr/hdp/3.1.4.0-315/hadoop/libexec" +export HADOOP_CONF_DIR="/usr/hdp/3.1.4.0-315/hadoop/conf" + +export SPARK_HOME=/usr/hdp/spark-3.3.2-bin-without-hadoop-ddf +export SPARK_CONF_DIR=/home/lakesoul/lakesoul_hadoop_ci/LakeSoul-main/LakeSoul/script/benchmark/hadoop/spark-conf + +export FLINK_HOME=/opt/flink-1.17.2 +export FLINK_CONF_DIR=/opt/flink-1.17.2/conf +export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$FLINK_HOME/bin:$JAVA_HOME/bin:$PATH +export HADOOP_CLASSPATH=$(hadoop classpath) +export SPARK_DIST_CLASSPATH=$HADOOP_CLASSPATH +export LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver +export LAKESOUL_PG_URL=jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +export LAKESOUL_PG_USERNAME=lakesoul_test +export LAKESOUL_PG_PASSWORD=lakesoul_test +``` +After configuring the above information, execute the following command, and then you can submit the LakeSoul task to the yarn cluster for running on the client +```shell +source env.sh +``` + +## 3. Use Docker Compose +### 3.1 Docker Compose Files +We provide a docker compose env to quickly start a local PostgreSQL service and a MinIO S3 Storage service. The docker compose env is located under [lakesoul-docker-compose-env](https://github.com/lakesoul-io/LakeSoul/tree/main/docker/lakesoul-docker-compose-env). + +### 3.2 Install Docker Compose +To install docker compose, please refer to [Install Docker Engine](https://docs.docker.com/engine/install/) + +### 3.3 Start docker compose +To start the docker compose env, cd into the docker compose env dir, and execute the command: +```bash +cd docker/lakesoul-docker-compose-env/ +docker compose up -d +``` +Then use `docker compose ps` to check both services' statuses are `running(healthy)`. The PostgreSQL service would automatically setup the database and tables required by LakeSoul Meta. And the MinIO service would setup a public bucket. You can change the user, password, database name and MinIO bucket name accordingly in the `docker-compose.yml` file. + +### 3.4 Run LakeSoul Tests in Docker Compose Env +#### 3.4.1 Prepare LakeSoul Properties File +```ini title="lakesoul.properties" +lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver +lakesoul.pg.url=jdbc:postgresql://lakesoul-docker-compose-env-lakesoul-meta-db-1:5432/lakesoul_test?stringtype=unspecified +lakesoul.pg.username=lakesoul_test +lakesoul.pg.password=lakesoul_test +``` +#### 3.4.2 Prepare Spark Image +You could use bitnami's Spark 3.3 docker image with packaged hadoop denendencies: +```bash +docker pull bitnami/spark:3.3.1 +``` + +#### 3.4.3 Start Spark Shell +```bash +docker run --net lakesoul-docker-compose-env_default --rm -ti \ + -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ + --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \ + spark-shell \ + --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \ + --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ + --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ + --conf spark.sql.defaultCatalog=lakesoul \ + --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ + --conf spark.hadoop.fs.s3a.buffer.dir=/opt/spark/work-dir/s3a \ + --conf spark.hadoop.fs.s3a.path.style.access=true \ + --conf spark.hadoop.fs.s3a.endpoint=http://minio:9000 \ + --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider +``` + +#### 3.4.4 Execute LakeSoul Scala APIs +```scala +val tablePath= "s3://lakesoul-test-bucket/test_table" +val df = Seq(("2021-01-01",1,"rice"),("2021-01-01",2,"bread")).toDF("date","id","name") +df.write + .mode("append") + .format("lakesoul") + .option("rangePartitions","date") + .option("hashPartitions","id") + .option("hashBucketNum","2") + .save(tablePath) +``` + +#### 3.4.5 Verify Data Written Successfully +Open link http://127.0.0.1:9001/buckets/lakesoul-test-bucket/browse/ in your browser to verify that LakeSoul table has been written to MinIO successfully. +Use minioadmin1:minioadmin1 to login into MinIO's console. + +### 3.5 Cleanup Meta Tables and MinIO Bucket +To cleanup all contents in LakeSoul meta tables, execute: +```bash +docker exec -ti lakesoul-docker-compose-env-lakesoul-meta-db-1 psql -h localhost -U lakesoul_test -d lakesoul_test -f /meta_cleanup.sql +``` +To cleanup all contents in MinIO bucket, execute: +```bash +docker run --net lakesoul-docker-compose-env_default --rm -t bitnami/spark:3.3.1 aws --no-sign-request --endpoint-url http://minio:9000 s3 rm --recursive s3://lakesoul-test-bucket/ +``` + +### 3.6 Shutdown Docker Compose Env +```bash +cd docker/lakesoul-docker-compose-env/ +docker compose stop +docker compose down +``` \ No newline at end of file diff --git a/website/docs/01-Getting Started/02-docker-compose.mdx b/website/docs/01-Getting Started/02-docker-compose.mdx deleted file mode 100644 index 3bb93cfe5..000000000 --- a/website/docs/01-Getting Started/02-docker-compose.mdx +++ /dev/null @@ -1,86 +0,0 @@ -# Use Docker Compose - - - -## Docker Compose Files -We provide a docker compose env to quickly start a local PostgreSQL service and a MinIO S3 Storage service. The docker compose env is located under [lakesoul-docker-compose-env](https://github.com/lakesoul-io/LakeSoul/tree/main/docker/lakesoul-docker-compose-env). - -## Install Docker Compose -To install docker compose, please refer to [Install Docker Engine](https://docs.docker.com/engine/install/) - -## Start docker compose -To start the docker compose env, cd into the docker compose env dir, and execute the command: -```bash -cd docker/lakesoul-docker-compose-env/ -docker compose up -d -``` -Then use `docker compose ps` to check both services' statuses are `running(healthy)`. The PostgreSQL service would automatically setup the database and tables required by LakeSoul Meta. And the MinIO service would setup a public bucket. You can change the user, password, database name and MinIO bucket name accordingly in the `docker-compose.yml` file. - -## Run LakeSoul Tests in Docker Compose Env -### Prepare LakeSoul Properties File -```ini title="lakesoul.properties" -lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver -lakesoul.pg.url=jdbc:postgresql://lakesoul-docker-compose-env-lakesoul-meta-db-1:5432/lakesoul_test?stringtype=unspecified -lakesoul.pg.username=lakesoul_test -lakesoul.pg.password=lakesoul_test -``` -### Prepare Spark Image -You could use bitnami's Spark 3.3 docker image with packaged hadoop denendencies: -```bash -docker pull bitnami/spark:3.3.1 -``` - -### Start Spark Shell -```bash -docker run --net lakesoul-docker-compose-env_default --rm -ti \ - -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ - --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \ - spark-shell \ - --packages com.dmetasoul:lakesoul-spark:2.5.0-spark-3.3 \ - --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ - --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ - --conf spark.sql.defaultCatalog=lakesoul \ - --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ - --conf spark.hadoop.fs.s3a.buffer.dir=/opt/spark/work-dir/s3a \ - --conf spark.hadoop.fs.s3a.path.style.access=true \ - --conf spark.hadoop.fs.s3a.endpoint=http://minio:9000 \ - --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider -``` - -### Execute LakeSoul Scala APIs -```scala -val tablePath= "s3://lakesoul-test-bucket/test_table" -val df = Seq(("2021-01-01",1,"rice"),("2021-01-01",2,"bread")).toDF("date","id","name") -df.write - .mode("append") - .format("lakesoul") - .option("rangePartitions","date") - .option("hashPartitions","id") - .option("hashBucketNum","2") - .save(tablePath) -``` - -### Verify Data Written Successfully -Open link http://127.0.0.1:9001/buckets/lakesoul-test-bucket/browse/ in your browser to verify that LakeSoul table has been written to MinIO successfully. -Use minioadmin1:minioadmin1 to login into MinIO's console. - -## Cleanup Meta Tables and MinIO Bucket -To cleanup all contents in LakeSoul meta tables, execute: -```bash -docker exec -ti lakesoul-docker-compose-env-lakesoul-meta-db-1 psql -h localhost -U lakesoul_test -d lakesoul_test -f /meta_cleanup.sql -``` -To cleanup all contents in MinIO bucket, execute: -```bash -docker run --net lakesoul-docker-compose-env_default --rm -t bitnami/spark:3.3.1 aws --no-sign-request --endpoint-url http://minio:9000 s3 rm --recursive s3://lakesoul-test-bucket/ -``` - -## Shutdown Docker Compose Env -```bash -cd docker/lakesoul-docker-compose-env/ -docker compose stop -docker compose down -``` \ No newline at end of file diff --git a/website/docs/01-Getting Started/03-spark-guide.md b/website/docs/01-Getting Started/02-spark-guide.md similarity index 98% rename from website/docs/01-Getting Started/03-spark-guide.md rename to website/docs/01-Getting Started/02-spark-guide.md index 46f4a8b85..15a77cb22 100644 --- a/website/docs/01-Getting Started/03-spark-guide.md +++ b/website/docs/01-Getting Started/02-spark-guide.md @@ -8,7 +8,7 @@ SPDX-License-Identifier: Apache-2.0 ## Setup -To use LakeSoul in Spark, first configure [Spark catalogs](02-docker-compose.mdx). LakeSoul uses Apache Spark’s DataSourceV2 API for data source and catalog implementations. Moreover, LakeSoul provides scala table API to extend the capability of LakeSoul table. +To use LakeSoul in Spark, first configure [Spark catalogs](01-setup-locl-env.mdx). LakeSoul uses Apache Spark’s DataSourceV2 API for data source and catalog implementations. Moreover, LakeSoul provides scala table API to extend the capability of LakeSoul table. ### Spark 3 Support Matrix diff --git a/website/docs/01-Getting Started/04-Flink-Guide.mdx b/website/docs/01-Getting Started/03-Flink-Guide.mdx similarity index 100% rename from website/docs/01-Getting Started/04-Flink-Guide.mdx rename to website/docs/01-Getting Started/03-Flink-Guide.mdx diff --git a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md index 306b28740..5683a4968 100644 --- a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md +++ b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/01-setup-local-env.md @@ -1,4 +1,4 @@ -# 搭建本地测试环境 +# 测试环境搭建 -## 启动一个 PostgreSQL 数据库 +## 1. 在Linux本地文件系统中搭建测试环境 +将数据存储在本地磁盘上,只需要有PostgreSQL数据库即可。 +### 1.1 启动一个 PostgreSQL 数据库 可以通过docker使用下面命令快速搭建一个pg数据库: ```bash docker run -d --name lakesoul-test-pg -p5432:5432 -e POSTGRES_USER=lakesoul_test -e POSTGRES_PASSWORD=lakesoul_test -e POSTGRES_DB=lakesoul_test -d swr.cn-north-4.myhuaweicloud.com/dmetasoul-repo/postgres:14.5 ``` -## PG 数据库初始化 +### 1.2 PG 数据库初始化 在 LakeSoul 代码库目录下执行: ```bash -PGPASSWORD=lakesoul_test psql -h localhost -p 5432 -U lakesoul_test -f script/meta_init.sql +## 将初始化脚本copy到容器中 +docker cp script/meta_init.sql lakesoul-test-pg:/ + +## 执行初始化命令 +docker exec -i lakesoul-test-pg sh -c "PGPASSWORD=lakesoul_test psql -h localhost -p 5432 -U lakesoul_test -f meta_init.sql" ``` -## 安装 Spark 环境 +### 1.3 安装 Spark 环境 由于 Apache Spark 官方的下载安装包不包含 hadoop-cloud 以及 AWS S3 等依赖,我们提供了一个 Spark 安装包,其中包含了 hadoop cloud 、s3 等必要的依赖:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/spark/spark-3.3.2-bin-hadoop3.tgz ```bash @@ -37,18 +43,18 @@ https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-without-hadoop.tgz LakeSoul 发布 jar 包可以从 GitHub Releases 页面下载:https://github.com/lakesoul-io/LakeSoul/releases 。下载后请将 Jar 包放到 Spark 安装目录下的 jars 目录中: ```bash -wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.5.0/lakesoul-spark-2.5.0-spark-3.3.jar -P $SPARK_HOME/jars +wget https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.0/lakesoul-spark-2.4.0-spark-3.3.jar -P $SPARK_HOME/jars ``` -如果访问 Github 有问题,也可以从如下链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.5.0-spark-3.3.jar +如果访问 Github 有问题,也可以从如下链接下载:https://dmetasoul-bucket.obs.cn-southwest-2.myhuaweicloud.com/releases/lakesoul/lakesoul-spark-2.4.0-spark-3.3.jar :::tip 从 2.1.0 版本起,LakeSoul 自身的依赖已经通过 shade 方式打包到一个 jar 包中。之前的版本是多个 jar 包以 tar.gz 压缩包的形式发布。 ::: -## 启动 spark-shell 进行测试 +#### 1.3.1 启动 spark-shell 进行测试 -### 首先为 LakeSoul 增加 PG 数据库配置 +#### 首先为 LakeSoul 增加 PG 数据库配置 默认情况下,pg数据库连接到本地数据库,配置信息如下: ```txt lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver @@ -64,16 +70,185 @@ export lakesoul_home=/opt/soft/pg.property 用户可以在这里自定义数据库配置信息,这样用户自定义 PG DB 的配置信息就会在 Spark 作业中生效。 -### 进入 Spark 安装目录,启动 spark 交互式 shell: +#### 1.3.2 进入 Spark 安装目录,启动 spark 交互式 shell: ```shell ./bin/spark-shell --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog --conf spark.sql.defaultCatalog=lakesoul ``` -## Spark 作业 LakeSoul 相关参数设置 +#### 1.3.3 将数据写入对象存储服务 +需要添加对象存储 access key, secret key 和 endpoint 等信息 + ```shell + ./bin/spark-shell --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog --conf spark.sql.defaultCatalog=lakesoul --conf spark.hadoop.fs.s3a.access.key=XXXXXX --conf spark.hadoop.fs.s3a.secret.key=XXXXXX --conf spark.hadoop.fs.s3a.endpoint=XXXXXX --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem + ``` + +#### Spark 作业 LakeSoul 相关参数设置 可以将以下配置添加到 spark-defaults.conf 或者 Spark Session Builder 部分。 |Key | Value |---|---| spark.sql.extensions | com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension spark.sql.catalog.lakesoul | org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog -spark.sql.defaultCatalog | lakesoul \ No newline at end of file +spark.sql.defaultCatalog | lakesoul + +### 1.4 Flink 环境搭建 +以当前发布最新版本为例,LakeSoul Flink jar 包下载地址为:https://github.com/lakesoul-io/LakeSoul/releases/download/v2.4.1/lakesoul-flink-2.4.1-flink-1.17.jar + +最新版本支持 flink 集群为1.17,Flink jar下载地址为:https://dlcdn.apache.org/flink/flink-1.17.2/flink-1.17.2-bin-scala_2.12.tgz + +#### 1.4.1 启动Flink SQL shell +在创建好 pg 数据库和 `lakesoul_home` 配置文件后,通过以下方式可以进入 SQL Client 客户端,将LakeSoul Flink jar放在 FLink 目录下, +进入 Flink 安装目录,执行以下命令: +```shell +# 启动 flink 集群 +export lakesoul_home=/opt/soft/pg.property && ./bin/start-cluster.sh + +# 启动 flink sql client +export lakesoul_home=/opt/soft/pg.property && ./bin/sql-client.sh embedded -j lakesoul-flink-2.4.1-flink-1.17.jar +``` + +#### 1.4.2 将数据写入对象存储服务 +需要在配置文件 flink-conf.yaml 添加 access key, secret key 和 endpoint 等信息 +```shell +s3.access-key: XXXXXX +s3.secret-key: XXXXXX +s3.endpoint: XXXXXX +``` +将flink-s3-fs-hadoop.jar 和 flink-shaded-hadoop-2-uber-2.6.5-10.0.jar 放到 Flink/lib 下 +flink-s3-fs-hadoop.jar 下载地址为:https://repo1.maven.org/maven2/org/apache/flink/flink-s3-fs-hadoop/1.17.2/flink-s3-fs-hadoop-1.17.2.jar +flink-shaded-hadoop-2-uber-2.6.5-10.0.jar 下载地址为:https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.6.5-10.0/flink-shaded-hadoop-2-uber-2.6.5-10.0.jar + +## 2. 在 Hadoop、Spark 和 FLink 集群环境下运行 +在 hadoop 集群中应用 LakeSoul 服务,只需要将相关配置信息假如环境变量中以及 Spark、FLink 集群配置中即可。具体操作如下: +2.1 在 Spark 配置文件 spark-defaults.conf 添加如下信息 +```shell +spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension +spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog +spark.sql.defaultCatalog=lakesoul + +spark.yarn.appMasterEnv.LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver +spark.yarn.appMasterEnv.LAKESOUL_PG_URL=jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +spark.yarn.appMasterEnv.LAKESOUL_PG_USERNAME=lakesoul_test +spark.yarn.appMasterEnv.LAKESOUL_PG_PASSWORD=lakesoul_test +``` + +2.2 在Flink 配置文件中 flink-conf.yaml 添加如下信息 +```shell +containerized.master.env.LAKESOUL_PG_DRIVER: com.lakesoul.shaded.org.postgresql.Driver +containerized.master.env.LAKESOUL_PG_USERNAME: postgres +containerized.master.env.LAKESOUL_PG_PASSWORD: postgres123 +containerized.master.env.LAKESOUL_PG_URL: jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +containerized.taskmanager.env.LAKESOUL_PG_DRIVER: com.lakesoul.shaded.org.postgresql.Driver +containerized.taskmanager.env.LAKESOUL_PG_USERNAME: lakesoul_test +containerized.taskmanager.env.LAKESOUL_PG_PASSWORD: lakesoul_test +containerized.taskmanager.env.LAKESOUL_PG_URL: jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +``` + +2.3 在客户端机器上配置全局环境变量信息,这里需要用到变量信息写到一个 env.sh 文件中,内容如下: +这里 Hadoop 版本为 3.1.4.0-315,Spark 版本为 spark-3.3.2, Flink 版本为 flink-1.17.2 +```shell +export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 +export HADOOP_HOME="/usr/hdp/3.1.4.0-315/hadoop" +export HADOOP_HDFS_HOME="/usr/hdp/3.1.4.0-315/hadoop-hdfs" +export HADOOP_MAPRED_HOME="/usr/hdp/3.1.4.0-315/hadoop-mapreduce" +export HADOOP_YARN_HOME="/usr/hdp/3.1.4.0-315/hadoop-yarn" +export HADOOP_LIBEXEC_DIR="/usr/hdp/3.1.4.0-315/hadoop/libexec" +export HADOOP_CONF_DIR="/usr/hdp/3.1.4.0-315/hadoop/conf" + +export SPARK_HOME=/usr/hdp/spark-3.3.2-bin-without-hadoop-ddf +export SPARK_CONF_DIR=/home/lakesoul/lakesoul_hadoop_ci/LakeSoul-main/LakeSoul/script/benchmark/hadoop/spark-conf + +export FLINK_HOME=/opt/flink-1.17.2 +export FLINK_CONF_DIR=/opt/flink-1.17.2/conf +export PATH=$HADOOP_HOME/bin:$SPARK_HOME/bin:$FLINK_HOME/bin:$JAVA_HOME/bin:$PATH +export HADOOP_CLASSPATH=$(hadoop classpath) +export SPARK_DIST_CLASSPATH=$HADOOP_CLASSPATH +export LAKESOUL_PG_DRIVER=com.lakesoul.shaded.org.postgresql.Driver +export LAKESOUL_PG_URL=jdbc:postgresql://127.0.0.1:5432/lakesoul_test?stringtype=unspecified +export LAKESOUL_PG_USERNAME=lakesoul_test +export LAKESOUL_PG_PASSWORD=lakesoul_test +``` +配置好如上信息后,执行以下命令,然后便可以在客户端将 LakeSoul 任务提交到 yarn 集群上运行 +```shell +source env.sh +``` + +## 3. 在 Docker Compose 环境运行 + +### 3.1 Docker Compose 文件 +我们提供了 docker compose 环境方便快速启动一个本地的 PostgreSQL 服务和一个 MinIO S3 存储服务。Docker Compose 环境可以在代码库中找到:[lakesoul-docker-compose-env](https://github.com/lakesoul-io/LakeSoul/tree/main/docker/lakesoul-docker-compose-env). + +### 3.2 安装 Docker Compose +安装 Docker Compose 可以参考 Docker 官方文档:[Install Docker Engine](https://docs.docker.com/engine/install/) + +### 3.3 启动 Docker Compose 环境 +启动 Docker Compose 环境,执行以下命令: +```bash +cd docker/lakesoul-docker-compose-env/ +docker compose up -d +``` +然后可以使用 `docker compose ps` 命令来检查服务状态是否是 `running`. PostgreSQL 服务会自动初始化好 LakeSoul 需要的 database 和 表结构。MinIO 服务会创建一个公共读写的桶。PostgreSQL 的用户名、密码、DB名字、MinIO 的桶名可以在 `docker-compose.yml` 文件中修改。 + +### 3.4 在 Docker Compose 环境中运行 LakeSoul 测试 +#### 3.4.1 准备 LakeSoul 配置文件 +```ini title="lakesoul.properties" +lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver +lakesoul.pg.url=jdbc:postgresql://lakesoul-docker-compose-env-lakesoul-meta-db-1:5432/lakesoul_test?stringtype=unspecified +lakesoul.pg.username=lakesoul_test +lakesoul.pg.password=lakesoul_test +``` +#### 3.4.2 准备 Spark 镜像 +可以使用 bitnami Spark 镜像: +```bash +docker pull bitnami/spark:3.3.1 +``` + +#### 3.4.3 启动 Spark Shell +```bash +docker run --net lakesoul-docker-compose-env_default --rm -ti \ + -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ + --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \ + spark-shell \ + --packages com.dmetasoul:lakesoul-spark:2.4.0-spark-3.3 \ + --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ + --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ + --conf spark.sql.defaultCatalog=lakesoul \ + --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ + --conf spark.hadoop.fs.s3a.buffer.dir=/opt/spark/work-dir/s3a \ + --conf spark.hadoop.fs.s3a.path.style.access=true \ + --conf spark.hadoop.fs.s3a.endpoint=http://minio:9000 \ + --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider +``` + +#### 3.4.4 执行 LakeSoul Scala API +```scala +val tablePath= "s3://lakesoul-test-bucket/test_table" +val df = Seq(("2021-01-01",1,"rice"),("2021-01-01",2,"bread")).toDF("date","id","name") +df.write + .mode("append") + .format("lakesoul") + .option("rangePartitions","date") + .option("hashPartitions","id") + .option("hashBucketNum","2") + .save(tablePath) +``` + +#### 3.4.5 检查数据是否成功写入 +可以打开链接 http://127.0.0.1:9001/buckets/lakesoul-test-bucket/browse/ 查看数据是否已经成功写入。 +MinIO console 的登录用户名密码是 minioadmin1:minioadmin1。 + +### 3.5 清理元数据表和 MinIO 桶 +清理元数据表内容: +```bash +docker exec -ti lakesoul-docker-compose-env-lakesoul-meta-db-1 psql -h localhost -U lakesoul_test -d lakesoul_test -f /meta_cleanup.sql +``` +清理 MinIO 桶内容: +```bash +docker run --net lakesoul-docker-compose-env_default --rm -t bitnami/spark:3.3.1 aws --no-sign-request --endpoint-url http://minio:9000 s3 rm --recursive s3://lakesoul-test-bucket/ +``` + +### 3.6 停止 Docker Compose 环境 +```bash +cd docker/lakesoul-docker-compose-env/ +docker compose stop +docker compose down +``` diff --git a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx deleted file mode 100644 index 7da34f6c7..000000000 --- a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-docker-compose.mdx +++ /dev/null @@ -1,86 +0,0 @@ -# 使用 Docker Compose - - - -## Docker Compose 文件 -我们提供了 docker compose 环境方便快速启动一个本地的 PostgreSQL 服务和一个 MinIO S3 存储服务。Docker Compose 环境可以在代码库中找到:[lakesoul-docker-compose-env](https://github.com/lakesoul-io/LakeSoul/tree/main/docker/lakesoul-docker-compose-env). - -## 安装 Docker Compose -安装 Docker Compose 可以参考 Docker 官方文档:[Install Docker Engine](https://docs.docker.com/engine/install/) - -## 启动 Docker Compose 环境 -启动 Docker Compose 环境,执行以下命令: -```bash -cd docker/lakesoul-docker-compose-env/ -docker compose up -d -``` -然后可以使用 `docker compose ps` 命令来检查服务状态是否是 `running`. PostgreSQL 服务会自动初始化好 LakeSoul 需要的 database 和 表结构。MinIO 服务会创建一个公共读写的桶。PostgreSQL 的用户名、密码、DB名字、MinIO 的桶名可以在 `docker-compose.yml` 文件中修改。 - -## 在 Docker Compose 环境中运行 LakeSoul 测试 -### 准备 LakeSoul 配置文件 -```ini title="lakesoul.properties" -lakesoul.pg.driver=com.lakesoul.shaded.org.postgresql.Driver -lakesoul.pg.url=jdbc:postgresql://lakesoul-docker-compose-env-lakesoul-meta-db-1:5432/lakesoul_test?stringtype=unspecified -lakesoul.pg.username=lakesoul_test -lakesoul.pg.password=lakesoul_test -``` -### 准备 Spark 镜像 -可以使用 bitnami Spark 镜像: -```bash -docker pull bitnami/spark:3.3.1 -``` - -### 启动 Spark Shell -```bash -docker run --net lakesoul-docker-compose-env_default --rm -ti \ - -v $(pwd)/lakesoul.properties:/opt/spark/work-dir/lakesoul.properties \ - --env lakesoul_home=/opt/spark/work-dir/lakesoul.properties bitnami/spark:3.3.1 \ - spark-shell \ - --packages com.dmetasoul:lakesoul-spark:2.5.0-spark-3.3 \ - --conf spark.sql.extensions=com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension \ - --conf spark.sql.catalog.lakesoul=org.apache.spark.sql.lakesoul.catalog.LakeSoulCatalog \ - --conf spark.sql.defaultCatalog=lakesoul \ - --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ - --conf spark.hadoop.fs.s3a.buffer.dir=/opt/spark/work-dir/s3a \ - --conf spark.hadoop.fs.s3a.path.style.access=true \ - --conf spark.hadoop.fs.s3a.endpoint=http://minio:9000 \ - --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider -``` - -### 执行 LakeSoul Scala API -```scala -val tablePath= "s3://lakesoul-test-bucket/test_table" -val df = Seq(("2021-01-01",1,"rice"),("2021-01-01",2,"bread")).toDF("date","id","name") -df.write - .mode("append") - .format("lakesoul") - .option("rangePartitions","date") - .option("hashPartitions","id") - .option("hashBucketNum","2") - .save(tablePath) -``` - -### 检查数据是否成功写入 -可以打开链接 http://127.0.0.1:9001/buckets/lakesoul-test-bucket/browse/ 查看数据是否已经成功写入。 -MinIO console 的登录用户名密码是 minioadmin1:minioadmin1。 - -## 清理元数据表和 MinIO 桶 -清理元数据表内容: -```bash -docker exec -ti lakesoul-docker-compose-env-lakesoul-meta-db-1 psql -h localhost -U lakesoul_test -d lakesoul_test -f /meta_cleanup.sql -``` -清理 MinIO 桶内容: -```bash -docker run --net lakesoul-docker-compose-env_default --rm -t bitnami/spark:3.3.1 aws --no-sign-request --endpoint-url http://minio:9000 s3 rm --recursive s3://lakesoul-test-bucket/ -``` - -## 停止 Docker Compose 环境 -```bash -cd docker/lakesoul-docker-compose-env/ -docker compose stop -docker compose down -``` \ No newline at end of file diff --git a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/03-spark-guide.md b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-spark-guide.md similarity index 97% rename from website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/03-spark-guide.md rename to website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-spark-guide.md index 071339038..de68ce3fc 100644 --- a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/03-spark-guide.md +++ b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/02-spark-guide.md @@ -7,7 +7,7 @@ SPDX-License-Identifier: Apache-2.0 --> ## 配置 -要在Spark中使用LakeSoul,请首先配置[Spark Catalog](02-docker-compose.mdx)。LakeSoul使用Apache Spark的DataSourceV2 API来实现数据源和目录。此外,LakeSoul还提供了 Scala 的表API,以扩展LakeSoul数据表的功能。 +要在Spark中使用LakeSoul,请首先配置[Spark Catalog](01-setup-local-env.mdx)。LakeSoul使用Apache Spark的DataSourceV2 API来实现数据源和目录。此外,LakeSoul还提供了 Scala 的表API,以扩展LakeSoul数据表的功能。 ### Spark 3 Support Matrix diff --git a/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/04-Flink-Guide.mdx b/website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/03-Flink-Guide.mdx similarity index 100% rename from website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/04-Flink-Guide.mdx rename to website/i18n/zh-Hans/docusaurus-plugin-content-docs/current/01-Getting Started/03-Flink-Guide.mdx