Skip to content

Latest commit

 

History

History
executable file
·
108 lines (86 loc) · 4.86 KB

README.md

File metadata and controls

executable file
·
108 lines (86 loc) · 4.86 KB

Lightning Catalog

The Lightning Catalog is an open-source data catalog designed for preparing data at any scale in ad-hoc analytics, data warehousing, lake houses, and ML projects. It is primarily developed to provide management of various data assets across the enterprise and stitch them together without requiring a centralized repository. It leverages Apache Spark as the primary compute engine.

Key Capabilities

  • Management of endpoints of diverse data assets in enterprise, providing unified access to them through SQL and the Apache Spark API.
  • Ad-hoc analytics through SQL queries over underlying data assets in a federated manner (e.g., joins between multiple source systems without moving data).
  • Simplifies the lifecycle of data engineering pipelines, from build and test to deployment, by leveraging Data Flow Tables. For example, user can load data into a lake house by fetching deltas and transforming data through SQL operations.
  • Alleviates the burden of data preparation for ML engineers, enabling them to focus on building models. It also supports access to unstructured data.
  • Ingest unstructured files and query both their metadata and actual contents. It leverages Spark's parallel processing capabilities.
  • Creation of a unified semantic layer in top-down manner, allowing users to upload DDL and map table definitions to underlying data sources.
  • Business data quality check including checking database constraints (PK, Unique, FK) over non-RDBMS tables, such as Parquet, Iceberg, Delta, and more.

Online Documentation

Collaboration

Lightning tracks issues on GitHub and encourages contributions via pull requests(github pull request)

Building - Backend

Lightning is built using Gradle with Java 1.8, Java 11, Java 17, 18, 19.

  • To build and run tests: ./gradlew build
  • To skip tests: ./gradlew build -x test -x integrationTest
  • To fix code style for default versions: ./gradlew spotlessApply
  • To fix code style for all versions of Spark/Hive/Flink: ./gradlew spotlessApply -DallVersions
  • To build with a specific Spark version profile: ./gradlew clean build -DdefaultSparkMajorVersion=3.4 -DdefaultSparkVersion=3.4.2
  • The distribution package can be found at lightning-metastore/spark/spark_version(v3.4, v3.5)/spark-runtime/build/distributions.

Building - Frontend & Backend

Lightning provides build.sh to build both the frontend and backend.

  • build.sh takes parameters from the backend build commands listed above.
  • The distribution package can be found at lightning-metastore/build/lightning-metastore-(spark_major_version)-(lightning_version).zip.

Execution

  • Copy third-party libraries, such as JDBC libraries, into $LIGHTNING_HOME/3rd-party-lib.
  • Modify the following two parameters in $LIGHTNING_HOME/bin/start-light.sh, then run the script.

Major Features

  • Running the catalog on file systems (HDFS, Blob, and local file), allowing version control.
  • Support for Apache Spark plug-in architecture.
  • Ability to run data pipelines at any scale by leveraging Apache Spark.
  • Support for running ANSI SQL and HiveQL queries over underlying source systems.
  • Support for multiple namespaces.
  • Data flow tables, a declarative ETL framework that defines transformations on data.
  • Processing unstructured data, recursively accessing all files and their metadata from an endpoint.
  • Unified semantic layer (USL) by compiling and deploying DDL.
  • Database constraint checks and business rule data quality checks over USL.

Currently Supported Data Sources (More to be Added)

DeltaLake
Iceberg
H2
Snowflake
Posstgres
Oracle
Mssql
Redshift
Terradata
MySQL
DB2
SQLLite
MariaDB
Derby
HANA
Greenplum
Vertica
Netezza
Csv
Parquet
Orc
Json
Avro
PDF
image
avi
txt