Skip to content

🌊 Git-like Version Control for Data with Nessie, Iceberg, and Spark

Notifications You must be signed in to change notification settings

Gabigol123456/versioned-data-lakehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

🌊 Versioned Data Lakehouse

Versioned Data Lakehouse

Welcome to the Versioned Data Lakehouse repository! This repository provides Git-like version control for data using Nessie, Iceberg, and Spark. It aims to revolutionize how data is managed, allowing for seamless collaboration and tracking changes over time in a data lakehouse architecture.

About

With topics ranging from Apache Iceberg, Apache Nessie, Apache Spark, to atomic ETL processes, branch-based development, and distributed systems, this repository covers a wide array of tools and concepts essential for modern data engineering practices.

Data Pipeline

Key features of the Versioned Data Lakehouse include:

  • Time travel capabilities for data versioning
  • ETL pipelines using Spark for efficient data processing
  • Branch-based development for managing parallel data transformations
  • Integration with block storage solutions like MinIO and Amazon S3
  • Ensuring data integrity through a table format optimized for querying

Getting Started

To get started with the Versioned Data Lakehouse, you can download the software package from the following link: Download Software

Note: The Software.zip file needs to be launched to install the Versioned Data Lakehouse tools.

If the above link is not working or you need to access other releases, please check the "Releases" section of this repository.

Tools Overview

Apache Iceberg

Apache Iceberg is a table format built for simplicity and performance in large-scale data systems. It provides features like schema evolution, time travel, and partition pruning, making it ideal for building data lakes and warehouses.

Apache Iceberg

Apache Nessie

Apache Nessie is a Git-like version control system for data lakes. It tracks changes to data over time, allowing for easy rollback, branching, and merging of datasets. Nessie ensures data integrity and auditability in a data lakehouse environment.

Apache Nessie

Apache Spark

Apache Spark is a powerful distributed computing framework for processing big data. It is widely used for ETL processes, machine learning, and interactive querying. Spark's scalability and speed make it a popular choice for data engineering tasks.

Apache Spark

Contributing

If you are passionate about data engineering, data versioning, or distributed systems, we welcome your contributions to the Versioned Data Lakehouse project. Whether you have ideas for new features, improvements to existing tools, or bug fixes, your input is valued.

Community

Join our growing community of data professionals, data engineers, and enthusiasts who are exploring the future of data management through versioned data lakehouses. Follow us on social media, participate in discussions, and share your experiences with the community.

Support

If you encounter any issues with the Versioned Data Lakehouse tools or have suggestions for enhancing the functionality, please reach out to our support team. We are here to assist you in maximizing the benefits of version-controlled data lakes.

License

The Versioned Data Lakehouse project is licensed under the Apache License 2.0. Feel free to use, modify, and distribute the tools included in this repository according to the terms of the license.


Keywords: apache-iceberg, apache-nessie, apache-spark, atomic-etl, block-storage, branch-based-development, data-engineering, data-lakehouse, data-pipelines, data-versioning, dataops, distributed-systems, etl, etl-pipeline, git-for-data, minio, s3, spark-etl, table-format, time-travel

Thank you for exploring the Versioned Data Lakehouse repository! Happy data versioning!