Skip to content

Commit

Permalink
Quickstart: add instructions for how to work with lakeFS data locally (
Browse files Browse the repository at this point in the history
…#7631)

* boilerplate for work locally step

* add a quickstart guide for working with lakeFS locally

* prettify text

* fix yaml formatting

* change menu title

* Update docs/quickstart/work-with-data-locally.md

Co-authored-by: Oz Katz <oz.katz@treeverse.io>

---------

Co-authored-by: Oz Katz <oz.katz@treeverse.io>
  • Loading branch information
talSofer and ozkatz authored Apr 8, 2024
1 parent 0c47bc4 commit d14070f
Show file tree
Hide file tree
Showing 6 changed files with 155 additions and 20 deletions.
Binary file added docs/assets/img/quickstart/lakectl-local-01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/lakectl-local-02.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/img/quickstart/quickstart-step-07.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 1 addition & 20 deletions docs/quickstart/actions-and-hooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ title: 6️⃣ Using Actions and Hooks in lakeFS
description: lakeFS quickstart / Use Actions and Hooks to enforce conditions when committing and merging changes
parent: ⭐ Quickstart
nav_order: 30
next: ["Resources for learning more about lakeFS", "./learning-more-lakefs.html"]
next: ["Work with lakeFS data on your local environment", "./work-with-data-locally.html"]
previous: ["Rollback the changes", "./rollback.html"]
---

Expand Down Expand Up @@ -171,22 +171,3 @@ You can view the history of all action runs from the **Action** tab:
<img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/hooks-08.png" alt="Action run history in lakeFS" class="quickstart"/>
## Bonus Challenge
And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you.
Implement the requirement from the beginning of this quickstart *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into main. Look up how to list the contents of the `main` branch and verify that it looks like this:
```
object 2023-03-21 17:33:51 +0000 UTC 20.9 kB denmark-lakes.parquet
object 2023-03-21 14:45:38 +0000 UTC 916.4 kB lakes.parquet
```
# Finishing Up
Once you've finished the quickstart, shut down your local environment with the following command:
```bash
docker stop lakefs
```
12 changes: 12 additions & 0 deletions docs/quickstart/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,5 +100,17 @@ This quickstart will introduce you to some of the core ideas in lakeFS and show
</div>
</div>

<div class="row">
<div class="col step-num">
<img src="{{ site.baseurl }}/assets/img/quickstart/quickstart-step-07.png" alt="step 6"/>
</div>
<div class="col">
<h3>
<a href="work-with-data-locally.html">Work Locally</a>
</h3>
<p>Experiment with lakeFS data on a local environment</p>
</div>
</div>

{: .note}
You can use the [30-day free trial of lakeFS Cloud](https://lakefs.cloud/register) if you want to try out lakeFS without installing anything.
142 changes: 142 additions & 0 deletions docs/quickstart/work-with-data-locally.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: 7️⃣ Work with lakeFS data locally
description: lakeFS quickstart / Bring lakeFS data to a local environment to show how lakeFS can be used for ML experiments development.
parent: ⭐ Quickstart
nav_order: 35
next: ["Resources for learning more about lakeFS", "./learning-more-lakefs.html"]
previous: ["Using Actions and Hooks in lakeFS", "./actions-and-hooks.html"]
---

# Work with lakeFS Data Locally

When working with lakeFS, there are scenarios where we need to access and manipulate data locally. An example use case for working
locally is machine learning model development. Machine learning model development is dynamic and iterative. To optimize this
process, experiments need to be conducted with speed, tracking ease, and reproducibility. Localizing model data during development
accelerates the process by enabling interactive and offline development and reducing data access latency.

We're going to use [lakectl local](../howto/local-checkouts.md) to bring a subset of our lakeFS data to a local directory within the lakeFS
container and edit an image dataset used for ML model development.

## Cloning a Subset of lakeFS Data into a Local Directory

1. In lakeFS create a new branch called `my-experiment`. You can do this through the UI or with `lakectl`:

```bash
docker exec lakefs \
lakectl branch create \
lakefs://quickstart/my-experiment \
--source lakefs://quickstart/main
```

2. Clone images from your quickstart repository into a local directory named `my_local_dir` within your container:

```bash
docker exec lakefs \
lakectl local clone lakefs://quickstart/my-experiment/images my_local_dir
```

3. Verify that `my_local_dir` is linked to the correct path in your lakeFS remote:

```bash
docker exec lakefs \
lakectl local list
```

You should see confirmation that my_local_dir is tracking the desired lakeFS path.:

```bash
my_local_dir lakefs://quickstart/my-experiment/images/ 8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53
```

4. Verify that your local environment is up-to-date with its remote path:

```bash
docker exec lakefs \
lakectl local status my_local_dir
```
You should get a confirmation message like this showing that there is no difference between your local environment and the lakeFS remote:

```bash
diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'...
diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'...
No diff found.
```

## Making Changes to Data Locally

1. Download a new image of an Axolotl and add it to the dataset cloned into `my_local_dir`:

```bash
curl -L https://go.lakefs.io/43ENDyS > axolotl.png
docker cp axolotl.png lakefs:/home/lakefs/my_local_dir
```

2. Clean the dataset by removing images larger than 225 KB:
```bash
docker exec lakefs \
find my_local_dir -type f -size +225k -delete
```

3. Check the status of your local changes compared to the lakeFS remote path:
```bash
docker exec lakefs \
lakectl local status my_local_dir
```

You should get a confirmation message like this, showing the modifications you made locally:
```bash
diff 'local:///home/lakefs/my_local_dir' <--> 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/'...
diff 'lakefs://quickstart/8614575b5488b47a094163bd17a12ed0b82e0bcbfd22ed1856151c671f1faa53/images/' <--> 'lakefs://quickstart/my-experiment/images/'...
╔════════╦══════════╦═════════════════════╗
║ SOURCE ║ CHANGE ║ PATH ║
╠════════╬══════════╬═════════════════════╣
local ║ modified ║ axolotl.png ║
local ║ removed ║ duckdb-main-02.png ║
local ║ removed ║ empty-repo-list.png ║
local ║ removed ║ repo-contents.png ║
╚════════╩══════════╩═════════════════════╝
```

## Pushing Local Changes to lakeFS

Once we are done with editing the image dataset in our local environment, we will push our changes to the lakeFS remote so that
the improved dataset is shared and versioned.

1. Commit your local changes to lakeFS:

```bash
docker exec lakefs \
lakectl local commit \
-m 'Deleted images larger than 225KB in size and changed the Axolotl image' my_local_dir
```

In your branch, you should see the commit including your local changes:

<img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-01.png" alt="A lakectl local commit to lakeFS" class="quickstart"/>

2. Compare `my-experiment` branch to the `main` branch to visualize your changes:

<img width="75%" src="{{ site.baseurl }}/assets/img/quickstart/lakectl-local-02.png" alt="A comparison between a branch that includes local changes to the main branch" class="quickstart"/>

## Bonus Challenge

And so with that, this quickstart for lakeFS draws to a close. If you're simply having _too much fun_ to stop then here's an exercise for you.

Implement the requirement from the beginning of this quickstart *correctly*, such that you write `denmark-lakes.parquet` in the respective branch and successfully merge it back into main. Look up how to list the contents of the `main` branch and verify that it looks like this:

```
object 2023-03-21 17:33:51 +0000 UTC 20.9 kB denmark-lakes.parquet
object 2023-03-21 14:45:38 +0000 UTC 916.4 kB lakes.parquet
```
# Finishing Up
Once you've finished the quickstart, shut down your local environment with the following command:
```bash
docker stop lakefs
```

0 comments on commit d14070f

Please sign in to comment.