Skip to content

Commit

Permalink
Docs: added huggingface Datasets integration guide (#7613)
Browse files Browse the repository at this point in the history
  • Loading branch information
ozkatz authored Apr 2, 2024
1 parent ac07c1c commit 314966b
Show file tree
Hide file tree
Showing 3 changed files with 79 additions and 0 deletions.
Binary file added docs/assets/img/logos/huggingface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
73 changes: 73 additions & 0 deletions docs/integrations/huggingface_datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: HuggingFace Datasets
description: Read, write and version your HuggingFace datasets with lakeFS
parent: Integrations

---
# Versioning HuggingFace Datasets with lakeFS


{% include toc_2-3.html %}


[HuggingFace 🤗 Datasets](https://www.kubeflow.org/docs/about/kubeflow/) is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

🤗 Datasets supports access to [cloud storage](https://huggingface.co/docs/datasets/en/filesystems) providers through [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) FileSystem implementations.

[lakefs-spec](https://lakefs-spec.org/) is a community implementation of an fsspec Filesystem that fully leverages lakeFS' capabilities. Let's start by installing it:

## Installation

```shell
pip install lakefs-spec
```

## Configuration

If you've already configured the lakeFS python SDK and/or lakectl, you should have a `$HOME/.lakectl.yaml` file that contains your access credentials and endpoint for your lakeFS environment.

Otherwise, install [`lakectl`](../reference/cli.html##installing-lakectl-locally) and run `lakectl config` to set up your access credentials.


## Reading a Dataset

To read a dataset, all we have to do is use a `lakefs://...` URI when calling [`load_dataset`](https://huggingface.co/docs/datasets/en/loading):

```python
>>> from datasets import load_dataset
>>>
>>> dataset = load_dataset('csv', data_files='lakefs://example-repository/my-branch/data/example.csv')
```

That's it! this should automatically load the lakefs-spec implementation that we've installed, which will use the `$HOME/.lakectl.yaml` file to read its credentials, so we don't need to pass additional configuration.

## Saving/Loading

Once we've loaded a Dataset, we can save it using the `save_to_disk` method as normal:

```python
>>> dataset.save_to_disk('lakefs://example-repository/my-branch/datasets/example/')
```

At this point, we might want to commit that change to lakeFS, and tag it, so we could share it with our colleagues.

We can do it through the UI or lakectl, but let's do it with the [lakeFS Python SDK](./python.md#using-the-lakefs-sdk):


```python
>>> import lakefs
>>>
>>> repo = lakefs.repository('example-repository')
>>> commit = repo.branch('my-branch').commit(
... 'saved my first huggingface Dataset!',
... metadata={'using': '🤗'})
>>> repo.tag('alice_experiment1').create(commit)
```

Now, others on our team can load our exact dataset by using the tag we created:

```python
>>> from datasets import load_from_disk
>>>
>>> dataset = load_from_disk('lakefs://example-repository/alice_experiment1/datasets/example/')
```
6 changes: 6 additions & 0 deletions docs/integrations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,12 @@ See below for detailed instructions for using different technologies with lakeFS
<td width="25%" align=center><a href="./r.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/r.png" alt="r logo"/><br/>R</a></td>
<td width="25%" align=center><a href="./vertex_ai.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/vertex_ai.png" alt="Vertex AI Logo"/><br/>Vertex AI</a></td>
</tr>
<tr>
<td width="25%" align=center><a href="./huggingface_datasets.html"><img width=120 src="{{ site.baseurl }}/assets/img/logos/huggingface.png" alt="Hugging Face Logo"/><br/>HuggingFace Datasets</a></td>
<td width="25%" align=center></td>
<td width="25%" align=center></td>
<td width="25%" align=center></td>
</tr>
</table>

{: .tip}
Expand Down

0 comments on commit 314966b

Please sign in to comment.