Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New history() command to replace list_versions() #3563

Open
wjones127 opened this issue Mar 18, 2025 · 0 comments
Open

New history() command to replace list_versions() #3563

wjones127 opened this issue Mar 18, 2025 · 0 comments

Comments

@wjones127
Copy link
Contributor

wjones127 commented Mar 18, 2025

Adding operation information

We have Transaction information, but it's not well surfaced. Along with each version, it would be nice to list the operations performed to produce each version.

Cleanup

When we cleanup versions, it removes them from the list.

Performance

The logic for listing version is:

  1. Use CommitHandler to list the manifests
  2. For each manifest:
    1. Open it
    2. Read off the timestamp and metadata from the versions.

If there are a lot of versions, this can be expensive.

API

We would create a new history() command to replace the versions() API that would provide more information about the commits:

def history(
    min_version: Optional[int] = None,
    created_after: Optional[datetime] = None,
    cleaned_up: Optional[bool] = None,
    only_tagged: bool = False,
) -> List[HistoryEntry]: ...


class HistoryEntry:
    version: int
    created_on: datetime
    created_by: str
    tags: Set[str]
    operation: LanceOperation
    cleaned_up: bool

TBD: how to add a created_by field? Should we just create metadata? Then we could add arbitrary things like: lance_version, lancedb_version, author, etc.

The only_tagged parameter let's users filter just for the versions that have tags currently.

Improving performance and handling cleanup

We can store a new file to speed up history listing: _history/{version}.lance. This would be a Lance file containing a cache of history up to version.

Tags are mutable, but we can still include them in the cache as (name, etag) pairs. We can list the _refs/tags directory. If the tag is missing, we ignore it. If there is an etag we don't recognize, we can add that.

When we run history(), we can cache this file if it doesn't exist. When we run cleanup_old_versions(), we can also create this file before deleting old versions. By caching this data, we keep around the full history. If we want, we can also add a parameter to cleanup_old_version() to prune this history so we don't retain longer than some time period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant