Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ready for review] Icechunk CLI Initial Implementation #716

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

DahnJ
Copy link
Contributor

@DahnJ DahnJ commented Feb 11, 2025

Icechunk CLI Implementation

Initial implementation for #461

Design doc: #714

Implements

  • Python integration
  • Serializable configuration
  • Works against local filesystem and S3
  • Operations
    • icechunk repo create
    • icechunk snapshot list
    • icechunk config init
    • icechunk config add
    • icechunk config list

Copy link
Collaborator

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great @DahnJ !

We don't need to wait until it's totally done to merge. lmk when you think you have reached a good milestone and we can merge in that state.

@DahnJ
Copy link
Contributor Author

DahnJ commented Feb 15, 2025

I've added a new version with serializable configuration. This works for me with both the Python binary and Rust one.

Make sure to edit the bucket in default.yaml directly or through editing default.rs which generates it.

cd icechunk
cargo install --path .

# Local repo
❯ icechunk repo create localrepo
✅ Created repository RepositoryAlias("localrepo")
❯ icechunk snapshot list localrepo
SnapshotInfo { id: 14883a60e82269cadf100ec8, parent_id: None, flushed_at: 2025-02-15T10:10:15.952248Z, message: "Repository initialized", metadata: {"__root": Bool(true)} }


# S3 repo
❯ icechunk repo create s3repo
✅ Created repository RepositoryAlias("s3repo")
❯ icechunk snapshot list s3repo
SnapshotInfo { id: 1e075c18c0c4e81ff2fb48e1, parent_id: None, flushed_at: 2025-02-15T10:09:43.722002Z, message: "Repository initialized", metadata: {"__root": Bool(true)} }

It's still WIP, but wanted to check that I'm going in the right direction with the config structure here.

@DahnJ DahnJ changed the title [Draft] Icechunk CLI Implementation Skeleton [Draft] Icechunk CLI Initial Implementation Feb 15, 2025
Copy link
Collaborator

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @DahnJ !!!!!!

@@ -0,0 +1,31 @@
repos:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this file src/bin/icechunk?

Copy link
Contributor Author

@DahnJ DahnJ Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now removed the file altogether (5b03fbd), since it can be generated by icechunk config init

Should we instead decide on a standard location for the config, not relative to the repo? How about $HOME/.config/icechunk/cli_config.yaml, obtained through dirs::config_dir()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that here 8f8d89f

On Mac, this actually defaults to $HOME/Library/Application Support.

#[tokio::main]
async fn main() {
let cli = IcechunkCLI::parse();
let _ = run_cli(cli).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably return an error, something like:

fn main() -> Result<(), Box<dyn Error>>

we will want the status code to be != 0 if some error happened

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved error handling in 1724fcd


use crate::config::{RepositoryConfig, S3Options, S3StaticCredentials};

// Redefine to remove the Refreshable field
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like quite a bit of work just to remove that field. Remember that you'll need also to do the same for GCP and other credential types. Can we live with Refreshable even if we fail on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wanted to but the Refreshable field contains a trait object for which I couldn't simply derive PartialEq, preventing me from testing these configs. Having this subset-enum seemed like the simplest solution. Do you think there's a better one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you lose PartialEq but it may be the better than maintaining another parallel hierarchy.

We do a lot of assert!(matches!(...)) in tests when we cannot use PartialEq

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice, I didn't know about assert!(matches!(...)). Changed in 7cdd087

repos.repos.insert(RepositoryAlias("s3repo".to_string()), s3_repo);
repos.repos.insert(RepositoryAlias("localrepo".to_string()), local_repo);

let path = "default.yaml";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know ... this test is writing code to the repos itself. Maybe move it to a temp file?

Copy link
Contributor Author

@DahnJ DahnJ Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, was just a quick way of generating something. The file is now removed.

I have added a rudimentary config init in 4d7afda

repository.lookup_branch("main").await.context("❌ Failed to lookup branch")?;
let ancestry = repository.ancestry(&VersionInfo::SnapshotId(snapshot_id)).await?;
pin!(ancestry);
while let Some(snapshot_info) = ancestry.next().await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

navigating ancestry in Icechunk is (currently) very slow. I'd recommend putting a limit of say 10 parents, and then let users override that default.

You may want to look at the TryStreamExt trait for useful methods in Stream, so you don't have to code the explicit loop. There is try_map, take and all what you would expect from an iterator type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this along the lines of what you were suggesting? 18d4d08

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, and probably you no longer need to pin it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let ancestry = repository.ancestry(&VersionInfo::SnapshotId(snapshot_id)).await?;
pin!(ancestry);
while let Some(snapshot_info) = ancestry.next().await {
println!("{:?}", snapshot_info?);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend printing object ids (like SnapshotId) using its to_string method, instead of the Debug instance. It has a nicer, shorter presentation, and it's the characters that go into the object store path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that require (in this case) SnapshotInfo to implement Display?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should implement __repr__ for PySnapshotInfo

@paraseba
Copy link
Collaborator

@DahnJ one idea to think about: it could be good to write all the logic in the CLI as a library crate in the project, and then have the binary only dispatching to the library using clap. That way, we can test the CLi very easily, and other libraries can reuse it in the future.

@DahnJ
Copy link
Contributor Author

DahnJ commented Feb 18, 2025

Implemented a rudimentary interactive config init and config list

❯ icechunk config init
Enter alias: s3repo
Select repository type: S3
Enter bucket: mybucket
Enter prefix: myprefix
Enter region: eu-west-1
❯ icechunk config list
repos:
  s3repo: !S3
    location:
      bucket: mybucket
      prefix: myprefix
    object_store_config:
      region: eu-west-1
      endpoint_url: null
      anonymous: false
      allow_http: false
    credentials: FromEnv
    config:
      inline_chunk_threshold_bytes: null
      unsafe_overwrite_refs: null
      get_partial_values_concurrency: null
      compression: null
      caching: null
      storage: null
      virtual_chunk_containers: null
      manifest: null

Copy link
Collaborator

@paraseba paraseba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great!

@@ -55,6 +55,9 @@ tracing-subscriber = { version = "0.3.19", features = [
tracing = "0.1.41"
err-into = "1.0.1"
serde_yaml_ng = "0.10.0"
clap = { version = "4.5", features = ["derive"] }
anyhow = "1.0.95"
dialoguer = "0.11.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the impact of creating another library crate in the workspace and putting everything cli related there? Projects that want to depend on Icechunk as a Rust library, wouldn't need to bring in all these dependencies.

Copy link
Contributor Author

@DahnJ DahnJ Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about declaring cli as an optional dependency, not installed by default? That could be more lightweight mechanism to provide a choice to the users.

I did this in 38747c1

This works for me both with cargo with the Rust library and maturin develop. Both require --features cli to install the CLI. Without it

  • Cargo won't install the executable at all
  • Maturin/Python will still add the icechunk executable that only prints
❯ icechunk
Must install the optional `cli` feature to use the Icechunk CLI.

@DahnJ DahnJ force-pushed the feat/dj-cli-start branch from 2f94e38 to 3bff618 Compare February 23, 2025 01:33
@DahnJ DahnJ force-pushed the feat/dj-cli-start branch from 3bff618 to 38747c1 Compare February 23, 2025 01:36
@DahnJ
Copy link
Contributor Author

DahnJ commented Feb 26, 2025

@DahnJ one idea to think about: it could be good to write all the logic in the CLI as a library crate in the project, and then have the binary only dispatching to the library using clap. That way, we can test the CLi very easily, and other libraries can reuse it in the future.

Regarding library crate, I offered an alternative in #716 (comment), let me please know what you think.

Regarding testing, made the interface more testable + added tests in 3137ace.

  • Sadly the dialoguer functions might not be easily testable.
  • We could also consider integration tests via assert_cmd but I think just testing the functions is sufficient for now

@DahnJ
Copy link
Contributor Author

DahnJ commented Feb 26, 2025

@paraseba as far as I am concerned, this is near a mergeable state. My goal with this PR is to provide something well-designed enough for others to be able to start contributing. Happy to make it good enough for Icechunk's standards though, let me know what you think.

@DahnJ DahnJ marked this pull request as ready for review February 26, 2025 00:19
@DahnJ DahnJ changed the title [Draft] Icechunk CLI Initial Implementation [Ready for review] Icechunk CLI Initial Implementation Feb 26, 2025
@dcherian
Copy link
Contributor

Thanks @DahnJ . Seba is on vacation this week, so feedback here will be a bit delayed.

Thanks for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants