-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ready for review] Icechunk CLI Initial Implementation #716
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great @DahnJ !
We don't need to wait until it's totally done to merge. lmk when you think you have reached a good milestone and we can merge in that state.
I've added a new version with serializable configuration. This works for me with both the Python binary and Rust one. Make sure to edit the bucket in cd icechunk
cargo install --path .
# Local repo
❯ icechunk repo create localrepo
✅ Created repository RepositoryAlias("localrepo")
❯ icechunk snapshot list localrepo
SnapshotInfo { id: 14883a60e82269cadf100ec8, parent_id: None, flushed_at: 2025-02-15T10:10:15.952248Z, message: "Repository initialized", metadata: {"__root": Bool(true)} }
# S3 repo
❯ icechunk repo create s3repo
✅ Created repository RepositoryAlias("s3repo")
❯ icechunk snapshot list s3repo
SnapshotInfo { id: 1e075c18c0c4e81ff2fb48e1, parent_id: None, flushed_at: 2025-02-15T10:09:43.722002Z, message: "Repository initialized", metadata: {"__root": Bool(true)} } It's still WIP, but wanted to check that I'm going in the right direction with the config structure here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great @DahnJ !!!!!!
icechunk/default.yaml
Outdated
@@ -0,0 +1,31 @@ | |||
repos: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this file src/bin/icechunk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now removed the file altogether (5b03fbd), since it can be generated by icechunk config init
Should we instead decide on a standard location for the config, not relative to the repo? How about $HOME/.config/icechunk/cli_config.yaml
, obtained through dirs::config_dir()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that here 8f8d89f
On Mac, this actually defaults to $HOME/Library/Application Support
.
icechunk/src/bin/icechunk/main.rs
Outdated
#[tokio::main] | ||
async fn main() { | ||
let cli = IcechunkCLI::parse(); | ||
let _ = run_cli(cli).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably return an error, something like:
fn main() -> Result<(), Box<dyn Error>>
we will want the status code to be != 0 if some error happened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved error handling in 1724fcd
icechunk/src/cli/config.rs
Outdated
|
||
use crate::config::{RepositoryConfig, S3Options, S3StaticCredentials}; | ||
|
||
// Redefine to remove the Refreshable field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like quite a bit of work just to remove that field. Remember that you'll need also to do the same for GCP and other credential types. Can we live with Refreshable
even if we fail on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I wanted to but the Refreshable
field contains a trait object for which I couldn't simply derive PartialEq
, preventing me from testing these configs. Having this subset-enum seemed like the simplest solution. Do you think there's a better one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you lose PartialEq
but it may be the better than maintaining another parallel hierarchy.
We do a lot of assert!(matches!(...))
in tests when we cannot use PartialEq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice, I didn't know about assert!(matches!(...))
. Changed in 7cdd087
icechunk/src/cli/default.rs
Outdated
repos.repos.insert(RepositoryAlias("s3repo".to_string()), s3_repo); | ||
repos.repos.insert(RepositoryAlias("localrepo".to_string()), local_repo); | ||
|
||
let path = "default.yaml"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know ... this test is writing code to the repos itself. Maybe move it to a temp file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, was just a quick way of generating something. The file is now removed.
I have added a rudimentary config init
in 4d7afda
icechunk/src/cli/interface.rs
Outdated
repository.lookup_branch("main").await.context("❌ Failed to lookup branch")?; | ||
let ancestry = repository.ancestry(&VersionInfo::SnapshotId(snapshot_id)).await?; | ||
pin!(ancestry); | ||
while let Some(snapshot_info) = ancestry.next().await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
navigating ancestry in Icechunk is (currently) very slow. I'd recommend putting a limit of say 10 parents, and then let users override that default.
You may want to look at the TryStreamExt
trait for useful methods in Stream
, so you don't have to code the explicit loop. There is try_map
, take
and all what you would expect from an iterator type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this along the lines of what you were suggesting? 18d4d08
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, and probably you no longer need to pin it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
icechunk/src/cli/interface.rs
Outdated
let ancestry = repository.ancestry(&VersionInfo::SnapshotId(snapshot_id)).await?; | ||
pin!(ancestry); | ||
while let Some(snapshot_info) = ancestry.next().await { | ||
println!("{:?}", snapshot_info?); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend printing object ids (like SnapshotId
) using its to_string
method, instead of the Debug
instance. It has a nicer, shorter presentation, and it's the characters that go into the object store path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that require (in this case) SnapshotInfo
to implement Display
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should implement __repr__
for PySnapshotInfo
@DahnJ one idea to think about: it could be good to write all the logic in the CLI as a library crate in the project, and then have the binary only dispatching to the library using clap. That way, we can test the CLi very easily, and other libraries can reuse it in the future. |
Implemented a rudimentary interactive ❯ icechunk config init
Enter alias: s3repo
Select repository type: S3
Enter bucket: mybucket
Enter prefix: myprefix
Enter region: eu-west-1
❯ icechunk config list
repos:
s3repo: !S3
location:
bucket: mybucket
prefix: myprefix
object_store_config:
region: eu-west-1
endpoint_url: null
anonymous: false
allow_http: false
credentials: FromEnv
config:
inline_chunk_threshold_bytes: null
unsafe_overwrite_refs: null
get_partial_values_concurrency: null
compression: null
caching: null
storage: null
virtual_chunk_containers: null
manifest: null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great!
icechunk/Cargo.toml
Outdated
@@ -55,6 +55,9 @@ tracing-subscriber = { version = "0.3.19", features = [ | |||
tracing = "0.1.41" | |||
err-into = "1.0.1" | |||
serde_yaml_ng = "0.10.0" | |||
clap = { version = "4.5", features = ["derive"] } | |||
anyhow = "1.0.95" | |||
dialoguer = "0.11.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the impact of creating another library crate in the workspace and putting everything cli related there? Projects that want to depend on Icechunk as a Rust library, wouldn't need to bring in all these dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about declaring cli
as an optional dependency, not installed by default? That could be more lightweight mechanism to provide a choice to the users.
I did this in 38747c1
This works for me both with cargo
with the Rust library and maturin develop
. Both require --features cli
to install the CLI. Without it
- Cargo won't install the executable at all
- Maturin/Python will still add the
icechunk
executable that only prints
❯ icechunk
Must install the optional `cli` feature to use the Icechunk CLI.
2f94e38
to
3bff618
Compare
3bff618
to
38747c1
Compare
Regarding library crate, I offered an alternative in #716 (comment), let me please know what you think. Regarding testing, made the interface more testable + added tests in 3137ace.
|
@paraseba as far as I am concerned, this is near a mergeable state. My goal with this PR is to provide something well-designed enough for others to be able to start contributing. Happy to make it good enough for Icechunk's standards though, let me know what you think. |
Thanks @DahnJ . Seba is on vacation this week, so feedback here will be a bit delayed. Thanks for your patience. |
Icechunk CLI Implementation
Initial implementation for #461
Design doc: #714
Implements
icechunk repo create
icechunk snapshot list
icechunk config init
icechunk config add
icechunk config list