v0.3.0
·
540 commits
to refs/heads/main
since this release
‼️ v0.2 → v0.3 Migration Guide ‼️
We're proud to release version 0.3.0 of Daft! Please note that with this minor version increment, v0.3 contains several breaking changes:
daft.read_delta_lake
- This function was deprecated in favor of
daft.read_deltalake
in v0.2.26 and is now removed. (#2663)
- This function was deprecated in favor of
daft.read_parquet
/daft.read_csv
/daft.read_json
- Schema hints are deprecated in favor of
infer_schema
(whether to turn on schema inference) andschema
(a definitive schema if infer_schema is False, otherwise it is used as a schema hint that is applied post inference). (#2326)
- Schema hints are deprecated in favor of
Expression.str.normalize()
- Parameters are now all False by default, and need to individually be toggled on. (#2647)
DataFrame.agg
/GroupedDataFrame.agg
- Tuple syntax for aggregations was deprecated in v0.2.18 and is now no longer supported. Please use aggregation expressions instead. (#2663)
- Ex:
df.agg([(col("x"), "sum"), (col("y"), "mean")])
should be written instead asdf.agg(col("x").sum(), col("y").mean())
DataFrame.count
- Calling
.count()
with no arguments will now return a DataFrame with column “count” which contains the length of the entire DataFrame, instead of the count for each of the columns (#1996)
- Calling
DataFrame.with_column
- Resource requests should now be specified on UDF expressions (
@udf(num_gpus=…)
) instead of on Projections (through.with_column(..., resource_request=...)
(#2654)
- Resource requests should now be specified on UDF expressions (
DataFrame.join
- When joining two DataFrames, columns will now be merged only if they exactly match join keys. (#2631)
- Ex:
df1 = daft.from_pydict({
"a": ["x", "y"],
"b": [1, 2]
})
df2 = daft.from_pydict({
"a": ["y", "z"],
"b": [20, 30]
})
result_df = df1.join(
df2,
left_on=[col("a"), col("b")],
right_on=[col("a"), col("b")/10], # NOTE THE "/10"
how="outer"
)
result_df.sort("a").collect()
# before
╭──────┬───────╮
│ a ┆ b │
│ --- ┆ --- │
│ Utf8 ┆ Int64 │
╞══════╪═══════╡
│ x ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z ┆ 30 │
╰──────┴───────╯
# after
╭──────┬───────┬─────────╮
│ a ┆ b ┆ right.b │
│ --- ┆ --- ┆ --- │
│ Utf8 ┆ Int64 ┆ Int64 │
╞══════╪═══════╪═════════╡
│ x ┆ 1 ┆ None │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ y ┆ 2 ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ z ┆ None ┆ 30 │
╰──────┴───────┴─────────╯
Changes
✨ New Features
- [FEAT] Ellipsize scan task sources if too many @Vince7778 (#2695)
- [FEAT] Allow user provided schema and schema inference length for read_sql @colin-ho (#2676)
- [FEAT] Add dataframe iteration on rows and change default buffer size @jaychia (#2685)
- [FEAT]: add to_arrow_iter @universalmind303 (#2681)
- [FEAT] Example Analyze for Local Execution Engine @samster25 (#2648)
- [FEAT] (ACTORS-1) Add DAFT_ENABLE_ACTOR_POOL_PROJECTS=1 feature flag and specifying concurrency @jaychia (#2668)
- [FEAT]: sql like & ilike @universalmind303 (#2666)
- [FEAT] Changes the default count() behavior to perform a global row count instead @jaychia (#2653)
- [FEAT] Support passing in column name strings to
to_struct
@Vince7778 (#2671) - [FEAT]: refactor tree display to get more info into physicalplan @universalmind303 (#2640)
- [FEAT] Add
to_struct
function for merging columns into a struct @Vince7778 (#2662) - [FEAT] Add hashing and groupby on structs @Vince7778 (#2657)
- [FEAT]:
daft.sql_expr
@universalmind303 (#2656) - [FEAT] Deprecates usage of resource_request on df.with_column API @jaychia (#2654)
- [FEAT] Add input batching for UDFs @Vince7778 (#2651)
- [FEAT] Add
cbrt
expression @raunakab (#2646) - [FEAT] use ObfuscatedString to hide creds when Display IOConfig @samster25 (#2645)
- [FEAT]: more sql functions @universalmind303 (#2596)
- [FEAT] Support __init__ arguments for StatefulUDFs @jaychia (#2634)
- [FEAT] Move resource requests to UDFs instead of on with_column @jaychia (#2632)
- [FEAT] Add wildcards in column expressions @Vince7778 (#2629)
- [FEAT] factor mermaid builder into it's own module to use independently @samster25 (#2636)
- [FEAT] Remote parquet streaming @colin-ho (#2620)
- [FEAT]: mermaid formatter @universalmind303 (#2619)
- [FEAT] Add ActorPoolProject logical and physical plans @jaychia (#2601)
- [FEAT] Enable broadcast strategy on anti and semi joins @kevinzwang (#2621)
- [FEAT] Add
.list.sort()
for sorting lists within a list column @Vince7778 (#2589) - [FEAT] Streaming Local Parquet Reads @colin-ho (#2592)
🚀 Performance Improvements
- [PERF] Add ability to automatically choose broadcast for anti/semi joins @kevinzwang (#2699)
- [PERF] Swordfish Dynamic Pipelines @samster25 (#2599)
- [PERF] Dyn Compare + Probe Table @samster25 (#2618)
👾 Bug Fixes
- [BUG] Fix Parquet reads with chunk sizing @desmondcheongzx (#2658)
- [BUG]: repr mermaid fix @universalmind303 (#2688)
- [BUG] Use Daft Pickle instead of Ray Pickle and use bincode for serializing @samster25 (#2693)
- [BUG] Add timeout to analytics client @raunakab (#2670)
- [BUG] Fix swordfish inner joins @colin-ho (#2678)
- [BUG] Fix struct
.hash()
naming bug @Vince7778 (#2673) - [BUG] Fix filter pushdown into non-inner joins @kevinzwang (#2659)
- [BUG] Fix issues where we check "is_ray_runner" on non-initialized contexts @jaychia (#2652)
- [BUG] Fix nested parquet reads for .show() and .limit() @desmondcheongzx (#2643)
- [BUG] Fix join op names and join key definition @kevinzwang (#2631)
- [BUG] Fix projection pushdowns not working with limits @Vince7778 (#2635)
- [BUG] Fix Expr::with_new_children for ScalarFunction @kevinzwang (#2624)
- [BUG] Fix pushdown past monotonically increasing id @Vince7778 (#2622)
📖 Documentation
- [CHORE] Fix FOTW #1 images notebook @jaychia (#2697)
- [DOCS] Add join types, renaming behavior, and example to join docs @kevinzwang (#2691)
- [FEAT] Add dataframe iteration on rows and change default buffer size @jaychia (#2685)
- [DOCS]: add docs for cosine_distance @universalmind303 (#2675)
- [FEAT] Add
to_struct
function for merging columns into a struct @Vince7778 (#2662) - [CHORE] Turn v0.3 deprecations into breaking changes @kevinzwang (#2663)
- [FEAT] Add
cbrt
expression @raunakab (#2646) - [FEAT] Support __init__ arguments for StatefulUDFs @jaychia (#2634)
- [FEAT] Move resource requests to UDFs instead of on with_column @jaychia (#2632)
- [FEAT] Add wildcards in column expressions @Vince7778 (#2629)
- [DOCS] Enable doc tests in CI @colin-ho (#2615)
- [FEAT] Add
.list.sort()
for sorting lists within a list column @Vince7778 (#2589) - docs: Add fotw tutorial on working with images @avriiil (#2490)
🧰 Maintenance
- [CHORE] fix merge conflict in repr tests @samster25 (#2700)
- [CHORE] Fix FOTW #1 images notebook @jaychia (#2697)
- [CHORE] Deprecate schema hints @colin-ho (#2655)
- [CHORE] Add error snafus for local executor @colin-ho (#2660)
- [FEAT]: refactor tree display to get more info into physicalplan @universalmind303 (#2640)
- [CHORE] Turn v0.3 deprecations into breaking changes @kevinzwang (#2663)
- [CHORE]: Drop use of deprecated form "default_features" @universalmind303 (#2665)
- [CHORE] bump dev version to 0.3.0 @samster25 (#2664)
- [CHORE]: fix feature flags @universalmind303 (#2661)
- [CHORE] Set
Expression.str.normalize()
options to False by default @Vince7778 (#2647) - [CHORE] Improve swordfish error handling @colin-ho (#2628)
- [CHORE] Add ignore for helix editor @raunakab (#2642)
- [CHORE] Add toolchain check to Makefile @Vince7778 (#2641)
- [CHORE] Upgrade Rust toolchain to 2024-08-01 @Vince7778 (#2639)
- [CHORE] Track memory for swordfish tpch @colin-ho (#2633)
- [CHORE] Split resource-request and hashable-float-wrapper into utility crates @jaychia (#2630)
- [CHORE] Use parquet for native tpch benchmarks @colin-ho (#2609)
- [CHORE] Refactor UDFs to separate stateful and stateless @jaychia (#2597)