Skip to content

Releases: koralium/flowtide

Version 0.11.1

17 Nov 15:19
9caa47b
Compare
Choose a tag to compare

Minor changes

Catalog support

This release adds support to use AddCatalog for connectors, to group connectors together under a catalog.
This is useful if two sources have the exact same table names for instace. They can then be added under different catalogs to add uniqueness.

See catalog documentation for more info on how to use it.

Pull requests

Full Changelog: v0.11.0...v0.11.1

Version 0.11.0

14 Nov 12:41
a7aae9e
Compare
Choose a tag to compare

Major Changes

Column-Based Event Format

Most operators have transitioned from treating events as rows with flexbuffer to a column-based format following the Apache Arrow specification. This change has led to significant performance improvements, especially in the merge join and aggregation operators. Transitioning from row-based to column-based events involved a major rewrite of core components, and some operators still use the row-based format, which will be updated in the future.

Not all expressions have been converted to work with column data yet. However, the solution currently handles conversions between formats to maintain backward compatibility. Frequent conversions may result in performance decreases.

The shift to a column format also introduced the use of unmanaged memory for new data structures, for the following reasons:

  • 64-byte aligned memory addresses for optimal SIMD/AVX operations.
  • Immediate memory return when a page is removed from the LRU cache, instead of waiting for the next garbage collection cycle.

With unmanaged memory, it is now possible to track memory allocation by different operators, providing better insight into memory usage in your stream.

B+ Tree Splitting by Byte Size

Previously, the B+ tree determined page sizes based on the number of elements, splitting pages into two equal parts when the max size (e.g., 128 elements) was reached. While this worked for streams with uniform element sizes, it led to size discrepancies in other cases, affecting serialization time and slowing down the stream.

This update introduces page splitting based on byte size, with a default page size of 32KB, ensuring more consistent and predictable page sizes.

Initial SQL Type Validation

This release contains the beginning of type validation when creating a Substrait plan using SQL. Currently, only SQL Server provides specific type metadata, while sources like Kafka continue to designate columns as 'any' due to varying payload types.

The new validation feature raises exceptions for type mismatches, such as when a boolean column is compared to an integer (e.g., boolColumn = 1). This helps inform users transitioning from SQL Server that bit columns are treated as boolean in Flowtide.

New UI

flowtidenewui

A new UI has been developed, featuring an integrated time series database that enables developers to monitor stream behavior over time. This database’s API aligns with Prometheus standards, allowing for custom queries to investigate potential issues.

The UI retrieves all data through the Prometheus API endpoint, enabling future deployment as a standalone tool connected to a Prometheus server.

Minor Changes

Congestion Control Based on Cache Misses

Flowtide processes data in small batches, typically 1-100 events. While this approach works well with in-memory data, cache misses that require persistent storage access can create bottlenecks. This is particularly problematic with multiple chained joins, where sequential data fetching can delay processing.

To address this, the join operator now monitors cache misses during a batch and, when a threshold is reached, splits the processed events and forwards them to the next operator. This change allows operators to access persistent storage in parallel, easing congestion.

Reduce the amount of pages written to persistent storage

Previously, all B+ tree metadata was written to persistent storage at every checkpoint, including root page IDs. In streams with numerous operators, this led to unnecessary writes.

Now, metadata is only written if changes have occurred, reducing the number of writes and improving storage efficiency.

Pull requests

  • Update to the latest substrait version by @Ulimo in #471
  • Add stream benchmarks to be able to test performance between versions by @Ulimo in #472
  • Allow setting a custom task scheduler by @Ulimo in #473
  • Add support for permify as a connector by @Ulimo in #466
  • Bug fix: Reusing of dataflowblockoptions caused no possible restarts by @Ulimo in #477
  • Bug fixes: Fix aggregation emit list in optimizer and not in by @Ulimo in #476
  • Add append tree by @Ulimo in #479
  • Fix unit tests by @Ulimo in #480
  • Fix some units tests for spicedb and sql server by @Ulimo in #481
  • Setup tests by @Ulimo in #482
  • Add support for exchange relation and substream relations by @Ulimo in #474
  • Improve comparison performance by @Ulimo in #485
  • Add support for column store by @Ulimo in #490
  • Add support to read number column from sharepoint by @Ulimo in #492
  • Add expression compiler for column data by @Ulimo in #491
  • Bug fix: Fix bug in binary list when doing updates by @Ulimo in #493
  • Fix edge case bug in merge join operator by @Ulimo in #494
  • Fix bug in column normalization operator where an empty batch caused exception by @Ulimo in #496
  • Fix bug in bitmap list where an insert at an integer border caused error by @Ulimo in #497
  • Fix bug in token cache where token was not refreshed correctly by @Ulimo in #498
  • Add column aggregate operator by @Ulimo in #499
  • Fix bug in addvaluetoelements by @Ulimo in #500
  • Change operators from batch manager to global by @Ulimo in #501
  • Fix memory not being disposed correctly by @Ulimo in #502
  • Add monitoring for memory allocations by @Ulimo in #503
  • Update docusaurus to 3.5 by @Ulimo in #504
  • Update path to regexp to fix security issue by @Ulimo in #505
  • Add boolean and string column based functions by @Ulimo in #511
  • Change so column agg operator sends data in smaller batches by @Ulimo in #512
  • Add metrics to aggregate and projection by @Ulimo in #513
  • Change sql server to use column data instead of row data when reading from sql server by @Ulimo in #507
  • Update graph sdk and add trace logging by @Ulimo in #514
  • Union and normalization fixes by @Ulimo in #517
  • Bug fix: fix after conversion to union column that value get inserted correctly by @Ulimo in #520
  • Fix so sql server source uses column and fix so convert to union disposes old column directly by @Ulimo in #521
  • Add a built in time series database for monitoring by @Ulimo in #519
  • Add RemoveRange to column data by @Ulimo in #522
  • Add methods to calculate the actual byte size used by columns and batches by @Ulimo in #518
  • Improve performance for map and union columns by @Ulimo in #523
  • Change so watermark in multiple targets gets aggregated by @Ulimo in #524
  • Change to use realloc instead of malloc when resizing memory by @Ulimo in #525
  • Fix so allocation metrics correctly show realloc stats by @Ulimo in #526
  • Bump braces from 3.0.2 to 3.0.3 in /src/FlowtideDotNet.AspNetCore/ClientApp by @dependabot in #489
  • Bump micromatch from 4.0.5 to 4.0.8 in /src/FlowtideDotNet.AspNetCore/ClientApp by @dependabot in #506
  • Diverse fixes to UI and also some minor bug fixes by @Ulimo in #527
  • Change so merge join uses microbatches by @Ulimo in #528
  • Add support to do splits in B+ tree based on byte size by @Ulimo in #529
  • Fix in metrics so readonlyspan of tags are used directly to find the correct metric series by @Ulimo in #530
  • Congestion control based on cache misses by @Ulimo in #531
  • Bug fixes for metrics and fix dispose in sql and aggregate by @Ulimo in #532
  • Change internal node to use native memory for children by @Ulimo in #533
  • Change column aggregates to use primitivelist for values by @Ulimo in #534
  • Improve performance of search boundries for int64 column by @Ulimo in #535
  • Change so normalize key storage uses value container by @Ulimo in #536
  • Fix so B+ tree search only uses async if page is not in th...
Read more

Version 0.11.0 alpha 22

11 Nov 09:33
14ab325
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Fix bug where a temporary tree fails fetching metadata after a crash by @Ulimo in #574

Full Changelog: v0.11.0-alpha21...v0.11.0-alpha22

Version 0.11.0 alpha 21

10 Nov 23:13
f11019c
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add between and coalesce column functions by @Ulimo in #565
  • Fix issue with watermark in multi input vertex by @Ulimo in #566
  • Fix reference remapping when adding a plan as a view by @Ulimo in #568
  • Add stop stream command to more gracefully stop a stream by @Ulimo in #569
  • Add documentation for general metrics available in the stream by @Ulimo in #571
  • Change fasterKV to only take index checkpoints by @Ulimo in #572

Full Changelog: v0.11.0-alpha20...v0.11.0-alpha21

Version 0.11.0 alpha 20

04 Nov 08:34
8ee161b
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add min element size after split for B+ tree by @Ulimo in #561
  • Remove metadata write when fetching a state client by @Ulimo in #557
  • Implement test cases for storage rules, also fix critical bug where a crash could cause metadata to overwrite data page by @Ulimo in #562
  • Add checks in tests that same data is not written multiple times by @Ulimo in #563
  • Bump http-proxy-middleware from 2.0.6 to 2.0.7 in /docs by @dependabot in #553
  • Bump Microsoft.Extensions.ObjectPool from 8.0.2 to 8.0.10 by @dependabot in #544

Full Changelog: v0.11.0-alpha19...v0.11.0-alpha20

Version 0.11.0 alpha 19

31 Oct 09:25
d4eddb0
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add docs for persistent storage rules and describe project structure by @Ulimo in #556
  • Add missing throw in sharepoint read by @Ulimo in #558
  • Add write to json for columns to easily allow json export by @Ulimo in #559
  • Change console sink to work similar to other connectors by @Ulimo in #560

Full Changelog: v0.11.0-alpha18...v0.11.0-alpha19

Version 0.11.0 alpha 18

29 Oct 12:03
e5c0bcf
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Fix so histogram actually takes the sum by @Ulimo in #546
  • Improve performance of remove range in bitmaplist by @Ulimo in #548
  • Add support for insert range in binary list by @Ulimo in #549
  • Add InsertRange to BitmapList by @Ulimo in #550
  • Add so benchmark is run on PRs by @Ulimo in #547
  • Add insert range in primitivelist and nativelonglist by @Ulimo in #551
  • Add CountTrue/FalseInRange for bitmap list by @Ulimo in #552
  • Add Insert range to columns and start using it in the stream by @Ulimo in #554
  • Do not ignore task cancelled exceptions from sharepoint sdk by @Ulimo in #555

Full Changelog: v0.11.0-alpha17...v0.11.0-alpha18

Version 0.11.0 alpha 17

13 Oct 21:24
19a0e88
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add support to do splits in B+ tree based on byte size by @Ulimo in #529
  • Fix in metrics so readonlyspan of tags are used directly to find the correct metric series by @Ulimo in #530
  • Congestion control based on cache misses by @Ulimo in #531
  • Bug fixes for metrics and fix dispose in sql and aggregate by @Ulimo in #532
  • Change internal node to use native memory for children by @Ulimo in #533
  • Change column aggregates to use primitivelist for values by @Ulimo in #534
  • Improve performance of search boundries for int64 column by @Ulimo in #535
  • Change so normalize key storage uses value container by @Ulimo in #536
  • Fix so B+ tree search only uses async if page is not in the cache by @Ulimo in #537
  • Remade counter to use observable instead. by @Ulimo in #538
  • Sql type validation by @Ulimo in #508
  • Change iteration operator to use column store by @Ulimo in #540
  • Change merge join from deprecated left right keys to keys by @Ulimo in #541
  • Fix all current build warnings by @Ulimo in #539
  • Add missing file header to openfga tests by @Ulimo in #542
  • Fix optimizer for direct field simplification by @Ulimo in #543

Full Changelog: v0.11.0-alpha16...v0.11.0-alpha17

Version 0.11.0 alpha 16

30 Sep 21:51
5840255
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Change so merge join uses microbatches by @Ulimo in #528

Full Changelog: v0.11.0-alpha15...v0.11.0-alpha16

Version 0.11.0 alpha 15

30 Sep 12:17
af8c6fa
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add a built in time series database for monitoring by @Ulimo in #519
  • Add RemoveRange to column data by @Ulimo in #522
  • Add methods to calculate the actual byte size used by columns and batches by @Ulimo in #518
  • Improve performance for map and union columns by @Ulimo in #523
  • Change so watermark in multiple targets gets aggregated by @Ulimo in #524
  • Change to use realloc instead of malloc when resizing memory by @Ulimo in #525
  • Fix so allocation metrics correctly show realloc stats by @Ulimo in #526
  • Bump braces from 3.0.2 to 3.0.3 in /src/FlowtideDotNet.AspNetCore/ClientApp by @dependabot in #489
  • Bump micromatch from 4.0.5 to 4.0.8 in /src/FlowtideDotNet.AspNetCore/ClientApp by @dependabot in #506
  • Diverse fixes to UI and also some minor bug fixes by @Ulimo in #527

Full Changelog: v0.11.0-alpha14...v0.11.0-alpha15