-
Notifications
You must be signed in to change notification settings - Fork 35
Deprecating SINGLE storage for channels
SINGLE storage uses a unique object in S3 for every item inserted into the hub. The S3 writes are done asynchronously after the write to Spoke. At 15 minute intervals, we verify that the items in Spoke match the items in S3, and any missing items are written.
BATCH storage uses a minute Webhook to write an index object and a compressed batch of all the items in that minute.
The storage property is not relevant to Spoke. Since we rely on Spoke for our write guarantee, waiting for batches of items is not possible.
Since we pay for each write to S3 (writes cost 12X more than reads), batching is cheaper for more than 2 items per minute. Since we compress all of the items for a minute together, we also get higher compression rates for batching.
items per minute | single writes | batch writes | batch savings |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 0.5 |
10 | 10 | 2 | 5 |
100 | 100 | 2 | 50 |
1000 | 1000 | 2 | 500 |
10000 | 10000 | 2 | 5000 |
Reads and queries items are a slightly different story. All hub items are written to S3, however, many items may not be read from S3. The vast majority of queries and reads are served exclusively by Spoke (most hubs have a 6 hour Spoke TTL).
We recently implemented caching of BATCH reads from S3, so they now share the same scaling factor above as writes. Batch queries have different performance characteristics depending on data rate, and are more efficient for higher throughput.
To reduce costs, simplify the hub API, and generally improve performance, we plan to switch all channels to BATCH storage, and remove storage as a channel option. This will help support extracting large volumes of data from the Hub for analysis and replication.
- Improve S3 Batching performance
- Make BATCH the default option
- Change all SINGLE storage channels to BOTH
- Change code to assume channels are BATCH
- Historical channels (with a mutableTime) will be both, though historical writes will be direct to S3 Single
These changes will allow us to test the impact of switching to BATCH, and revert if needed.
- Remove the storage option
- Change code to only attempt to read SINGLE as a last case
During both phases, we can rewrite S3 SINGLE data to S3 BATCH data. This will reduce storage costs, and increase performance.
This is not required, as SINGLE data will not be orphaned.