-
Notifications
You must be signed in to change notification settings - Fork 35
Hub Clustering
Since the introduction of Spoke 2+ years ago, each Spoke node in a Hub cluster does all of the same work. Every write and every query goes to every available Spoke node. With this design, three nodes is the ideal. 3 nodes ensures data remains available during a rolling restart. A fourth node could be added, but doing so would not increase the capacity of Spoke, and would increase the total amount of cpu used by the system, since all writes and queries go everywhere.
There are a number of reasons we would like to scale beyond 3 nodes:
-
Increase system capacity for an overloaded cluster
Currently, if a system is at or near capacity, adding a new node does not increase total capacity of Spoke, and creates more overall work for the system. If we want to resize all 3 nodes, it means we have to remove each node, temporarily reducing capacity. -
Additional options for system capacity
As we use larger instances in a cluster, AWS instance types tend to increase by a factor of two, including cost. Being able to have clusters with 3-6 nodes allows more flexibility, and allows us to better optimize costs. -
Rebuilding/Replacing nodes
Being able to add and remove nodes will make the inevitable node replacement and maintenance simpler. -
Rolling restart resilience
With a 3 node cluster, only 2 nodes are available during a rolling restart, which means that some data will only be available from one active node during the process. We would prefer not to have this single point of failure.
We attempted to address these issues previously, using hash rings. While the hash rings worked for some aspects, it had some significant flaws for our usage in prod.
-
Unpredictable Spoke capacity
The 10 highest byte volume channels are 60% of the total bytes flowing through the system. When randoming assigning these high volume channels to 3 of 6 nodes, the node with the highest volume used twice as much disk as the lowest. -
Complexity
In trying to improve rolling restart resilience, we created a system where an engineer might need to manually edit ZooKeeper values due to unexpected behavior of the system. This did happen, and it was terrifying.
This solution was rolled back.
During Version 1, we noticed that making Spoke calls to more servers had much less cpu impact than anticipated. This suggested a different tack. We could write all data to 3 random Spoke nodes, and query all servers. This solution will not scale inifinitely, but it might scale just fine for 6 servers.
It has the advantage of being much simpler to implement, and it distributes the load evenly across all available servers. In addition, a 3 node cluster works the same as before.
In our testing up to six nodes, each new node adds 90+% of its cpu capacity, and 100% of its Spoke capacity.
It addresses all of our original concerns, as well as the issues from version 1 of scaling.
We can add capacity directly to an over loaded system.
We can replace an entire 3 node cluster by adding 3 new nodes, and letting the old servers continue to serve requests until the Spoke cache expires.
Clusters with 4+ nodes will have a write factor of 3 during rolling restarts.