Heartbeat Multiplexing

In Raft, leader is responsible for periodically sending heartbeats to all followers to maintain the leadership. In Multi-Raft, without any optimization load from the the heartbeats can be non-negligible. However, we can optimize to reduce the heartbeat RPCs by multiplexing the heartbeats from different shards.

Problem of naive implementation

In the following diagram, we have two nodes Node1 and Node2 and there are two shards. A Process is a leader if it is denoted with an asterisk (*): P1 is sending the heartbeats to the P3 and P2 to P4 as well. So there are two RPC messages sent from P1 to P3 and P2 to P4 independently.

graph LR
  subgraph Node1
    P1(P1*)
    P2(P2*)
  end
  subgraph Node2
    P3(P3)
    P4(P4)
  end
  P1 -->|heartbeat| P3
  P2 --> P4

Multiplexing heartbeats

To reduce the number of RPCs to only one, we can introduce multiplexer and demultiplexer in the nodes. The two heartbeats are buffered in the multiplexer and sent in a batched RPC to the destination node. In the destination node, the demultiplexer will split the batched RPC into individual heartbeats and send them to the corresponding processes.

graph LR
  subgraph Node1
    P1(P1*)
    P2(P2*)
    MUX(Mux)
  end
  subgraph Node2
    DEMUX(Demux)
    P3(P3)
    P4(P4)
  end
  P1 --> MUX
  P2 --> MUX
  MUX --> DEMUX
  DEMUX --> P3
  DEMUX --> P4

In this case the reduction rate is 2. But what would be the reduction rate in general. Let's do some math.

Math: Reduction rate

Let's consider there are N nodes and L shards. Each shard have K replication and leader processes are balanced among the nodes.

In this case, the total number of heartbeats sent in period is LK. And the total directed paths connecting two nodes is N(N-1) and these heartbeats are evenly attributed to these paths. Therefore, the number of heartbeats sent in each path is LK/(N(N-1)) and this is the reduction rate. For example, if there are 5 nodes and 1000 shards with 3 replication, the reduction rate is 150.