-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new message type MsgGroupBroadcast and corresponding handler #485
base: master
Are you sure you want to change the base?
Conversation
e8c7473
to
75cb96f
Compare
@Connor1996 PTAL |
2084791
to
ef1ec68
Compare
d4a8512
to
806b77a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will MsgGroupBroadcast
be set?
806b77a
to
1fcdb47
Compare
src/raft.rs
Outdated
m_append.commit_term = m.get_commit_term(); | ||
self.r.send(m_append, &mut self.msgs); | ||
} | ||
Err(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider temporary unavailable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the processing like the way in fn send_append_aggressively
, looping for fetching until it success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, call it again in on_entries_fetched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Async fetch for forwarding is committed. PTAL.
src/raft.rs
Outdated
/// This enables data replication from a follower to other servers in the same available zone. | ||
/// | ||
/// Enable this for reducing across-AZ traffic of cloud deployment. | ||
pub follower_repl: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see nowhere uses it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in raftstore.
src/raft.rs
Outdated
} | ||
|
||
// For a broadcast, append entries to onw log and forward MsgAppend to other dest. | ||
fn handle_group_broadcast(&mut self, m: &Message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the broadcast message is sent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In raftstore, several MsgAppend will be merged into a MsgGroupBroadcast in fn build_raft_messages
4957ce5
to
eed70bc
Compare
eed70bc
to
a7567c3
Compare
Test case is missing. |
a7567c3
to
2fa5d20
Compare
@@ -70,6 +70,7 @@ impl GetEntriesContext { | |||
pub fn can_async(&self) -> bool { | |||
match self.0 { | |||
GetEntriesFor::SendAppend { .. } => true, | |||
GetEntriesFor::SendForward { .. } => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be false to avoid the async fetch's potential latency in follower replication.
…s agent selection Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
… is enabled Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
9406810
to
0cc4fa9
Compare
src/raft.rs
Outdated
} | ||
|
||
// Group messages | ||
if let Some(group) = msg_group.get_mut(&group_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg_group.entry().or_default().and_modify()
src/raft.rs
Outdated
self.prs | ||
.iter_mut() | ||
.filter(|&(id, _)| *id != self_id) | ||
.for_each(|(id, pr)| core.send_append(*id, pr, &mut msgs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can form msg group from here directly
src/raft.rs
Outdated
|
||
let mut idx: usize = 0; | ||
for msg in msgs { | ||
if !skip[idx] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about grouping the messages instead of the pos. And merge_append_group
modifies or filters the messages directly instead of by passing skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In new commit, I have modified it. PTAL.
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
@@ -517,6 +554,31 @@ impl<T: Storage> Raft<T> { | |||
self.prs().group_commit() | |||
} | |||
|
|||
/// Checks whether the raft group is using group commit and consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why move it around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a mis-operation. I will fix it.
src/raft.rs
Outdated
/// Sends RPC, with entries to all peers that are not up-to-date | ||
/// according to the progress recorded in r.prs(). | ||
pub fn bcast_append(&mut self) { | ||
let self_id = self.id; | ||
let leader_group_id = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be queried only when follower replication is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two conditions that we should use leader replication. One is that follower replication is disabled. The other is that follower replication is enabled but leader's group id is unknown. So I want to reuse the code of iterating progress and sending append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut use_leader_replication = self.is_leader();
let mut leader_group_id = 0;
if !use_leader_replication {
leader_group_id = query;
use_leader_replciation = leader_group_id == 0;
}
if use_leader_replication {
return;
}
src/raft.rs
Outdated
// Messages that needs to be forwarded are stored in hashmap temporarily, | ||
// and they are grouped by broadcast_group_id of progress. | ||
// Messages in msg_group will be pushed to message queue later. | ||
let mut msg_group: HashMap<u64, Vec<Message>> = HashMap::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better cache msg_group to avoid allocation.
src/raft.rs
Outdated
|
||
// Merge messages in the same broadcast group and send them. | ||
for (group_id, mut group) in msg_group.drain() { | ||
// Double check: do not need to forward messages in leader's broadcast group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's meaningless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it.
src/raft.rs
Outdated
} | ||
// Attach forward information to MsgGroupbroadcast and send it. | ||
group[mark].set_forwards(forwards.into()); | ||
for msg in group { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean self.msgs.push(group[mark])
?
src/raft.rs
Outdated
// Find an appropriate agent. | ||
// If found an appropriate one, mark the corresponding message's type as | ||
// MsgGroupBroadcast and return true. If not found, return false. | ||
fn select_agent_for_bcast_group(&self, msgs: &mut [Message]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return the index of the message? Changing the type in this function is quite confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a better choice.
src/raft.rs
Outdated
let peer_id = msg.to; | ||
let is_voter = self.prs().conf().voters().contains(peer_id); | ||
// Agent must be voter and recently active. | ||
if !is_voter || !self.is_recent_active(peer_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These information can be obtained when the message is generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The information of voter
and recent_active
is only used for agent selection. When the message is generated, we only determine whether it should be replicated by follower.
If we want to obtain these information when the message is generated, another cache is needed to record them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what you need to do is merging bcast_append
, merge_msg_group
and select_agent_for_bcast_group
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge_msg_group
and select_agent_for_bcast_group
have been merged into bcast_append
. PTAL
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
src/raft.rs
Outdated
// Do not need to merge if group size is less than two. Or there is no appropriate agent. | ||
if !need_merge { | ||
msgs.append(&mut group.into_iter().map(|(msg, _)| msg).collect()); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank very much. I think this cause the jitters of the number of sent messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should improve the test to cover this
let mut agent_msg = group.swap_remove(agent_msg_idx.unwrap()).0; | ||
agent_msg.set_msg_type(MessageType::MsgGroupBroadcast); | ||
agent_msg.set_forwards(forwards.into()); | ||
msgs.push(agent_msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to send other msgs for agent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other messages have been pushed to self.msgs
when generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was thought there would be multiple MsgAppend
for one peer
index: msg.index, | ||
..Default::default() | ||
}; | ||
forwards.push(forward); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if too many forward info makes one message too large
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that there are often 3 or 5 replicas deployed on 2 or 3 az, I think one message won't take too many forward information.
Signed-off-by: LintianShi <[email protected]>
@@ -888,12 +1002,106 @@ impl<T: Storage> Raft<T> { | |||
/// according to the progress recorded in r.prs(). | |||
pub fn bcast_append(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follower replication only covers bcast_append
, but misses somewhere calling send_append
and send_append_aggressively
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a cluster replicates data via bcast_append
, the cluster is in a stable state with high probability.
If send_append
is called, it is usually due to reject of MsgAppendResp
, which means that follower's raft log is conflict with leader's. In this situation, we'd better use leader replication to fix follower's raft log as quickly as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous test, we find that follower replication can reduce a lot of cross-az traffic. So I think we should sacrifice some coverage of follower replication to archive less performance regression. Though we just use follower replication in the main path of log replication, it can still reduce considerable cross-az traffic.
Maybe we can expand follower replication in the future. For example, we can get snapshot from followers.
src/raft.rs
Outdated
@@ -260,6 +266,9 @@ pub struct RaftCore<T: Storage> { | |||
|
|||
/// Max size per committed entries in a `Read`. | |||
pub(crate) max_committed_size_per_ready: u64, | |||
|
|||
// Message group cache for follower replication. | |||
msg_group: HashMap<u64, Vec<(Message, bool)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec
is more efficient than HashMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the number of groups is usually no more than 3, finding element in Vec
is more efficient than HashMap
. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just query, but also update and memory footprint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Use Vec<(u64, Vec<(Message, bool))>
instead of HashMap<u64, Vec<(Message, bool)>>
.
|
||
/// The current configuration state of the cluster. | ||
#[get = "pub"] | ||
conf: Configuration, | ||
pub(crate) conf: Configuration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the above line states it's public accessible already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using conf()
, there will be a conflict between mutable borrow of self.prs.progress
and immutable borrow of self.prs.conf
.
src/raft.rs
Outdated
}); | ||
|
||
// Merge messages in the same broadcast group and send them. | ||
for (_, mut group) in core.msg_group.drain() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can merge be done in L1056?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be impossible because the broadcast group is fixed after all progresses are iterated. The merge should start after the broadcast group is fixed.
If we group progress by broadcast_group_id
first and then iterate them, it is feasible.
let mut tmp_msgs = Vec::default(); | ||
// Let messages be pushed into tmp_vec firstly. | ||
core.send_append(*id, pr, &mut tmp_msgs); | ||
for msg in tmp_msgs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
if pr.broadcast_group_id == leader_group_id || !pr.is_replicating() {
msgs.extend(tmp_msgs);
return;
}
let is_voter = is_voter(pr);
for msg in tmp_msgs {
if msg.get_msg_type() != MsgAppend {
} else {
}
}
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi <[email protected]>
Signed-off-by: LintianShi [email protected]
Modify raft-rs to support follower replication in TiKV.
Main change:
RawNode
andConfig
MsgGroupBroadcast
and an extra fieldforwards
inMessage
, which contains information that needs to be forwarded.MsgGroupBroadcast
, which appends entries to local log and forwardMsgAppend
to other peers.