Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[dvc][common][server][test]leader complete status part 2: DVC and sta…
…ndby completes based on the consumed leader complete status (#775) Part 2 on top of PR 741 * [Deadlock issue from part 1]: In part 1, we made a change that caused the leader replicas to send a heartbeat with leader completed header to VT when they report COMPLETED. This was done by the drainer thread using the same producer that originally produced the message that made the leader to be marked COMPLETED, which created a circular loop: [producer -> producer buffer -> VT -> callback -> drainer queue -> producer]. This led to a deadlock situation when both the producer buffer and the drainer queue were full, and the first message in the drainer queue made the leader report COMPLETED. This resulted in the drainer being blocked on the producer buffer being full, and the producer being blocked on the drainer queue being full. To avoid this, we removed the part that sent the heartbeat to VT directly, and instead we set the LFSIT#lastSendIngestionHeartbeatTimestamp to 0 such that maybeSendIngestionHeartbeat() force sends the HB to RT whenever a leader partition is reported completed apart from the interval based one. * Also added another change wrt amplification factor: Leader partition reads HB from RT and propagates the heartbeat to all sub partitions' VT * In maybeSendIngestionHeartbeat(): Don't send heartbeat to non-existing RT topics if amplification factor is enabled resulting in stuck consumption task. * Followers reads the header added in heartbeat and updates its state in PartitionConsumptionState which will be used when checking whether lag is acceptable or not for both offset based lag and time-based lag for both DaVinci and STANDYBY replicas. * Found some issue with stores without AA: In non-AA stores, consumer task is unable to read HB messages from RT topic (though a test consumer can still read it), leading to standby replicas waiting for leader completion state header indefinitely, so disabling this feature for non-AA stores until that issue is resolved. * New configs: "server.leader.complete.state.check.in.follower.enabled" => to enable this feature. Disabled by default and should be enabled only after Venice tag 0.4.154 is fully rolled out which supports sending HBs. "server.leader.complete.state.check.in.follower.valid.interval.ms" => configure the time interval within which the HB should be considered valid. Default 5mins.
- Loading branch information