Replies: 4 comments 4 replies
-
What librdkafka version? |
Beta Was this translation helpful? Give feedback.
-
I also looked at the documentation and I do not see this as a documented state for the 'fetch_state'. |
Beta Was this translation helpful? Give feedback.
-
I am not sure if anyone reads this discussion, but the issue happened again during the last few days. Same problem, partition would not receive messages anymore and the partition statistics shows the same behavior
ie: "fetch_state": "validate-epoch-wait" |
Beta Was this translation helpful? Give feedback.
-
We had the same problem. Try out the new release 2.3.0 It seems to solve the problem - so far. Been running for several days now without problems. https://github.com/confluentinc/confluent-kafka-dotnet/releases |
Beta Was this translation helpful? Give feedback.
-
Hi,
Before opening an issue, I prefer to ask a question here to see if there is anything I can find about our problem.
We do not use transactions for message producer in our implementation. We have many producers and in this simple case, only one consumer process (in a kubernetes pod) which processes all 60 partitions. Messages are spread more or less evenly between producers/partitions. I do not see any evident issues on the producer side.
At some point, one (or more) partition do not receive messages anymore and I can observe a consumer lag on confluent's cluster.
We implemented a partition monitor so that we check when we get no message after a certain period of time. We got into this situation recently where we stopped getting messages from partition 35 (out of 60) and this is what we have in the stats for that partition:
"35": { "partition": 35, "broker": 21, "leader": 21, "desired": true, "unknown": false, "msgq_cnt": 0, "msgq_bytes": 0, "xmit_msgq_cnt": 0, "xmit_msgq_bytes": 0, "fetchq_cnt": 0, "fetchq_size": 0, "fetch_state": "validate-epoch-wait", "query_offset": -1001, "next_offset": 1245176, "app_offset": 1245176, "stored_offset": -1001, "stored_leader_epoch": -1, "commited_offset": 1245176, "committed_offset": 1245176, "committed_leader_epoch": 120, "eof_offset": 1245176, "lo_offset": -1, "hi_offset": -1, "ls_offset": -1, "consumer_lag": -1, "consumer_lag_stored": -1, "leader_epoch": 121, "txmsgs": 0, "txbytes": 0, "rxmsgs": 615, "rxbytes": 710580, "msgs": 615, "rx_ver_drops": 0, "msgs_inflight": 0, "next_ack_seq": 0, "next_err_seq": 0, "acked_msgid": 0 },
what is the meaning of validate-epoch-wait ?
We plan to restart the pod whenever we encounter this type of issue since we do not know how to recover from this nor do we know why it happens.
Any insight or help on this issue would be appreciated,
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions