-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd old leader still revokes lease after it steped to follower if cpu is high #12528
Comments
It seems that in extreme cases, it is indeed possible that an old leader sends a revoke lease request to the cluster. When the cluster leader changes, the old leader will clear all lessers. However, during the period from the old leader to the updated membership, the old leader will revoke expired Leases. |
https://github.com/etcd-io/etcd/blob/v3.4.6/etcdserver/raft.go#L174-L195 |
Hi, @tangcong In fact, we encountered this issue several times already and with no sleep function :) . For a distributed system, such phenomena like higher CPU load, Network latency are the norm, and we think that it is better if the community can solve this issue. And why don't we try to avoid calling lease revoke as soon as the old leader got MsgApp with higher term? |
Yes, I can test this PR in our cluster, and will show the test results a.s.a.p. |
Hi, @tangcong , |
Hi all, Thanks :) |
@jiapeish, I have updated and verified pr in my local environment, and it can fix this issue. etcd will verify leader status from raft state machine, but it will sacrifice a little performance. can you take a try? thanks.
|
@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the |
good. thanks. if you cherry-pick it into etcd 3.4/3.3, you have to detect if the |
@tangcong Hi, as this PR take effect, I was wondering if we/community can cherry-pick/backport this PR to etcd 3.4.6 officially? And can we submit it to the newer version? |
ok, I will add some comments and see if the failed test cases are relevant. @jiapeish |
@tangcong Is there any progress of the failed test cases? It seems that the CI-Job-4706.6 is failed, and I tried to figure out why it failed, however the CI logs confused me... |
I've seen this (or simillar) issue on single node cluster:
|
Yes, this is a similarity between them, so I think this issue is important to fix... |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
This issue still EXIST. |
I also faced the issue, my case is as below: |
@aaronjzhang could you raise a new issue and link to this one? We will triage it later. |
** Issue Overview **
Hi,
I have found a possible bug related to lease. It seems that the etcd old leader will send leaseRevoke
request to new leader, which results in lease repired of the key, although the client is still sending
lease keepalive to the new leader.
** our use case **
we use 3 node etcd cluster with version v3.4.6, echo node runs as a container with
4vCPU core. These 3 containers are deployed at 3 different VMs. Client will write key with lease time 4 seconds to the etcd cluster.
the system architecture is as follows:
** Steps to reproduce **
func (r *raftNode) start(rh *raftReadyHandler)
,compile the etcd in step 1 as etcd-stuck.exe, and start another 2 etcd, try to make etcd-stuck as the leader;
Write a key into the cluster and send lease keepalive to the 3 nodes cluster;
after 10s, the etcd-stuck will sleep, then a new leader is elected.
at the same time, the old leader will send a lease revoke request to the new leader, and the key is
deleted.
** What we expect **
we expect that the key is not deleted, because 2 nodes are normal and they can process the lease
keepalive of the clients.
We wonder that why the old leader send lease revoke request as it step down to follower at that time?
Why the key is deleted? Why not the new leader check the delete request and refuse it?
** Logs **
I've attached the 3 nodes logs.
etcd1.txt
etcd2.txt
etcd3.txt
The text was updated successfully, but these errors were encountered: