etcd old leader still revokes lease after it steped to follower if cpu is high #12528

jiapeish · 2020-12-07T12:34:47Z

** Issue Overview **
Hi,

I have found a possible bug related to lease. It seems that the etcd old leader will send leaseRevoke
request to new leader, which results in lease repired of the key, although the client is still sending
lease keepalive to the new leader.

** our use case **

we use 3 node etcd cluster with version v3.4.6, echo node runs as a container with
4vCPU core. These 3 containers are deployed at 3 different VMs. Client will write key with lease time 4 seconds to the etcd cluster.
the system architecture is as follows:

** Steps to reproduce **

In etcdserver/raft.go, add a small piece of code to simulate CPU heavy load in func (r *raftNode) start(rh *raftReadyHandler) ,

					rh.updateLeadership(newLeader)
					r.td.Reset()
				}

				// mock stuck
				if time.Since(start) > 10*time.Second && !flag {
					plog.Warningf("sleep!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
					time.Sleep(5*time.Second)
					flag = true
				}

				if len(rd.ReadStates) != 0 {

compile the etcd in step 1 as etcd-stuck.exe, and start another 2 etcd, try to make etcd-stuck as the leader;
Write a key into the cluster and send lease keepalive to the 3 nodes cluster;
after 10s, the etcd-stuck will sleep, then a new leader is elected.
at the same time, the old leader will send a lease revoke request to the new leader, and the key is
deleted.

** What we expect **

we expect that the key is not deleted, because 2 nodes are normal and they can process the lease
keepalive of the clients.
We wonder that why the old leader send lease revoke request as it step down to follower at that time?
Why the key is deleted? Why not the new leader check the delete request and refuse it?

** Logs **

I've attached the 3 nodes logs.
etcd1.txt
etcd2.txt
etcd3.txt

The text was updated successfully, but these errors were encountered:

tangcong · 2020-12-08T00:18:42Z

It seems that in extreme cases, it is indeed possible that an old leader sends a revoke lease request to the cluster. When the cluster leader changes, the old leader will clear all lessers. However, during the period from the old leader to the updated membership, the old leader will revoke expired Leases.

jiapeish · 2020-12-08T01:42:24Z

When the old leader received the msg with higher term, it became follower, but it seems that another goroutine still mark itself as "leader" , is this the period you mean "update membership"? If so, why not we clear the lessors immediately when it receives msg with higher term?

tangcong · 2020-12-08T04:17:47Z

https://github.com/etcd-io/etcd/blob/v3.4.6/etcdserver/raft.go#L174-L195
When the leader changes, the updateLeadership function will clear the leases. The etcd server obtains and processes raft log entries serially from the raft module. Before the old leader calls updateleadership, i guess that it may take a long time to process the last message, causing this issue. For example, higher CPU load, higher disk latency, and your sleep function.

jiapeish · 2020-12-08T11:41:45Z

https://github.com/etcd-io/etcd/blob/v3.4.6/etcdserver/raft.go#L174-L195
When the leader changes, the updateLeadership function will clear the leases. The etcd server obtains and processes raft log entries serially from the raft module. Before the old leader calls updateleadership, i guess that it may take a long time to process the last message, causing this issue. For example, higher CPU load, higher disk latency, and your sleep function.

Hi, @tangcong

In fact, we encountered this issue several times already and with no sleep function :) . For a distributed system, such phenomena like higher CPU load, Network latency are the norm, and we think that it is better if the community can solve this issue. And why don't we try to avoid calling lease revoke as soon as the old leader got MsgApp with higher term?

tangcong · 2020-12-08T16:01:46Z

yeah, we need to spend more time to find a simple and safe way to fix it, If you have a good idea to fix it, please submit a PR. thanks.

I submitted a PR to try to reduce the probability of this bug. Can you help test it in your cluster? I am a little busy recently. thanks. @jiapeish

jiapeish · 2020-12-09T02:02:42Z

yeah, we need to spend more time to find a simple and safe way to fix it, If you have a good idea to fix it, please submit a PR. thanks.

I submitted a PR to try to reduce the probability of this bug. Can you help test it in your cluster? I am a little busy recently. thanks. @jiapeish

Yes, I can test this PR in our cluster, and will show the test results a.s.a.p.

jiapeish · 2020-12-09T08:02:45Z

Hi, @tangcong ,
unfortunately, this PR doesn't take effect, the issue still occurs. I believe it has some relation to concurrent/sync problems between different goroutines, since LeaseRevoke is in a new goroutine.

jiapeish · 2020-12-11T13:51:53Z

Hi all,
Is there anybody can help us? Do we have a solution/plan to fix this potential bug?

Thanks :)

tangcong · 2020-12-14T09:11:23Z

@jiapeish, I have updated and verified pr in my local environment, and it can fix this issue. etcd will verify leader status from raft state machine, but it will sacrifice a little performance. can you take a try? thanks.

16:57:52 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:52.831+0800","caller":"etcdserver/server.go:1055","msg":"ignore older leader revoke lease request","lease-id":"32697660774ba905","isLeader":true,"isLeaderInRaftStateMachine":false}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"fd422379fda50e48","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889343035s"}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"91bc3c398fb3c146","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889457147s"}

jiapeish · 2020-12-15T06:30:41Z

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

tangcong · 2020-12-15T07:50:05Z

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

good. thanks. if you cherry-pick it into etcd 3.4/3.3, you have to detect if the lg is nil.

jiapeish · 2020-12-17T09:04:54Z

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

good. thanks. if you cherry-pick it into etcd 3.4/3.3, you have to detect if the lg is nil.

@tangcong Hi, as this PR take effect, I was wondering if we/community can cherry-pick/backport this PR to etcd 3.4.6 officially? And can we submit it to the newer version?

tangcong · 2020-12-17T09:23:56Z

ok, I will add some comments and see if the failed test cases are relevant. @jiapeish

jiapeish · 2020-12-22T13:28:15Z

@tangcong Is there any progress of the failed test cases? It seems that the CI-Job-4706.6 is failed, and I tried to figure out why it failed, however the CI logs confused me...
Could you help see why the test cases are failed?

tangcong · 2020-12-23T09:32:41Z

@jiapeish sorry, I have been busy recently,some test cases are flaky, It seems that the failed test case is not related to this PR. PR is ready for review.
#12531

ptabor · 2021-02-16T18:46:55Z

I've seen this (or simillar) issue on single node cluster:

2021-02-16 06:01:30.976722 I | etcdserver/api/etcdhttp: /health OK (status code 200)
(there are no logs at all for 14s - so its pretty serious 'starvation') 
2021-02-16 06:01:44.004055 W | etcdserver: request "header:<ID:5225433015266645667 > lease_revoke:<id:488477a450910fe1>" with result "size:29" took too long (1.989515846s) to execute
2021-02-16 06:01:44.655677 W | etcdserver: failed to apply request,took 27.653µs,request header:<ID:5225433015266645670 > lease_revoke:<id:488477a450910fe1>,resp size:29,err is lease not found
...
> 2021-02-16 06:01:44.786936 W | etcdserver: failed to revoke 488477a450910fe1 ("lease not found")

jiapeish · 2021-02-20T11:36:12Z

I've seen this (or simillar) issue on single node cluster:

2021-02-16 06:01:30.976722 I | etcdserver/api/etcdhttp: /health OK (status code 200)
(there are no logs at all for 14s - so its pretty serious 'starvation') 
2021-02-16 06:01:44.004055 W | etcdserver: request "header:<ID:5225433015266645667 > lease_revoke:<id:488477a450910fe1>" with result "size:29" took too long (1.989515846s) to execute
2021-02-16 06:01:44.655677 W | etcdserver: failed to apply request,took 27.653µs,request header:<ID:5225433015266645670 > lease_revoke:<id:488477a450910fe1>,resp size:29,err is lease not found
...
> 2021-02-16 06:01:44.786936 W | etcdserver: failed to revoke 488477a450910fe1 ("lease not found")

Yes, this is a similarity between them, so I think this issue is important to fix...

stale · 2021-05-23T17:44:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

jiapeish · 2021-05-24T03:02:00Z

This issue is used to track #12531
#12531

jiapeish · 2021-06-18T02:18:00Z

This issue still EXIST.

aaronjzhang · 2023-01-13T13:31:31Z

I also faced the issue, my case is as below:
We have 3 etcd nodes, they are running in k8s, use portworx storage, when portworx node reboots, it will cause fdatasync takes too long time, the leader etcd server will be stuck in processing Ready, then the raft will step to follower after an election-timeout, but the lessor still consider it's the primary lessor and revoke the expired leases, even it will be request timeout, but after the portworx restored, the fdatasync will return and the etcd server continue to process the next Ready and still send out the revoke lease request which are stored in the next Ready.Messages.
Looks the fix works well, why it's not merged? I also think it's very import to fix

ahrtr · 2023-01-14T23:34:46Z

@aaronjzhang could you raise a new issue and link to this one? We will triage it later.

jiapeish changed the title ~~etcd old server still revokes lease after it steped to follower if cpu is high~~ etcd old leader still revokes lease after it steped to follower if cpu is high Dec 7, 2020

tangcong mentioned this issue Dec 23, 2020

etcdserver: ignore old leader's request to revoke lease #12531

Closed

stale bot added the stale label May 23, 2021

stale bot closed this as completed Jun 14, 2021

aaronjzhang mentioned this issue Feb 6, 2023

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Closed

nongfushanquan mentioned this issue Jul 3, 2023

ticdc lag reached more than 10min and ticdc crash when inject pdleader data io hang pingcap/tiflow#9054

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd old leader still revokes lease after it steped to follower if cpu is high #12528

etcd old leader still revokes lease after it steped to follower if cpu is high #12528

jiapeish commented Dec 7, 2020 •

edited

Loading

tangcong commented Dec 8, 2020

jiapeish commented Dec 8, 2020

tangcong commented Dec 8, 2020

jiapeish commented Dec 8, 2020

tangcong commented Dec 8, 2020 •

edited

Loading

jiapeish commented Dec 9, 2020

jiapeish commented Dec 9, 2020 •

edited

Loading

jiapeish commented Dec 11, 2020

tangcong commented Dec 14, 2020

jiapeish commented Dec 15, 2020

tangcong commented Dec 15, 2020

jiapeish commented Dec 17, 2020

tangcong commented Dec 17, 2020

jiapeish commented Dec 22, 2020

tangcong commented Dec 23, 2020

ptabor commented Feb 16, 2021

jiapeish commented Feb 20, 2021

stale bot commented May 23, 2021

jiapeish commented May 24, 2021

jiapeish commented Jun 18, 2021

aaronjzhang commented Jan 13, 2023

ahrtr commented Jan 14, 2023

etcd old leader still revokes lease after it steped to follower if cpu is high #12528

etcd old leader still revokes lease after it steped to follower if cpu is high #12528

Comments

jiapeish commented Dec 7, 2020 • edited Loading

tangcong commented Dec 8, 2020

jiapeish commented Dec 8, 2020

tangcong commented Dec 8, 2020

jiapeish commented Dec 8, 2020

tangcong commented Dec 8, 2020 • edited Loading

jiapeish commented Dec 9, 2020

jiapeish commented Dec 9, 2020 • edited Loading

jiapeish commented Dec 11, 2020

tangcong commented Dec 14, 2020

jiapeish commented Dec 15, 2020

tangcong commented Dec 15, 2020

jiapeish commented Dec 17, 2020

tangcong commented Dec 17, 2020

jiapeish commented Dec 22, 2020

tangcong commented Dec 23, 2020

ptabor commented Feb 16, 2021

jiapeish commented Feb 20, 2021

stale bot commented May 23, 2021

jiapeish commented May 24, 2021

jiapeish commented Jun 18, 2021

aaronjzhang commented Jan 13, 2023

ahrtr commented Jan 14, 2023

jiapeish commented Dec 7, 2020 •

edited

Loading

tangcong commented Dec 8, 2020 •

edited

Loading

jiapeish commented Dec 9, 2020 •

edited

Loading