-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finalization stalling issue #1029
Comments
Checking similar recent issues on Substrate repo |
Since the OPs confirmed that GRANDPA keys were unmodified, the suggested fix is to call the Grandpa module's In TFChain, finalization is handled by GRANDPA protocol, which relies on a supermajority of validators to cast votes on blocks. When a network outage occurs, some factors can combine to cause finalization to remain stalled after the outage is resolved, notably a prolonged consensus delay can disrupt the process of aggregating enough votes to finalize blocks. This kept improving with the more recent versions of Polkadot, but seems from time to time, especially on older versions, Validators might need to reinitiate consensus rounds after such a major outage (via the “note_stalled” extrinsic) to restart finalization. We need to craft a council motion to trigger that extrinsic, it takes two arguments: I tried to search for more info about the delay value, and found that comment in the grandpa pallet code,
|
I checked the val logs and found this:
also before that part, there are many logs like this
|
Requested one validator to be restarted with |
Update: The stalled finality bug seems particularly hard to track and fix since it spans multiple issues over the past four years on substrate repo. We could open a ticket to see if Parity can offer support, but that might (and likely) take some time to get any response too. Alternatively, since we are dealing with Devnet, we might try purging the chain of all validators and restarting from a snapshot before the stalled finalized block if available (@bertcoesens) but never done that before, and not sure what would be the effect, or testing the solution that Lee implemented previously which need carefully execution to not halt the block productions. Also, we can extend the investigation to next week, but I think that the growing number of non-finalized blocks may lead to other issues that may complicate the recovery process. |
Update: It had some time (the time till the end of the current session plus the added delay), then the validators resubmitted the finality votes, and now the blocks are finalizing again. Note: For the feature, it will be useful to enable |
Describe the bug
After a network outage, Devnet and Qanet validators stopped finalizing blocks.
Restoring the voters/validators did not resolve the issue.
It is expected to impact dependent services (indexers, bridges, etc.) that rely on finalized blocks for syncing.
This issue is to track the current investigation.
Screenshots
QA validators
data:image/s3,"s3://crabby-images/e23c9/e23c9ec23b61b27bdabae3d8da5c1cfb41cf4b0e" alt="Image"
Dev validators
data:image/s3,"s3://crabby-images/d0abf/d0abfbe209542661eec6daf76e445309b9b4fd53" alt="Image"
Additional context
It seems that it happened before, we need to check if it is the same issue or can be fixed in similar way
https://github.com/threefoldtech/tf_operations/issues/1298
The text was updated successfully, but these errors were encountered: