Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalization stalling issue #1029

Closed
sameh-farouk opened this issue Feb 12, 2025 · 7 comments
Closed

Finalization stalling issue #1029

sameh-farouk opened this issue Feb 12, 2025 · 7 comments
Assignees

Comments

@sameh-farouk
Copy link
Member

sameh-farouk commented Feb 12, 2025

Describe the bug

After a network outage, Devnet and Qanet validators stopped finalizing blocks.

Restoring the voters/validators did not resolve the issue.

It is expected to impact dependent services (indexers, bridges, etc.) that rely on finalized blocks for syncing.

This issue is to track the current investigation.

Screenshots

QA validators
Image

Dev validators
Image

Additional context

It seems that it happened before, we need to check if it is the same issue or can be fixed in similar way
https://github.com/threefoldtech/tf_operations/issues/1298

@sameh-farouk sameh-farouk self-assigned this Feb 12, 2025
@sameh-farouk sameh-farouk moved this to Accepted in 3.16.x Feb 12, 2025
@sameh-farouk sameh-farouk moved this from Accepted to In Progress in 3.16.x Feb 12, 2025
@sameh-farouk
Copy link
Member Author

Checking similar recent issues on Substrate repo

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 12, 2025

Since the OPs confirmed that GRANDPA keys were unmodified, the suggested fix is to call the Grandpa module's note_stalled extrinsic (which seems works for similar issues, see here).

In TFChain, finalization is handled by GRANDPA protocol, which relies on a supermajority of validators to cast votes on blocks. When a network outage occurs, some factors can combine to cause finalization to remain stalled after the outage is resolved, notably a prolonged consensus delay can disrupt the process of aggregating enough votes to finalize blocks.

This kept improving with the more recent versions of Polkadot, but seems from time to time, especially on older versions, Validators might need to reinitiate consensus rounds after such a major outage (via the “note_stalled” extrinsic) to restart finalization.

We need to craft a council motion to trigger that extrinsic, it takes two arguments:
The first one is a delay (in block count), and the second one is the last finalized block number.

I tried to search for more info about the delay value, and found that comment in the grandpa pallet code,

The delay should be high enough to safely assume that the block signalling the forced change will not be re-orged (e.g. 1000 blocks)

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 13, 2025

I checked the val logs and found this:

2025-02-12 01:53:09.576  INFO tokio-runtime-worker substrate: ✨ Imported #14980260 (0xe7b2…705e)    
2025-02-12 02:53:11	
2025-02-12 01:53:09.576  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #14980259,0x29d0…2e22 to #14980260,0xe7b2…705e, common ancestor #14980026,0xce62…8998    
2025-02-12 02:53:11	
2025-02-12 01:53:09.576 DEBUG tokio-runtime-worker offchain-worker: Spawning offchain workers at 0xe7b2297a6411bd9c17bf2d7d9ec18843a97d836956d291d85559bc621187705e
2025-02-12 02:53:11	
2025-02-12 01:53:09.575 DEBUG tokio-runtime-worker offchain-worker: Checking offchain workers at 0xe7b2297a6411bd9c17bf2d7d9ec18843a97d836956d291d85559bc621187705e: version: 2
2025-02-12 02:53:11	
2025-02-12 01:53:09.400  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980247 -- 0x7663…04d3) announcement from 12D3KooWGc5sgeLNvR9wKXFw1uapnyXubfKEi5yN7p1HmDZRkVn1 because all validation slots for this peer are occupied.    
2025-02-12 02:53:11	
2025-02-12 01:53:09.400  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980246 -- 0x41be…e026) announcement from 12D3KooWGc5sgeLNvR9wKXFw1uapnyXubfKEi5yN7p1HmDZRkVn1 because all validation slots for this peer are occupied.    
2025-02-12 02:53:11	
2025-02-12 01:53:09.392  WARN tokio-runtime-worker sync: 💔 Error importing block 0x104272ed3a62b544a18592420c53f424f6029c3b7f675a11c00549f4153a736c: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.392  WARN tokio-runtime-worker sync: 💔 Error importing block 0xf973ccf7cd13d1cc26fa5645ef4858d5546bdadefc0887f592b8bd39f75041d5: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  WARN tokio-runtime-worker sync: 💔 Error importing block 0x95711ee2e22b24f868ba8c646ff84908f40786949ecd94bac25852029191e1d6: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  WARN tokio-runtime-worker sync: 💔 Error importing block 0x075548178eaa3d49e13831a856bec37da07dab356e6a401c53010dde84ef6018: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  INFO tokio-runtime-worker substrate: ✨ Imported #14980259 (0x619e…4bc2)    

also before that part, there are many logs like this

2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980254 -- 0x222c…006b) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980253 -- 0xfc1f…ff43) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980252 -- 0x7684…1e89) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980251 -- 0x6440…bcad) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980250 -- 0x74a9…4e3f) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980249 -- 0x7fa8…ac5d) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980248 -- 0x26ea…a8ea) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980247 -- 0x7663…04d3) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124 DEBUG tokio-runtime-worker offchain-worker: Skipping offchain workers for non-canon block: Header { parent_hash: 0xf1cc8ae08ed45bef92a097bb7459fb59163c5a4617e38a76b5dda434ac029b8f, number: 14980169, state_root: 0xc9a4a2b704ab0461167d74fd95caf086cfbab36cbcc5a9aebe6833082fa4b83b, extrinsics_root: 0xcd9d9150bb18feff0847808556b16699a9ee07ef9ae6c85c376d336f8749f25e, digest: Digest { logs: [DigestItem::PreRuntime([97, 117, 114, 97], [58, 84, 71, 17, 0, 0, 0, 0]), DigestItem::Seal([97, 117, 114, 97], [0, 133, 120, 185, 192, 11, 5, 133, 30, 125, 114, 154, 102, 180, 201, 238, 234, 120, 146, 100, 19, 87, 53, 144, 158, 213, 216, 237, 31, 213, 36, 119, 104, 156, 95, 222, 14, 130, 117, 249, 230, 214, 203, 221, 205, 176, 185, 211, 32, 58, 227, 152, 32, 87, 158, 12, 56, 103, 239, 92, 153, 54, 85, 139])] } }
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  INFO tokio-runtime-worker substrate: ✨ Imported #14980169 (0xfe75…324c)   

@sameh-farouk
Copy link
Member Author

Requested one validator to be restarted with -lafg flag for further debugging

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 13, 2025

Update:
so the situation so far:
It appears there was a network split that self-resolved indicated by a reorg in logs. Other logs show a mix of unexplained events (for me) some related to an open bug tracked on Substrate repo such paritytech/polkadot-sdk#4109 and others tied to issues that have been fixed or improved in more recent versions (we might need ~12 releases to catch up with the latest stable version, which could take several months of development and testing).

The stalled finality bug seems particularly hard to track and fix since it spans multiple issues over the past four years on substrate repo. We could open a ticket to see if Parity can offer support, but that might (and likely) take some time to get any response too.

Alternatively, since we are dealing with Devnet, we might try purging the chain of all validators and restarting from a snapshot before the stalled finalized block if available (@bertcoesens) but never done that before, and not sure what would be the effect, or testing the solution that Lee implemented previously which need carefully execution to not halt the block productions.

Also, we can extend the investigation to next week, but I think that the growing number of non-finalized blocks may lead to other issues that may complicate the recovery process.

@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 14, 2025

Update:
So we fixed this by calling the Grandpa module's note_stalled extrinsic.

It had some time (the time till the end of the current session plus the added delay), then the validators resubmitted the finality votes, and now the blocks are finalizing again.

Image

Note: For the feature, it will be useful to enable -lgrandpa when we debug any block's finality issue.
Another thought: I belive this particular fix is relevant to the issue we experienced, and it was released recently on polkadot stable 1.16.0

@sameh-farouk sameh-farouk moved this from In Progress to In Verification in 3.16.x Feb 14, 2025
@sameh-farouk
Copy link
Member Author

sameh-farouk commented Feb 18, 2025

QA stalling issue was also resolved today after calling the Grandpa module's note_stalled externsic.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant