Finalization stalling issue #1029

sameh-farouk · 2025-02-12T16:36:46Z

Describe the bug

After a network outage, Devnet and Qanet validators stopped finalizing blocks.

Restoring the voters/validators did not resolve the issue.

It is expected to impact dependent services (indexers, bridges, etc.) that rely on finalized blocks for syncing.

This issue is to track the current investigation.

Screenshots

QA validators

Dev validators

Additional context

It seems that it happened before, we need to check if it is the same issue or can be fixed in similar way
https://github.com/threefoldtech/tf_operations/issues/1298

sameh-farouk · 2025-02-12T16:58:57Z

Checking similar recent issues on Substrate repo

Finalisation stopped : Encountered error finding best chain paritytech/polkadot-sdk#3487

sameh-farouk · 2025-02-12T17:29:23Z

Since the OPs confirmed that GRANDPA keys were unmodified, the suggested fix is to call the Grandpa module's note_stalled extrinsic (which seems works for similar issues, see here).

In TFChain, finalization is handled by GRANDPA protocol, which relies on a supermajority of validators to cast votes on blocks. When a network outage occurs, some factors can combine to cause finalization to remain stalled after the outage is resolved, notably a prolonged consensus delay can disrupt the process of aggregating enough votes to finalize blocks.

This kept improving with the more recent versions of Polkadot, but seems from time to time, especially on older versions, Validators might need to reinitiate consensus rounds after such a major outage (via the “note_stalled” extrinsic) to restart finalization.

We need to craft a council motion to trigger that extrinsic, it takes two arguments:
The first one is a delay (in block count), and the second one is the last finalized block number.

I tried to search for more info about the delay value, and found that comment in the grandpa pallet code,

The delay should be high enough to safely assume that the block signalling the forced change will not be re-orged (e.g. 1000 blocks)

sameh-farouk · 2025-02-13T12:12:23Z

I checked the val logs and found this:

2025-02-12 01:53:09.576  INFO tokio-runtime-worker substrate: ✨ Imported #14980260 (0xe7b2…705e)    
2025-02-12 02:53:11	
2025-02-12 01:53:09.576  INFO tokio-runtime-worker sc_informant: ♻️  Reorg on #14980259,0x29d0…2e22 to #14980260,0xe7b2…705e, common ancestor #14980026,0xce62…8998    
2025-02-12 02:53:11	
2025-02-12 01:53:09.576 DEBUG tokio-runtime-worker offchain-worker: Spawning offchain workers at 0xe7b2297a6411bd9c17bf2d7d9ec18843a97d836956d291d85559bc621187705e
2025-02-12 02:53:11	
2025-02-12 01:53:09.575 DEBUG tokio-runtime-worker offchain-worker: Checking offchain workers at 0xe7b2297a6411bd9c17bf2d7d9ec18843a97d836956d291d85559bc621187705e: version: 2
2025-02-12 02:53:11	
2025-02-12 01:53:09.400  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980247 -- 0x7663…04d3) announcement from 12D3KooWGc5sgeLNvR9wKXFw1uapnyXubfKEi5yN7p1HmDZRkVn1 because all validation slots for this peer are occupied.    
2025-02-12 02:53:11	
2025-02-12 01:53:09.400  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980246 -- 0x41be…e026) announcement from 12D3KooWGc5sgeLNvR9wKXFw1uapnyXubfKEi5yN7p1HmDZRkVn1 because all validation slots for this peer are occupied.    
2025-02-12 02:53:11	
2025-02-12 01:53:09.392  WARN tokio-runtime-worker sync: 💔 Error importing block 0x104272ed3a62b544a18592420c53f424f6029c3b7f675a11c00549f4153a736c: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.392  WARN tokio-runtime-worker sync: 💔 Error importing block 0xf973ccf7cd13d1cc26fa5645ef4858d5546bdadefc0887f592b8bd39f75041d5: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  WARN tokio-runtime-worker sync: 💔 Error importing block 0x95711ee2e22b24f868ba8c646ff84908f40786949ecd94bac25852029191e1d6: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  WARN tokio-runtime-worker sync: 💔 Error importing block 0x075548178eaa3d49e13831a856bec37da07dab356e6a401c53010dde84ef6018: block has an unknown parent    
2025-02-12 02:53:11	
2025-02-12 01:53:09.391  INFO tokio-runtime-worker substrate: ✨ Imported #14980259 (0x619e…4bc2)

also before that part, there are many logs like this

2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980254 -- 0x222c…006b) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980253 -- 0xfc1f…ff43) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980252 -- 0x7684…1e89) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980251 -- 0x6440…bcad) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980250 -- 0x74a9…4e3f) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980249 -- 0x7fa8…ac5d) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980248 -- 0x26ea…a8ea) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  WARN tokio-runtime-worker sync: 💔 Ignored block (#14980247 -- 0x7663…04d3) announcement from 12D3KooWS85iesifJ9R9Uqv6JBZp8mMpzHD8dTXZorHWeWB61Q2p because all validation slots for this peer are occupied.    
2025-02-12 02:53:10	
2025-02-12 01:53:09.124 DEBUG tokio-runtime-worker offchain-worker: Skipping offchain workers for non-canon block: Header { parent_hash: 0xf1cc8ae08ed45bef92a097bb7459fb59163c5a4617e38a76b5dda434ac029b8f, number: 14980169, state_root: 0xc9a4a2b704ab0461167d74fd95caf086cfbab36cbcc5a9aebe6833082fa4b83b, extrinsics_root: 0xcd9d9150bb18feff0847808556b16699a9ee07ef9ae6c85c376d336f8749f25e, digest: Digest { logs: [DigestItem::PreRuntime([97, 117, 114, 97], [58, 84, 71, 17, 0, 0, 0, 0]), DigestItem::Seal([97, 117, 114, 97], [0, 133, 120, 185, 192, 11, 5, 133, 30, 125, 114, 154, 102, 180, 201, 238, 234, 120, 146, 100, 19, 87, 53, 144, 158, 213, 216, 237, 31, 213, 36, 119, 104, 156, 95, 222, 14, 130, 117, 249, 230, 214, 203, 221, 205, 176, 185, 211, 32, 58, 227, 152, 32, 87, 158, 12, 56, 103, 239, 92, 153, 54, 85, 139])] } }
2025-02-12 02:53:10	
2025-02-12 01:53:09.124  INFO tokio-runtime-worker substrate: ✨ Imported #14980169 (0xfe75…324c)

sameh-farouk · 2025-02-13T12:16:23Z

Requested one validator to be restarted with -lafg flag for further debugging

sameh-farouk · 2025-02-13T14:43:06Z

Update:
so the situation so far:
It appears there was a network split that self-resolved indicated by a reorg in logs. Other logs show a mix of unexplained events (for me) some related to an open bug tracked on Substrate repo such paritytech/polkadot-sdk#4109 and others tied to issues that have been fixed or improved in more recent versions (we might need ~12 releases to catch up with the latest stable version, which could take several months of development and testing).

The stalled finality bug seems particularly hard to track and fix since it spans multiple issues over the past four years on substrate repo. We could open a ticket to see if Parity can offer support, but that might (and likely) take some time to get any response too.

Alternatively, since we are dealing with Devnet, we might try purging the chain of all validators and restarting from a snapshot before the stalled finalized block if available (@bertcoesens) but never done that before, and not sure what would be the effect, or testing the solution that Lee implemented previously which need carefully execution to not halt the block productions.

Also, we can extend the investigation to next week, but I think that the growing number of non-finalized blocks may lead to other issues that may complicate the recovery process.

sameh-farouk · 2025-02-14T19:33:27Z

Update:
So we fixed this by calling the Grandpa module's note_stalled extrinsic.

It had some time (the time till the end of the current session plus the added delay), then the validators resubmitted the finality votes, and now the blocks are finalizing again.

Note: For the feature, it will be useful to enable -lgrandpa when we debug any block's finality issue.
Another thought: I belive this particular fix is relevant to the issue we experienced, and it was released recently on polkadot stable 1.16.0

sameh-farouk · 2025-02-18T22:51:45Z

QA stalling issue was also resolved today after calling the Grandpa module's note_stalled externsic.

sameh-farouk self-assigned this Feb 12, 2025

sameh-farouk added this to 3.16.x Feb 12, 2025

sameh-farouk moved this to Accepted in 3.16.x Feb 12, 2025

sameh-farouk moved this from Accepted to In Progress in 3.16.x Feb 12, 2025

sameh-farouk moved this from In Progress to In Verification in 3.16.x Feb 14, 2025

sameh-farouk closed this as completed Feb 18, 2025

github-project-automation bot moved this from In Verification to Done in 3.16.x Feb 18, 2025

sameh-farouk mentioned this issue Feb 19, 2025

Critical Need to Prioritize Substrate Upgrades to Mitigate Operational Risks #1030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalization stalling issue #1029

Finalization stalling issue #1029

sameh-farouk commented Feb 12, 2025 •

edited

Loading

sameh-farouk commented Feb 12, 2025

sameh-farouk commented Feb 12, 2025 •

edited

Loading

sameh-farouk commented Feb 13, 2025 •

edited

Loading

sameh-farouk commented Feb 13, 2025

sameh-farouk commented Feb 13, 2025 •

edited

Loading

sameh-farouk commented Feb 14, 2025 •

edited

Loading

sameh-farouk commented Feb 18, 2025 •

edited

Loading

Finalization stalling issue #1029

Finalization stalling issue #1029

Comments

sameh-farouk commented Feb 12, 2025 • edited Loading

Describe the bug

Screenshots

Additional context

sameh-farouk commented Feb 12, 2025

sameh-farouk commented Feb 12, 2025 • edited Loading

sameh-farouk commented Feb 13, 2025 • edited Loading

sameh-farouk commented Feb 13, 2025

sameh-farouk commented Feb 13, 2025 • edited Loading

sameh-farouk commented Feb 14, 2025 • edited Loading

sameh-farouk commented Feb 18, 2025 • edited Loading

sameh-farouk commented Feb 12, 2025 •

edited

Loading

sameh-farouk commented Feb 12, 2025 •

edited

Loading

sameh-farouk commented Feb 13, 2025 •

edited

Loading

sameh-farouk commented Feb 13, 2025 •

edited

Loading

sameh-farouk commented Feb 14, 2025 •

edited

Loading

sameh-farouk commented Feb 18, 2025 •

edited

Loading