Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/tomo: Add 'repair-snapshot' command to fix 'Masternodes lists are different in checkpoint header & snapshot' error #499

Merged
merged 13 commits into from
Jan 7, 2025

Conversation

tuanha-98
Copy link
Contributor

@tuanha-98 tuanha-98 commented Dec 27, 2024

Description

This pull request introduces a new db command with a repair-snapshot subcommand to resolve inconsistencies in the masternode list that can occur during updates. The issue arises from the use of sort.Slice to sort the masternode list when masternodes have identical stake amounts. This sorting behavior is unstable across different Go versions, leading to potential discrepancies.

When a new masternode list is updated (e.g., at gap 895), the list is initially stored in the cache but is not immediately saved to the database. If the node crashes during this period, the updated masternode list is lost. Although the verifyCascadingFields mechanism provides some level of recovery, it cannot fully resolve the issue because the root cause lies in the version-dependent behavior of sort.Slice. This can result in mismatched signer lists when sorting occurs.

Even in scenarios where the node does not crash, there remains a risk of mismatched signer lists between nodes running different Go versions due to inconsistencies in the sorting behavior.

Introduces a robust solution to repair snapshots using a compatible Go version without having to restore datadir from a snapshot.

Changes

  • Added a new db command, setting the foundation for future subcommands (e.g., inspect-db, etc.).
  • Added the repair-snapshot subcommand to fix the sorting bug caused by sort.Slice. This subcommand recalculates and updates the masternode list to ensure it aligns across nodes.

Usage

tomo db repair-snapshot --datadir path/to/data

@tungng98
Copy link
Collaborator

Thanks for implementing this fix. However, I find that the description is a little bit confusing. From what I learn from the code in this PR, and the point of the fix. I think this sentence should change a bit:

This update introduces a robust solution to repair snapshots and ensure the masternode list remains consistent across all nodes and scenarios.

to

This update introduces a robust solution to repair snapshots without having to restoring datadir from a snapshot.

This fix cannot ensure the masternode list remains consistent across all nodes because it doesn't change the current synchronization flow. It only allows node administrators to have a way to fix the inconsistent snapshot using a compatible Go version. Let's say if the node is built with go1.21 and got the error. Then node admin can rebuilt the node with go1.18, run this command to fix the snapshot, then start the node again.

Running this command with a node built with go1.19+ can't fix the problem, hence ensure the masternode list remains consistent across all nodes and scenarios seem to be confusing.

@tungng98 tungng98 changed the base branch from master to pre-release December 30, 2024 03:18
@tuanha-98
Copy link
Contributor Author

Thank you for the feedback! I've updated the description to:

Introduces a robust solution to repair snapshots using a compatible Go version without having to restore datadir from a snapshot.

@tungng98 tungng98 changed the title Feat: Add 'repair-snapshot' command to fix 'Masternodes lists are different in checkpoint header & snapshot' error cmd/tomo: Add 'repair-snapshot' command to fix 'Masternodes lists are different in checkpoint header & snapshot' error Jan 7, 2025
@tungng98 tungng98 merged commit 043410b into BuildOnViction:pre-release Jan 7, 2025
tuanha-98 added a commit to tuanha-98/victionchain that referenced this pull request Jan 8, 2025
… different in checkpoint header & snapshot' error (BuildOnViction#499)

* feat: add db repair-snapshot command
* fix: make datadir flag required
* fix: limit first 150 masternode
* refactor: use state to retrieve getCandidateCap instead of making a contract call
* chore: rename confusing variable names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants