Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Startup optimization in shrink - don't shrink non-shrinkable slots #17405

Merged
merged 1 commit into from
Jun 2, 2021

Conversation

sakridge
Copy link
Contributor

Problem

Startup shrink and clean operations are less efficient than they could be

Summary of Changes

  • Skip shrinking slots which do not save any space or mappings.
  • Only add keys to dirty set when there is more than 1 version of the key in the index and thus has a chance to be cleaned. Detected in generate_index

Fixes #

@sakridge
Copy link
Contributor Author

This PR:
datapoint: verify_snapshot_bank clean_us=9408005i shrink_all_slots_us=10999445i
clean > 9.4s, shrink > 10.9s
master
datapoint: verify_snapshot_bank clean_us=104287378i shrink_all_slots_us=55167563i
clean > 104s, shrink > 55s

@sakridge sakridge force-pushed the startup-opt branch 2 times, most recently from edaa72d to 409af0d Compare May 22, 2021 09:56
@codecov
Copy link

codecov bot commented May 22, 2021

Codecov Report

Merging #17405 (395ada1) into master (7ce910f) will increase coverage by 0.0%.
The diff coverage is 96.2%.

❗ Current head 395ada1 differs from pull request most recent head d223978. Consider uploading reports for the commit d223978 to get more accurate results

@@           Coverage Diff            @@
##           master   #17405    +/-   ##
========================================
  Coverage    82.6%    82.7%            
========================================
  Files         425      425            
  Lines      118840   118543   -297     
========================================
- Hits        98269    98057   -212     
+ Misses      20571    20486    -85     

@sakridge sakridge force-pushed the startup-opt branch 2 times, most recently from d47e292 to 395ada1 Compare May 24, 2021 17:15
drop(accounts_index_map_lock);
index_read_elapsed.stop();
let aligned_total: u64 = self.page_align(alive_total);

// If not saving memory maps or more than 1 page of data, then skip the shrink
if num_stores == 1 && aligned_total + PAGE_SIZE >= original_bytes {
Copy link
Contributor

@carllin carllin May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also a good idea to extract this into a reusable is_bytes_saved_sufficient_for_shrink() that can be added to the check here where we add to the shrink_candidates: https://github.com/solana-labs/solana/blob/master/runtime/src/accounts_db.rs#L4592-L4607.

Otherwise, we'll continually add to the shrink candidates and incur the overhead of running/scanning in the beginning of this do_shrink_slot_stores() function as we delete accounts from the same slot, even if they continually fail this new aligned_total + PAGE_SIZE >= original_bytes check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just out of curiosity, do we know the snapshot sizes before and after?

Theoretically if every slot had an extra PAGE_SIZE = 4096 bytes, we might get an extra 432000 slots * 496 bytes ~= 2GB raw extra size before compression, but wonder if the results reflect this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this for a while on mainnet. snapshot sizes didn't seem to be affected:

Every 2.0s: ls -alh mainnet-beta/validator-ledger/snapshot-*

-rw-rw-r-- 1 stephen_solana_com stephen_solana_com 6.3G May  4 15:00 mainnet-beta/validator-ledger/snapshot-76776676-EcV4U9gqqdXuhqgP69WaLhWmoyDQcJYxM1ZE5YuSspdF.tar.zst
-rw-rw-r-- 1 stephen_solana_com stephen_solana_com 6.6G May 27 15:13 mainnet-beta/validator-ledger/snapshot-80249526-9NwTfazvvgmdb4Wr8VnAhC1kEWqnU9APZjfpmrktjpRS.tar.zst
-rw-rw-r-- 1 stephen_solana_com stephen_solana_com 6.6G May 27 15:18 mainnet-beta/validator-ledger/snapshot-80249917-2WYkAHaKCYFEyCtA8gBF4Qxh6iNbrcxRbXUrqUQSvPai.tar.zst

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left side is this change, right side master starting with that huge spike at startup.

bytes-written

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great 😃

if num_stores == 1 && aligned_total + PAGE_SIZE >= original_bytes {
for pubkey in unrefed_pubkeys {
if let Some(locked_entry) = self.accounts_index.get_account_read_entry(pubkey) {
locked_entry.addref();
Copy link
Contributor

@carllin carllin May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could avoid this addref() by either:

  1. Removing the earlier unref(), and instead just iterate through the unrefed_pubkeys and call unref() after we've passed the aligned_total + PAGE_SIZE >= original_bytes check above. Downside is an extra pass through the unrefed_pubkeys keys

  2. We have the utility methods store.alive_bytes() and store.total_bytes() for tracking the alive/dead sizes. These should be accurate as long as a full clean() runs before shrink() which i think is true in verify_snapshot_bank(). Moreover, clean() needs to run before shrink() for the current is_alive check in shrink to be accurate anyways:

let is_alive = locked_entry.slot_list().iter().any(|(_slot, i)| {
                    i.store_id == stored_account.store_id
                        && i.offset == stored_account.account.offset
                });

Maybe using these we could modify and move the aligned_total + PAGE_SIZE >= original_bytes check earlier in the function before the scan, and exit there if the check fails (would also avoid the overhead of the additional scan through the storages). Something like:

if num_stores == 1 && self.page_align(store.alive_bytes() as u64) + PAGE_SIZE >= store.total_bytes() { return 0;}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right for 1, I am thinking this is the not-common case since we queued it up to shrink and we have other signals that tell us that shrinking is likely productive like alive_bytes/written_bytes. So we should optimize for the case where we are actually doing the shrink.

Clean does run before this shrink. I was thinking those may not be accurate, hence the location of this check, but I think you might be right, they might actually be somewhat accurate.

@sakridge sakridge changed the title Startup optimization in shrink and clean Startup optimization in shrink - don't shrink non-shrinkable slots May 25, 2021
jeffwashington
jeffwashington previously approved these changes May 25, 2021
Copy link
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mergify mergify bot dismissed jeffwashington’s stale review May 26, 2021 17:38

Pull request has been modified.

@sakridge sakridge force-pushed the startup-opt branch 4 times, most recently from 4951a62 to afa4ac4 Compare May 27, 2021 17:18
@sakridge sakridge force-pushed the startup-opt branch 2 times, most recently from df846bb to 7a43a8d Compare May 29, 2021 14:04
@sakridge sakridge requested a review from carllin June 1, 2021 18:56
@sakridge
Copy link
Contributor Author

sakridge commented Jun 1, 2021

@carllin any other concerns on this?

Copy link
Contributor

@carllin carllin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one final pass, looks good!

@sakridge sakridge merged commit 14c52ab into solana-labs:master Jun 2, 2021
@sakridge sakridge deleted the startup-opt branch June 2, 2021 07:51
@sakridge sakridge added the v1.7 label Jun 7, 2021
mergify bot pushed a commit that referenced this pull request Jun 7, 2021
(cherry picked from commit 14c52ab)

# Conflicts:
#	runtime/src/accounts_db.rs
jeffwashington pushed a commit that referenced this pull request Jun 16, 2021
(cherry picked from commit 14c52ab)

# Conflicts:
#	runtime/src/accounts_db.rs
mergify bot added a commit that referenced this pull request Jun 16, 2021
…ackport #17405) (#17792)

* Skip shrink when it doesn't save anything (#17405)

(cherry picked from commit 14c52ab)

# Conflicts:
#	runtime/src/accounts_db.rs

* fix merge error

Co-authored-by: sakridge <[email protected]>
Co-authored-by: Jeff Washington (jwash) <[email protected]>
@brooksprumo brooksprumo mentioned this pull request Aug 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants