Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Logrotate interrupts Tar process #20588

Closed
diman-io opened this issue Oct 11, 2021 · 6 comments
Closed

Logrotate interrupts Tar process #20588

diman-io opened this issue Oct 11, 2021 · 6 comments

Comments

@diman-io
Copy link
Contributor

diman-io commented Oct 11, 2021

Problem

If the logrotation occurs during the snapshot archive creation, then the tar process is interrupted due to a signal. And the temporary file remains in the snapshot folder until the next restart.

total 73180132
drwxr-xr-x  3 root root         120 Oct 11 12:28 ./
drwxr-xr-x  4 root root          80 Oct  7 16:47 ../
drwxr-xr-x 10 root root         200 Oct 11 12:28 snapshot/
-rw-r--r--  1 root root 37991239680 Oct 11 12:28 snapshot-100903903-7YaG6S1m21Yydc7jdysPNMSfTT1EmBzeakyg6uwXy8Wb.tar
-rw-r--r--  1 root root 36945213440 Oct 11 00:00 tmp-snapshot-100810845.tar
[2021-10-11T00:00:00.536376919Z WARN  solana_runtime::snapshot_utils] tar command failed with exit code: signal: 10

I just tested this manually by sending a signal during the snapshot creation.

Proposed Solution

Maybe delete the tmp-snapshot-* files every time before starting to create a new snapshot archive?

@diman-io
Copy link
Contributor Author

@t-nelson
Copy link
Contributor

What version was this observed on? We don't shell out to tar anymore so I wouldn't expect the logrotate signal to effect snapshot archiving

@diman-io
Copy link
Contributor Author

diman-io commented Oct 11, 2021

I saw this yesterday and today in mainnet (1.7.14)

I observe this behavior for a very long time. And many folks wrote about this in the Russian Telegram chat.

Previously, the snapshots were smaller and this was not a big problem. Now I turned off archiving and it became a problem and I decided to look for the reason.

Archiving is disabled so that the validator starts faster.

@t-nelson
Copy link
Contributor

Ah... looks like the changes haven't been backported

@plutocrat
Copy link

Validator running 1.7.15 on mainnet-beta stopped validating a couple of days ago due to build up of tmp-snapshot files. There were 20-30 of them each an 11G .zst archive. Not related to logrotate, as they were at various times per day, up to once per hour.
Seems like they're the correct size, but just abandoned after creation. And also there is no process to remove them, so they just build up until the disk is full and the validator stops working.
Logs look like this:

Oct 16 12:15:50 hostname start_solana.sh[316365]: tar: snapshots/101748706/101748706: file changed as we read it
Oct 16 12:15:51 hostname start_solana.sh[313425]: [2021-10-16T04:15:51.124171904Z WARN solana_runtime::snapshot_utils] tar command failed with exit code: exit code: 1
Oct 16 12:15:55 hostname start_solana.sh[313425]: [2021-10-16T04:15:55.220552265Z WARN solana_core::snapshot_packager_service] Failed to create snapshot archive: archive generation failure exit code: 1

My current solution is to run "find /ledger/ -name 'tmp-snapshot-*zst' -mmin +120 -delete" in a cronjob to stop the buildup.

@nyetwurk
Copy link

v1.8.1 hopefully fixes this
https://github.com/solana-labs/solana/releases/tag/v1.8.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants