Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5581] Timeline Management #716

Merged
merged 17 commits into from
Oct 15, 2024
Merged

Conversation

Zvirovyi
Copy link
Contributor

@Zvirovyi Zvirovyi commented Sep 30, 2024

Add timeline management feature. Fixes #612 and fixes #715. There are the same PR for the VM - basically, it's a port of the same code, so i left description unchanged, bet me know if it can be improved.
I've currently disabled failing unit tests and will fix them after this PR is reviewed.

Overview

Main restriction of the previous PITR work was lack of the WAL timeline management. Because of this, "move restored cluster to another s3 bucket" blocking message was introduced - to restrict user from having multiple timelines in the single s3 stanza. This PR is a continuation of that work and adds ability to have multiple WAL timelines in the single s3 stanza simultaneously and interact with them through UI.

Changes

  1. list-backups action:
    1.1. type column renamed to the action; for the ordinary backups backup suffix is added (example: full -> full backup)
    1.2. restore events are shown along the ordinary backups to indicate timeline switch after each successful restore; user can use their id to run PITR restore within specific timeline chosen
    1.3. timeline column added to indicate timeline of the backup / restore
    Example can be found below in the section "Use-case example".
  2. move restored cluster to another s3 bucket message is removed entirely
  3. check of the last archived WAL by the cluster compared to the one in the s3 stanza is removed entirely
  4. backup-id is mandatory for the action restore again as user can now select specific timeline to run PITR within. UPD: user can use backup-id to choose specific timeline to restore from, but with single restore-to-time parameter charm will automatically deduce required timeline as specified in [DPE-5581] Timeline Management #716 (comment).
  5. improved Restore succeeded message to include info about:
    5.1. real backup used to restore
    5.2. timeline (either of real backup or of restore event) chosen for the restore
    5.3. current timeline after restore
    Example: unit-postgresql-0: 01:17:27 INFO unit.postgresql/0.juju-log Restored to latest from timeline 2. Currently tracking the newly created timeline 3.

Use-case example

PostgreSQL and s3-integrator are considered to be configured and integrated already. Timeline id of the newly created cluster will always be 1.
For this example, we will create test table right after stanza initialization:

create table asd(message text);

Then, full backup must be created - it will be point of origin for WAL logs and therefore PITR (maybe this would be a great idea to automatically create full backup right after s3 stanza creation?):

juju run postgresql-k8s/leader create-backup

Then, let's write some test data and switch wal file (this ensures immediate wal file archiving and must be performed every time you want to preserve latest changes in the s3 stanza):

insert into asd values ('hello');
select pg_switch_wal();

Then, let's do restore to the full backup we created previously:

juju run postgresql-k8s/leader restore backup-id=2024-09-26T22:08:16Z

After cluster is restored successfully to the timeline 2, there should not be any data in the table. Let's create some test data specific to the timeline 2:

insert into asd values ('world');
select pg_switch_wal();

And then repeat restore to the full backup (on the timeline 1), but now with the restore-to-time=latest parameter:

juju run postgresql-k8s/leader restore backup-id=2024-09-26T22:08:16Z restore-to-time=latest

Now, in the newly created timeline 3, we will only see hello test data as we restored all the lifespan of the timeline 1 and not used timeline 2. Let's then restore to the timeline 2 using backup-id from the list-backups option:
image

juju run postgresql-k8s/leader restore backup-id=2024-09-26T22:11:54Z restore-to-time=latest

After successful restore to the timeline 4, we will see only the world from the timeline 2 in the test table and not the hello from the end of timeline 1.
Here the basic algorithm for this example:

1: table -> backup -> insert_hello -> restore_backup
2: insert_world -> restore_backup_latest
3: restore_timeline_2_latest
4: (*-*)

PITR test algorithm

image

Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 84.28571% with 11 lines in your changes missing coverage. Please review.

Project coverage is 71.65%. Comparing base (857fe48) to head (83ecb84).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/backups.py 85.24% 6 Missing and 3 partials ⚠️
src/charm.py 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #716      +/-   ##
==========================================
+ Coverage   70.75%   71.65%   +0.90%     
==========================================
  Files          11       11              
  Lines        2999     3031      +32     
  Branches      464      464              
==========================================
+ Hits         2122     2172      +50     
+ Misses        767      753      -14     
+ Partials      110      106       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the amazing work, @Zvirovyi! I left some comments related to questions and suggestions.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work @Zvirovyi! What needs to be added are the unit tests.

I'm currently reviewing the VM charm PR and running the integration tests locally to double-check that they're still working.

I asked the team to review this PR.

Copy link
Contributor

@lucasgameiroborges lucasgameiroborges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work, and appreciate the detailed description!

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Zvirovyi !!!

@marceloneppel please consider to merge this and update documentation as mentioned here.

@marceloneppel marceloneppel merged commit 30af27d into canonical:main Oct 15, 2024
97 checks passed
@Zvirovyi Zvirovyi deleted the timelines branch January 26, 2025 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants