prog: avoid verifier loop of death #1693

ti-mo · 2025-02-20T09:36:42Z

Supersedes #1679.

prog: factor retry decision into separate testable function

This commit special-cases LogDisabled into a clause without a retry loop since
it doesn't need a buffer allocated. It also simplifies the logic somewhat.

A follow-up commit will add another separate clause for a new LogSizeOverride
CollectionOption to factor out even more complexity related to buffer sizing.

The main purpose of this commit is to extract the decision-making logic around
retries. Instead of directly manipulating loop control flow, a function is
introduced that issues a retry verdict, and it can be tested indendently of
the program load action.

prog: give up when max verifier log size is reached

Previously, loading a program would loop forever if the log didn't fit
in the kernel's max log size.

prog: prevent prog_load loop of death

To limit the damage of any future regressions, limit the attempts we'll take
to load a program.

@lmb I eventually decided against deprecating LogSizeStart because it has utility beyond just break-glass cases, for example to reduce the amount of attempts the Cilium verifier tests take to obtain the full buffer, which is often quite large. We don't want to hardcode a buffer size there because the size varies significantly across programs and the environment is somewhat resource-constrained. It also ended up being more code; as of this proposal it's quite concise and testable, and I'm happy with it.

This commit special-cases LogDisabled into a clause without a retry loop since it doesn't need a buffer allocated. It also simplifies the logic somewhat. A follow-up commit will add another separate clause for a new LogSizeOverride CollectionOption to factor out even more complexity related to buffer sizing. The main purpose of this commit is to extract the decision-making logic around retries. Instead of directly manipulating loop control flow, a function is introduced that issues a retry verdict, and it can be tested indendently of the program load action. Signed-off-by: Timo Beckers <[email protected]>

Previously, loading a program would loop forever if the log didn't fit in the kernel's max log size. Signed-off-by: Timo Beckers <[email protected]>

To limit the damage of any future regressions, limit the attempts we'll take to load a program. Signed-off-by: Timo Beckers <[email protected]>

brycekahle · 2025-02-20T22:22:49Z

prog.go

+
+		// ENOSPC means we've loaded the program before and the log buffer was too
+		// small.
+		if attr.LogSize == maxVerifierLogSize {


I'm not sure you will ever reach this condition, because I don't think the kernel will return ENOSPC when using the max verifier log size. That said, it doesn't hurt to be defensive here.

brycekahle

I tested and did not receive the same infinite loop we were seeing before.

lmb · 2025-02-24T09:24:38Z

I eventually decided against deprecating LogSizeStart because it has utility beyond just break-glass cases, for example to reduce the amount of attempts the Cilium verifier tests take to obtain the full buffer, which is often quite large.

That tells me that the current approach just doesn't work very well and we should try to do better. I have a firm belief that making log size at all configurable was a mistake from the get go.

So: how large is "often quite large"?

lmb · 2025-02-24T09:26:49Z

prog.go

-			break
-		}
+		attempts := 1
+		for {


Nit:

Suggested change

for {

for attempts := 0; attemps < maxVerifierAttempts ; attempts++ {

Putting the increment into the for loop has the benefit that it does the right thing when continuing, etc.

lmb · 2025-02-24T09:29:50Z

prog.go

+
+		// Ensure the size doesn't overflow.
+		const factor = 2
+		if attr.LogSize >= maxVerifierLogSize/factor {


I find this whole factor thing not very intuitive. This is to avoid overflow?

lmb · 2025-02-24T09:31:19Z

prog.go

+	// ENOSPC means the log was enabled on the previous iteration, so we only
+	// need to grow the buffer.
+	if errors.Is(err, unix.ENOSPC) {
+		if attr.LogTrueSize != 0 {


This doesn't preserve the behaviour that we don't retry if LogSize >= LogTrueSize. I'm also not sure why this only happens on ENOSPC? LogTrueSize should always be populated?

lmb · 2025-02-24T09:36:16Z

prog.go

-
-		if opts.LogDisabled {
-			break
+			return &Program{"", fd, spec.Name, "", spec.Type}, nil
 		}


You can return here, all of the other code only deals with the log being present. Then the else can go and we can unindent by one leve.

ti-mo added 3 commits February 19, 2025 16:30

prog: give up when max verifier log size is reached

9e5357d

Previously, loading a program would loop forever if the log didn't fit in the kernel's max log size. Signed-off-by: Timo Beckers <[email protected]>

prog: prevent prog_load loop of death

fbe9df5

To limit the damage of any future regressions, limit the attempts we'll take to load a program. Signed-off-by: Timo Beckers <[email protected]>

ti-mo requested a review from a team as a code owner February 20, 2025 09:36

ti-mo requested review from lmb and brycekahle February 20, 2025 09:37

ti-mo mentioned this pull request Feb 20, 2025

allow verifier log size to reach max to prevent infinite loop #1679

Closed

brycekahle reviewed Feb 20, 2025

View reviewed changes

brycekahle approved these changes Feb 20, 2025

View reviewed changes

lmb requested changes Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prog: avoid verifier loop of death #1693

prog: avoid verifier loop of death #1693

ti-mo commented Feb 20, 2025

brycekahle Feb 20, 2025

brycekahle left a comment

lmb commented Feb 24, 2025

lmb Feb 24, 2025

lmb Feb 24, 2025

lmb Feb 24, 2025

lmb Feb 24, 2025

	for {
	for attempts := 0; attemps < maxVerifierAttempts ; attempts++ {

prog: avoid verifier loop of death #1693

Are you sure you want to change the base?

prog: avoid verifier loop of death #1693

Conversation

ti-mo commented Feb 20, 2025

brycekahle Feb 20, 2025

Choose a reason for hiding this comment

brycekahle left a comment

Choose a reason for hiding this comment

lmb commented Feb 24, 2025

lmb Feb 24, 2025

Choose a reason for hiding this comment

lmb Feb 24, 2025

Choose a reason for hiding this comment

lmb Feb 24, 2025

Choose a reason for hiding this comment

lmb Feb 24, 2025

Choose a reason for hiding this comment