Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wsconn: Ignore some connect errors. #3182

Merged
merged 1 commit into from
Feb 12, 2025

Conversation

JoeGruffins
Copy link
Member

All of the surrounding code suggests that we want to continue on error with initial connect unless reconnect is off. We are also already logging the error. So, do not return error for some connect failures.

closes #3180

@@ -564,7 +564,7 @@ func (conn *wsConn) Connect(ctx context.Context) (*sync.WaitGroup, error) {
close(conn.readCh) // signal to MessageSource receivers that the wsConn is dead
}()

return &conn.wg, err
return &conn.wg, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be up to the caller to decide how to handle Connect errors. Maybe a const error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see now. My bad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok? From the comments I think the intent was to keep going on error, we could make an exception inside the callers (dex.runner) but I don't think necessary, as all we would do is log the error again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nil is right. Ancient bug.

@JoeGruffins
Copy link
Member Author

@norwnd unable to request review through ui but does this fix the issue well enough?

@norwnd
Copy link
Contributor

norwnd commented Feb 11, 2025

@norwnd unable to request review through ui but does this fix the issue well enough?

hey, couple things:

  • it doesn't fix bisonw: Initial connection error does not start retry loop #3180 because like @jholdstock mentioned we shouldn't cancel() context in case of error - the bug is in ConnectionMaster itself
  • I don't believe the change in this PR is correct in a sense that it changes the behavior described in the func comment (and I don't think a change of behavior is needed to fix bisonw: Initial connection error does not start retry loop #3180, it's not an ancient bug - it's by design that caller is notified of the error happening on the very 1st connect attempt (for one, it allows for re-using Connect in ConnectOnce in ConnectionMaster implementation, and also it lets the caller know that 1st attempt was not successful so it can notify anybody interested or log this error for example), at least ConnectionMaster works this way - wsConn probably follows similar design pattern):
// Connect connects the client. Any error encountered during the initial
// connection will be returned. An auto-(re)connect goroutine will be started,
// even on error. To terminate it, use Stop() or cancel the context.
func (conn *wsConn) Connect(ctx context.Context) (*sync.WaitGroup, error)

these parts ^ I'm pretty confident in, the reason I rewrote ConnectionMaster (and it's been running well so far for me) is because

  • there are other implementation parts I'm not so sure about (whether they even work correctly, the way it's written is a bit complex to grasp)
  • and there are some sanity-checks added on top (eg. to make sure we don't call Connect twice on the same ConnectionMaster instance)

@buck54321
Copy link
Member

buck54321 commented Feb 11, 2025

The bug is not in ConnectionMaster and we should cancel the context in the case of an error returned from Connect. That's exactly why it was a bug. Because in wsConn we handle retries internally, so ConnectionMaster needs to keep the context alive.

@buck54321
Copy link
Member

I can't dig deeper right now, but could the problem be that we're using (*ConnectionMaster).On somewhere when we should be using (*wsConn).IsDown?

@norwnd
Copy link
Contributor

norwnd commented Feb 11, 2025

it doesn't fix #3180 because like @jholdstock mentioned we shouldn't cancel() context in case of error - the bug is in ConnectionMaster itself

I must correct myself here ^ - it does fix #3180 but not in the way ConnectionMaster was designed to work with underlying Connector (and also if you want to fix that on Connector - wsConn might not be the only implementation with that same issue):

	// Attempt to start the Connector.
	ctx, cancel := context.WithCancel(ctx)
	wg, err := c.connector.Connect(ctx)
	if err != nil {
		cancel() // no context leak
		return fmt.Errorf("connect failure: %w", err)
	}
	// NOTE: A non-nil error currently does not indicate that the Connector is
	// not running, only that the initial connection attempt has failed. As long
	// as the WaitGroup is non-nil we need to wait on it. We return the error so
	// that the caller may decide to stop it or wait (see ConnectOnce).

edit 1: in other words, ConnectionMaster doesn't handle correctly (it shouldn't cancel context) the case when Connector chooses to return error on the initial connect-attempt while continuing retries in background (I believe it's up to Connector to decide whether or not it wants to return error on unsuccessful connect-attempt, and in fact the fix in this PR for wsConn actually does make sense to deduplicate errors logged - it's just not the whole story)

edit 2: additionally, consider how ConnectionMaster.ConnectOnce will work with wsConn - if wsConn doesn't return error on first unsuccessful connect attempt ... that probably means we can't use ConnectionMaster.ConnectOnce with wsConn ? How is the calling code supposed to know about that if it interfaces with ConnectionMaster (and doesn't even know about wsConn existence) ? So the change in this PR seems to be breaking ConnectionMaster.ConnectOnce behavior.

@JoeGruffins
Copy link
Member Author

What you're saying makes sense, but I think the websocket is unique in that it has the option to run a retry loop. If one wanted the "correct" functionality of the connection master, they only need run the websocket with DisableAutoReconnect as true. I think you are overthinking.

But it does solve the issue?

@buck54321
Copy link
Member

but I think the websocket is unique in that it has the option to run a retry loop.

Exactly. You can't return an error, because then the context is canceled, rendering the internal retry loop useless.

@buck54321
Copy link
Member

ConnectionMaster is working exactly as designed. The difference between (*ConnectionMaster).Connect and (*ConnectionMaster).ConnectOnce was some internal subtlety. I remember thinking at the time that ConnectOnce was a poor solution the problem it intended to solve, but I was in no position to fight the change. Looking now, I honestly can't see what it possibly fixes.

@norwnd
Copy link
Contributor

norwnd commented Feb 11, 2025

I think you are overthinking.

I think ConnectionMaster itself is a sort of over-engineering - I'm just trying to make it all consistent (implementation code and comments), it's too late to get rid of it - so might as well finish it off to work as intended

If one wanted the "correct" functionality of the connection master, they only need run the websocket with DisableAutoReconnect as true.

That's true for wsConn but if you want to wrap some other Connector that doesn't have similar thing (DisableAutoReconnect) - it won't work.

Exactly. You can't return an error, because then the context is canceled, rendering the internal retry loop useless.

I don't disagree with that, you are just approaching the problem from different angle (imo by fixing symptom and not the root cause). It's fine to do it like you suggest, but then it's worth revising ConnectionMaster to:

  • update comments in there to reflect these expectations you've mentioned in these discussions
  • remove ConnectOnce method since it's no longer needed afaict, I believe its original purpose was to automatically Disconnect on 1st connect-attempt error - if you leave that cancel() in - that's your disconnect (perhaps done needs to be updated too though)

@JoeGruffins
Copy link
Member Author

Just want to fix the issue in the simplest and most unintrusive way possible. If the ConnectionMaster needs to be reworked I think that can be a separate issue/pr. #3180 is solved with a one line change.

@buck54321
Copy link
Member

ConnectionMaster is useful for avoiding monotonous boilerplate from bubbling up with both context,Context and sync.WaitGroup. It just combines their functionality and provides friendly-named methods for their uses.

ConnectOnce should go.

wsConn shouldn't even be handling reconnects to be honest. The caller should do that. That way we don't need all the reconnect messaging either. A ConnectionMaster can be restarted, which is nice.

WsConn shouldn't be handling msgjson type either. We can write functionality in the msgjson package to handle some of that stuff. I think the idea there is that gorilla uses a json.Decoder internally that allows for decoding of streams, but maybe we can pass the stream up, somehow?

Feel free to tackle any of these problems.

@buck54321 buck54321 merged commit 78a36c0 into decred:master Feb 12, 2025
5 checks passed
martonp pushed a commit to martonp/dcrdex that referenced this pull request Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bisonw: Initial connection error does not start retry loop
5 participants