Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bidirectional grpc stream issues on transport is closing #4578

Closed
nkostoulas opened this issue Jul 2, 2021 · 10 comments
Closed

bidirectional grpc stream issues on transport is closing #4578

nkostoulas opened this issue Jul 2, 2021 · 10 comments

Comments

@nkostoulas
Copy link

nkostoulas commented Jul 2, 2021

I have a bidirectional grpc stream set up with the server rpc handler looking something like the following:

func (s *SomeServer) Rpc(stream pb.Some_RpcServer) error {
    for {
        data, err := stream.Recv()
        if err != nil {
            if err == io.EOF {
               goto ENDSTREAM
            }
            return err
        }

        // db operations

        stream.Send(data)
    }

    ENDSTREAM
        // db operation using background context, since server context is cancelled
}

and a client that:

  • Has a separate goroutine that listens on Recv()
  • Uses Send() and waits for the response
  • Sends a final Send() and then calls CloseSend() when it's time to terminate the stream

Whenever the client receives an error back it closes the current connection, garbage collects the stream and then creates a new client connection and rpc stream. It seems that the current setup has some issues in which when the client receives a connection dropped error code = Unavailable desc = transport is closing then the stream is closed but the server rpc handler never finishes (or it takes veeery long). This was observed because there's a lock on the server handler that is never unlocked but details are not really relevant.

I'd like to understand how these streams actually terminate as it's quite confusing reading the docs. Possible interpretations of this issue are:

  • We didn't handle the error from stream.Send() on the server and then it was hanging on Recv() but shouldn't Recv() return an EOF since the client has garbage collected the stream anyway?
  • We didn't properly close the stream on the client as given that connectivity is lost then CloseSend() never really succeeds. Also even though there is only one stream at a time this stream uses a parent context that is never actually cancelled. Again though shouldn't garbage collection handle all this?
@dfawley
Copy link
Member

dfawley commented Jul 2, 2021

It would be useful to see your client code as well, since it sounds like what you're doing is somewhat unusual.

  • Generally, you should not close a client just because an RPC fails, even if it's because the connection is lost. The client should reconnect automatically if the connection dies.
  • You should always use a new context for each stream and cancel it when the stream is done. You should generally have deadlines on all RPCs, as well (though this is often not done for long-lived streams like pub/sub).

We didn't handle the error from stream.Send() on the server and then it was hanging on Recv() but shouldn't Recv() return an EOF since the client has garbage collected the stream anyway?

This should be fine; Recv should also error if Send errors.

We didn't properly close the stream on the client as given that connectivity is lost then CloseSend() never really succeeds.

If the connection is actually lost, then on the server side the RPC's context will be canceled and Recv/Send should unblock.

I can't tell from what is provided why your servers would be getting stuck on RPC operations. It might be helpful to create a minimal reproducible example for both client & server to help narrow it down.

@nkostoulas
Copy link
Author

nkostoulas commented Jul 2, 2021

This is a simplified version. I don't have access to the full code at the moment can revisit on Monday.

struct Writer{
    stream  pb.Some_RpcClient
    errChan chan error
}

func NewWriter(ctx context.Context, client *pb.SomeClient) *Writer{
    stream, err := client.RpcMethod(ctx)
    if err != nil {
        return err
    }
    writer := &Writer{ stream, make(chan, error) }
    go listenToResponses()
    return writer
}

func (w *Writer) listenToResponses() {
    for {
        data, err := w.stream.Recv()
        if err != nil {
            if err == io.EOF {
               break
            }
            w.errChan <- err
            continue
        }
        log.Println(data)
    }
}

func (w *Writer) getResponse() err {
    select {
        case <-ctx.Done():
            return ctx.Error()
        case newErr := <- w.errChan
            return newErr
        case <-time.After(1 * time.Minute):
            return errors.New("timeout")
    }
}

func (w *Writer) Write() error {
    err := w.stream.Send(&RpcMethodRequest{})
    if err != nil {
        return err
    }
    return w.getResponse()
}

func (w *Writer) Close() error {
    // send a final message
    err := w.stream.Send(&RpcMethodRequest{})
    if err != nil {
        return err
    }
    err := w.getResponse()
    if err != nil {
        return err
    }
    return w.stream.CloseSend()
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), time.Hour * 1)
    defer cancel()
    var client *pb.SomeClient
    var stream *pb.Some_RpcClient

    for {
        select {
        case <- ctx.Done:
            break;
        }
        client = pb.NewSomeClient()
        stream = client.Rpc(ctx)

        for {
            err := stream.Send()
            if err != nil {
                stream.Close()
                break
            }
        }
    }
}

@nkostoulas
Copy link
Author

nkostoulas commented Jul 2, 2021

You should always use a new context for each stream and cancel it when the stream is done

I've read about this a lot and I assume it's for memory leaks but can it cause the side effects i'm mentioning though?

If the connection is actually lost, then on the server side the RPC's context will be canceled and Recv/Send should unblock.

Could it be that we are not checking the context on the server side (we actually aren't) then or would that cause Send/Recv to error automatically anyway?

In any case from what you are saying it sounds like whatever the client does, if the connection is dropped (with the error i've described) the server should gracefully finish the service handler execution without hanging for any reason. Is there any possible way the server hangs due to something the client does?

@dfawley
Copy link
Member

dfawley commented Jul 2, 2021

I've read about this a lot and I assume it's for memory leaks but can it cause the side effects i'm mentioning though?

Yes, and it's for just generally following best practices. It's possible to safely not do this, but I wouldn't recommend it. No, it shouldn't matter here, but if you have a bug in your client, then this can help ensure the RPC is properly closed.

Could it be that we are not checking the context on the server side (we actually aren't) then or would that cause Send/Recv to error automatically anyway?

Send and Recv should unblock themselves. You don't need to check the RPC's context on the server side unless you are doing another blocking operation besides something on the stream.

the server should gracefully finish the service handler execution without hanging for any reason.

In your case, yes, because you say you are always blocked on only a Send or Recv.

Is there any possible way the server hangs due to something the client does?

If the client just sits there and doesn't close the stream or connection, then yes. Otherwise no.

@nkostoulas
Copy link
Author

nkostoulas commented Jul 2, 2021

There are a couple DB operations between Send/Recv and another DB operation on ENDSTREAM in the server handler.
The DB operation on the ENDSTREAM uses a background context, since the server context should have been cancelled.

@dfawley
Copy link
Member

dfawley commented Jul 2, 2021

The DB operation on the ENDSTREAM uses a background context, since the server context should have been cancelled.

This sounds wrong to me. Anything blocking on the server needs to block on the RPC context or a context derived from it.

@nkostoulas
Copy link
Author

nkostoulas commented Jul 3, 2021

This sounds wrong to me. Anything blocking on the server needs to block on the RPC context or a context derived from it.

What if you need to do an operation though after the stream is closed / RPC cancelled but before returning from the service handler?

Hmm could it be that the stream is actually never cleared completely on the client side because of
go w.listenToResponses() as we never pass a termination signal to it 🤔 Actually if connectivity is lost this should be cleared too...

@github-actions
Copy link

github-actions bot commented Jul 9, 2021

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@dfawley
Copy link
Member

dfawley commented Jul 14, 2021

What if you need to do an operation though after the stream is closed / RPC cancelled but before returning from the service handler?

The act of returning from the service handler is the only thing that ends the RPC. If this is a goroutine spawned from the handler function, then it does make sense to use the background context for those types of operations. (Actually it would probably be best to use a context derived from the handler's, but not canceled - like what is proposed here: golang/go#40221 - but no such thing exists right now.)

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jul 20, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants