Allow the Transport to Stop/Reset a Stream? #3291

nibanks · 2019-12-07T01:54:18Z

Per the current design for streams, the application MUST be the one to shutdown a stream, because the error code space for streams is purely application layer; and the QUIC transport has no knowledge of these error code, so it can't just pick one.

When writing a general purpose QUIC library, and integrating it into other general purpose libraries (i.e. dotnet) I constantly have to explain to non-QUIC folks that they can't just close their stream object/handle without first supplying an application layer specific error code; meaning any intermediate library needs input from the app even if some fatal error (memory allocation failure) happened along the way.

So, I've come to the point that it might be a good idea to allow for the transport layer to completely terminate a stream for it's own reason. Whether that means we add a flag to RESET_STREAM and STOP_SENDING to indicate the transport is specifying the error code, or we add a new frame entirely doesn't matter to me. It would just be a lot cleaner for general purpose libraries to be able to kill a stream when necessary, on its own (no app layer involvement).

RyanTheOptimist · 2019-12-07T17:50:47Z

I don't think I understand how allowing the transport to reset the stream solves your problem? It sounds like the problem you're describing is an API complexity issue about plumbing the error code all the way down, not that the transport will, of its own volition, decide that a stream needs to be reset. Have you found situations where the transport code knows enough to reset a stream?

nibanks · 2019-12-07T18:55:37Z

Imagine a C++/C# Stream class/object in an intermediate library between the app and the native C transport library. The transport indicates a C object to the intermediate library, then the intermediate library creates it's own object/wrapper, but something fails along the way that's not fatal to the entire connection. The app layer is not yet in the picture, but the intermediate library cannot continue with the stream. I see these possible options:

Upcall to the app (or allow the app to pre-configure) to get an appropriate error code to terminate the stream with.
Terminate the entire connection even though it's a stream specific problem (but this has the same problem, because the error code would have to be app specific, so we'd have to expose an API to allow the intermediate layer to issue a transport connection close).
Change the protocol to allow a non-app specific error code to be used to terminate the stream.

LPardue · 2019-12-07T19:33:14Z

Taking the opposite view, adding this capability would add complexity to HTTP/3 because not only do we have to document actual stream errors and their effect on the connection, but with this we'd have to describe what happens when the transport aborts streams. I don't see it adding anything useful for HTTP/3.

nibanks · 2019-12-07T21:46:55Z

I don't see how it would add any real extra complexity. HTTP/3 must already have some sort of "Internal Error" at the app layer. You could very easily map a transport failure to precisely that. QUIC has to work standalone. It can't only work well when written along side HTTP, in the same code base; and I'd really prefer the solution not to be "just kill the whole connection" if this ever happens.

LPardue · 2019-12-08T03:29:30Z

QUIC can't work standalone, every connection requires an ALPN and it's the application mapping that drives stream creation and handles stream errors.

If there was a QUIC-level stream error, we'd need to describe how it applies to H3. Closing control and QPACK streams is an error. So it seems this is applicable to only request streams or push streams when an endpoint detects them being created? What is the interop goal with being informed the peer had an internal transport error when creating a stream?

martinthomson · 2019-12-08T22:34:29Z

If you are concerned about - for example - catching an application exception and having something to use to signal stream closure in that case, let the application decide and configure a default application error code. The HTTP/3 internal error (H3_INTERNAL_ERROR) seems like a fine choice for that default. Configure it along with ALPN if you need a catch-all.

This seems entirely solvable at the interface to the transport.

kazuho · 2019-12-08T23:59:44Z

While I can see the concern (at least theoretically), I am not sure if we can come up with a good design for HTTP/3.

The concept of having a stream-level reset as a transport-level feature (which should be application-agnostic) is based on the premise that each stream operates independently. For example, the transport resetting an arbitrary stream is fine in case of HQ (i.e. HTTP/0.9 over QUIC).

However, for application protocols like HTTP/3, that does not work, because the application protocol is designed to use multiple streams for doing an exchange. For example, exchange of an HTTP request typically involves three streams (i.e. the request stream, QPACK encoder stream, QPACK decoder stream). The transport cannot reset one stream, assuming that the communication on the other streams to succeed.

Considering this, I think it's reasonable to limit the signaling scheme of transport-level error signals to connection level (i.e. CONNECTION_CLOSE; frame=0x1c), at least in QUIC v1.

RyanTheOptimist · 2019-12-09T02:36:33Z

I agree with @martinthomson that passing in the "default" stream error code along with the ALPN is a fine way to structure this API and nicely avoid the need for the transport to terminate streams. (Which @kazuho rightly observes the transport really has no understanding of)

igorlord · 2019-12-09T04:07:34Z

The problem at hand is the transport layer (or some intermediary library) unable to fulfill its responsibilities (in this case, it is managing a stream properly). The most coherent response seems to be giving up on the connection via transport's version of CONNECTION_CLOSE (frame=0x1c), assuming that the transport layer is still able to generate at least this. (This means your QUIC library should expose an ability to do so to its users, so an intermediary library can cause frame=0x1c to be sent.)

I see @nibanks proposing a more nuanced response of just giving up on a specific stream, assuming this is what it would take the transport layer (or an intermediary library) to "recover". It seems like a more complicated choice, because now every application would need to be ready for the transport to give up on any stream at any time -- something that many applications may find very hard to recover from.

martinthomson · 2019-12-09T04:19:55Z

That nuance is already possible: if the application wants to hand the transport this power, it can. It can provide a default code for the entire connection; or more complex expressions of policy can be made (e.g., even streams get code X, stream Y gets code Z, otherwise don't touch). That is a negotiation that can be managed at the interface between application and transport code. Without that information, the transport is always able to access the final solution.

As for the suggestion that an application can ask the transport to use one of its own error codes with CONNECTION_CLOSE (0x1c/transport), that's another negotiation for the API they share.

nibanks · 2019-12-09T15:05:03Z

So, it seems the WG has no appetite to change the protocol for this scenario. Implementations are stuck with one of the following options then:

Close the entire connection with a transport error code.
Require app input for the default error code, and close the stream with that error code.

(1) is much simpler, IMO, so that will likely what I'm forced to go ahead with. Thanks for the feedback.

RyanTheOptimist · 2019-12-09T16:12:42Z

Given that the app (or some proxy for the app) will be passing in the ALPN string to the transport it seems like it would be pretty easy to pass in a tuple of (default_stream_error, ALPN string). But you're right. Either one should work.

ianswett · 2019-12-09T21:32:32Z

I also tend to think doing nothing is the right thing here. If I was to do something, I'd reserve one value for INTERNAL_ERROR across all apps for consistency.

But I tend to think a connection close error is going to cause fewer edge cases and be easier to debug.

janaiyengar · 2019-12-09T21:41:17Z

@nibanks: You will additionally need an API for QUIC to communicate to the app that a stream was closed anyways, so it's not going to be as simple as it might seem. And:

something fails along the way that's not fatal to the entire connection

In my experience, if an internal error caused QUIC to terminate a stream, that's likely going to precipitate towards more such errors and a connection close is probably going to happen anyway.

geoffkizer · 2019-12-11T22:07:35Z

If I was to do something, I'd reserve one value for INTERNAL_ERROR across all apps for consistency.

I suspect most app protocols will need to define some value for INTERNAL_ERROR anyway. Why not just define 0 = INTERNAL_ERROR for all protocols?

janaiyengar · 2019-12-11T22:41:51Z

@geoffkizer : See @kazuho's comment above. Stream resets are (effectively) reserved for use by the application, since streams have application semantics associated with them. If there is a stream-level error within the transport, that results in a connection level error.

martinthomson · 2019-12-11T23:18:42Z

Also, we did reserve stream 0 in the past, but explicitly chose (for the reasons already discussed here) not to keep that reservation. Let's not second-guess that decision unless we have new information. I haven't seen any new information in this issue so far.

geoffkizer · 2019-12-12T20:32:07Z

@janaiyengar

If there is a stream-level error within the transport, that results in a connection level error.

I understand the rationale here, but I think it's going to lead to some difficult to diagnose behavior.

Here's a concrete example: Consider a garbage collected language where the QuicStream class represents a QUIC stream. The user acquires a QuicStream instance, does some sends and receives, and neglects to properly shutdown the QuicStream. The QuicStream instance will eventually be collected, which will cause the entire connection to be killed.

Note that this is a programming error -- the user failed to shutdown the stream properly -- but it's a very common programming error. Ideally, it should be straightforward to diagnose. Killing the entire connection makes this rather hard to diagnose. Terminating the stream with a well-known error code like "INTERNAL_ERROR" or "UNKNOWN" or similar is much easier to diagnose.

mikkelfj · 2019-12-12T20:38:19Z

I don't see how that example would cause the entire connection to be killed. Either the collector arrives so late that the connection is already dead, or it will close the stream at that point, otherwise the stream just stay open, unless the application protocol (not transport), at the other end decides something is wrong and takes action. Now, it could happen that the entire connection blocks on flow control due to leaking streams, but then that would be an observable event and one could start to suspect leaking streams. Notably, BLOCKED frames would be sent.

mikkelfj · 2019-12-12T20:42:16Z

Another programming error could be sending data on a stream that is already closed, and that could in fact take down the entire connection. However, in that case the local transport API ought to raise an error code towards the application rather than sending data on an invalid state, and if it does so anyway, the implementation is bad and closing the connection is the right thing to do, since the peer has nothing to work with. Still, the error can be supported by an error string explaining in more detail why the connection was closed, especially in debug deployments.

geoffkizer · 2019-12-12T20:44:22Z

otherwise the stream just stay open

Yes, the library could choose to just leave the stream open and abandoned in this case. I'd argue that this is worse behavior than killing the connection. Better to fail fast than to slowly leak to death.

mikkelfj · 2019-12-12T20:47:31Z

The library cannot close the stream without information from the application code. That would happen via explicit calls, or in the garbage collectors finalizer. Of course, an transport library can do strange things such as automatically close streams if it is running low, or have its own internal garbage collector to sweep inactive streams, but that is not a library that I go anywhere near.

RyanTheOptimist · 2019-12-12T20:48:18Z

Having a stream open and idle is not in any way an "error". Think of HTTP hanging GETs for example. If a transport killed and open stream which just hasn't sent or received any data in a while that would be a potential footgun.

(Of course if a transport REALLY REALLY wanted to kill streams it could just require the application's "internal error code" to be passed in along with the application's ALPN string.)

geoffkizer · 2019-12-12T20:57:13Z

Having a stream open and idle is not in any way an "error".

The case I'm describing here is not simply an idle stream. It's a stream that is no longer accessible by the user. Without intervention, it will simply never perform any further action and never terminate. This is user error.

(Of course if a transport REALLY REALLY wanted to kill streams it could just require the application's "internal error code" to be passed in along with the application's ALPN string.)

Yes, that's a possibility. But it seems rather strange to require passing an error code that is only needed when you write incorrect code.

mikkelfj · 2019-12-12T21:00:07Z

That is what finalizers do. You cannot know that a stream is unreachable before the garbage collector decides that that is the case.

geoffkizer · 2019-12-12T21:02:17Z

That is what finalizers do. You cannot know that a stream is unreachable before the garbage collector decides that that is the case

Yes, that's exactly what I'm describing here. The object is being collected and the finalizer is running.

mikkelfj · 2019-12-12T21:04:53Z

The finalizer is application specific code. Or, it could be library code, but then it is above the transport layer, for example a HTTP/3 library. All the transport layer sees is a call to close the stream whether it is explicit or via a fixed or customizable finalizer call.

geoffkizer · 2019-12-12T21:08:28Z

The finalizer isn't application specific code. It's just the finalizer for the QuicStream object, that represents the QUIC stream itself.

mikkelfj · 2019-12-12T21:16:49Z

If you want debug facilities in the QUIC Stream library, you add that information either when you create the stream, or you provide a default application level error code that says "leaked stream". It is still outside the transport layer.

RyanTheOptimist · 2019-12-12T21:21:52Z

(Of course if a transport REALLY REALLY wanted to kill streams it could just require the application's "internal error code" to be passed in along with the application's ALPN string.)

Yes, that's a possibility. But it seems rather strange to require passing an error code that is only needed when you write incorrect code.

It seems even stranger to me to change the protocol in a way that's only needed when you write incorrect code.

geoffkizer · 2019-12-12T21:23:28Z

It is still outside the transport layer.

The point I'm trying to make here is that the transport layer can detect that a stream is no longer usable because the object that represents it has been garbage collected.

This is a good thing, because it helps detect and prevent leaks.

The problem is that the transport layer can't really do anything except kill the connection, which makes it more difficult to diagnose what caused the problem.

geoffkizer · 2019-12-12T21:27:11Z

It seems even stranger to me to change the protocol in a way that's only needed when you write incorrect code.

Well, incorrect code happens. Bugs happen. I assume pretty much every app protocol is going to define some error code like INTERNAL_ERROR that basically means "something bad happened".

The only change I'm suggesting is that we define a universal, non-app-specific error code for INTERNAL_ERROR, because this allows that code to be used in certain circumstances where it's hard or impossible to figure out how to represent INTERNAL_ERROR in an app-specific way.

LPardue · 2019-12-12T21:37:36Z

The suggestion runs the wrong way against the protocol design. There is no universal application error code space, each application mapping defines their own. So either you send a CONNECTION_CLOSE with transport error code 0x1 or you have to send an application specific-error code in CONNECTION_CLOSE, STOP_SENDING or RESET_STREAM. One might want to recommend or mandate that each application error space has an INTERNAL error but that seems overly prescriptive to be in the core transport doc

nibanks · 2019-12-12T21:50:16Z

I also disagree that a default error code is the way to go, but the only solution that leaves is close the entire connection with a transport error code (INTERNAL_ERROR probably). As @geoffkizer says, this is is going to be a very common bug for app developers. They might forget to drain the read pipe or completely close the write pipe. Having that result in the entire connection being torn down is going to be very confusing for developers to debug/diagnose.

LPardue · 2019-12-12T21:56:39Z

The user should not be using a QUIC transport primitive to perform application behaviour. If they can't or don't want to shim it with an application mapping then this is a class of bug that will exist.

kazuho · 2019-12-13T00:23:54Z

Right.

I agree that it is beneficial to have rich set of information being given to the application for diagnosing issues. That's an API design issue.

If that information should be exposed to the peer is a different point, which has security implications too. We already have the transport-level error code, and also a field to hold arbitrary string that can be used for explaining the cause. I do not think we need more than that.

martinthomson · 2019-12-13T01:32:06Z

I will observe that the garbage-collected stream object can a) do nothing and leave the stream hanging, b) close the stream, or c) call on some help to decide what to use with RESET_STREAM or STOP_SENDING or both. I think that (a) is likely the default. The local implementation drops state and carries on. The other side might be left hanging, but that is - as much as anything else - the only clear bit of application intent you might infer from what happened.

Now, if you want to ensure that the problem is noticed and acted upon, local signals might be the next step. But if the other side needs to know, then terminating the connection will get the problem noticed.

MikeBishop · 2019-12-16T22:13:41Z

The user should not be using a QUIC transport primitive to perform application behaviour. If they can't or don't want to shim it with an application mapping then this is a class of bug that will exist.

I'm not sure what you're suggesting here. "The user" in this case is writing a client-server application that's talking over QUIC. They're using a managed code base that gives them QUIC streams, just like it would give them a TCP stream. What are you suggesting they should be doing instead?

I agree that this is kind of an awkward situation. I think the best solution probably is passing an error code to the connection constructor, at least under the assumption that you need to abruptly close the stream rather than considering it simply to have ended.

LPardue · 2019-12-16T22:26:37Z

I was attempting to say that because QUIC provides no hand holding when it comes to streams, then using streams for an application without defining any application-level mapping (especially around the handling of errors) seems like an odd design choice. Especially given that you have to negotiate some application when completing a handshake.

martinthomson · 2020-02-04T15:21:57Z

Based on the feedback here, I think that the overwhelming sense is that people don't want to do this. This has also been discussed before (see #485) and I'm not aware of anything that has changed in that time.

Proposal: close with no action.

mnot added the -transport label Dec 9, 2019

nibanks mentioned this issue Dec 11, 2019

[QUIC] Design of Shutdown for QuicStreams in System.Net.Quic dotnet/runtime#756

Closed

larseggert added the design An issue that affects the design of the protocol; resolution requires consensus. label Feb 4, 2020

martinthomson added the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Feb 4, 2020

LPardue removed the proposal-ready An issue which has a proposal that is believed to be ready for a consensus call. label Feb 19, 2020

LPardue added the call-issued An issue that the Chairs have issued a Consensus call for. label Feb 26, 2020

LPardue added has-consensus An issue that the Chairs have determined has consensus, by canvassing the mailing list. and removed call-issued An issue that the Chairs have issued a Consensus call for. labels Mar 4, 2020

martinthomson closed this as completed Mar 5, 2020

Allow the Transport to Stop/Reset a Stream? #3291

Allow the Transport to Stop/Reset a Stream? #3291

Comments

nibanks commented Dec 7, 2019

RyanTheOptimist commented Dec 7, 2019

nibanks commented Dec 7, 2019

LPardue commented Dec 7, 2019

nibanks commented Dec 7, 2019

LPardue commented Dec 8, 2019 • edited Loading

martinthomson commented Dec 8, 2019

kazuho commented Dec 8, 2019 • edited Loading

RyanTheOptimist commented Dec 9, 2019

igorlord commented Dec 9, 2019 • edited Loading

martinthomson commented Dec 9, 2019

nibanks commented Dec 9, 2019

RyanTheOptimist commented Dec 9, 2019

ianswett commented Dec 9, 2019

janaiyengar commented Dec 9, 2019

geoffkizer commented Dec 11, 2019

janaiyengar commented Dec 11, 2019

martinthomson commented Dec 11, 2019

geoffkizer commented Dec 12, 2019

mikkelfj commented Dec 12, 2019

mikkelfj commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

mikkelfj commented Dec 12, 2019

RyanTheOptimist commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

mikkelfj commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

mikkelfj commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

mikkelfj commented Dec 12, 2019 • edited Loading

RyanTheOptimist commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

geoffkizer commented Dec 12, 2019

LPardue commented Dec 12, 2019

nibanks commented Dec 12, 2019

LPardue commented Dec 12, 2019

kazuho commented Dec 13, 2019 • edited Loading

martinthomson commented Dec 13, 2019

MikeBishop commented Dec 16, 2019

LPardue commented Dec 16, 2019

martinthomson commented Feb 4, 2020

LPardue commented Dec 8, 2019 •

edited

Loading

kazuho commented Dec 8, 2019 •

edited

Loading

igorlord commented Dec 9, 2019 •

edited

Loading

mikkelfj commented Dec 12, 2019 •

edited

Loading

kazuho commented Dec 13, 2019 •

edited

Loading