Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify why we want TLSv1.3 #37

Closed
cpaasch opened this issue Mar 23, 2022 · 12 comments
Closed

Clarify why we want TLSv1.3 #37

cpaasch opened this issue Mar 23, 2022 · 12 comments

Comments

@cpaasch
Copy link
Contributor

cpaasch commented Mar 23, 2022

Feedback from Bjorn Mork (https://lists.bufferbloat.net/pipermail/rpm/2022-March/000165.html)

We should explain why TLSv1.3 is required.

@richb-hanover
Copy link
Contributor

richb-hanover commented Mar 23, 2022

It would also be good to describe whether the certificates for TLSv1.3 could be self-signed, or must be signed by a CA. I know this was discussed a while back (on the RPM list?) but don't remember the outcome.

If the spec permits an RPM client to accept an unsigned certificate (perhaps noting the server certificate is unsigned), implementing an RPM server becomes a lot easier if it's not necessary to figure out all the Let's Encrypt machinery for my 192.168.1.1 router at home. Thanks.

@hawkinsw
Copy link
Member

Feedback from Bjorn Mork (https://lists.bufferbloat.net/pipermail/rpm/2022-March/000165.html)

We should explain why TLSv1.3 is required.

After we talked about this, I 100% agree that a description of why we want TLS 1.3 would be great to have in the draft.

@cpaasch
Copy link
Contributor Author

cpaasch commented Mar 23, 2022

It would also be good to describe whether the certificates for TLSv1.3 could be self-signed, or must be signed by a CA. I know this was discussed a while back (on the RPM list?) but don't remember the outcome.

If the spec permits an RPM client to accept an unsigned certificate (perhaps noting the server certificate is unsigned), implementing an RPM server becomes a lot easier if it's not necessary to figure out all the Let's Encrypt machinery for my 192.168.1.1 router at home. Thanks.

This should be left to the implementation's choice. It is not part of the wire-format or the methodology. Ultimately, there is no difference between a self-signed or a CA-signed cert. The only difference is whether the client decides to trust the root or not.

@LPardue
Copy link
Contributor

LPardue commented Apr 19, 2022

So as far as justification, at the time of writing this spec, it seems fine to rationalize it as the majority of Web traffic is using modern TLS - just like the use of HTTP/2 is rationalized. This spec shouldn't limit itself to TLS 1.3, to avoid preventing future changes. But is doesn't need to worry itself with older versions or tie itself in knots coming up with rationale.

For instance. According to figures on radar.cloudflare.com today, globally over the last 7 days 67% of traffic is using TLS 1.3, and 25 % is using QUIC's handshake (based on TLS 1.3). The remainder uses TLS 1.2.

@cpaasch
Copy link
Contributor Author

cpaasch commented Apr 19, 2022

Good point @LPardue - we should say that TLS 1.3 and later should be used.

The reason why we end up fixating on TLS 1.3 is because it is possible to easily count the rounds with TLS 1.3. We know that the handshake is always 1 round-trip. With TLS 1.2, it may be 1 or 2 round-trips.

@cpaasch
Copy link
Contributor Author

cpaasch commented Jun 24, 2022

As I have been iterating over this, I am trending towards actually not caring about TLS handshake latency. Here is my reasoning:

Nowadays, we try to reuse connections as much as possible. TLS handshake latency is not relevant then. Also, we are focusing on measuring responsiveness on the load-generating connections. Thus, again TLS handshake latency is irrelevant.

Also, by reusing connections for the latency-probes it is possible to send the probes continuously from the start of the load-generation. I am able to get much more data-points that way and thus a more stable result.

So, I tend to remove the notion of TLS-probes from the draft. Any strong opinions?

@richb-hanover
Copy link
Contributor

richb-hanover commented Jun 25, 2022

A bunch of semi-related thoughts come to mind:

  1. I don't know whether measuring TLS handshake latency is important

  2. To the extent that the algorithm creates no/relatively few new TLS connections, their latency seems unimportant. To the extent that multiple connections simulates real-world use, it could matter.

  3. I seem to remember that TLS handshakes require two round-trips which would make it more sensitive to latency than other metrics.

  4. Measurement is hard. We're aiming for an algorithm that's sensitive (that is, it correctly detects the presence or absence of latency) and specific (the most likely explanation for the result is overbuffering in the path to the server). To the extent that measuring TLS latency aids either of those goals, we should think about keeping it.

  5. Laying out a stable spec is hard, too. Changing the spec might invalidate earlier results. It's not much fun to be reviewing results, and have to ask whether both the client and server ends are newer than commit #1234567.

  6. But it may not matter much in practice. Current RPM measurements are more descriptive than numerically accurate (low, medium, high vs a number with two decimal places). I regularly see a factor of 2-3 between the worst and best RPM values.

  7. We shouldn't be afraid to change the spec if we can devise a better algorithm. There are millions and millions of people who'll use the RPM Test in the future, and we should get it right as early as possible. (I am reminded of the likely apocryphal 'roff' story, where they created this terrific text formatting package in the early days of Unix. After using it for a while, they came up with a new (better) command syntax but decided not to implement it because they "already had a dozen users...")

  8. I don't know (See Item 1 above)

@cpaasch
Copy link
Contributor Author

cpaasch commented Jun 27, 2022

I don't know whether measuring TLS handshake latency is important

Depends on the goal, I guess :) I think for "responsiveness under working conditions" it is less important.

To the extent that the algorithm creates no/relatively few new TLS connections, their latency seems unimportant. To the extent that multiple connections simulates real-world use, it could matter.

The question becomes whether we expect latencies to be different for TLS vs H2 req/resp. And that depends entirely on the network (e.g., a transparent TCP proxy inspecting the TLS client-hello's SNI could end up delaying TLS quite a bit).

I seem to remember that TLS handshakes require two round-trips which would make it more sensitive to latency than other metrics.

It depends on the TLS version. Which is why we mandate(d) 1.3 (see my comment above).

Measurement is hard. We're aiming for an algorithm that's sensitive (that is, it correctly detects the presence or absence of latency) and specific (the most likely explanation for the result is overbuffering in the path to the server). To the extent that measuring TLS latency aids either of those goals, we should think about keeping it.

Yes, looking at it from the "Transparent TCP proxy" perspective which are very popular in cellular networks, it makes sense to measure TLS.

Laying out a stable spec is hard, too. Changing the spec might invalidate earlier results. It's not much fun to be reviewing results, and have to ask whether both the client and server ends are newer than commit #1234567.

Agreed. But the method was never said to be "the universal and only way to measure latency". I was/still am hoping that we are going to converge.

But it may not matter much in practice. Current RPM measurements are more descriptive than numerically accurate (low, medium, high vs a number with two decimal places). I regularly see a factor of 2-3 between the worst and best RPM values.

Variance has always been a problem. I am trying to increase the sample-size without causing the test to run much longer. Right now, I seem to achieve that goal, by starting the measurement not at the moment we reach saturation but rather sending a probe every 100ms from the beginning on. I then take the 90th percentile and the average among the latency on the load-generating and the separate connections. Numbers are fairly stable now.

If we want to bring TLS into that, it would mean that instead of reusing connections for my probes on the separate connections, I would create new ones. I will experiment with that.

We shouldn't be afraid to change the spec if we can devise a better algorithm. There are millions and millions of people who'll use the RPM Test in the future, and we should get it right as early as possible. (I am reminded of the likely apocryphal 'roff' story, where they created this terrific text formatting package in the early days of Unix. After using it for a while, they came up with a new (better) command syntax but decided not to implement it because they "already had a dozen users...")

Agreed, we need to converge as early as possible. macOS Ventura will have the latest and greatest.

I don't know (See Item 1 above)

@cpaasch
Copy link
Contributor Author

cpaasch commented Jun 27, 2022

As I read my reply to you, @richb-hanover I realize how at the beginning I was mostly convinced that dropping TLS is good and then moved more and more towards keeping TLS ;-)

As you can see, I'm quite split on this 😅

@richb-hanover
Copy link
Contributor

As you can see, I'm quite split on this

And I'm always happy to help muddy the waters here :-) Thanks.

@cpaasch
Copy link
Contributor Author

cpaasch commented Jul 6, 2022

After more discussions and experimentations, it is best to keep the full handshake. Weighting of the values is still an open question. With the current approach, we get 4 sets of data: separate_tcp, separate_tls, separate_h2, load_generating_h2. For the 2 types of probes. The ones on separate connections (for these we get TCP, TLS and H2 data), and the load-generating connections (for this one we only get H2 data).

From these 4 data-sets we take the 90th percentile. So, we have 4 values separate_tcp_p90, separate_tls_p90, separate_h2_p90, load_generating_h2_p90.

Suggestion would be to average these in the following way:

Responsiveness = (1/3*separate_tcp_p90 + 1/3*separate_tls_p90 + 1/3*separate_h2_p90 + load_generating_h2_p90) / 2

We can also increase the weight towards H2:

Responsiveness = (1/6*separate_tcp_p90 + 1/6*separate_tls_p90 + 2/3*separate_h2_p90 + load_generating_h2_p90) / 2

@cpaasch
Copy link
Contributor Author

cpaasch commented Jul 11, 2022

af69ae2 please reopen if that's not sufficient. I removed the reference to TLS v1.3 and rather explain what the TLS-handshake latency is. It is a calculation of latency per round-trip during the TLS handshake phase. If the TLS-version being used requires 2 round-trips before the client can transmit data, then the latency needs to be divided by 2.

@cpaasch cpaasch closed this as completed Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants