Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool Connections in RGWAdmin Client #48

Merged
merged 2 commits into from
May 19, 2020

Conversation

Dorthu
Copy link
Contributor

@Dorthu Dorthu commented May 18, 2020

It's common in my usage to create an RGWAdmin instance and use it for
multiple requests. As implemented now, this requires creating a new
connection per request, which isn't horribly slow but it's not as fast
as keeping the TCP connection open.

This change uses a requests.Session to keep the TCP connection open so
that multiple requests being made to the same endpoint don't have to
renegotiate the connection. This results in a small, but appreciable speed
improvement.

I tested the difference by making a client with pooling enabled and one
without and hitting the same admin endpoint 100 times in a loop, timing
the entire operation. The results were:

No pooling: ~19s
Pooling: ~6s

I made this configurable - it's enabled by default since I assume
this is the common usage pattern, but it can be disabled and fall back
to the old behavior.

It's common in my usage to create an RGWAdmin instance and use it for
multiple requests.  As implemented now, this requires creating a new
connection per request, which isn't horribly slow but it's not as fast
as keeping the TCP connection open.

This change uses a `requests.Session` to keep the TCP connection open so
that multiple requests being made to the same endpoint don't have to
renegotiate the connection.  This results in a small, but appreciable speed
improvement.

I tested the difference by making a client with pooling enabled and one
without and hitting the same admin endpoint 100 times in a loop, timing
the entire operation.  The results were:

No pooling: ~19s
Pooling: ~6s

I made this configurable - it's enabled by default since I assume
this is the common usage pattern, but it can be disabled and fall back
to the old behavior.
@liammonahan
Copy link
Contributor

Hi @Dorthu

This is a great idea. I am going to try to test if this change has any implications for stale connections held open for long periods of time. I want to know what would happen in the exception handing in those cases and if we need any other edge cases to be caught that were possible, but unlikely to manifest in the non-pooled configuration.

I will try to look at this today or tomorrow. Let me know if you have additional thoughts.

Thanks,
Liam

@liammonahan liammonahan self-assigned this May 18, 2020
@Dorthu
Copy link
Contributor Author

Dorthu commented May 18, 2020

I didn't think to test that - I'll give that a shot now as well.

@liammonahan
Copy link
Contributor

So for some background we have a test and prod cluster that are fronted by HAProxy, which loadbalances to the rgw origins. HAProxy since v1.5 is supposed to attempt to do keepalive on both the frontend and the backend by default. I also explicitly added enable_keep_alive=yes to the rgw_frontends config option in Ceph.

If I make calls in a loop with 9 seconds or fewer in between everything works fine with connection pooling. Once I go to 10 seconds, I get connection errors: "Remote end closed connection without response"

I also added this and it did not seem to have the intended effect:

if pool_connections:
    self._session = requests.Session()
    self._session.headers.update({'Connection': 'keep-alive'})

Just adding this as a datapoint. What does your loadbalancing chain look like for your use case?

Thanks,
Liam

@Dorthu
Copy link
Contributor Author

Dorthu commented May 19, 2020

We're using OpenResty (nginx) to load balance to the rgws, and I haven't seen any issues like that in our setup. That said, the cluster I'm testing against is having some unrelated issues at the moment, so I haven't been able to try the loop you described.

Looks like this has the potential to cause issues for some deployments,
so it should be opt-in.
@Dorthu
Copy link
Contributor Author

Dorthu commented May 19, 2020

Alright, I sorted out the issue with the test cluster, and ran the following without issue:

In [17]: for i in range(10):
    ...:     client.get_bucket() # has pooled_connections enabled
    ...:     sleep(10)

I tested sleeps between 5 and 20 seconds, all worked as expected.

That being said, if there's a possibility this is going to cause mysterious failures in some setups, I'd rather err on the side of caution and default it to disabled; I wouldn't want to break existing systems because they upgraded.

@liammonahan
Copy link
Contributor

So I found where HAProxy has a default keep-alive timeout of 10 seconds. I'm guessing there may be other people with other loadbalancers that have a wide array of different options. I think defaulting to connection pooling off is the right call.

I am planning to add some documentation to rgwadmin this week and can include a section on this. I think the next thing this feature would need to be robust would be connection resumption/reestablishment if the library could detect that the pooled connection had been closed. If it could do this transparently that would be neat.

Merging now. Thanks.

@liammonahan liammonahan merged commit bef0e93 into UMIACS:master May 19, 2020
@liammonahan
Copy link
Contributor

Released in v.2.3.1.

https://pypi.org/project/rgwadmin/2.3.1/#files

@liammonahan
Copy link
Contributor

I added a note about connection pooling to the docs here: https://rgwadmin.readthedocs.io/en/latest/rgwadmin/user-guide.html#connection-pooling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants