Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Metrics failing with the occasional "Broken Pipe" #9921

Closed
jaywink opened this issue May 3, 2021 · 8 comments
Closed

Metrics failing with the occasional "Broken Pipe" #9921

jaywink opened this issue May 3, 2021 · 8 comments
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@jaywink
Copy link
Member

jaywink commented May 3, 2021

Description

We have an EMS customer with 4x synchrotrons and a master, in addition to some other workers. The CPU usage of the syncs is generally quite high, which makes it feel like this could be a performance thing like #7596. Additional note; presence is turned on for the host.

Stacktrace:

2021-05-03 08:51:06,237 - twisted - 260 - ERROR - sentinel- ----------------------------------------
2021-05-03 08:51:06,241 - twisted - 260 - ERROR - sentinel- Exception happened during processing of request from ('<monitoring_host>, 50782)
2021-05-03 08:51:06,244 - twisted - 260 - ERROR - sentinel- Traceback (most recent call last):
2021-05-03 08:51:06,258 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 683, in process_request_thread
2021-05-03 08:51:06,263 - twisted - 260 - ERROR - sentinel-     self.finish_request(request, client_address)
2021-05-03 08:51:06,267 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 360, in finish_request
2021-05-03 08:51:06,275 - twisted - 260 - ERROR - sentinel-     self.RequestHandlerClass(request, client_address, self)
2021-05-03 08:51:06,280 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 747, in __init__
2021-05-03 08:51:06,282 - twisted - 260 - ERROR - sentinel-     self.handle()
2021-05-03 08:51:06,285 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/http/server.py", line 427, in handle
2021-05-03 08:51:06,288 - twisted - 260 - ERROR - sentinel-     self.handle_one_request()
2021-05-03 08:51:06,292 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/http/server.py", line 415, in handle_one_request
2021-05-03 08:51:06,319 - twisted - 260 - ERROR - sentinel-     method()
2021-05-03 08:51:06,324 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/site-packages/synapse/metrics/_exposition.py", line 211, in do_GET
2021-05-03 08:51:06,329 - twisted - 260 - ERROR - sentinel-     self.wfile.write(output)
2021-05-03 08:51:06,339 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 826, in write
2021-05-03 08:51:06,344 - twisted - 260 - ERROR - sentinel-     self._sock.sendall(b)
2021-05-03 08:51:06,349 - twisted - 260 - ERROR - sentinel- BrokenPipeError: [Errno 32] Broken pipe
2021-05-03 08:51:06,352 - twisted - 260 - ERROR - sentinel- ----------------------------------------

Version information

  • Homeserver: please ping for the EMS host name

  • Version: v1.32.2

@richvdh
Copy link
Member

richvdh commented May 4, 2021

this means the client (prometheus) went away before synapse got a chance to send the response. what is the timeout set to in prometheus, and how does it compare to the length of garbage collections on this server?

@jaywink
Copy link
Member Author

jaywink commented May 6, 2021

Looks like prometheus scraping is set to 15s timeout. Here are some graphs:

The memory graph shows deployments. The Up graph is lying since prometheus thinks the host is down a lot, when in fact it's fine.

Selection_999(273)

This graph makes it feel like while the alerts went away when we redeployed with a few more synchrotrons, the alerts will likely be back soon.

Selection_999(274)

Selection_999(275)

@richvdh
Copy link
Member

richvdh commented May 6, 2021

looks like matrix.org's scrape_timeout is 15s too, so I'd expect this to be better than it is.

Do the scrapes go through a reverse-proxy (the k8s ingress, maybe) which could be dropping connections prematurely?

@jaywink
Copy link
Member Author

jaywink commented May 6, 2021

Yes, that is the case.

I'll get an ops opinion on this as well.

@daenney
Copy link
Contributor

daenney commented May 6, 2021

The default timeout for proxy_read in nginx is 60s and we don't change it. The Prometheus client would have closed the connection to nginx before then, which in turn would cause it to close the connection to the backend. If we're seeing broken pipes it would indicate Synapse has taken longer than the Prometheus scrape timeout or nginx's default timeout to respond, whichever comes first.

@jaywink
Copy link
Member Author

jaywink commented May 6, 2021

Let's see if additional horizontal scaling helps by reducing CPU load - will close this later if the alert stays away.

@daenney
Copy link
Contributor

daenney commented May 7, 2021

@jaywink How's it looking?

@erikjohnston erikjohnston added the X-Needs-Info This issue is blocked awaiting information from the reporter label May 20, 2021
@richvdh
Copy link
Member

richvdh commented May 27, 2021

assuming it's better - let's reopen if not

@richvdh richvdh closed this as completed May 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

4 participants