Metrics failing with the occasional "Broken Pipe" #9921

jaywink · 2021-05-03T09:22:13Z

Description

We have an EMS customer with 4x synchrotrons and a master, in addition to some other workers. The CPU usage of the syncs is generally quite high, which makes it feel like this could be a performance thing like #7596. Additional note; presence is turned on for the host.

Stacktrace:

2021-05-03 08:51:06,237 - twisted - 260 - ERROR - sentinel- ----------------------------------------
2021-05-03 08:51:06,241 - twisted - 260 - ERROR - sentinel- Exception happened during processing of request from ('<monitoring_host>, 50782)
2021-05-03 08:51:06,244 - twisted - 260 - ERROR - sentinel- Traceback (most recent call last):
2021-05-03 08:51:06,258 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 683, in process_request_thread
2021-05-03 08:51:06,263 - twisted - 260 - ERROR - sentinel-     self.finish_request(request, client_address)
2021-05-03 08:51:06,267 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 360, in finish_request
2021-05-03 08:51:06,275 - twisted - 260 - ERROR - sentinel-     self.RequestHandlerClass(request, client_address, self)
2021-05-03 08:51:06,280 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 747, in __init__
2021-05-03 08:51:06,282 - twisted - 260 - ERROR - sentinel-     self.handle()
2021-05-03 08:51:06,285 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/http/server.py", line 427, in handle
2021-05-03 08:51:06,288 - twisted - 260 - ERROR - sentinel-     self.handle_one_request()
2021-05-03 08:51:06,292 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/http/server.py", line 415, in handle_one_request
2021-05-03 08:51:06,319 - twisted - 260 - ERROR - sentinel-     method()
2021-05-03 08:51:06,324 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/site-packages/synapse/metrics/_exposition.py", line 211, in do_GET
2021-05-03 08:51:06,329 - twisted - 260 - ERROR - sentinel-     self.wfile.write(output)
2021-05-03 08:51:06,339 - twisted - 260 - ERROR - sentinel-   File "/usr/local/lib/python3.8/socketserver.py", line 826, in write
2021-05-03 08:51:06,344 - twisted - 260 - ERROR - sentinel-     self._sock.sendall(b)
2021-05-03 08:51:06,349 - twisted - 260 - ERROR - sentinel- BrokenPipeError: [Errno 32] Broken pipe
2021-05-03 08:51:06,352 - twisted - 260 - ERROR - sentinel- ----------------------------------------

Version information

Homeserver: please ping for the EMS host name
Version: v1.32.2

The text was updated successfully, but these errors were encountered:

richvdh · 2021-05-04T11:50:37Z

this means the client (prometheus) went away before synapse got a chance to send the response. what is the timeout set to in prometheus, and how does it compare to the length of garbage collections on this server?

jaywink · 2021-05-06T08:00:35Z

Looks like prometheus scraping is set to 15s timeout. Here are some graphs:

The memory graph shows deployments. The Up graph is lying since prometheus thinks the host is down a lot, when in fact it's fine.

This graph makes it feel like while the alerts went away when we redeployed with a few more synchrotrons, the alerts will likely be back soon.

richvdh · 2021-05-06T09:22:54Z

looks like matrix.org's scrape_timeout is 15s too, so I'd expect this to be better than it is.

Do the scrapes go through a reverse-proxy (the k8s ingress, maybe) which could be dropping connections prematurely?

jaywink · 2021-05-06T10:18:54Z

Yes, that is the case.

I'll get an ops opinion on this as well.

daenney · 2021-05-06T10:48:03Z

The default timeout for proxy_read in nginx is 60s and we don't change it. The Prometheus client would have closed the connection to nginx before then, which in turn would cause it to close the connection to the backend. If we're seeing broken pipes it would indicate Synapse has taken longer than the Prometheus scrape timeout or nginx's default timeout to respond, whichever comes first.

jaywink · 2021-05-06T10:51:38Z

Let's see if additional horizontal scaling helps by reducing CPU load - will close this later if the alert stays away.

daenney · 2021-05-07T16:18:37Z

@jaywink How's it looking?

richvdh · 2021-05-27T11:34:42Z

assuming it's better - let's reopen if not

erikjohnston added the X-Needs-Info This issue is blocked awaiting information from the reporter label May 20, 2021

richvdh closed this as completed May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics failing with the occasional "Broken Pipe" #9921

Metrics failing with the occasional "Broken Pipe" #9921

jaywink commented May 3, 2021

richvdh commented May 4, 2021

jaywink commented May 6, 2021

richvdh commented May 6, 2021

jaywink commented May 6, 2021 •

edited

Loading

daenney commented May 6, 2021

jaywink commented May 6, 2021

daenney commented May 7, 2021

richvdh commented May 27, 2021

Metrics failing with the occasional "Broken Pipe" #9921

Metrics failing with the occasional "Broken Pipe" #9921

Comments

jaywink commented May 3, 2021

Description

Version information

richvdh commented May 4, 2021

jaywink commented May 6, 2021

richvdh commented May 6, 2021

jaywink commented May 6, 2021 • edited Loading

daenney commented May 6, 2021

jaywink commented May 6, 2021

daenney commented May 7, 2021

richvdh commented May 27, 2021

jaywink commented May 6, 2021 •

edited

Loading