-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Metrics failing with the occasional "Broken Pipe" #9921
Comments
this means the client (prometheus) went away before synapse got a chance to send the response. what is the timeout set to in prometheus, and how does it compare to the length of garbage collections on this server? |
Looks like prometheus scraping is set to 15s timeout. Here are some graphs: The memory graph shows deployments. The Up graph is lying since prometheus thinks the host is down a lot, when in fact it's fine. This graph makes it feel like while the alerts went away when we redeployed with a few more synchrotrons, the alerts will likely be back soon. |
looks like matrix.org's scrape_timeout is 15s too, so I'd expect this to be better than it is. Do the scrapes go through a reverse-proxy (the k8s ingress, maybe) which could be dropping connections prematurely? |
Yes, that is the case. I'll get an ops opinion on this as well. |
The default timeout for proxy_read in nginx is 60s and we don't change it. The Prometheus client would have closed the connection to nginx before then, which in turn would cause it to close the connection to the backend. If we're seeing broken pipes it would indicate Synapse has taken longer than the Prometheus scrape timeout or nginx's default timeout to respond, whichever comes first. |
Let's see if additional horizontal scaling helps by reducing CPU load - will close this later if the alert stays away. |
@jaywink How's it looking? |
assuming it's better - let's reopen if not |
Description
We have an EMS customer with 4x synchrotrons and a master, in addition to some other workers. The CPU usage of the syncs is generally quite high, which makes it feel like this could be a performance thing like #7596. Additional note; presence is turned on for the host.
Stacktrace:
Version information
Homeserver: please ping for the EMS host name
Version: v1.32.2
The text was updated successfully, but these errors were encountered: