-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection flooding until reaching max_connections and then panic #2340
Comments
I've played a bit around with this, but I've not found anything that could cause this. What follows is just a list of things I've tried/looked at.
That said I'm out of idea's what else could be tried here. (I've checked the behaviour locally with my own wundergraph/juniper/actix/diesel project, it does not happen there) |
Thanks for your effort. I'm thinking if I might have to test to bring up a managed azure postgresql instance and see if that´s something different. Perhaps Azure does something interesting (i´ve experienced it before in other situations with sql server). That their networking solutions does weird stuff with connections. But usually they are on the other side, i.e. terminating idle connections to save resources. |
Gathering some more clues, I saw these error logs when running locally (client on windows, server on ubuntu on azure):
|
That sounds like something is terminating your connections at network level. r2d2 does issue checks if a connection is active if it pulls a connection from a given pool. Diesel provides a callback to do this, y just sending a I think I'm closing this issue now, as I do not see anything actionable for the diesel team what could be done to improve this situation. Seems like this is just the behaviour of libpq in such cases. |
I know this is closed but I try to add more potential info as I go along...hope it´s ok. I just found this (a bit old) but maybe a clue? actix/actix-web#24 |
Sure it's totally fine to comment here with more information. I'm also curious what is causing this, it's just that I do not see what on diesel side could cause this. |
I agree, knowing where the error is a mystery to me too so far.
I have put up a monitor on the graphql introspection endpoint and I can still see that the app dies for half an hour (!) every now and then. The frequence of crashes was higher before and I fixed that. I was passing the database connection string via environment variables. But it seems that there was some race condition populating the variables in the container. Either azure did not pass them along on boot, or the container did not pick them up or the rust threads didn´t get them. I removed this source of error by passing along the connection string in a settings file for now. Still seeing connection flooding though...with the only traffic being the monitor calls that run every 5 minutes (and they should´t even hit the db, only the graphql endpoint). If you wan´t, I can add you to the project @weiznich to give you a better chance of investigating if you have interest. |
The hunt goes on...I installed pgbouncer in an attempt to not flood the db and get some more logs. And with that info, my latest prime suspect in dealing with the broken connections is r2d2. @weiznich, do you have (or would you mind making) an example of how to use wundegraph without r2d2 connection pooling? |
@andrejohansson It should be possible to just use a raw connection instead of a pooled connection as context type in wundergraph out of the box. That said I'm not sure how to handle the integration in a web framework in such cases. |
@andrejohansson Hello! I ran into similar issues on my local. I had issues reaching my max amount of connections to my postgres database. My I would look into how many "workers" you're initializing from your
So on my local machine, by default, I ran 8 workers on my server and for each worker my connection pool from I'm not so familiar with Azure, but since you're on a docker container, would your "number of logical CPUs in the system" ever change? My guess is that if you're running on a system that automatically scales up and down at all, you'd have a decent chance of going over the connection limit. Especially if the error from exceeding the connection limit causes you to spin up another service which would then attempt to create its own connection pools. Please let me know if this helps or ends up being relevant to your issues! |
Hello! You are partly right. I have come a bit further on the issue. I have, as you said a lot of cpu´s on my system (12 logical cores) and I had a pooling setting of 8 which quickly added up to the max_connections limit. This explained to me why it "felt" like the system got choked much faster when I ran locally against the db. But still...even after tweaking these numbers (setting workers to 1 and pool connections to 2) things didn´t quite work as i expected. Suddenly it just died again. Turns out that I wasnt´t far of....I remembered that I have had problems with azure before and its load balancer. Note that I am running a VM and then postgres on that vm. I am not using the hosted Postgres service on azure. Current hypothesisSometimes azure does "magic behind the curtains". I suspect this might be such a case. If the connection pool opens up X idle connections against my azure vm running postgres. Then the azure network firewall/balancer cuts those connections silently after roughly 4 minutes (in the default configuration). But for some reason the connection driver on my end does not understand that the connection is gone, and the pool happily serves a dead connection when requested. Killing the conneciotns while letting both ends still think they are alive and idle is nasty. It would lead to the pool thinking it has 8 healthy connections it can use. But as soon as it picks one and sends a query. Nothing will happen. Possible alternative workarounds/solutions
Topics that may be related
My current solutionJust to test my hypothesis, i launched a small instance on https://www.elephantsql.com/ roughly a week ago and I have had no issue since! So...for me the problem is solved for now. However, some part is still not right. Is it azures load balancer being dirty? Should something be done to r2d2 to improve detetction of killed connections? To the rust postgres driver? |
@andrejohansson thanks for the details! I wonder if you had a healthcheck somewhere that somehow keeping a connection to your postgres vm endpoint since it would likely not have the logic in place to close the connection. For a crate like r2d2 or maybe even a separate crate, I guess you can introduce more automatic management. Like querying connections, seeing which "workers" or clients are connected and how many they have. Then this crate can lower the amount of connections in each client's pool. For r2d2, I wonder if they may make a "desired connections per pool" config then back off this number as total connections approaches the max connections config. As a user of r2d2, I'd be able to "aim" for a certain amount of connections per pool, say 10. But if we're over 50% of the db's max connections, we only add 7 connections if another pool is needed, then only 3 connections if over 75% or whatever the next threshold. Then you can also give warnings at some point. Anyways, this got way off topic from diesel but has definitely been an interesting conversation haha Additional thoughts: |
Setup
Versions
1.41.1 (f3e1a954d 2020-02-24)
1.4.4
(see below for full Cargo.toml dependencies)postgres (PostgreSQL) 12.2 (Ubuntu 12.2-2.pgdg18.04+1)
Windows 10 Pro, version 1909, build 18363.720
Ubuntu 18.04.4 LTS (GNU/Linux 5.0.0-1032-azure x86_64)
(running as a vm in azure)Feature Flags
["postgres", "chrono", "r2d2"]
Problem Description
This is a move of this issue, in order to get help to find the source of the problem.
What are you trying to accomplish?
Use actix web server with juniper+wundergraph graphql api endpoint in a stable manner.
What is the expected output?
That if I set my connection pooling to 8 connections, it should remain at 8 connections.
What is the actual output?
Idle connections get created a lot until
MAX_CONNECTIONS
is reached, then my app cannot connect anymore and the database gets unavailable for further use until a restart or killing connections manually.Running
select * from pg_stat_activity;
shows that idle connections is filling up. But I also found that one query got spammed:SET CLIENT_ENCODING TO 'UTF8'
. According to @weiznich this is some internal query run by diesel on connection setup.Are you seeing any additional errors?
As a result of hitting
MAX_CONNECTIONS
then the app threads starts to fail and panic (as expected).Steps to reproduce
Really not sure about this, this is my actix web server initialization code:
And this is my connection setup code:
Other than this, I am not doing anything manually regarding database queries or access. It´s all handled by the generated diesel/juniper/wundergraph classes.
Questions/hypothesis
I'm new to all of this (rust, diesel, juniper, wundergraph and postgres) so no question/answer is too stupid. I'm a bit stuck of how to investigate this further so all help and suggestions are welcome.
One hypothesis I have (but I don´t know how to test) is that when I'm developing, I'm opening connections, then I exit my app with
ctrl + c
. Perhaps this is the cause of the lingering connections? If so, is there anyway I can get my app to exit gracefully and shut down the connections on exit. Or is this already handled automatically by r2d2?My other hypothesis was that when executing a graphql query that was nested it somehow triggered opening of a lot connections. But why should this ignore the pool limits and not wait? Could here be some nasty spooky threading issue?
I've tried to configure postgresql to cleanup connections that have been idle for 5 minutes using:
but that is more of a emergency solution than a remedy.
All help is much appreciated.
Checklist
I have already looked over the issue tracker for similar issues.
This issue can be reproduced on Rust's stable channel. (Your issue will be
closed if this is not the case)
Cargo.toml
postgresql-12-main.log
The text was updated successfully, but these errors were encountered: