Rare edge case issue causing a thread deadlock with access to _optional_thread_lock in ConnectionPool #990
Unanswered
Zenulous
asked this question in
Potential Issue
Replies: 1 comment 1 reply
-
We are still running into this issue after creating one HTTP Client per thread, so I do suspect the same thread is trying to acquire the lock twice without releasing it. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Firstly, thanks for continuously contributing to this important core lib for Python!
We rely on it a lot. Specifically, we have a Python API server which creates both streaming and non-streaming requests to OpenAI via OpenAI's Python SDK.
Throughout our production usage we eventually noticed processes are hanging.
We are currently using a multithreaded Gunicorn environment with these HTTP-related dependencies:
httpx = "~=0.28.0"
openai = "~=1.60.1"
We have a Python API server which creates both streaming and non-streaming requests to OpenAI via OpenAI's Python SDK.
OpenAI uses
httpx
and thus httpcore under the hood.Throughout our production usage we eventually noticed processes are hanging.
We are currently using a multithreaded Gunicorn environment so we were suspecing some kind of threead deadlock.
While investigating this issue, I figured it was a good idea to check the stack trace of one of the stuck Python processes.
This stack trace consistently showed me across stuck processes that there was always one thread which attempts to call "close" on a `PoolByteStream.
Here is our stack trace, removing app-specific and irrelevant traces such as from the main Gunicorn process.
Note how thread 2_4 is stuck as it is indefinitely trying to enter the
ThreadLock
, which is necessary to runwith self._pool._optional_thread_lock:
:Thread were the lock got stuck
Notice how this other thread, which is one of the all the stuck threads, is trying to access the
__enter__
Seemingly to get the
self._optional_thread_lock
from theConnectionPool
.This is of course not possible as
ThreadPoolExecutor-2_4
is indefinitely trying to get the lock as well.And that causes our entire worker to hang as none of the threads are able to access this lock anymore.
Thread which is also stuck as it cannot access the lock either
Now we wish we could provide you a simple way to reproduce this issue.
Unfortunately after quite some time trying we ourselves do not know the exact condition causing this.
It does seem like
except BaseException as exc:
is hit because this is where theself.close()
method is called.I suspect there is a rare condition which causes this method to be called while the
_optional_thread_lock
is already taken by the thread. This causes the thread to hang indefinitely trying to access_optional_thread_lock
again.I would initially propose changing the ThreadLock lock type to an
RLock
to allow for re-entry of the same thread.I do understand if this might raise some performance concerns, but due to my lack of experience with programming in this lib directly it is the first-best solution I could think of. I'm hoping that an actual maintainer of this repo can shed some light on the situation given all this context 🙏.
Beta Was this translation helpful? Give feedback.
All reactions