Rare edge case issue causing a thread deadlock with access to _optional_thread_lock in ConnectionPool #990

Zenulous · 2025-01-30T08:24:51Z

Zenulous
Jan 30, 2025

Firstly, thanks for continuously contributing to this important core lib for Python!
We rely on it a lot. Specifically, we have a Python API server which creates both streaming and non-streaming requests to OpenAI via OpenAI's Python SDK.
Throughout our production usage we eventually noticed processes are hanging.
We are currently using a multithreaded Gunicorn environment with these HTTP-related dependencies:
httpx = "~=0.28.0"
openai = "~=1.60.1"

We have a Python API server which creates both streaming and non-streaming requests to OpenAI via OpenAI's Python SDK.
OpenAI uses httpx and thus httpcore under the hood.
Throughout our production usage we eventually noticed processes are hanging.
We are currently using a multithreaded Gunicorn environment so we were suspecing some kind of threead deadlock.

While investigating this issue, I figured it was a good idea to check the stack trace of one of the stuck Python processes.
This stack trace consistently showed me across stuck processes that there was always one thread which attempts to call "close" on a `PoolByteStream.
Here is our stack trace, removing app-specific and irrelevant traces such as from the main Gunicorn process.

Note how thread 2_4 is stuck as it is indefinitely trying to enter the ThreadLock, which is necessary to run with self._pool._optional_thread_lock::

Thread were the lock got stuck

(myenv) root@cse-copilot-856f676954-4xg67:~# py-spy dump --pid 4157 --locals
Process 4157: python app.py
Python v3.12.8 (/usr/bin/python3.12)

Thread 4157 (idle): "MainThread"
    (main Gunicorn thread)...
Thread 4162 (idle): "ThreadPoolExecutor-2_4"
    __enter__ (httpcore/_synchronization.py:268)
        Arguments:
            self: <ThreadLock at 0x7f7c024d4560>
    close (httpcore/_sync/connection_pool.py:416)
        Arguments:
            self: <PoolByteStream at 0x7f7c1acf07d0>
    __iter__ (httpcore/_sync/connection_pool.py:406)
        Arguments:
            self: <PoolByteStream at 0x7f7c1acf07d0>
        Locals:
            part: <bytes at 0x7f7bffbdbdb0>
            exc: <GeneratorExit at 0x7f7c18315780>
    __iter__ (httpx/_transports/default.py:128)
        Arguments:
            self: <ResponseStream at 0x7f7c1acf35c0>
        Locals:
            part: <bytes at 0x7f7bffbdbdb0>
    __iter__ (httpx/_client.py:154)
        Arguments:
            self: <BoundSyncStream at 0x7f7c1acb3cb0>
        Locals:
            chunk: <bytes at 0x7f7bffbdbdb0>
    iter_raw (httpx/_models.py:954)
        Arguments:
            self: <Response at 0x7f7c1acf2930>
            chunk_size: None
        Locals:
            chunker: <ByteChunker at 0x7f7c1acb2e10>
            raw_stream_bytes: <bytes at 0x7f7bffbdbdb0>
            chunk: <bytes at 0x7f7bffbdbdb0>
    iter_bytes (httpx/_models.py:900)
        Arguments:
            self: <Response at 0x7f7c1acf2930>
            chunk_size: None
        Locals:
            decoder: <IdentityDecoder at 0x7f7bffbf5580>
            chunker: <ByteChunker at 0x7f7c1acb2e70>
            raw_bytes: <bytes at 0x7f7bffbdbdb0>
            decoded: <bytes at 0x7f7bffbdbdb0>
            chunk: <bytes at 0x7f7bffbdbdb0>
    __init__ (httpcore/_synchronization.py:241)
        Arguments:
            self: <Lock at 0x7f7c1830e540>
    __init__ (httpcore/_sync/connection.py:66)
        Arguments:
            self: <HTTPConnection at 0x7f7c1830cec0>
            origin: <Origin at 0x7f7c1830e7b0>
            ssl_context: <SSLContext at 0x7f7c0253f950>
            keepalive_expiry: 5
            http1: True
            http2: False
            retries: 0
            local_address: None
            uds: None
            network_backend: <SyncBackend at 0x7f7c024d4500>
            socket_options: None
    create_connection (httpcore/_sync/connection_pool.py:168)
        Arguments:
            self: <ConnectionPool at 0x7f7c024d44d0>
            origin: <Origin at 0x7f7c1830e7b0>
    _assign_requests_to_connections (httpcore/_sync/connection_pool.py:326)
        Arguments:
            self: <ConnectionPool at 0x7f7c024d44d0>
        Locals:
            closing_connections: [<HTTPConnection at 0x7f7c1aca8770>]
            connection: <HTTPConnection at 0x7f7c1aca8770>
            queued_requests: [<PoolRequest at 0x7f7c1830d3d0>]
            pool_request: <PoolRequest at 0x7f7c1830d3d0>
            origin: <Origin at 0x7f7c1830e7b0>
            available_connections: []
            idle_connections: []
    handle_request (httpcore/_sync/connection_pool.py:228)
        Arguments:
            self: <ConnectionPool at 0x7f7c024d44d0>
            request: <Request at 0x7f7c1830d7f0>
        Locals:
            scheme: "https"
            timeouts: {"connect": 10, "read": 10, "write": 10, "pool": 10}
            timeout: 10
            pool_request: <PoolRequest at 0x7f7c1830d3d0>
    handle_request (httpx/_transports/default.py:250)
        Arguments:
            self: <HTTPTransport at 0x7f7c027ca240>
            request: <Request at 0x7f7c1830cfe0>
        Locals:
            httpcore: <module at 0x7f7c024d1e90>
            req: <Request at 0x7f7c1830d7f0>
    _send_single_request (httpx/_client.py:1014)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1830cfe0>
        Locals:
            transport: <HTTPTransport at 0x7f7c027ca240>
            start: 104095.237487327
    _send_handling_redirects (httpx/_client.py:979)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1830cfe0>
            follow_redirects: True
            history: []
    _send_handling_auth (httpx/_client.py:942)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1830cfe0>
            auth: <Auth at 0x7f7c1aced0a0>
            follow_redirects: True
            history: []
        Locals:
            auth_flow: <generator at 0x7f7c1ac56f80>
    send (httpx/_client.py:914)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1830cfe0>
        Locals:
            stream: False
            auth: <Auth at 0x7f7c1aced0a0>
            follow_redirects: True
    _request (openai/_base_client.py:996)
        Arguments:
            self: <AzureOpenAI at 0x7f7c02a6ee40>
        Locals:
            cast_to: 0
            options: <FinalRequestOptions at 0x7f7c1831a260>
            retries_taken: 0
            stream: False
            stream_cls: None
            input_options: <FinalRequestOptions at 0x7f7c1831a300>
            remaining_retries: 3
            request: <Request at 0x7f7c1830cfe0>
            kwargs: {}
    request (openai/_base_client.py:960)
        Arguments:
            self: <AzureOpenAI at 0x7f7c02a6ee40>
            cast_to: 0
            options: <FinalRequestOptions at 0x7f7c1830a210>
            remaining_retries: None
        Locals:
            stream: False
            stream_cls: None
            retries_taken: 0

Notice how this other thread, which is one of the all the stuck threads, is trying to access the __enter__
Seemingly to get the self._optional_thread_lock from the ConnectionPool.
This is of course not possible as ThreadPoolExecutor-2_4 is indefinitely trying to get the lock as well.
And that causes our entire worker to hang as none of the threads are able to access this lock anymore.

Thread which is also stuck as it cannot access the lock either

Thread 4161 (idle): "ThreadPoolExecutor-2_3"
    __enter__ (httpcore/_synchronization.py:268)
        Arguments:
            self: <ThreadLock at 0x7f7c024d4560>
    handle_request (httpcore/_sync/connection_pool.py:218)
        Arguments:
            self: <ConnectionPool at 0x7f7c024d44d0>
            request: <Request at 0x7f7c1833ca40>
        Locals:
            scheme: "https"
            timeouts: {"connect": 10, "read": 10, "write": 10, "pool": 10}
            timeout: 10
    handle_request (httpx/_transports/default.py:250)
        Arguments:
            self: <HTTPTransport at 0x7f7c027ca240>
            request: <Request at 0x7f7c1833ce30>
        Locals:
            httpcore: <module at 0x7f7c024d1e90>
            req: <Request at 0x7f7c1833ca40>
    _send_single_request (httpx/_client.py:1014)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1833ce30>
        Locals:
            transport: <HTTPTransport at 0x7f7c027ca240>
            start: 104316.449569162
    _send_handling_redirects (httpx/_client.py:979)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1833ce30>
            follow_redirects: True
            history: []
    _send_handling_auth (httpx/_client.py:942)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1833ce30>
            auth: <Auth at 0x7f7c1830e210>
            follow_redirects: True
            history: []
        Locals:
            auth_flow: <generator at 0x7f7c1ac91540>
    send (httpx/_client.py:914)
        Arguments:
            self: <SyncHttpxClientWrapper at 0x7f7c02666cf0>
            request: <Request at 0x7f7c1833ce30>
        Locals:
            stream: False
            auth: <Auth at 0x7f7c1830e210>
            follow_redirects: True
    _request (openai/_base_client.py:996)
        Arguments:
            self: <AzureOpenAI at 0x7f7c02a6ee40>
        Locals:
            cast_to: 0
            options: <FinalRequestOptions at 0x7f7c1831bc00>
            retries_taken: 0
            stream: False
            stream_cls: None
            input_options: <FinalRequestOptions at 0x7f7c1831bf20>
            remaining_retries: 3
            request: <Request at 0x7f7c1833ce30>
            kwargs: {}

Now we wish we could provide you a simple way to reproduce this issue.
Unfortunately after quite some time trying we ourselves do not know the exact condition causing this.
It does seem like except BaseException as exc: is hit because this is where the self.close() method is called.
I suspect there is a rare condition which causes this method to be called while the _optional_thread_lock is already taken by the thread. This causes the thread to hang indefinitely trying to access _optional_thread_lock again.

I would initially propose changing the ThreadLock lock type to an RLock to allow for re-entry of the same thread.
I do understand if this might raise some performance concerns, but due to my lack of experience with programming in this lib directly it is the first-best solution I could think of. I'm hoping that an actual maintainer of this repo can shed some light on the situation given all this context 🙏.

Zenulous · 2025-02-03T10:21:13Z

Zenulous
Feb 3, 2025
Author

We are still running into this issue after creating one HTTP Client per thread, so I do suspect the same thread is trying to acquire the lock twice without releasing it.

1 reply

Zenulous Feb 10, 2025
Author

To confirm whether an RLock would help, we monkey-patched it in our app code as follows:

class ThreadLock:
    """
    This is a threading-only lock for no-I/O contexts.
    In the sync case `ThreadLock` provides thread locking.
    In the async case `AsyncThreadLock` is a no-op.
    """

    def __init__(self) -> None:
        # Using an RLock here to allow for a thread to re-acquire its own lock
        # We monkey patch this solution while the httpcore team considers our issue
        # See https://github.com/encode/httpcore/discussions/990
        self._lock = threading.RLock()
        self._owner = threading.local()
        self._owner.thread_id = None

    def __enter__(self) -> "ThreadLock":
        current_thread_id = threading.get_ident()
        if getattr(self._owner, "thread_id", None) == current_thread_id:
            logging.info(
                "Thread %s already owns the connection pool lock and is trying to re-acquire it.",
                current_thread_id,
            )
        self._lock.acquire()
        self._owner.thread_id = current_thread_id
        return self

    def __exit__(
        self,
        exc_type: type[BaseException] | None = None,
        exc_value: BaseException | None = None,
        traceback: TracebackType | None = None,
    ) -> None:
        self._lock.release()
        if self._lock._is_owned():
            self._owner.thread_id = threading.get_ident()
        else:
            self._owner.thread_id = None

After a couple of days we did indeed see in our Splunk logs that in rare edge cases a thread can re-acquire the lock:
Therefore I again suggest the move to an RLock, or ideally figuring out what edge case can lead to lock re-acquiring. The latter would be something that requires the expertise of the library maintainers since I was not able to find it. @tomchristie , I see that you modified the thread safety of the library quite some months ago. Perhaps you have an intuition about what is causing this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare edge case issue causing a thread deadlock with access to _optional_thread_lock in ConnectionPool #990

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Rare edge case issue causing a thread deadlock with access to _optional_thread_lock in ConnectionPool #990

Zenulous Jan 30, 2025

Replies: 1 comment · 1 reply

Zenulous Feb 3, 2025 Author

Zenulous Feb 10, 2025 Author

Zenulous
Jan 30, 2025

Replies: 1 comment 1 reply

Zenulous
Feb 3, 2025
Author

Zenulous Feb 10, 2025
Author