single client multi-prompt hangs on server #4583

jxy · 2023-12-22T03:38:19Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Tried the example in #4232

Current Behavior

The example in #4232 hangs the server.

$ ./server -m models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -t 1 -ngl 1 -np 2                                                                                                                                                                     
{"timestamp":1703215447,"level":"INFO","function":"main","line":2668,"message":"build info","build":1680,"commit":"afefa319"}
{"timestamp":1703215447,"level":"INFO","function":"main","line":2675,"message":"system info","n_threads":1,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
[... omit ...]
Available slots:
 -> Slot 0 - max context: 16384
 -> Slot 1 - max context: 16384

llama server listening at http://127.0.0.1:8080

{"timestamp":1703215448,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 2]
slot 1 is processing [task id: 3]
slot 0 : kv cache rm - [0, end)
slot 1 : kv cache rm - [0, end)

print_timings: prompt eval time =     888.72 ms /    17 tokens (   52.28 ms per token,    19.13 tokens per second)
print_timings:        eval time =   16917.36 ms /    85 runs   (  199.03 ms per token,     5.02 tokens per second)
print_timings:       total time =   17806.08 ms
slot 0 released (103 tokens in cache)

print_timings: prompt eval time =     888.64 ms /    16 tokens (   55.54 ms per token,    18.01 tokens per second)
print_timings:        eval time =   19226.04 ms /   111 runs   (  173.21 ms per token,     5.77 tokens per second)
print_timings:       total time =   20114.68 ms

On the client side, it's the example in #4232, but there's nothing coming back

$  curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'

The text was updated successfully, but these errors were encountered:

jxy · 2023-12-22T04:42:55Z

Here are the relevant bits from the stack traces of two threads.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018150c524 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000181547168 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000181544af8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x0000000181474300 libc++.1.dylib`std::__1::mutex::lock() + 16
    frame #4: 0x0000000100bfc534 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f2ec830, __m=0x000000016f2ef078) at lock_guard.h:35:10
    frame #5: 0x0000000100bf9578 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f2ec830, __m=0x000000016f2ef078) at lock_guard.h:34:19
    frame #6: 0x0000000100bf2eac server`llama_server_context::process_tasks(this=0x000000016f2eec20) at server.cpp:1564:45
    frame #7: 0x0000000100b1a7f0 server`llama_server_context::update_slots(this=0x000000016f2eec20) at server.cpp:1578:9
    frame #8: 0x0000000100b152a0 server`main(argc=11, argv=0x000000016f2ef320) at server.cpp:3116:29

Thread 1 process_tasks() locked mutex_tasks and proceeded to wait for the lock on mutex_results (server.cpp:1564)

  thread #4
    frame #0: 0x000000018150c524 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000181547168 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000181544af8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x0000000181474300 libc++.1.dylib`std::__1::mutex::lock() + 16
    frame #4: 0x0000000100bfc534 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f5a5490, __m=0x000000016f2ef038) at lock_guard.h:35:10
    frame #5: 0x0000000100bf9578 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f5a5490, __m=0x000000016f2ef038) at lock_guard.h:34:19
    frame #6: 0x0000000100c0d544 server`llama_server_context::update_multi_task(this=0x000000016f2eec20, multitask_id=1, subtask_id=3, result=0x000060000220dc00) at server.cpp:1151:37
    frame #7: 0x0000000100c37de4 server`llama_server_context::next_result(this=0x000000016f2eec20, task_id=1) at server.cpp:1374:21
    frame #8: 0x0000000100c37268 server`main::$_5::operator()(this=0x000000014e704bd8, req=0x000000016f5a6150, res=0x000000016f5a6080) const at server.cpp:2764:48

Thread 4 next_result locked mutex_results and found queue_results[i].multitask_id == task_id and proceeded to update_multi_task and started to wait for mutex_tasks (server.cpp:1151).

ziedbha · 2024-01-13T01:41:33Z

Thanks for accurately reporting this issue! The fix should be here: #4905

github-actions · 2024-03-18T01:35:34Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:10:12Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jxy added the bug-unconfirmed label Dec 22, 2023

jxy mentioned this issue Dec 22, 2023

server: Completion of pre-tokenized prompt is broken #4476

Closed

ziedbha mentioned this issue Jan 13, 2024

Fix deadlock that occurs in multiprompt scenarios introduced, reported in #4583 #4905

Merged

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single client multi-prompt hangs on server #4583

single client multi-prompt hangs on server #4583

jxy commented Dec 22, 2023

jxy commented Dec 22, 2023

ziedbha commented Jan 13, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

single client multi-prompt hangs on server #4583

single client multi-prompt hangs on server #4583

Comments

jxy commented Dec 22, 2023

Prerequisites

Expected Behavior

Current Behavior

jxy commented Dec 22, 2023

ziedbha commented Jan 13, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024