Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single client multi-prompt hangs on server #4583

Closed
4 tasks done
jxy opened this issue Dec 22, 2023 · 4 comments
Closed
4 tasks done

single client multi-prompt hangs on server #4583

jxy opened this issue Dec 22, 2023 · 4 comments

Comments

@jxy
Copy link
Contributor

jxy commented Dec 22, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Tried the example in #4232

Current Behavior

The example in #4232 hangs the server.

$ ./server -m models/mistral-7b-instruct-v0.2.Q8_0.gguf -c 32768 -t 1 -ngl 1 -np 2                                                                                                                                                                     
{"timestamp":1703215447,"level":"INFO","function":"main","line":2668,"message":"build info","build":1680,"commit":"afefa319"}
{"timestamp":1703215447,"level":"INFO","function":"main","line":2675,"message":"system info","n_threads":1,"n_threads_batch":-1,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "}
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/mistral-7b-instruct-v0.2.Q8_0.gguf (version GGUF V3 (latest))
[... omit ...]
Available slots:
 -> Slot 0 - max context: 16384
 -> Slot 1 - max context: 16384

llama server listening at http://127.0.0.1:8080

{"timestamp":1703215448,"level":"INFO","function":"main","line":3097,"message":"HTTP server listening","port":"8080","hostname":"127.0.0.1"}
all slots are idle and system prompt is empty, clear the KV cache
slot 0 is processing [task id: 2]
slot 1 is processing [task id: 3]
slot 0 : kv cache rm - [0, end)
slot 1 : kv cache rm - [0, end)

print_timings: prompt eval time =     888.72 ms /    17 tokens (   52.28 ms per token,    19.13 tokens per second)
print_timings:        eval time =   16917.36 ms /    85 runs   (  199.03 ms per token,     5.02 tokens per second)
print_timings:       total time =   17806.08 ms
slot 0 released (103 tokens in cache)

print_timings: prompt eval time =     888.64 ms /    16 tokens (   55.54 ms per token,    18.01 tokens per second)
print_timings:        eval time =   19226.04 ms /   111 runs   (  173.21 ms per token,     5.77 tokens per second)
print_timings:       total time =   20114.68 ms

On the client side, it's the example in #4232, but there's nothing coming back

$  curl --request POST --url http://localhost:8080/completion --header "Content-Type: application/json" --data '{"prompt": ["<s>[INST] What is the capital of the US? [/INST]", "<s>[INST] What is the capital of France? [/INST]"], "n_predict": 2048}'
@jxy
Copy link
Contributor Author

jxy commented Dec 22, 2023

Here are the relevant bits from the stack traces of two threads.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018150c524 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000181547168 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000181544af8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x0000000181474300 libc++.1.dylib`std::__1::mutex::lock() + 16
    frame #4: 0x0000000100bfc534 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f2ec830, __m=0x000000016f2ef078) at lock_guard.h:35:10
    frame #5: 0x0000000100bf9578 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f2ec830, __m=0x000000016f2ef078) at lock_guard.h:34:19
    frame #6: 0x0000000100bf2eac server`llama_server_context::process_tasks(this=0x000000016f2eec20) at server.cpp:1564:45
    frame #7: 0x0000000100b1a7f0 server`llama_server_context::update_slots(this=0x000000016f2eec20) at server.cpp:1578:9
    frame #8: 0x0000000100b152a0 server`main(argc=11, argv=0x000000016f2ef320) at server.cpp:3116:29

Thread 1 process_tasks() locked mutex_tasks and proceeded to wait for the lock on mutex_results (server.cpp:1564)

  thread #4
    frame #0: 0x000000018150c524 libsystem_kernel.dylib`__psynch_mutexwait + 8
    frame #1: 0x0000000181547168 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 84
    frame #2: 0x0000000181544af8 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 248
    frame #3: 0x0000000181474300 libc++.1.dylib`std::__1::mutex::lock() + 16
    frame #4: 0x0000000100bfc534 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f5a5490, __m=0x000000016f2ef038) at lock_guard.h:35:10
    frame #5: 0x0000000100bf9578 server`std::__1::lock_guard<std::__1::mutex>::lock_guard[abi:ue170006](this=0x000000016f5a5490, __m=0x000000016f2ef038) at lock_guard.h:34:19
    frame #6: 0x0000000100c0d544 server`llama_server_context::update_multi_task(this=0x000000016f2eec20, multitask_id=1, subtask_id=3, result=0x000060000220dc00) at server.cpp:1151:37
    frame #7: 0x0000000100c37de4 server`llama_server_context::next_result(this=0x000000016f2eec20, task_id=1) at server.cpp:1374:21
    frame #8: 0x0000000100c37268 server`main::$_5::operator()(this=0x000000014e704bd8, req=0x000000016f5a6150, res=0x000000016f5a6080) const at server.cpp:2764:48

Thread 4 next_result locked mutex_results and found queue_results[i].multitask_id == task_id and proceeded to update_multi_task and started to wait for mutex_tasks (server.cpp:1151).

@ziedbha
Copy link
Contributor

ziedbha commented Jan 13, 2024

Thanks for accurately reporting this issue! The fix should be here: #4905

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants