You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.
=============================================
Bug From UI:
ERROR: CUDA out of memory. Tried to allocate 13.60 GiB. GPU 0 has a total capacity of 23.64 GiB of which 5.43 GiB is free. Process 2526843 has 1.08 GiB memory in use. Process 2526841 has 17.05 GiB memory in use. Of the allocated memory 4.45 GiB is allocated by PyTorch, and 12.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
=============================================
Bug From Log:
1efbf370242ac120006 HTTP/1.1" 200 -
2025-02-23 11:34:34,021 INFO 15 172. - - [23/Feb/2025 11:34:34] "GET /v1/dialog/list HTTP/1.1" 200 -
2025-02-23 11:34:40,446 INFO 15 HTTP Request: POST http://10.12:9998/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-23 11:34:40,490 ERROR 15 LLMBundle.encode_queries can't update token usage for c68d4aa4eabd11efac6c0242ac120006/EMBEDDING used_tokens: 247
2025-02-23 11:34:40,746 INFO 15 POST http://es01:9200/ragflow_c68d4aa4eabd11efac6c0242ac120006/_search [status:200 duration:0.222s]
Compute Scores: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/ragflow/api/apps/conversation_app.py", line 230, in stream
for ans in chat(dia, msg, True, **req):
File "/ragflow/api/db/services/dialog_service.py", line 271, in chat
kbinfos = retriever.retrieval(" ".join(questions), embd_mdl, tenant_ids, dialog.kb_ids, 1, dialog.top_n,
File "<@beartype(rag.nlp.search.Dealer.retrieval) at 0x7e3928a6d7e0>", line 35, in retrieval
File "/ragflow/rag/nlp/search.py", line 366, in retrieval
sim, tsim, vsim = self.rerank_by_model(rerank_mdl,
File "<@beartype(rag.nlp.search.Dealer.rerank_by_model) at 0x7e3928a6d6c0>", line 35, in rerank_by_model
File "/ragflow/rag/nlp/search.py", line 327, in rerank_by_model
vtsim, _ = rerank_mdl.similarity(query, [rmSpace(" ".join(tks)) for tks in ins_tw])
File "<@beartype(api.db.services.llm_service.LLMBundle.similarity) at 0x7e38ec7541f0>", line 50, in similarity
File "/ragflow/api/db/services/llm_service.py", line 251, in similarity
sim, used_tokens = self.mdl.similarity(query, texts)
File "<@beartype(rag.llm.rerank_model.DefaultRerank.similarity) at 0x7e39198c80d0>", line 50, in similarity
File "/ragflow/rag/llm/rerank_model.py", line 98, in similarity
scores = self._model.compute_score(pairs[i:i + batch_size], max_length=2048)
File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return
Expected behavior
CUDA video memory is released normally, CUDA video memory does not overflow, normal Q&A
Steps to reproduce
1. git clone
2. docker compose -f docker/docker-compose-gpu.yml
If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.
Chat Configuration:
- Show quote True
- Multi-turn optimization False
- Text to speech False
- Rerank model BAAI/bge-reranker-v2-m3
- Use knowledge graph False
- Variable False
- Max tokens False
Is there an existing issue for the same bug?
RAGFlow workspace code commit ID
7b3d700
RAGFlow image version
v0.16.0-24-gd197f336 full
Other environment information
Actual behavior
If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.
=============================================
Bug From UI:
ERROR: CUDA out of memory. Tried to allocate 13.60 GiB. GPU 0 has a total capacity of 23.64 GiB of which 5.43 GiB is free. Process 2526843 has 1.08 GiB memory in use. Process 2526841 has 17.05 GiB memory in use. Of the allocated memory 4.45 GiB is allocated by PyTorch, and 12.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
=============================================
Bug From Log:
1efbf370242ac120006 HTTP/1.1" 200 -
2025-02-23 11:34:34,021 INFO 15 172. - - [23/Feb/2025 11:34:34] "GET /v1/dialog/list HTTP/1.1" 200 -
2025-02-23 11:34:40,446 INFO 15 HTTP Request: POST http://10.12:9998/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-23 11:34:40,490 ERROR 15 LLMBundle.encode_queries can't update token usage for c68d4aa4eabd11efac6c0242ac120006/EMBEDDING used_tokens: 247
2025-02-23 11:34:40,746 INFO 15 POST http://es01:9200/ragflow_c68d4aa4eabd11efac6c0242ac120006/_search [status:200 duration:0.222s]
Compute Scores: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/ragflow/api/apps/conversation_app.py", line 230, in stream
for ans in chat(dia, msg, True, **req):
File "/ragflow/api/db/services/dialog_service.py", line 271, in chat
kbinfos = retriever.retrieval(" ".join(questions), embd_mdl, tenant_ids, dialog.kb_ids, 1, dialog.top_n,
File "<@beartype(rag.nlp.search.Dealer.retrieval) at 0x7e3928a6d7e0>", line 35, in retrieval
File "/ragflow/rag/nlp/search.py", line 366, in retrieval
sim, tsim, vsim = self.rerank_by_model(rerank_mdl,
File "<@beartype(rag.nlp.search.Dealer.rerank_by_model) at 0x7e3928a6d6c0>", line 35, in rerank_by_model
File "/ragflow/rag/nlp/search.py", line 327, in rerank_by_model
vtsim, _ = rerank_mdl.similarity(query, [rmSpace(" ".join(tks)) for tks in ins_tw])
File "<@beartype(api.db.services.llm_service.LLMBundle.similarity) at 0x7e38ec7541f0>", line 50, in similarity
File "/ragflow/api/db/services/llm_service.py", line 251, in similarity
sim, used_tokens = self.mdl.similarity(query, texts)
File "<@beartype(rag.llm.rerank_model.DefaultRerank.similarity) at 0x7e39198c80d0>", line 50, in similarity
File "/ragflow/rag/llm/rerank_model.py", line 98, in similarity
scores = self._model.compute_score(pairs[i:i + batch_size], max_length=2048)
File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return
Expected behavior
CUDA video memory is released normally, CUDA video memory does not overflow, normal Q&A
Steps to reproduce
Additional information
change code setting:
The text was updated successfully, but these errors were encountered: