[Bug]: CUDA out of memory problem caused by keyword analysis #5262

undcloud · 2025-02-23T03:58:31Z

Is there an existing issue for the same bug?

I have checked the existing issues.

RAGFlow workspace code commit ID

7b3d700

RAGFlow image version

v0.16.0-24-gd197f336 full

Other environment information

Driver Version: 550.120       
CUDA Version: 12.4  
GPU: NVIDIA GeForce RTX 4090

Actual behavior

If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.

=============================================
Bug From UI:
ERROR: CUDA out of memory. Tried to allocate 13.60 GiB. GPU 0 has a total capacity of 23.64 GiB of which 5.43 GiB is free. Process 2526843 has 1.08 GiB memory in use. Process 2526841 has 17.05 GiB memory in use. Of the allocated memory 4.45 GiB is allocated by PyTorch, and 12.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

=============================================
Bug From Log:
1efbf370242ac120006 HTTP/1.1" 200 -
2025-02-23 11:34:34,021 INFO 15 172. - - [23/Feb/2025 11:34:34] "GET /v1/dialog/list HTTP/1.1" 200 -
2025-02-23 11:34:40,446 INFO 15 HTTP Request: POST http://10.12:9998/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-23 11:34:40,490 ERROR 15 LLMBundle.encode_queries can't update token usage for c68d4aa4eabd11efac6c0242ac120006/EMBEDDING used_tokens: 247
2025-02-23 11:34:40,746 INFO 15 POST http://es01:9200/ragflow_c68d4aa4eabd11efac6c0242ac120006/_search [status:200 duration:0.222s]
Compute Scores: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/ragflow/api/apps/conversation_app.py", line 230, in stream
for ans in chat(dia, msg, True, **req):
File "/ragflow/api/db/services/dialog_service.py", line 271, in chat
kbinfos = retriever.retrieval(" ".join(questions), embd_mdl, tenant_ids, dialog.kb_ids, 1, dialog.top_n,
File "<@beartype(rag.nlp.search.Dealer.retrieval) at 0x7e3928a6d7e0>", line 35, in retrieval
File "/ragflow/rag/nlp/search.py", line 366, in retrieval
sim, tsim, vsim = self.rerank_by_model(rerank_mdl,
File "<@beartype(rag.nlp.search.Dealer.rerank_by_model) at 0x7e3928a6d6c0>", line 35, in rerank_by_model
File "/ragflow/rag/nlp/search.py", line 327, in rerank_by_model
vtsim, _ = rerank_mdl.similarity(query, [rmSpace(" ".join(tks)) for tks in ins_tw])
File "<@beartype(api.db.services.llm_service.LLMBundle.similarity) at 0x7e38ec7541f0>", line 50, in similarity
File "/ragflow/api/db/services/llm_service.py", line 251, in similarity
sim, used_tokens = self.mdl.similarity(query, texts)
File "<@beartype(rag.llm.rerank_model.DefaultRerank.similarity) at 0x7e39198c80d0>", line 50, in similarity
File "/ragflow/rag/llm/rerank_model.py", line 98, in similarity
scores = self._model.compute_score(pairs[i:i + batch_size], max_length=2048)
File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return

Expected behavior

CUDA video memory is released normally, CUDA video memory does not overflow, normal Q&A

Steps to reproduce

1. git clone
2. docker compose -f docker/docker-compose-gpu.yml

If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.


Chat Configuration：
   - Show quote  True
   - Multi-turn optimization  False
   - Text to speech  False
   - Rerank model  BAAI/bge-reranker-v2-m3
   - Use knowledge graph  False
   - Variable  False
   - Max tokens  False

Additional information

change code setting:

DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024))
===>
DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 12800 * 1024 * 1024))

MAX_CONTENT_LENGTH: "134217728"
===>
MAX_CONTENT_LENGTH: "13421772800"

client_max_body_size 128M;
===>
client_max_body_size 12800M;

- HF_ENDPOINT=${HF_ENDPOINT}
===>
- HF_ENDPOINT=https://hf-mirror.com/

RAGFLOW_IMAGE=infiniflow/ragflow:v0.16.0-slim
===>
RAGFLOW_IMAGE=infiniflow/ragflow:v0.16.0

os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024)
===>
os.environ.get("MAX_CONTENT_LENGTH", 12800 * 1024 * 1024)

The text was updated successfully, but these errors were encountered:

undcloud · 2025-02-24T02:53:16Z

Thank you~

undcloud added the bug Something isn't working label Feb 23, 2025

liwenju0 mentioned this issue Feb 24, 2025

Refactor rerank model with dynamic batch processing and memory manage… #5273

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

undcloud commented Feb 23, 2025

undcloud commented Feb 24, 2025

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

Comments

undcloud commented Feb 23, 2025

Is there an existing issue for the same bug?

RAGFlow workspace code commit ID

RAGFlow image version

Other environment information

Actual behavior

Expected behavior

Steps to reproduce

Additional information

undcloud commented Feb 24, 2025