Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

Open
1 task done
undcloud opened this issue Feb 23, 2025 · 1 comment
Open
1 task done

[Bug]: CUDA out of memory problem caused by keyword analysis #5262

undcloud opened this issue Feb 23, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@undcloud
Copy link

Is there an existing issue for the same bug?

  • I have checked the existing issues.

RAGFlow workspace code commit ID

7b3d700

RAGFlow image version

v0.16.0-24-gd197f336 full

Other environment information

Driver Version: 550.120       
CUDA Version: 12.4  
GPU: NVIDIA GeForce RTX 4090

Actual behavior

If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.

=============================================
Bug From UI:
ERROR: CUDA out of memory. Tried to allocate 13.60 GiB. GPU 0 has a total capacity of 23.64 GiB of which 5.43 GiB is free. Process 2526843 has 1.08 GiB memory in use. Process 2526841 has 17.05 GiB memory in use. Of the allocated memory 4.45 GiB is allocated by PyTorch, and 12.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

=============================================
Bug From Log:
1efbf370242ac120006 HTTP/1.1" 200 -
2025-02-23 11:34:34,021 INFO 15 172. - - [23/Feb/2025 11:34:34] "GET /v1/dialog/list HTTP/1.1" 200 -
2025-02-23 11:34:40,446 INFO 15 HTTP Request: POST http://10.12:9998/v1/chat/completions "HTTP/1.1 200 OK"
2025-02-23 11:34:40,490 ERROR 15 LLMBundle.encode_queries can't update token usage for c68d4aa4eabd11efac6c0242ac120006/EMBEDDING used_tokens: 247
2025-02-23 11:34:40,746 INFO 15 POST http://es01:9200/ragflow_c68d4aa4eabd11efac6c0242ac120006/_search [status:200 duration:0.222s]
Compute Scores: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/ragflow/api/apps/conversation_app.py", line 230, in stream
for ans in chat(dia, msg, True, **req):
File "/ragflow/api/db/services/dialog_service.py", line 271, in chat
kbinfos = retriever.retrieval(" ".join(questions), embd_mdl, tenant_ids, dialog.kb_ids, 1, dialog.top_n,
File "<@beartype(rag.nlp.search.Dealer.retrieval) at 0x7e3928a6d7e0>", line 35, in retrieval
File "/ragflow/rag/nlp/search.py", line 366, in retrieval
sim, tsim, vsim = self.rerank_by_model(rerank_mdl,
File "<@beartype(rag.nlp.search.Dealer.rerank_by_model) at 0x7e3928a6d6c0>", line 35, in rerank_by_model
File "/ragflow/rag/nlp/search.py", line 327, in rerank_by_model
vtsim, _ = rerank_mdl.similarity(query, [rmSpace(" ".join(tks)) for tks in ins_tw])
File "<@beartype(api.db.services.llm_service.LLMBundle.similarity) at 0x7e38ec7541f0>", line 50, in similarity
File "/ragflow/api/db/services/llm_service.py", line 251, in similarity
sim, used_tokens = self.mdl.similarity(query, texts)
File "<@beartype(rag.llm.rerank_model.DefaultRerank.similarity) at 0x7e39198c80d0>", line 50, in similarity
File "/ragflow/rag/llm/rerank_model.py", line 98, in similarity
scores = self._model.compute_score(pairs[i:i + batch_size], max_length=2048)
File "/ragflow/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return

Expected behavior

CUDA video memory is released normally, CUDA video memory does not overflow, normal Q&A

Steps to reproduce

1. git clone
2. docker compose -f docker/docker-compose-gpu.yml

If you turn on Keyword Analysis, CUDA memory will accumulate and not be released, causing memory overflow.
If you turn off Keyword Analysis, you can answer questions normally.


Chat Configuration:
   - Show quote  True
   - Multi-turn optimization  False
   - Text to speech  False
   - Rerank model  BAAI/bge-reranker-v2-m3
   - Use knowledge graph  False
   - Variable  False
   - Max tokens  False

Additional information

change code setting:

DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024))
===>
DOC_MAXIMUM_SIZE = int(os.environ.get("MAX_CONTENT_LENGTH", 12800 * 1024 * 1024))

MAX_CONTENT_LENGTH: "134217728"
===>
MAX_CONTENT_LENGTH: "13421772800"

client_max_body_size 128M;
===>
client_max_body_size 12800M;

- HF_ENDPOINT=${HF_ENDPOINT}
===>
- HF_ENDPOINT=https://hf-mirror.com/

RAGFLOW_IMAGE=infiniflow/ragflow:v0.16.0-slim
===>
RAGFLOW_IMAGE=infiniflow/ragflow:v0.16.0

os.environ.get("MAX_CONTENT_LENGTH", 128 * 1024 * 1024)
===>
os.environ.get("MAX_CONTENT_LENGTH", 12800 * 1024 * 1024)
@undcloud
Copy link
Author

Thank you~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant