diff --git a/.github/workflows/docker/compose/llms-compose.yaml b/.github/workflows/docker/compose/llms-compose.yaml index c7ea529cda..f60aa425e9 100644 --- a/.github/workflows/docker/compose/llms-compose.yaml +++ b/.github/workflows/docker/compose/llms-compose.yaml @@ -24,11 +24,3 @@ services: build: dockerfile: comps/llms/text-generation/vllm/langchain/Dockerfile image: ${REGISTRY:-opea}/llm-vllm:${TAG:-latest} - llm-vllm-ray: - build: - dockerfile: comps/llms/text-generation/vllm/ray/Dockerfile - image: ${REGISTRY:-opea}/llm-vllm-ray:${TAG:-latest} - llm-vllm-ray-hpu: - build: - dockerfile: comps/llms/text-generation/vllm/ray/dependency/Dockerfile - image: ${REGISTRY:-opea}/llm-vllm-ray-hpu:${TAG:-latest} diff --git a/comps/llms/text-generation/README.md b/comps/llms/text-generation/README.md index d148dbbb2b..d3dc89e8b6 100644 --- a/comps/llms/text-generation/README.md +++ b/comps/llms/text-generation/README.md @@ -2,20 +2,20 @@ This microservice, designed for Language Model Inference (LLM), processes input consisting of a query string and associated reranked documents. It constructs a prompt based on the query and documents, which is then used to perform inference with a large language model. The service delivers the inference results as output. -A prerequisite for using this microservice is that users must have a LLM text generation service (etc., TGI, vLLM and Ray) already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the LLM service for executing language model operations. +A prerequisite for using this microservice is that users must have a LLM text generation service (etc., TGI, vLLM) already running. Users need to set the LLM service's endpoint into an environment variable. The microservice utilizes this endpoint to create an LLM object, enabling it to communicate with the LLM service for executing language model operations. -Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM/Ray service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses. +Overall, this microservice offers a streamlined way to integrate large language model inference into applications, requiring minimal setup from the user beyond initiating a TGI/vLLM service and configuring the necessary environment variables. This allows for the seamless processing of queries and documents to generate intelligent, context-aware responses. ## Validated LLM Models -| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | Ray | -| --------------------------- | --------- | -------- | ---------- | --- | -| [Intel/neural-chat-7b-v3-3] | ✓ | ✓ | ✓ | ✓ | -| [Llama-2-7b-chat-hf] | ✓ | ✓ | ✓ | ✓ | -| [Llama-2-70b-chat-hf] | ✓ | - | ✓ | x | -| [Meta-Llama-3-8B-Instruct] | ✓ | ✓ | ✓ | ✓ | -| [Meta-Llama-3-70B-Instruct] | ✓ | - | ✓ | x | -| [Phi-3] | x | Limit 4K | Limit 4K | ✓ | +| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | +| --------------------------- | --------- | -------- | ---------- | +| [Intel/neural-chat-7b-v3-3] | ✓ | ✓ | ✓ | +| [Llama-2-7b-chat-hf] | ✓ | ✓ | ✓ | +| [Llama-2-70b-chat-hf] | ✓ | - | ✓ | +| [Meta-Llama-3-8B-Instruct] | ✓ | ✓ | ✓ | +| [Meta-Llama-3-70B-Instruct] | ✓ | - | ✓ | +| [Phi-3] | x | Limit 4K | Limit 4K | ## Clone OPEA GenAIComps @@ -121,53 +121,6 @@ export vLLM_ENDPOINT="http://${vLLM_HOST_IP}:8008" python llm.py ``` -#### 1.2.3 Start the Ray Service - -Install the requirements for Ray Service - -```bash -cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray - -pip install -r requirements.txt -``` - -Execute the docker run command to initiate the backend, along with the Python script that launches the microservice. - -```bash -export vLLM_RAY_HOST_IP=$(hostname -I | awk '{print $1}') # This sets IP of the current machine -export LLM_MODEL=${your_hf_llm_model} -export DATA_DIR=$HOME/data # Location to download the model -export HF_TOKEN=${your_hf_api_token} - -# Build the image first as opea/vllm:cpu -bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh - -# Initiate the backend -docker run \ - --name="vllm-ray-service" \ - --runtime=habana \ - -v $DATA_DIR:/data \ - -e HABANA_VISIBLE_DEVICES=all \ - -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ - --cap-add=sys_nice \ - --ipc=host \ - -p 8006:8000 \ - -e HF_TOKEN=$HF_TOKEN \ - opea/vllm_ray:habana \ - /bin/bash -c " \ - ray start --head && \ - python vllm_ray_openai.py \ - --port_number 8000 \ - --model_id_or_path $LLM_MODEL \ - --tensor_parallel_size 2 \ - --enforce_eager False" - -# Start the microservice with an endpoint as the above docker run command -export vLLM_RAY_ENDPOINT="http://${vLLM_RAY_HOST_IP}:8006" - -python llm.py -``` - ## 🚀2. Start Microservice with Docker (Option 2) In order to start the microservices with docker, you need to build the docker images first for the microservice. @@ -203,22 +156,6 @@ docker build \ -f comps/llms/text-generation/vllm/langchain/Dockerfile . ``` -#### 2.1.3 Ray - -```bash -# Build the Ray Serve docker -bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh - -# Build the microservice docker -cd ${OPEA_GENAICOMPS_ROOT} - -docker build \ - --build-arg https_proxy=$https_proxy \ - --build-arg http_proxy=$http_proxy \ - -t opea/llm-vllm-ray:latest \ - -f comps/llms/text-generation/vllm/ray/Dockerfile . -``` - ### 2.2 Start LLM Service with the built image To start a docker container, you have two options: @@ -247,15 +184,6 @@ export vLLM_LLM_ENDPOINT="http://${your_ip}:8008" export LLM_MODEL=${your_hf_llm_model} ``` -In order to start Ray serve and LLM services, you need to setup the following environment variables first. - -```bash -export HF_TOKEN=${your_hf_api_token} -export RAY_Serve_ENDPOINT="http://${your_ip}:8008" -export LLM_MODEL=${your_hf_llm_model} -export CHAT_PROCESSOR="ChatModelLlama" -``` - ### 2.3 Run Docker with CLI (Option A) #### 2.3.1 TGI @@ -311,29 +239,6 @@ docker run \ opea/llm-vllm:latest ``` -#### 2.3.3 Ray Serve - -Start Ray Serve endpoint. - -```bash -bash ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh -``` - -Start Ray Serve microservice. - -```bash -docker run -d \ - --name="llm-ray-server" \ - -p 9000:9000 \ - --ipc=host \ - -e http_proxy=$http_proxy \ - -e https_proxy=$https_proxy \ - -e RAY_Serve_ENDPOINT=$RAY_Serve_ENDPOINT \ - -e HF_TOKEN=$HF_TOKEN \ - -e LLM_MODEL=$LLM_MODEL \ - opea/llm-ray:latest -``` - ### 2.4 Run Docker with Docker Compose (Option B) #### 2.4.1 TGI @@ -350,13 +255,6 @@ cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/langchain docker compose -f docker_compose_llm.yaml up -d ``` -#### 2.4.3 Ray Serve - -```bash -cd ${OPEA_GENAICOMPS_ROOT}/comps/llms/text-generation/vllm/ray -docker compose -f docker_compose_llm.yaml up -d -``` - ## 🚀3. Consume LLM Service ### 3.1 Check Service Status @@ -391,22 +289,6 @@ curl http://${your_ip}:8008/v1/completions \ }' ``` -#### 3.2.3 Verify the Ray Service - -```bash -curl http://${your_ip}:8008/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": ${your_hf_llm_model}, - "messages": [ - {"role": "assistant", "content": "You are a helpful assistant."}, - {"role": "user", "content": "What is Deep Learning?"} - ], - "max_tokens": 32, - "stream": true - }' -``` - ### 3.3 Consume LLM Service You can set the following model parameters according to your actual needs, such as `max_tokens`, `streaming`. diff --git a/comps/llms/text-generation/ray_serve/llm.py b/comps/llms/text-generation/ray_serve/llm.py deleted file mode 100644 index c09bb45848..0000000000 --- a/comps/llms/text-generation/ray_serve/llm.py +++ /dev/null @@ -1,82 +0,0 @@ -# Copyright (c) 2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from fastapi.responses import StreamingResponse -from langchain_openai import ChatOpenAI - -from comps import GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice - - -def post_process_text(text: str): - if text == " ": - return "data: @#$\n\n" - if text == "\n": - return "data:
\n\n" - if text.isspace(): - return None - new_text = text.replace(" ", "@#$") - return f"data: {new_text}\n\n" - - -@register_microservice( - name="opea_service@llm_ray", - service_type=ServiceType.LLM, - endpoint="/v1/chat/completions", - host="0.0.0.0", - port=9000, -) -async def llm_generate(input: LLMParamsDoc): - llm_endpoint = os.getenv("RAY_Serve_ENDPOINT", "http://localhost:8080") - llm_model = os.getenv("LLM_MODEL", "Llama-2-7b-chat-hf") - if "/" in llm_model: - llm_model = llm_model.split("/")[-1] - llm = ChatOpenAI( - openai_api_base=llm_endpoint + "/v1", - model_name=llm_model, - openai_api_key=os.getenv("OPENAI_API_KEY", "not_needed"), - max_tokens=input.max_tokens, - temperature=input.temperature, - streaming=input.streaming, - request_timeout=600, - ) - - if input.streaming: - - async def stream_generator(): - chat_response = "" - async for text in llm.astream(input.query): - text = text.content - chat_response += text - processed_text = post_process_text(text) - if text and processed_text: - if "" in text: - res = text.split("")[0] - if res != "": - yield res - break - yield processed_text - print(f"[llm - chat_stream] stream response: {chat_response}") - yield "data: [DONE]\n\n" - - return StreamingResponse(stream_generator(), media_type="text/event-stream") - else: - response = await llm.ainvoke(input.query) - response = response.content - return GeneratedDoc(text=response, prompt=input.query) - - -if __name__ == "__main__": - opea_microservices["opea_service@llm_ray"].start() diff --git a/comps/llms/text-generation/ray_serve/requirements.txt b/comps/llms/text-generation/ray_serve/requirements.txt deleted file mode 100644 index 87aef99929..0000000000 --- a/comps/llms/text-generation/ray_serve/requirements.txt +++ /dev/null @@ -1,14 +0,0 @@ -docarray[full] -fastapi -huggingface_hub -langchain -langchain_community -langchain_openai -openai -opentelemetry-api -opentelemetry-exporter-otlp -opentelemetry-sdk -prometheus-fastapi-instrumentator -ray[serve]>=2.10 -shortuuid -transformers diff --git a/comps/llms/text-generation/vllm/ray/Dockerfile b/comps/llms/text-generation/vllm/ray/Dockerfile deleted file mode 100644 index cf6e518df6..0000000000 --- a/comps/llms/text-generation/vllm/ray/Dockerfile +++ /dev/null @@ -1,25 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -FROM python:3.11-slim - -RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ - libgl1-mesa-glx \ - libjemalloc-dev - -RUN useradd -m -s /bin/bash user && \ - mkdir -p /home/user && \ - chown -R user /home/user/ - -USER user - -COPY comps /home/user/comps - -RUN pip install --no-cache-dir --upgrade pip setuptools && \ - pip install --no-cache-dir -r /home/user/comps/llms/text-generation/vllm/ray/requirements.txt - -ENV PYTHONPATH=$PYTHONPATH:/home/user - -WORKDIR /home/user/comps/llms/text-generation/vllm/ray - -ENTRYPOINT ["python", "llm.py"] diff --git a/comps/llms/text-generation/vllm/ray/README.md b/comps/llms/text-generation/vllm/ray/README.md deleted file mode 100644 index 14c2f6058c..0000000000 --- a/comps/llms/text-generation/vllm/ray/README.md +++ /dev/null @@ -1,89 +0,0 @@ -# VLLM-Ray Endpoint Service - -[Ray](https://docs.ray.io/en/latest/serve/index.html) is an LLM serving solution that makes it easy to deploy and manage a variety of open source LLMs. Built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) it has native support for autoscaling and multi-node deployments, and is easy to use for LLM inference serving across multiple platforms. - -[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention and Continuous Batching among others. Besides GPUs, vLLM supports [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Intel Gaudi accelerators](https://habana.ai/products). - -This guide provides an example on how to launch vLLM with Ray serve endpoint on [Intel Gaudi2 Accelerator](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html). - -## Set up environment - -```bash -export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token} -export vLLM_RAY_ENDPOINT="http://${your_ip}:8006" -export LLM_MODEL=${your_hf_llm_model} -``` - -For gated models such as `LLAMA-2`, you will have to pass the environment HUGGINGFACEHUB_API_TOKEN. Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token and export `HUGGINGFACEHUB_API_TOKEN` environment with the token. - -## Set up VLLM Ray Gaudi Service - -First of all, go to the server folder for vllm. - -```bash -cd dependency -``` - -### Build docker - -```bash -bash ./build_docker_vllmray.sh -``` - -### Launch the service - -```bash -bash ./launch_vllmray.sh -``` - -The `launch_vllmray.sh` script accepts three parameters: - -- port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8006. -- model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf. -- parallel_number: The number of HPUs specifies the number of HPUs per worker process, the default is set to 2. -- enforce_eager: Whether to enforce eager execution, default to be False. - -If you want to customize the setting, can run: - -```bash -bash ./launch_vllmray.sh ${port_number} ${model_name} ${parallel_number} False/True -``` - -### Query the service - -And then you can make requests with the OpenAI-compatible APIs like below to check the service status: - -```bash -curl http://${your_ip}:8006/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model": $LLM_MODEL, "messages": [{"role": "user", "content": "How are you?"}]}' -``` - -For more information about the OpenAI APIs, you can checkeck the [OpenAI official document](https://platform.openai.com/docs/api-reference/). - -## Set up OPEA microservice - -Then we warp the VLLM Ray service into OPEA microcervice. - -### Build docker - -```bash -bash ./build_docker_microservice.sh -``` - -### Launch the microservice - -```bash -bash ./launch_microservice.sh -``` - -### Query the microservice - -```bash -curl http://${your_ip}:9000/v1/chat/completions \ - -X POST \ - -d '{"query":"What is Deep Learning?","max_tokens":17,"top_p":1,"temperature":0.7,"frequency_penalty":0,"presence_penalty":0, "streaming":false}' \ - -H 'Content-Type: application/json' -``` - -For parameters, can refer to [LangChain ChatOpenAI API](https://python.langchain.com/v0.2/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html) diff --git a/comps/llms/text-generation/vllm/ray/__init__.py b/comps/llms/text-generation/vllm/ray/__init__.py deleted file mode 100644 index 916f3a44b2..0000000000 --- a/comps/llms/text-generation/vllm/ray/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 diff --git a/comps/llms/text-generation/vllm/ray/build_docker_microservice.sh b/comps/llms/text-generation/vllm/ray/build_docker_microservice.sh deleted file mode 100644 index c53684af81..0000000000 --- a/comps/llms/text-generation/vllm/ray/build_docker_microservice.sh +++ /dev/null @@ -1,9 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -cd ../../../../../ -docker build \ - -t opea/llm-vllm-ray:latest \ - --build-arg https_proxy=$https_proxy \ - --build-arg http_proxy=$http_proxy \ - -f comps/llms/text-generation/vllm/ray/Dockerfile . diff --git a/comps/llms/text-generation/vllm/ray/dependency/Dockerfile b/comps/llms/text-generation/vllm/ray/dependency/Dockerfile deleted file mode 100644 index 6746f9da0e..0000000000 --- a/comps/llms/text-generation/vllm/ray/dependency/Dockerfile +++ /dev/null @@ -1,29 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -FROM vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0 AS hpu - -ENV LANG=en_US.UTF-8 - -WORKDIR /home/user/vllm/ray - -# copy the source code to the package directory -COPY comps/llms/text-generation/vllm/ray /home/user/vllm/ray - -RUN pip install --no-cache-dir --upgrade-strategy eager optimum[habana] && \ - pip install --no-cache-dir git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1 -RUN pip install --no-cache-dir -v git+https://github.com/HabanaAI/vllm-fork.git@cf6952d -RUN pip install --no-cache-dir "ray>=2.10" "ray[serve,tune]>=2.10" - -RUN ssh-keygen -A && \ - sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \ - service ssh restart - -ENV PYTHONPATH=$PYTHONPATH:/root:/home/user/vllm/ray - -# Required by DeepSpeed -ENV RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES=1 - -ENV PT_HPU_LAZY_ACC_PAR_MODE=0 - -ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true diff --git a/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh b/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh deleted file mode 100644 index 38a5a15dee..0000000000 --- a/comps/llms/text-generation/vllm/ray/dependency/build_docker_vllmray.sh +++ /dev/null @@ -1,17 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -CURRENT_DIR=$(pwd) - -# Go to top level directory of this repo and build the image -cd "$(git rev-parse --show-toplevel)" - -docker build \ - -f comps/llms/text-generation/vllm/ray/dependency/Dockerfile \ - -t opea/vllm_ray:habana \ - --network=host \ - --build-arg http_proxy=${http_proxy} \ - --build-arg https_proxy=${https_proxy} \ - --build-arg no_proxy=${no_proxy} . - -cd $CURRENT_DIR diff --git a/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh b/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh deleted file mode 100644 index ec8329a010..0000000000 --- a/comps/llms/text-generation/vllm/ray/dependency/launch_vllmray.sh +++ /dev/null @@ -1,43 +0,0 @@ -#!/bin/bash - -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -# Set default values -default_port=8006 -default_model=$LLM_MODEL -default_parallel_number=2 -default_enforce_eager=False - -# Assign arguments to variables -port_number=${1:-$default_port} -model_name=${2:-$default_model} -parallel_number=${3:-$default_parallel_number} -enforce_eager=${4:-$default_enforce_eager} - -# Check if all required arguments are provided -if [ "$#" -lt 0 ] || [ "$#" -gt 3 ]; then - echo "Usage: $0 [port_number] [model_name] [parallel_number] [enforce_eager]" - echo "Please customize the arguments you want to use. - - port_number: The port number assigned to the Ray Gaudi endpoint, with the default being 8080. - - model_name: The model name utilized for LLM, with the default set to meta-llama/Llama-2-7b-chat-hf. - - parallel_number: The number of HPUs specifies the number of HPUs per worker process. - - enforce_eager: Whether to enforce eager execution, default to be True." - exit 1 -fi - -# Build the Docker run command based on the number of cards -docker run -d --rm \ - --name="vllm-ray-service" \ - --runtime=habana \ - -v $DATA_DIR:/data \ - -e HABANA_VISIBLE_DEVICES=all \ - -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ - --cap-add=sys_nice \ - --ipc=host \ - -p $port_number:8000 \ - -e HTTPS_PROXY=$https_proxy \ - -e HTTP_PROXY=$https_proxy \ - -e HF_TOKEN=$HF_TOKEN \ - opea/vllm_ray:habana \ - /bin/bash -c "ray start --head && python vllm_ray_openai.py --port_number 8000 --model_id_or_path $model_name --tensor_parallel_size $parallel_number --enforce_eager $enforce_eager" diff --git a/comps/llms/text-generation/vllm/ray/docker_compose_llm.yaml b/comps/llms/text-generation/vllm/ray/docker_compose_llm.yaml deleted file mode 100644 index 4a5f058d9b..0000000000 --- a/comps/llms/text-generation/vllm/ray/docker_compose_llm.yaml +++ /dev/null @@ -1,46 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -version: "3.8" - -services: - vllm-ray-service: - image: opea/vllm_ray:habana - container_name: vllm-ray-gaudi-server - ports: - - "8006:8000" - volumes: - - "./data:/data" - environment: - no_proxy: ${no_proxy} - http_proxy: ${http_proxy} - https_proxy: ${https_proxy} - HF_TOKEN: ${HF_TOKEN} - HABANA_VISIBLE_DEVICES: all - OMPI_MCA_btl_vader_single_copy_mechanism: none - LLM_MODEL: ${LLM_MODEL} - runtime: habana - cap_add: - - SYS_NICE - ipc: host - command: /bin/bash -c "ray start --head && python vllm_ray_openai.py --port_number 8000 --model_id_or_path $LLM_MODEL --tensor_parallel_size 2 --enforce_eager False" - llm: - image: opea/llm-vllm-ray:latest - container_name: llm-vllm-ray-gaudi-server - depends_on: - - vllm-ray-service - ports: - - "9000:9000" - ipc: host - environment: - no_proxy: ${no_proxy} - http_proxy: ${http_proxy} - https_proxy: ${https_proxy} - vLLM_RAY_ENDPOINT: ${vLLM_RAY_ENDPOINT} - HF_TOKEN: ${HF_TOKEN} - LLM_MODEL: ${LLM_MODEL} - restart: unless-stopped - -networks: - default: - driver: bridge diff --git a/comps/llms/text-generation/vllm/ray/launch_microservice.sh b/comps/llms/text-generation/vllm/ray/launch_microservice.sh deleted file mode 100644 index 6281020325..0000000000 --- a/comps/llms/text-generation/vllm/ray/launch_microservice.sh +++ /dev/null @@ -1,13 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -docker run -d --rm \ - --name="llm-vllm-ray-server" \ - -p 9000:9000 \ - --ipc=host \ - -e http_proxy=$http_proxy \ - -e https_proxy=$https_proxy \ - -e vLLM_RAY_ENDPOINT=$vLLM_RAY_ENDPOINT \ - -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \ - -e LLM_MODEL=$LLM_MODEL \ - opea/llm-vllm-ray:latest diff --git a/comps/llms/text-generation/vllm/ray/llm.py b/comps/llms/text-generation/vllm/ray/llm.py deleted file mode 100644 index cd19cb13b3..0000000000 --- a/comps/llms/text-generation/vllm/ray/llm.py +++ /dev/null @@ -1,74 +0,0 @@ -# Copyright (c) 2024 Intel Corporation -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -from fastapi.responses import StreamingResponse -from langchain_openai import ChatOpenAI - -from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice - -logger = CustomLogger("llm_vllm_ray") -logflag = os.getenv("LOGFLAG", False) - - -@register_microservice( - name="opea_service@llm_vllm_ray", - service_type=ServiceType.LLM, - endpoint="/v1/chat/completions", - host="0.0.0.0", - port=9000, -) -async def llm_generate(input: LLMParamsDoc): - if logflag: - logger.info(input) - llm_endpoint = os.getenv("vLLM_RAY_ENDPOINT", "http://localhost:8006") - llm_model = os.getenv("LLM_MODEL", "meta-llama/Llama-2-7b-chat-hf") - llm = ChatOpenAI( - openai_api_base=llm_endpoint + "/v1", - model_name=llm_model, - openai_api_key=os.getenv("OPENAI_API_KEY", "not_needed"), - max_tokens=input.max_tokens, - top_p=input.top_p, - temperature=input.temperature, - frequency_penalty=input.frequency_penalty, - presence_penalty=input.presence_penalty, - streaming=input.streaming, - request_timeout=600, - ) - - if input.streaming: - - async def stream_generator(): - chat_response = "" - for text in llm.astream(input.query): - text = text.content - chat_response += text - chunk_repr = repr(text.encode("utf-8")) - yield f"data: {chunk_repr}\n\n" - if logflag: - logger.info(f"[llm - chat_stream] stream response: {chat_response}") - yield "data: [DONE]\n\n" - - return StreamingResponse(stream_generator(), media_type="text/event-stream") - else: - response = await llm.ainvoke(input.query) - response = response.content - if logflag: - logger.info(response) - return GeneratedDoc(text=response, prompt=input.query) - - -if __name__ == "__main__": - opea_microservices["opea_service@llm_vllm_ray"].start() diff --git a/comps/llms/text-generation/vllm/ray/query.sh b/comps/llms/text-generation/vllm/ray/query.sh deleted file mode 100644 index 87c3ce4f66..0000000000 --- a/comps/llms/text-generation/vllm/ray/query.sh +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -your_ip="0.0.0.0" - -##query vllm ray service -curl http://${your_ip}:8006/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model": "meta-llama/Llama-2-7b-chat-hf", "messages": [{"role": "user", "content": "How are you?"}]}' - -##query microservice -curl http://${your_ip}:9000/v1/chat/completions \ - -X POST \ - -d '{"query":"What is Deep Learning?","max_tokens":17,"top_p":1,"temperature":0.7,"frequency_penalty":0,"presence_penalty":0, "streaming":false}' \ - -H 'Content-Type: application/json' diff --git a/comps/llms/text-generation/vllm/ray/requirements.txt b/comps/llms/text-generation/vllm/ray/requirements.txt deleted file mode 100644 index 69dbbd1730..0000000000 --- a/comps/llms/text-generation/vllm/ray/requirements.txt +++ /dev/null @@ -1,17 +0,0 @@ -docarray[full] -fastapi -huggingface_hub -langchain -langchain_community -langchain_openai -openai -opentelemetry-api -opentelemetry-exporter-otlp -opentelemetry-sdk -prometheus-fastapi-instrumentator -ray[serve]>=2.10 -setuptools -shortuuid -transformers -uvicorn -vllm diff --git a/comps/llms/text-generation/vllm/ray/vllm_ray_openai.py b/comps/llms/text-generation/vllm/ray/vllm_ray_openai.py deleted file mode 100644 index 75ff7ff4af..0000000000 --- a/comps/llms/text-generation/vllm/ray/vllm_ray_openai.py +++ /dev/null @@ -1,179 +0,0 @@ -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -import argparse -import logging -import os -import sys - -# __serve_example_begin__ -from typing import Dict, List, Optional - -from fastapi import FastAPI -from huggingface_hub import login -from ray import serve -from starlette.requests import Request -from starlette.responses import JSONResponse, StreamingResponse -from vllm.engine.arg_utils import AsyncEngineArgs -from vllm.engine.async_llm_engine import AsyncLLMEngine -from vllm.entrypoints.openai.cli_args import make_arg_parser -from vllm.entrypoints.openai.protocol import ChatCompletionRequest, ChatCompletionResponse, ErrorResponse -from vllm.entrypoints.openai.serving_chat import OpenAIServingChat -from vllm.entrypoints.openai.serving_engine import LoRAModulePath - -hg_token = os.getenv("HUGGINGFACEHUB_API_TOKEN", "") -if hg_token != "": - login(token=hg_token) - -logger = logging.getLogger("ray.serve") - -app = FastAPI() - - -@serve.deployment( - autoscaling_config={ - "min_replicas": 1, - "max_replicas": 10, - "target_ongoing_requests": 5, - }, - max_ongoing_requests=10, -) -@serve.ingress(app) -class VLLMDeployment: - def __init__( - self, - engine_args: AsyncEngineArgs, - response_role: str, - lora_modules: Optional[List[LoRAModulePath]] = None, - chat_template: Optional[str] = None, - ): - logger.info(f"Starting with engine args: {engine_args}") - self.engine = AsyncLLMEngine.from_engine_args(engine_args) - - # Determine the name of the served model for the OpenAI client. - if engine_args.served_model_name is not None: - served_model_names = engine_args.served_model_name - else: - served_model_names = [engine_args.model] - self.openai_serving_chat = OpenAIServingChat( - self.engine, served_model_names, response_role, lora_modules, chat_template - ) - - @app.post("/v1/chat/completions") - async def create_chat_completion(self, request: ChatCompletionRequest, raw_request: Request): - """OpenAI-compatible HTTP endpoint. - - API reference: - - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html - """ - logger.info(f"Request: {request}") - generator = await self.openai_serving_chat.create_chat_completion(request, raw_request) - if isinstance(generator, ErrorResponse): - return JSONResponse(content=generator.model_dump(), status_code=generator.code) - if request.stream: - return StreamingResponse(content=generator, media_type="text/event-stream") - else: - assert isinstance(generator, ChatCompletionResponse) - return JSONResponse(content=generator.model_dump()) - - -def parse_vllm_args(cli_args: Dict[str, str]): - """Parses vLLM args based on CLI inputs. - - Currently uses argparse because vLLM doesn't expose Python models for all of the - config options we want to support. - """ - parser = make_arg_parser() - arg_strings = [] - for key, value in cli_args.items(): - arg_strings.extend([f"--{key}", str(value)]) - logger.info(arg_strings) - parsed_args = parser.parse_args(args=arg_strings) - return parsed_args - - -def build_app(cli_args: Dict[str, str]) -> serve.Application: - """Builds the Serve app based on CLI arguments. - - See https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server - for the complete set of arguments. - - Supported engine arguments: https://docs.vllm.ai/en/latest/models/engine_args.html. - """ # noqa: E501 - device = cli_args.pop("device") - enforce_eager = cli_args.pop("enforce_eager") - parsed_args = parse_vllm_args(cli_args) - engine_args = AsyncEngineArgs.from_cli_args(parsed_args) - engine_args.worker_use_ray = True - engine_args.enforce_eager = enforce_eager - engine_args.block_size = 128 - engine_args.max_num_seqs = 256 - engine_args.max_seq_len_to_capture = 2048 - - tp = engine_args.tensor_parallel_size - logger.info(f"Tensor parallelism = {tp}") - pg_resources = [] - pg_resources.append({"CPU": 1}) # for the deployment replica - for i in range(tp): - pg_resources.append({"CPU": 1, device: 1}) # for the vLLM actors - - # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on - # the same Ray node. - return VLLMDeployment.options(placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK").bind( - engine_args, - parsed_args.response_role, - parsed_args.lora_modules, - parsed_args.chat_template, - ) - - -# __serve_example_end__ - - -def str2bool(v): - if isinstance(v, bool): - return v - if v.lower() in ("yes", "true", "t", "y", "1"): - return True - elif v.lower() in ("no", "false", "f", "n", "0"): - return False - else: - raise argparse.ArgumentTypeError("Boolean value expected.") - - -def main(argv=None): - parser = argparse.ArgumentParser(description="Serve vLLM models with Ray.", add_help=True) - parser.add_argument("--port_number", default="8000", type=str, help="Port number to serve on.", required=False) - parser.add_argument( - "--model_id_or_path", - default="meta-llama/Llama-2-7b-chat-hf", - type=str, - help="Model id or path.", - required=False, - ) - parser.add_argument( - "--tensor_parallel_size", default=2, type=int, help="parallel nodes number for 'hpu' mode.", required=False - ) - parser.add_argument( - "--enforce_eager", default=False, type=str2bool, help="Whether to enforce eager execution", required=False - ) - args = parser.parse_args(argv) - - serve.start(http_options={"host": "0.0.0.0", "port": args.port_number}) - serve.run( - build_app( - { - "model": args.model_id_or_path, - "tensor-parallel-size": args.tensor_parallel_size, - "device": "HPU", - "enforce_eager": args.enforce_eager, - } - ) - ) - # input("Service is deployed successfully.") - while 1: - pass - - -if __name__ == "__main__": - main(sys.argv[1:]) diff --git a/tests/llms/test_llms_text-generation_vllm_ray_on_intel_hpu.sh b/tests/llms/test_llms_text-generation_vllm_ray_on_intel_hpu.sh deleted file mode 100644 index d55e8f6a22..0000000000 --- a/tests/llms/test_llms_text-generation_vllm_ray_on_intel_hpu.sh +++ /dev/null @@ -1,125 +0,0 @@ -#!/bin/bash -# Copyright (C) 2024 Intel Corporation -# SPDX-License-Identifier: Apache-2.0 - -set -x - -WORKPATH=$(dirname "$PWD") -LOG_PATH="$WORKPATH/tests" -ip_address=$(hostname -I | awk '{print $1}') - -function build_docker_images() { - ## Build VLLM Ray docker - cd $WORKPATH - docker build \ - -f comps/llms/text-generation/vllm/ray/dependency/Dockerfile \ - --no-cache -t opea/vllm_ray-habana:comps --network=host . - if [ $? -ne 0 ]; then - echo "opea/vllm_ray-habana built fail" - exit 1 - else - echo "opea/vllm_ray-habana built successful" - fi - - ## Build OPEA microservice docker - cd $WORKPATH - docker build \ - --no-cache -t opea/llm-vllm-ray:comps \ - -f comps/llms/text-generation/vllm/ray/Dockerfile . - if [ $? -ne 0 ]; then - echo "opea/llm-vllm-ray built fail" - exit 1 - else - echo "opea/llm-vllm-ray built successful" - fi -} - -function start_service() { - export LLM_MODEL="Intel/neural-chat-7b-v3-3" - port_number=5031 - docker run -d --rm \ - --name="test-comps-vllm-ray-service" \ - --runtime=habana \ - -v $PWD/data:/data \ - -e HABANA_VISIBLE_DEVICES=all \ - -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ - --cap-add=sys_nice \ - --ipc=host \ - -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \ - -p $port_number:8000 \ - opea/vllm_ray-habana:comps \ - /bin/bash -c "ray start --head && python vllm_ray_openai.py --port_number 8000 --model_id_or_path $LLM_MODEL --tensor_parallel_size 2 --enforce_eager False" - - export vLLM_RAY_ENDPOINT="http://${ip_address}:${port_number}" - service_port=5032 - docker run -d --rm\ - --name="test-comps-vllm-ray-microservice" \ - -p $service_port:9000 \ - --ipc=host \ - -e vLLM_RAY_ENDPOINT=$vLLM_RAY_ENDPOINT \ - -e HUGGINGFACEHUB_API_TOKEN=$HUGGINGFACEHUB_API_TOKEN \ - -e LLM_MODEL=$LLM_MODEL \ - opea/llm-vllm-ray:comps - - # check whether vllm ray is fully ready - n=0 - until [[ "$n" -ge 100 ]] || [[ $ready == true ]]; do - docker logs test-comps-vllm-ray-service > ${LOG_PATH}/test-comps-vllm-ray-service.log - n=$((n+1)) - if grep -q Connected ${LOG_PATH}/test-comps-vllm-ray-service.log; then - break - fi - sleep 5s - done - sleep 5s -} - -function validate_microservice() { - port_number=5031 - result=$(http_proxy="" curl http://${ip_address}:$port_number/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "How are you?"}]}') - if [[ $result == *"message"* ]]; then - echo "Result correct." - else - echo "Result wrong. Received was $result" - docker logs test-comps-vllm-ray-service >> ${LOG_PATH}/ray-dependency.log - docker logs test-comps-vllm-ray-microservice >> ${LOG_PATH}/llm-ray.log - exit 1 - fi - - service_port=5032 - result=$(http_proxy="" curl http://${ip_address}:$service_port/v1/chat/completions \ - -X POST \ - -d '{"query":"What is Deep Learning?","max_tokens":17,"top_p":1,"temperature":0.7,"frequency_penalty":0,"presence_penalty":0, "streaming":false}' \ - -H 'Content-Type: application/json') - if [[ $result == *"text"* ]]; then - echo "Result correct." - else - echo "Result wrong. Received was $result" - docker logs test-comps-vllm-ray-service >> ${LOG_PATH}/ray-dependency.log - docker logs test-comps-vllm-ray-microservice >> ${LOG_PATH}/llm-ray.log - exit 1 - fi -} - -function stop_docker() { - cid=$(docker ps -aq --filter "name=test-comps-vllm-ray*") - if [[ ! -z "$cid" ]]; then docker rm $cid -f && sleep 1s; fi -} - -function main() { - - stop_docker - - build_docker_images - start_service - - validate_microservice - - stop_docker - echo y | docker system prune - -} - -main