Skip to content

Commit

Permalink
Merge branch 'main' into qg/safety
Browse files Browse the repository at this point in the history
  • Loading branch information
ashahba authored Aug 14, 2024
2 parents edc3f1e + f2497c5 commit b05ffc5
Show file tree
Hide file tree
Showing 32 changed files with 357 additions and 176 deletions.
2 changes: 1 addition & 1 deletion comps/cores/proto/api_protocol.py
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ class ChatCompletionRequest(BaseModel):
logit_bias: Optional[Dict[str, float]] = None
logprobs: Optional[bool] = False
top_logprobs: Optional[int] = 0
max_tokens: Optional[int] = 16 # use https://platform.openai.com/docs/api-reference/completions/create
max_tokens: Optional[int] = 1024 # use https://platform.openai.com/docs/api-reference/completions/create
n: Optional[int] = 1
presence_penalty: Optional[float] = 0.0
response_format: Optional[ResponseFormat] = None
Expand Down
1 change: 1 addition & 0 deletions comps/dataprep/pgvector/langchain/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ psycopg2-binary
pymupdf
pyspark
python-docx
python-multipart
python-pptx
sentence_transformers
shortuuid
Expand Down
23 changes: 18 additions & 5 deletions comps/dataprep/qdrant/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@ docker build -t opea/dataprep-qdrant:latest --build-arg https_proxy=$https_proxy
## Run Docker with CLI

```bash
docker run -d --name="dataprep-qdrant-server" -p 6000:6000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-qdrant:latest
docker run -d --name="dataprep-qdrant-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-qdrant:latest
```

## Setup Environment Variables

```bash
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export QDRANT=${host_ip}
export QDRANT_HOST=${host_ip}
export QDRANT_PORT=6333
export COLLECTION_NAME=${your_collection_name}
```
Expand All @@ -72,19 +72,32 @@ docker compose -f docker-compose-dataprep-qdrant.yaml up -d
Once document preparation microservice for Qdrant is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/path/to/document"}' http://localhost:6000/v1/dataprep
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/path/to/document","chunk_size":1500,"chunk_overlap":100}' http://localhost:6000/v1/dataprep
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep
```

We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".

Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/path/to/document","process_table":true,"table_strategy":"hq"}' http://localhost:6000/v1/dataprep
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./your_file.pdf" \
-F "process_table=true" \
-F "table_strategy=hq" \
http://localhost:6007/v1/dataprep
```
2 changes: 1 addition & 1 deletion comps/dataprep/qdrant/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
EMBED_MODEL = os.getenv("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")

# Qdrant configuration
QDRANT_HOST = os.getenv("QDRANT", "localhost")
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
QDRANT_PORT = int(os.getenv("QDRANT_PORT", 6333))
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag-qdrant")

Expand Down
12 changes: 9 additions & 3 deletions comps/dataprep/qdrant/docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missin
build-essential \
libgl1-mesa-glx \
libjemalloc-dev \
default-jre \
vim

RUN useradd -m -s /bin/bash user && \
Expand All @@ -22,13 +23,18 @@ USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip && \
if [ ${ARCH} = "cpu" ]; then pip install torch --index-url https://download.pytorch.org/whl/cpu; fi && \
RUN pip install --no-cache-dir --upgrade pip setuptools && \
if [ ${ARCH} = "cpu" ]; then pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \
pip install --no-cache-dir -r /home/user/comps/dataprep/qdrant/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

USER root

RUN mkdir -p /home/user/comps/dataprep/qdrant/uploaded_files && chown -R user /home/user/comps/dataprep/qdrant/uploaded_files

USER user

WORKDIR /home/user/comps/dataprep/qdrant

ENTRYPOINT ["python", "prepare_doc_qdrant.py"]

21 changes: 19 additions & 2 deletions comps/dataprep/qdrant/docker/docker-compose-dataprep-qdrant.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,36 @@ services:
ports:
- "6333:6333"
- "6334:6334"
tei-embedding-service:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
container_name: tei-embedding-server
ports:
- "6006:80"
volumes:
- "./data:/data"
shm_size: 1g
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate
dataprep-qdrant:
image: opea/gen-ai-comps:dataprep-qdrant-xeon-server
container_name: dataprep-qdrant-server
depends_on:
- qdrant-vector-db
- tei-embedding-service
ports:
- "6000:6000"
- "6007:6007"
ipc: host
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
QDRANT: ${QDRANT}
QDRANT_HOST: ${QDRANT_HOST}
QDRANT_PORT: ${QDRANT_PORT}
COLLECTION_NAME: ${COLLECTION_NAME}
TEI_ENDPOINT: ${TEI_ENDPOINT}
restart: unless-stopped

networks:
Expand Down
118 changes: 99 additions & 19 deletions comps/dataprep/qdrant/prepare_doc_qdrant.py
Original file line number Diff line number Diff line change
@@ -1,30 +1,31 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import os
import json
from typing import List, Optional, Union

from config import COLLECTION_NAME, EMBED_MODEL, QDRANT_HOST, QDRANT_PORT
from config import COLLECTION_NAME, EMBED_MODEL, QDRANT_HOST, QDRANT_PORT, TEI_EMBEDDING_ENDPOINT
from fastapi import File, Form, HTTPException, UploadFile
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings, HuggingFaceHubEmbeddings
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_huggingface import HuggingFaceEndpointEmbeddings
from langchain_text_splitters import HTMLHeaderTextSplitter

from comps import DocPath, opea_microservices, opea_telemetry, register_microservice
from comps.dataprep.utils import document_loader, get_separators, get_tables_result
from comps import DocPath, opea_microservices, register_microservice
from comps.dataprep.utils import (
document_loader,
encode_filename,
get_separators,
get_tables_result,
parse_html,
save_content_to_local_disk,
)

tei_embedding_endpoint = os.getenv("TEI_ENDPOINT")
upload_folder = "./uploaded_files/"


@register_microservice(
name="opea_service@prepare_doc_qdrant",
endpoint="/v1/dataprep",
host="0.0.0.0",
port=6000,
input_datatype=DocPath,
output_datatype=None,
)
@opea_telemetry
def ingest_documents(doc_path: DocPath):
def ingest_data_to_qdrant(doc_path: DocPath):
"""Ingest document to Qdrant."""
path = doc_path.path
print(f"Parsing document {path}.")
Expand All @@ -38,23 +39,30 @@ def ingest_documents(doc_path: DocPath):
text_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
else:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=doc_path.chunk_size, chunk_overlap=100, add_start_index=True, separators=get_separators()
chunk_size=doc_path.chunk_size,
chunk_overlap=doc_path.chunk_overlap,
add_start_index=True,
separators=get_separators(),
)

content = document_loader(path)

chunks = text_splitter.split_text(content)
if doc_path.process_table and path.endswith(".pdf"):
table_chunks = get_tables_result(path, doc_path.table_strategy)
chunks = chunks + table_chunks
print("Done preprocessing. Created ", len(chunks), " chunks of the original pdf")

# Create vectorstore
if tei_embedding_endpoint:
if TEI_EMBEDDING_ENDPOINT:
# create embeddings using TEI endpoint service
embedder = HuggingFaceHubEmbeddings(model=tei_embedding_endpoint)
embedder = HuggingFaceEndpointEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
else:
# create embeddings using local embedding model
embedder = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)

print("embedder created.")

# Batch size
batch_size = 32
num_chunks = len(chunks)
Expand All @@ -71,6 +79,78 @@ def ingest_documents(doc_path: DocPath):
)
print(f"Processed batch {i//batch_size + 1}/{(num_chunks-1)//batch_size + 1}")

return True


@register_microservice(
name="opea_service@prepare_doc_qdrant",
endpoint="/v1/dataprep",
host="0.0.0.0",
port=6007,
input_datatype=DocPath,
output_datatype=None,
)
async def ingest_documents(
files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
link_list: Optional[str] = Form(None),
chunk_size: int = Form(1500),
chunk_overlap: int = Form(100),
process_table: bool = Form(False),
table_strategy: str = Form("fast"),
):
print(f"files:{files}")
print(f"link_list:{link_list}")

if files:
if not isinstance(files, list):
files = [files]
uploaded_files = []
for file in files:
encode_file = encode_filename(file.filename)
save_path = upload_folder + encode_file
await save_content_to_local_disk(save_path, file)
ingest_data_to_qdrant(
DocPath(
path=save_path,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
process_table=process_table,
table_strategy=table_strategy,
)
)
uploaded_files.append(save_path)
print(f"Successfully saved file {save_path}")

return {"status": 200, "message": "Data preparation succeeded"}

if link_list:
link_list = json.loads(link_list) # Parse JSON string to list
if not isinstance(link_list, list):
raise HTTPException(status_code=400, detail="link_list should be a list.")
for link in link_list:
encoded_link = encode_filename(link)
save_path = upload_folder + encoded_link + ".txt"
content = parse_html([link])[0][0]
try:
await save_content_to_local_disk(save_path, content)
ingest_data_to_qdrant(
DocPath(
path=save_path,
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
process_table=process_table,
table_strategy=table_strategy,
)
)
except json.JSONDecodeError:
raise HTTPException(status_code=500, detail="Fail to ingest data into qdrant.")

print(f"Successfully saved link {link}")

return {"status": 200, "message": "Data preparation succeeded"}

raise HTTPException(status_code=400, detail="Must provide either a file or a string list.")


if __name__ == "__main__":
opea_microservices["opea_service@prepare_doc_qdrant"].start()
1 change: 1 addition & 0 deletions comps/dataprep/qdrant/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ huggingface_hub
langchain
langchain-community
langchain-text-splitters
langchain_huggingface
markdown
numpy
opentelemetry-api
Expand Down
1 change: 1 addition & 0 deletions comps/dataprep/redis/langchain_ray/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ pyarrow
pymupdf
python-bidi==0.4.2
python-docx
python-multipart
python-pptx
ray
redis
Expand Down
3 changes: 1 addition & 2 deletions comps/embeddings/langchain-mosec/mosec-docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ ARG DEBIAN_FRONTEND=noninteractive
ENV GLIBC_TUNABLES glibc.cpu.x86_shstk=permissive
RUN apt update && apt install -y python3 python3-pip

USER user
COPY comps /home/user/comps

RUN pip3 install torch==2.2.2 torchvision --index-url https://download.pytorch.org/whl/cpu
Expand All @@ -19,7 +18,7 @@ RUN pip3 install transformers
RUN pip3 install llmspec mosec

RUN cd /home/user/ && export HF_ENDPOINT=https://hf-mirror.com && huggingface-cli download --resume-download BAAI/bge-large-zh-v1.5 --local-dir /home/user/bge-large-zh-v1.5

USER user
ENV EMB_MODEL="/home/user/bge-large-zh-v1.5/"

WORKDIR /home/user/comps/embeddings/langchain-mosec/mosec-docker
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
from llmspec import EmbeddingData, EmbeddingRequest, EmbeddingResponse, TokenUsage
from mosec import ClientError, Runtime, Server, Worker

DEFAULT_MODEL = "/root/bge-large-zh-v1.5/"
DEFAULT_MODEL = "/home/user/bge-large-zh-v1.5/"


class Embedding(Worker):
Expand Down
4 changes: 2 additions & 2 deletions comps/embeddings/llama_index/local_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# SPDX-License-Identifier: Apache-2.0

from langsmith import traceable
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.huggingface_api import HuggingFaceInferenceAPIEmbedding

from comps import EmbedDoc, ServiceType, TextDoc, opea_microservices, register_microservice

Expand All @@ -24,5 +24,5 @@ def embedding(input: TextDoc) -> EmbedDoc:


if __name__ == "__main__":
embeddings = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")
embeddings = HuggingFaceInferenceAPIEmbedding(model_name="BAAI/bge-large-en-v1.5")
opea_microservices["opea_service@local_embedding"].start()
1 change: 1 addition & 0 deletions comps/embeddings/llama_index/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ docarray[full]
fastapi
huggingface_hub
langsmith
llama-index-embeddings-huggingface-api
llama-index-embeddings-text-embeddings-inference
opentelemetry-api
opentelemetry-exporter-otlp
Expand Down
2 changes: 1 addition & 1 deletion comps/guardrails/llama_guard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ pip install -r requirements.txt
export HF_TOKEN=${your_hf_api_token}
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=${your_langchain_api_key}
export LANGCHAIN_PROJECT="opea/gaurdrails"
export LANGCHAIN_PROJECT="opea/guardrails"
volume=$PWD/data
model_id="meta-llama/Meta-Llama-Guard-2-8B"
docker pull ghcr.io/huggingface/tgi-gaudi:2.0.1
Expand Down
Loading

0 comments on commit b05ffc5

Please sign in to comment.