Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataprep-redis: failed when uploading pdf with process_table=true #414

Closed
lianhao opened this issue Aug 6, 2024 · 1 comment · Fixed by #421
Closed

dataprep-redis: failed when uploading pdf with process_table=true #414

lianhao opened this issue Aug 6, 2024 · 1 comment · Fixed by #421
Assignees
Labels
aitce bug Something isn't working

Comments

@lianhao
Copy link
Collaborator

lianhao commented Aug 6, 2024

When I try to upload the pdf to dataprep-redis docker container using the following curl command withprocess_table feature on:

$ curl -X POST  -H "Content-Type: multipart/form-data" -F "files=@../../../../retrievers/langchain/redis/data/nke-10k-2023.pdf" http://localhost:6007/v1/dataprep -F "process_table=true" -F "table_strategy=hq"
Internal Server Error

Checking out the docker container logs to find the following error:

Parsing document ./uploaded_files/nke-10k-2023.pdf.                                                                                                                                 [60/1994]
INFO:     172.23.0.1:38236 - "POST /v1/dataprep HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/pdf2image/pdf2image.py", line 581, in pdfinfo_from_path
    proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.11/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 174, in __call__
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 172, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/user/.local/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/user/.local/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 468, in async_wrapper
    raise e
  File "/home/user/.local/lib/python3.11/site-packages/langsmith/run_helpers.py", line 454, in async_wrapper
    function_result = await asyncio.create_task(  # type: ignore[call-arg]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 232, in ingest_documents
    ingest_data_to_redis(
  File "/home/user/comps/dataprep/redis/langchain/prepare_doc_redis.py", line 131, in ingest_data_to_redis
    table_chunks = get_tables_result(path, doc_path.table_strategy)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/comps/dataprep/utils.py", line 571, in get_tables_result
    raw_pdf_elements = partition_pdf(
                       ^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/documents/elements.py", line 514, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 591, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 546, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/chunking/__init__.py", line 52, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 183, in partition_pdf
    return partition_pdf_or_image(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 480, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 278, in _partition_pdf_or_image_local
    inferred_document_layout = process_file_with_model(
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 404, in process_file_with_model
    else DocumentLayout.from_file(
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 64, in from_file
    _image_paths = convert_pdf_to_image(
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/unstructured_inference/inference/layout.py", line 446, in convert_pdf_to_image
    images = pdf2image.convert_from_path(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pdf2image/pdf2image.py", line 127, in convert_from_path
    page_count = pdfinfo_from_path(
                 ^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pdf2image/pdf2image.py", line 607, in pdfinfo_from_path
    raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
@ZailiWang ZailiWang self-assigned this Aug 6, 2024
@ZailiWang
Copy link
Collaborator

Thanks for reporting, let me try reproduce and fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aitce bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants