Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ml_inference ingest processor incorrectly parsing input field #2904

Closed
IanMenendez opened this issue Sep 5, 2024 · 5 comments
Closed
Labels
bug Something isn't working untriaged

Comments

@IanMenendez
Copy link

What is the bug?
ml_inference ingest processor not correctly parsing input field when given in "full json path"
This is tested with an OS hosted ml model.

How can one reproduce the bug?


PUT /_ingest/pipeline/ml_inference_pipeline
{
  "processors": [
    {
      "ml_inference": {
        "model_id": "DO2Ew5EBAm-NfbMQYIyT",
        "function_name": "text_embedding",
        "model_input": """{"text_docs": ${input_map.text_docs}, "target_response": ["sentence_embedding"]}""",
        "input_map": [
          {
            "text_docs": "dynamicProperties.description"
          }
        ],
        "output_map": [
          {
            "dynamicProperties.description.knn": "$.inference_results.*.output.*.data"
          }
        ],
        "full_response_path": true
      }
    }
  ]
}

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties.description": [
          "text1",
          "text2"
        ]
      }
    }
  ]
}

returns

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "dynamicProperties.description": [
            "text1",
            "text2"
          ]
        },
        "_ingest": {
          "timestamp": "2024-09-05T21:57:13.113971065Z"
        }
      }
    }
  ]
}

What is the expected behavior?
I expect the processor to yield text embeddings

@IanMenendez IanMenendez added bug Something isn't working untriaged labels Sep 5, 2024
@IanMenendez
Copy link
Author

Found that even if I do not use full json path as input

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties": {
          "description": [
            "text1",
            "text2"
          ]
        }
      }
    }
  ]
}

I get the next error


{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "response_handler_failure_transport_exception",
            "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]"
          }
        ],
        "type": "response_handler_failure_transport_exception",
        "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
          "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"knn\""
          }
        }
      }
    }
  ]
}

So it is not possible to have nested objects as output??

@IanMenendez IanMenendez changed the title [BUG] ml_inference ingest processor not correctly parsing input field [BUG] ml_inference ingest processor incorrectly parsing input field Sep 5, 2024
@mingshl
Copy link
Collaborator

mingshl commented Sep 11, 2024

Found that even if I do not use full json path as input

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties": {
          "description": [
            "text1",
            "text2"
          ]
        }
      }
    }
  ]
}

I get the next error


{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "response_handler_failure_transport_exception",
            "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]"
          }
        ],
        "type": "response_handler_failure_transport_exception",
        "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
          "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"knn\""
          }
        }
      }
    }
  ]
}

So it is not possible to have nested objects as output??

Hi @IanMenendez what's your index setting? if you set up the knn field to be knn_vector field type,

similar to here

curl -XPUT "http://localhost:9200/my-knn-index-1" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 2,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        },
        "my_vector2": {
          "type": "knn_vector",
          "dimension": 4,
          "method": {
            "name": "hnsw",
            "space_type": "innerproduct",
            "engine": "faiss",
            "parameters": {
              "ef_construction": 256,
              "m": 48
            }
          }
        }
    }
  }
}

the mapping will check if the field is an integer type. It's not allowed by ml inference ingest processor. It's not allowed by the mapping in this case.

to further troubleshoot the issue, can you provide more information about the model? It seems you are using a local model? what is the predict request looks like? so I can help checking the model_input and mapping for you

@IanMenendez
Copy link
Author

@mingshl I fixed the issue. But have another one :)

my index looks something like this

PUT ml_index
{
  "settings": {
    "index": {
      "knn": true,
      "default_pipeline": "testing_pipeline"
    }
  },
  "mappings": {
    "properties": {
      "dynamicProperties": {
        "properties": {
          "description": {
            "type": "text"
          }
        }
      },
      "description_minilm_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 384
          }
        }
      }
    }
  }
}

Now my ml inference processor looks like this

PUT /_ingest/pipeline/testing_pipeline
{
  "processors": [
    {
      "ml_inference": {
        "model_id": "DO2Ew5EBAm-NfbMQYIyT",
        "function_name": "text_embedding",
                        "model_input": "{\"text_docs\": [\"${input_map.text_docs}\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}",
        "input_map": [
          {
            "text_docs": "dynamicProperties.description"
          }
        ],
        "output_map": [
          {
            "description_minilm_embedding.knn": "$.inference_results.*.output.*.data"
          }
        ],
        "full_response_path": false
      }
    }
  ]
}

I have several documents with special characters that break the ML inference processor. For example:

POST ml_index/_doc
{
  "dynamicProperties": {
    "description": """<span style="color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; ">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>"""
  }
}


which throws

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Invalid payload: {\"text_docs\": [\"<span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; \">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Invalid payload: {\"text_docs\": [\"<span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; \">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"
  },
  "status": 400
}

This is because the " inside the document are not escaped. Is there a way to escape them from inside the ML inference processor? We have no easy way to escape them before this docs are ingested to our index

@mingshl
Copy link
Collaborator

mingshl commented Sep 11, 2024

check out this PR. I introduced a method _toString() that will help convert to String format.

in your pipeline config. try set the model_input as

"model_input": "{\"text_docs\": [\"${input_map.text_docs._toString()}\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"

@IanMenendez
Copy link
Author

This worked thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged
Projects
Development

No branches or pull requests

2 participants