[BUG] ml_inference ingest processor incorrectly parsing input field #2904

IanMenendez · 2024-09-05T22:06:40Z

What is the bug?
ml_inference ingest processor not correctly parsing input field when given in "full json path"
This is tested with an OS hosted ml model.

How can one reproduce the bug?


PUT /_ingest/pipeline/ml_inference_pipeline
{
  "processors": [
    {
      "ml_inference": {
        "model_id": "DO2Ew5EBAm-NfbMQYIyT",
        "function_name": "text_embedding",
        "model_input": """{"text_docs": ${input_map.text_docs}, "target_response": ["sentence_embedding"]}""",
        "input_map": [
          {
            "text_docs": "dynamicProperties.description"
          }
        ],
        "output_map": [
          {
            "dynamicProperties.description.knn": "$.inference_results.*.output.*.data"
          }
        ],
        "full_response_path": true
      }
    }
  ]
}

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties.description": [
          "text1",
          "text2"
        ]
      }
    }
  ]
}

returns

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "dynamicProperties.description": [
            "text1",
            "text2"
          ]
        },
        "_ingest": {
          "timestamp": "2024-09-05T21:57:13.113971065Z"
        }
      }
    }
  ]
}

What is the expected behavior?
I expect the processor to yield text embeddings

The text was updated successfully, but these errors were encountered:

IanMenendez · 2024-09-05T22:08:32Z

Found that even if I do not use full json path as input

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties": {
          "description": [
            "text1",
            "text2"
          ]
        }
      }
    }
  ]
}

I get the next error


{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "response_handler_failure_transport_exception",
            "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]"
          }
        ],
        "type": "response_handler_failure_transport_exception",
        "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
          "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"knn\""
          }
        }
      }
    }
  ]
}

So it is not possible to have nested objects as output??

mingshl · 2024-09-11T00:22:51Z

Found that even if I do not use full json path as input

POST _ingest/pipeline/ml_inference_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "dynamicProperties": {
          "description": [
            "text1",
            "text2"
          ]
        }
      }
    }
  ]
}

I get the next error


{
  "docs": [
    {
      "error": {
        "root_cause": [
          {
            "type": "response_handler_failure_transport_exception",
            "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]"
          }
        ],
        "type": "response_handler_failure_transport_exception",
        "reason": "java.lang.IllegalArgumentException: [knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[knn] is not an integer, cannot be used as an index as part of path [dynamicProperties.description.knn]",
          "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"knn\""
          }
        }
      }
    }
  ]
}

So it is not possible to have nested objects as output??

Hi @IanMenendez what's your index setting? if you set up the knn field to be knn_vector field type,

similar to here

curl -XPUT "http://localhost:9200/my-knn-index-1" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 2,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        },
        "my_vector2": {
          "type": "knn_vector",
          "dimension": 4,
          "method": {
            "name": "hnsw",
            "space_type": "innerproduct",
            "engine": "faiss",
            "parameters": {
              "ef_construction": 256,
              "m": 48
            }
          }
        }
    }
  }
}

the mapping will check if the field is an integer type. It's not allowed by ml inference ingest processor. It's not allowed by the mapping in this case.

to further troubleshoot the issue, can you provide more information about the model? It seems you are using a local model? what is the predict request looks like? so I can help checking the model_input and mapping for you

IanMenendez · 2024-09-11T22:04:49Z

@mingshl I fixed the issue. But have another one :)

my index looks something like this

PUT ml_index
{
  "settings": {
    "index": {
      "knn": true,
      "default_pipeline": "testing_pipeline"
    }
  },
  "mappings": {
    "properties": {
      "dynamicProperties": {
        "properties": {
          "description": {
            "type": "text"
          }
        }
      },
      "description_minilm_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 384
          }
        }
      }
    }
  }
}

Now my ml inference processor looks like this

PUT /_ingest/pipeline/testing_pipeline
{
  "processors": [
    {
      "ml_inference": {
        "model_id": "DO2Ew5EBAm-NfbMQYIyT",
        "function_name": "text_embedding",
                        "model_input": "{\"text_docs\": [\"${input_map.text_docs}\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}",
        "input_map": [
          {
            "text_docs": "dynamicProperties.description"
          }
        ],
        "output_map": [
          {
            "description_minilm_embedding.knn": "$.inference_results.*.output.*.data"
          }
        ],
        "full_response_path": false
      }
    }
  ]
}

I have several documents with special characters that break the ML inference processor. For example:

POST ml_index/_doc
{
  "dynamicProperties": {
    "description": """<span style="color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; ">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>"""
  }
}


which throws

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Invalid payload: {\"text_docs\": [\"<span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; \">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Invalid payload: {\"text_docs\": [\"<span style=\"color: rgb(34, 34, 34); font-family: arial, sans-serif; line-height: normal; \">Columbia's fleece jacket has a soft feel and comfortable modern classic fit.</span>\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"
  },
  "status": 400
}

This is because the " inside the document are not escaped. Is there a way to escape them from inside the ML inference processor? We have no easy way to escape them before this docs are ingested to our index

mingshl · 2024-09-11T23:08:55Z

check out this PR. I introduced a method _toString() that will help convert to String format.

in your pipeline config. try set the model_input as

"model_input": "{\"text_docs\": [\"${input_map.text_docs._toString()}\"], \"return_number\": true, \"target_response\": [\"sentence_embedding\"]}"

IanMenendez · 2024-09-11T23:22:18Z

This worked thanks!

IanMenendez added bug Something isn't working untriaged labels Sep 5, 2024

IanMenendez changed the title ~~[BUG] ml_inference ingest processor not correctly parsing input field~~ [BUG] ml_inference ingest processor incorrectly parsing input field Sep 5, 2024

ylwu-amzn added this to ml-commons projects Sep 10, 2024

IanMenendez closed this as completed Sep 11, 2024

github-project-automation bot moved this to Done in ml-commons projects Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ml_inference ingest processor incorrectly parsing input field #2904

[BUG] ml_inference ingest processor incorrectly parsing input field #2904

IanMenendez commented Sep 5, 2024

IanMenendez commented Sep 5, 2024

mingshl commented Sep 11, 2024 •

edited

Loading

IanMenendez commented Sep 11, 2024

mingshl commented Sep 11, 2024

IanMenendez commented Sep 11, 2024

[BUG] ml_inference ingest processor incorrectly parsing input field #2904

[BUG] ml_inference ingest processor incorrectly parsing input field #2904

Comments

IanMenendez commented Sep 5, 2024

IanMenendez commented Sep 5, 2024

mingshl commented Sep 11, 2024 • edited Loading

IanMenendez commented Sep 11, 2024

mingshl commented Sep 11, 2024

IanMenendez commented Sep 11, 2024

mingshl commented Sep 11, 2024 •

edited

Loading