Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add configuration option to keep raw JSON messages #9172

Closed

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Nov 20, 2018

New configuration option is added for docker input and json settings of log input named keep_original. By default it is set to false.

If the option is set, the raw content is published under log.original field. The length of the original message is limited by max_bytes option of the input.

If log.original is part of the incoming event, the field added by Filebeat is overwritten. It behaves the same as log.offset or log.source. (If log.original is part of the event, it is possible to keep both fields by setting json.fields_under_root to false.)

Partial implementation of #8083

@kvch kvch added review Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. labels Nov 20, 2018
@kvch kvch requested review from urso and ruflin November 20, 2018 13:34
@kvch
Copy link
Contributor Author

kvch commented Nov 21, 2018

Failing tests are unrelated.

@ph ph self-requested a review November 21, 2018 12:56
@ph
Copy link
Contributor

ph commented Nov 22, 2018

Looking at this @kvch!

@@ -313,6 +313,9 @@ func (h *Harvester) Run() error {
},
},
}
if len(message.Original) > 0 {
fields.Put("log.original", string(message.Original))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite unlikely but still possible that the original event has already the log.original present in the JSON structure, this means we would override the value.

I think we should do the following:

  • Making sure that we don't override existing value.
  • Maybe log if the value already exists.
  • Make sure that we can configure the target field.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the given proposal, the correct behavior also depends on the JSON settings.

If keys_under_root is disabled we can just write to log.original.

If keys_under_root is enabled and:

  • overwrite_keys==false => overwrite log.original if present. If log is no object, but a string remove log before putting log.original
  • overwrite_keys==true => only write to log.original if not present AND log is an object.

Maybe it would be easier to have the json reader/parser add the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new log.original is a metadata field of the event. It should behave the same way as log.offset or log.source does. Both of them are added to the fields first and can be overwritten later if JSON settings require it. Thus, we have the correct behaviour you have described.
So it is not the responsibility of the JSON reader to add it. Especially if later we are adding this feature for all event formats. It is metadata like log.source and should be added the same way, so users can expect the same well-known behaviour.

Copy link

@urso urso Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Do we have tests for this working with different combinations of configs? So no one breaks behavior in the future by accident.

If not, let us at least create an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added test cases with keys_under_root enabled. I am creating an issue to add more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #9273

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the rename processor I would propose to remove keys_under_root options in 7.0 to eliminate such issues. What I haven't tested is if rename works properly when moving fields under root.

@ph
Copy link
Contributor

ph commented Nov 22, 2018

@kvch Great addition! I am pretty sure it will help us debug some use case :)

@webmat
Copy link
Contributor

webmat commented Nov 22, 2018

Agreed. Also great for security use cases, where you never want to fully trust the nice parsing. You always want to be able to look at the original untouched event to confirm.

@kvch
Copy link
Contributor Author

kvch commented Nov 22, 2018

I have added a new option to configure the name of the field: original_message_target
If the field exists, the raw string is not added to the event.

@ph
Copy link
Contributor

ph commented Nov 22, 2018

make check is failing, we probably need to run make update to pick up some changes.

@ruflin ruflin requested a review from exekias November 22, 2018 21:10
output = self.read_output()
assert len(output) == 1
assert "my_custom_original_message" in output[0]
assert output[0]["my_custom_original_message"].count("\n") == 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we have here is what I would expect in event.original. In `log.original I would expect that it is already the combined json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what you mean by your comment. This is the event what is produced by the Beat in this test:

{
  "@timestamp": "2018-11-22T15:26:02.702Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "doc",
    "version": "7.0.0"
  },
  "log": {
    "offset": 0,
    "file": {
      "path": "/home/n/go/src/github.com/elastic/beats/filebeat/build/system-tests/run/test_json.Test.test_keep_original_multiline/log/json_multiline.log"
    },
    "flags": [
      "multiline"
    ]
  },
  "json": {},
  "input": {
    "type": "log"
  },
  "host": {
    "name": "sleipnir"
  },
  "agent": {
    "type": "filebeat",
    "hostname": "sleipnir",
    "version": "7.0.0"
  },
  "message": "[2018-05-05 18:00] line 1\n    line 2\n    line 3\n",
  "my_custom_original_message": "{\"log_line\": \"[2018-05-05 18:00] line 1\"}\n{\"log_line\": \"    line 2\"}\n{\"log_line\": \"    line 3\"}"
}

Three raw events are joined by a newline.

Do you want to see something else instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it should already be a valid JSON object. What you have above I would expect for a multiline event but not in the case of docker. I think combining the different json objects into one is something that is very effectively done on the edge. It is a modification of the event but all the data is still "original", ready for processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean something like this as the value of log.original:

{
    {\"log_line\": \"[2018-05-05 18:00] line 1\"},
    {\"log_line\": \"    line 2\"},
    {\"log_line\": \"    line 3\"}
}

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more thinking of {\"log_line\": \"[2018-05-05 18:00] line 1\n line 2\n line 3\"}, In this case it's almost identical to the message but encoded json string is still there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do see that this format has its advantages, e.g. no need for multiline parsing during reprocessing. But I am worried about the performance overhead of merging multiple events with numerous (nested) fields. I think I would prefer to have raw events concatenated under log.original and have it parsed later again IF needed. I would not do any parsing in advance if I do not know for sure that I am going to need it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking on what should go into log.original so far: Everything that is done by the input / readers should be done and then put into log.original. This means in the ingest pipeline could in general rely and processing / reprocessing log.original. In the above case I think the multiline part is local knowledge of the input and not on processing side.

The statement about ingest pipelines relying on log.original has a bit bigger implications like we should potentially not do dissect or json decoding on the Beats side. But I would ignore the local processors for now and focus on what the input does.

What you currently have for log.original above for me would belong into event.original.

@kvch kvch changed the title Keep raw JSON messages Add configuration option to keep raw JSON messages Nov 23, 2018
@kvch kvch force-pushed the feature-filebeat-json-keep-original-msg branch from f1abaf8 to a9e9157 Compare November 26, 2018 13:21
@kvch
Copy link
Contributor Author

kvch commented Nov 26, 2018

I rebased the branch and dropped the commit which introduced renaming of the field.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the CI intake is failling on this PR.

@@ -313,6 +313,9 @@ func (h *Harvester) Run() error {
},
},
}
if len(message.Original) > 0 {
fields.Put("log.original", string(message.Original))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the given proposal, the correct behavior also depends on the JSON settings.

If keys_under_root is disabled we can just write to log.original.

If keys_under_root is enabled and:

  • overwrite_keys==false => overwrite log.original if present. If log is no object, but a string remove log before putting log.original
  • overwrite_keys==true => only write to log.original if not present AND log is an object.

Maybe it would be easier to have the json reader/parser add the field.

h.config.DockerJSON.Partial,
h.config.DockerJSON.ForceCRI,
h.config.DockerJSON.CRIFlags,
h.config.DockerJSON.KeepOriginal)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 Params already. Maybe we better introduce a configuration/settings struct to the readjson package.

mlr.message.Original = append(mlr.message.Original, mlr.separator...)
}
mlr.message.Original = append(mlr.message.Original, m.Original...)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upper limit on the size of mlr.message.Original? When encoding to JSON, these might become quite big. At worst events might be dropped due to the increased size.

The multiline reader already ignores new lines once max_bytes have been reached. So to keep memory usage at least somewhat under control.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we introduce a max_bytes, will me make the content of log.original valid json or would it miss the last part?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would miss the last part, as is the case for unparsed raw text.

The default limit for message is already 10MB. We have users putting it to values like 250MB (xml). Do we really want to deal with potential 500MB events or bigger. I understand the sentiment of really keeping the original, but there is a limit of what can be shipped and what is accepted by the sinks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with dropping as long as we have a config option which the user can adjust. Also we should at least on the debug level log when we drop things.

@@ -42,6 +42,11 @@ func (p *StripNewline) Next() (reader.Message, error) {
L := message.Content
message.Content = L[:len(L)-lineEndingChars(L)]

if len(message.Original) > 0 {
O := message.Original
message.Original = O[:len(O)-lineEndingChars(O)]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but maybe we don't want to strip the newline here? Right now the multiline reader inserts a separator, after this one has been removed. => we don't report the original, but some modified contents. We can simply concatenate message.Original if we were not strip newlines here.

@urso urso force-pushed the feature-filebeat-json-keep-original-msg branch from 36c5b74 to 7bcd2ae Compare December 10, 2018 18:30
@kvch
Copy link
Contributor Author

kvch commented Apr 9, 2019

Closing as new processors copy_fields and truncate_fields are added:

@kvch kvch closed this Apr 9, 2019
@kvch kvch removed the needs_backport PR is waiting to be backported to other branches. label Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants