Add configuration option to keep raw JSON messages #9172

kvch · 2018-11-20T13:34:13Z

New configuration option is added for docker input and json settings of log input named keep_original. By default it is set to false.

If the option is set, the raw content is published under log.original field. The length of the original message is limited by max_bytes option of the input.

If log.original is part of the incoming event, the field added by Filebeat is overwritten. It behaves the same as log.offset or log.source. (If log.original is part of the event, it is possible to keep both fields by setting json.fields_under_root to false.)

Partial implementation of #8083

kvch · 2018-11-21T07:50:46Z

Failing tests are unrelated.

ph · 2018-11-22T14:03:03Z

Looking at this @kvch!

filebeat/filebeat.reference.yml

ph · 2018-11-22T14:13:16Z

filebeat/input/log/harvester.go

@@ -313,6 +313,9 @@ func (h *Harvester) Run() error {
 					},
 				},
 			}
+			if len(message.Original) > 0 {
+				fields.Put("log.original", string(message.Original))
+			}


Quite unlikely but still possible that the original event has already the log.original present in the JSON structure, this means we would override the value.

I think we should do the following:

Making sure that we don't override existing value.

Maybe log if the value already exists.

Make sure that we can configure the target field.

Besides the given proposal, the correct behavior also depends on the JSON settings.

If keys_under_root is disabled we can just write to log.original.

If keys_under_root is enabled and:

overwrite_keys==false => overwrite log.original if present. If log is no object, but a string remove log before putting log.original

overwrite_keys==true => only write to log.original if not present AND log is an object.

Maybe it would be easier to have the json reader/parser add the field.

The new log.original is a metadata field of the event. It should behave the same way as log.offset or log.source does. Both of them are added to the fields first and can be overwritten later if JSON settings require it. Thus, we have the correct behaviour you have described.
So it is not the responsibility of the JSON reader to add it. Especially if later we are adding this feature for all event formats. It is metadata like log.source and should be added the same way, so users can expect the same well-known behaviour.

👍

Do we have tests for this working with different combinations of configs? So no one breaks behavior in the future by accident.

If not, let us at least create an issue.

I only added test cases with keys_under_root enabled. I am creating an issue to add more tests.

Done: #9273

Since we have the rename processor I would propose to remove keys_under_root options in 7.0 to eliminate such issues. What I haven't tested is if rename works properly when moving fields under root.

ph · 2018-11-22T14:16:55Z

@kvch Great addition! I am pretty sure it will help us debug some use case :)

webmat · 2018-11-22T14:46:42Z

Agreed. Also great for security use cases, where you never want to fully trust the nice parsing. You always want to be able to look at the original untouched event to confirm.

kvch · 2018-11-22T15:28:19Z

I have added a new option to configure the name of the field: original_message_target
If the field exists, the raw string is not added to the event.

ph · 2018-11-22T17:15:46Z

make check is failing, we probably need to run make update to pick up some changes.

filebeat/_meta/common.reference.p2.yml

CHANGELOG.asciidoc

filebeat/input/log/config.go

ruflin · 2018-11-22T21:09:41Z

filebeat/tests/system/test_json.py

+        output = self.read_output()
+        assert len(output) == 1
+        assert "my_custom_original_message" in output[0]
+        assert output[0]["my_custom_original_message"].count("\n") == 2


What we have here is what I would expect in event.original. In `log.original I would expect that it is already the combined json.

I am not sure I understand what you mean by your comment. This is the event what is produced by the Beat in this test:

{ "@timestamp": "2018-11-22T15:26:02.702Z", "@metadata": { "beat": "filebeat", "type": "doc", "version": "7.0.0" }, "log": { "offset": 0, "file": { "path": "/home/n/go/src/github.com/elastic/beats/filebeat/build/system-tests/run/test_json.Test.test_keep_original_multiline/log/json_multiline.log" }, "flags": [ "multiline" ] }, "json": {}, "input": { "type": "log" }, "host": { "name": "sleipnir" }, "agent": { "type": "filebeat", "hostname": "sleipnir", "version": "7.0.0" }, "message": "[2018-05-05 18:00] line 1\n line 2\n line 3\n", "my_custom_original_message": "{\"log_line\": \"[2018-05-05 18:00] line 1\"}\n{\"log_line\": \" line 2\"}\n{\"log_line\": \" line 3\"}" }

Three raw events are joined by a newline.

Do you want to see something else instead?

I was thinking that it should already be a valid JSON object. What you have above I would expect for a multiline event but not in the case of docker. I think combining the different json objects into one is something that is very effectively done on the edge. It is a modification of the event but all the data is still "original", ready for processing.

You mean something like this as the value of log.original:

{ {\"log_line\": \"[2018-05-05 18:00] line 1\"}, {\"log_line\": \" line 2\"}, {\"log_line\": \" line 3\"} }

?

I was more thinking of {\"log_line\": \"[2018-05-05 18:00] line 1\n line 2\n line 3\"}, In this case it's almost identical to the message but encoded json string is still there.

I do see that this format has its advantages, e.g. no need for multiline parsing during reprocessing. But I am worried about the performance overhead of merging multiple events with numerous (nested) fields. I think I would prefer to have raw events concatenated under log.original and have it parsed later again IF needed. I would not do any parsing in advance if I do not know for sure that I am going to need it.

My thinking on what should go into log.original so far: Everything that is done by the input / readers should be done and then put into log.original. This means in the ingest pipeline could in general rely and processing / reprocessing log.original. In the above case I think the multiline part is local knowledge of the input and not on processing side.

The statement about ingest pipelines relying on log.original has a bit bigger implications like we should potentially not do dissect or json decoding on the Beats side. But I would ignore the local processors for now and focus on what the input does.

What you currently have for log.original above for me would belong into event.original.

kvch · 2018-11-26T13:21:49Z

I rebased the branch and dropped the commit which introduced renaming of the field.

ph

LGTM, but the CI intake is failling on this PR.

urso · 2018-11-27T22:06:24Z

filebeat/input/log/harvester.go

@@ -313,6 +313,9 @@ func (h *Harvester) Run() error {
 					},
 				},
 			}
+			if len(message.Original) > 0 {
+				fields.Put("log.original", string(message.Original))
+			}


Besides the given proposal, the correct behavior also depends on the JSON settings.

If keys_under_root is disabled we can just write to log.original.

If keys_under_root is enabled and:

overwrite_keys==false => overwrite log.original if present. If log is no object, but a string remove log before putting log.original

overwrite_keys==true => only write to log.original if not present AND log is an object.

Maybe it would be easier to have the json reader/parser add the field.

urso · 2018-11-27T22:10:15Z

filebeat/input/log/harvester.go

+			h.config.DockerJSON.Partial,
+			h.config.DockerJSON.ForceCRI,
+			h.config.DockerJSON.CRIFlags,
+			h.config.DockerJSON.KeepOriginal)


5 Params already. Maybe we better introduce a configuration/settings struct to the readjson package.

urso · 2018-11-27T22:13:49Z

libbeat/reader/multiline/multiline.go

+			mlr.message.Original = append(mlr.message.Original, mlr.separator...)
+		}
+		mlr.message.Original = append(mlr.message.Original, m.Original...)
+	}


Is there an upper limit on the size of mlr.message.Original? When encoding to JSON, these might become quite big. At worst events might be dropped due to the increased size.

The multiline reader already ignores new lines once max_bytes have been reached. So to keep memory usage at least somewhat under control.

In case we introduce a max_bytes, will me make the content of log.original valid json or would it miss the last part?

it would miss the last part, as is the case for unparsed raw text.

The default limit for message is already 10MB. We have users putting it to values like 250MB (xml). Do we really want to deal with potential 500MB events or bigger. I understand the sentiment of really keeping the original, but there is a limit of what can be shipped and what is accepted by the sinks.

I'm ok with dropping as long as we have a config option which the user can adjust. Also we should at least on the debug level log when we drop things.

urso · 2018-11-27T22:18:14Z

libbeat/reader/readfile/strip_newline.go

@@ -42,6 +42,11 @@ func (p *StripNewline) Next() (reader.Message, error) {
 	L := message.Content
 	message.Content = L[:len(L)-lineEndingChars(L)]

+	if len(message.Original) > 0 {
+		O := message.Original
+		message.Original = O[:len(O)-lineEndingChars(O)]


Minor detail, but maybe we don't want to strip the newline here? Right now the multiline reader inserts a separator, after this one has been removed. => we don't report the original, but some modified contents. We can simply concatenate message.Original if we were not strip newlines here.

Co-Authored-By: kvch <[email protected]>

kvch · 2019-04-09T06:34:53Z

Closing as new processors copy_fields and truncate_fields are added:

kvch added review Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. labels Nov 20, 2018

kvch requested review from urso and ruflin November 20, 2018 13:34

ph self-requested a review November 21, 2018 12:56

ph suggested changes Nov 22, 2018

View reviewed changes

ruflin requested a review from exekias November 22, 2018 21:10

ruflin reviewed Nov 22, 2018

View reviewed changes

kvch changed the title ~~Keep raw JSON messages~~ Add configuration option to keep raw JSON messages Nov 23, 2018

kvch force-pushed the feature-filebeat-json-keep-original-msg branch from f1abaf8 to a9e9157 Compare November 26, 2018 13:21

ph approved these changes Nov 26, 2018

View reviewed changes

urso reviewed Nov 27, 2018

View reviewed changes

This was referenced Nov 28, 2018

Add more E2E tests for JSON configuration variations and log.* fields #9273

Closed

Add raw JSON to event.original if parsing fails #9361

Closed

kvch and others added 8 commits December 10, 2018 19:29

add keep_original_message to JSON inputs: log & docker

73fe8bf

follow up changes in scripts

217f4d4

follow up more tests

a24d22b

add missing test file

cc77f20

add changelog entry

89ad3f7

Update CHANGELOG.asciidoc

c2c61f5

Co-Authored-By: kvch <[email protected]>

rename keep_original_message to keep_original

822d1e1

add missing config

7bcd2ae

urso force-pushed the feature-filebeat-json-keep-original-msg branch from 36c5b74 to 7bcd2ae Compare December 10, 2018 18:30

kvch closed this Apr 9, 2019

kvch removed the needs_backport PR is waiting to be backported to other branches. label Aug 29, 2019

ruflin mentioned this pull request Sep 9, 2019

NGINX - Access log doesn't have a message field (Original Message) elastic/ecs#543

Closed

Randy-312 mentioned this pull request Sep 10, 2019

keep_original is not documented #13574

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configuration option to keep raw JSON messages #9172

Add configuration option to keep raw JSON messages #9172

kvch commented Nov 20, 2018 •

edited

Loading

kvch commented Nov 21, 2018

ph commented Nov 22, 2018

ph Nov 22, 2018

urso Nov 27, 2018

kvch Nov 28, 2018

urso Nov 28, 2018 •

edited

Loading

kvch Nov 28, 2018

kvch Nov 28, 2018

ruflin Nov 29, 2018

ph commented Nov 22, 2018

webmat commented Nov 22, 2018

kvch commented Nov 22, 2018

ph commented Nov 22, 2018

ruflin Nov 22, 2018

kvch Nov 23, 2018

ruflin Nov 23, 2018

kvch Nov 26, 2018

ruflin Nov 27, 2018

kvch Nov 27, 2018

ruflin Nov 27, 2018

kvch commented Nov 26, 2018

ph left a comment

urso Nov 27, 2018

urso Nov 27, 2018

urso Nov 27, 2018

ruflin Nov 29, 2018

urso Dec 10, 2018

ruflin Dec 11, 2018

urso Nov 27, 2018

kvch commented Apr 9, 2019

Add configuration option to keep raw JSON messages #9172

Add configuration option to keep raw JSON messages #9172

Conversation

kvch commented Nov 20, 2018 • edited Loading

kvch commented Nov 21, 2018

ph commented Nov 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

urso Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ph commented Nov 22, 2018

webmat commented Nov 22, 2018

kvch commented Nov 22, 2018

ph commented Nov 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kvch commented Nov 26, 2018

ph left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kvch commented Apr 9, 2019

kvch commented Nov 20, 2018 •

edited

Loading

urso Nov 28, 2018 •

edited

Loading