-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add configuration option to keep raw JSON messages #9172
Conversation
Failing tests are unrelated. |
Looking at this @kvch! |
@@ -313,6 +313,9 @@ func (h *Harvester) Run() error { | |||
}, | |||
}, | |||
} | |||
if len(message.Original) > 0 { | |||
fields.Put("log.original", string(message.Original)) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite unlikely but still possible that the original event has already the log.original
present in the JSON structure, this means we would override the value.
I think we should do the following:
- Making sure that we don't override existing value.
- Maybe log if the value already exists.
- Make sure that we can configure the target field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the given proposal, the correct behavior also depends on the JSON settings.
If keys_under_root
is disabled we can just write to log.original
.
If keys_under_root
is enabled and:
overwrite_keys==false
=> overwritelog.original
if present. Iflog
is no object, but a string removelog
before puttinglog.original
overwrite_keys==true
=> only write tolog.original
if not present ANDlog
is an object.
Maybe it would be easier to have the json reader/parser add the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new log.original
is a metadata field of the event. It should behave the same way as log.offset
or log.source
does. Both of them are added to the fields first and can be overwritten later if JSON settings require it. Thus, we have the correct behaviour you have described.
So it is not the responsibility of the JSON reader to add it. Especially if later we are adding this feature for all event formats. It is metadata like log.source
and should be added the same way, so users can expect the same well-known behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Do we have tests for this working with different combinations of configs? So no one breaks behavior in the future by accident.
If not, let us at least create an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added test cases with keys_under_root
enabled. I am creating an issue to add more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #9273
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the rename
processor I would propose to remove keys_under_root
options in 7.0 to eliminate such issues. What I haven't tested is if rename works properly when moving fields under root.
@kvch Great addition! I am pretty sure it will help us debug some use case :) |
Agreed. Also great for security use cases, where you never want to fully trust the nice parsing. You always want to be able to look at the original untouched event to confirm. |
I have added a new option to configure the name of the field: |
|
filebeat/tests/system/test_json.py
Outdated
output = self.read_output() | ||
assert len(output) == 1 | ||
assert "my_custom_original_message" in output[0] | ||
assert output[0]["my_custom_original_message"].count("\n") == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we have here is what I would expect in event.original
. In `log.original I would expect that it is already the combined json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand what you mean by your comment. This is the event what is produced by the Beat in this test:
{
"@timestamp": "2018-11-22T15:26:02.702Z",
"@metadata": {
"beat": "filebeat",
"type": "doc",
"version": "7.0.0"
},
"log": {
"offset": 0,
"file": {
"path": "/home/n/go/src/github.com/elastic/beats/filebeat/build/system-tests/run/test_json.Test.test_keep_original_multiline/log/json_multiline.log"
},
"flags": [
"multiline"
]
},
"json": {},
"input": {
"type": "log"
},
"host": {
"name": "sleipnir"
},
"agent": {
"type": "filebeat",
"hostname": "sleipnir",
"version": "7.0.0"
},
"message": "[2018-05-05 18:00] line 1\n line 2\n line 3\n",
"my_custom_original_message": "{\"log_line\": \"[2018-05-05 18:00] line 1\"}\n{\"log_line\": \" line 2\"}\n{\"log_line\": \" line 3\"}"
}
Three raw events are joined by a newline.
Do you want to see something else instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that it should already be a valid JSON object. What you have above I would expect for a multiline event but not in the case of docker. I think combining the different json objects into one is something that is very effectively done on the edge. It is a modification of the event but all the data is still "original", ready for processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean something like this as the value of log.original
:
{
{\"log_line\": \"[2018-05-05 18:00] line 1\"},
{\"log_line\": \" line 2\"},
{\"log_line\": \" line 3\"}
}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more thinking of {\"log_line\": \"[2018-05-05 18:00] line 1\n line 2\n line 3\"},
In this case it's almost identical to the message but encoded json string is still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see that this format has its advantages, e.g. no need for multiline parsing during reprocessing. But I am worried about the performance overhead of merging multiple events with numerous (nested) fields. I think I would prefer to have raw events concatenated under log.original
and have it parsed later again IF needed. I would not do any parsing in advance if I do not know for sure that I am going to need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking on what should go into log.original so far: Everything that is done by the input / readers should be done and then put into log.original. This means in the ingest pipeline could in general rely and processing / reprocessing log.original. In the above case I think the multiline part is local knowledge of the input and not on processing side.
The statement about ingest pipelines relying on log.original has a bit bigger implications like we should potentially not do dissect or json decoding on the Beats side. But I would ignore the local processors for now and focus on what the input does.
What you currently have for log.original above for me would belong into event.original.
f1abaf8
to
a9e9157
Compare
I rebased the branch and dropped the commit which introduced renaming of the field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but the CI intake is failling on this PR.
@@ -313,6 +313,9 @@ func (h *Harvester) Run() error { | |||
}, | |||
}, | |||
} | |||
if len(message.Original) > 0 { | |||
fields.Put("log.original", string(message.Original)) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the given proposal, the correct behavior also depends on the JSON settings.
If keys_under_root
is disabled we can just write to log.original
.
If keys_under_root
is enabled and:
overwrite_keys==false
=> overwritelog.original
if present. Iflog
is no object, but a string removelog
before puttinglog.original
overwrite_keys==true
=> only write tolog.original
if not present ANDlog
is an object.
Maybe it would be easier to have the json reader/parser add the field.
h.config.DockerJSON.Partial, | ||
h.config.DockerJSON.ForceCRI, | ||
h.config.DockerJSON.CRIFlags, | ||
h.config.DockerJSON.KeepOriginal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 Params already. Maybe we better introduce a configuration/settings struct to the readjson package.
mlr.message.Original = append(mlr.message.Original, mlr.separator...) | ||
} | ||
mlr.message.Original = append(mlr.message.Original, m.Original...) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an upper limit on the size of mlr.message.Original
? When encoding to JSON, these might become quite big. At worst events might be dropped due to the increased size.
The multiline reader already ignores new lines once max_bytes
have been reached. So to keep memory usage at least somewhat under control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case we introduce a max_bytes
, will me make the content of log.original
valid json or would it miss the last part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would miss the last part, as is the case for unparsed raw text.
The default limit for message is already 10MB. We have users putting it to values like 250MB (xml). Do we really want to deal with potential 500MB events or bigger. I understand the sentiment of really keeping the original, but there is a limit of what can be shipped and what is accepted by the sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with dropping as long as we have a config option which the user can adjust. Also we should at least on the debug level log when we drop things.
@@ -42,6 +42,11 @@ func (p *StripNewline) Next() (reader.Message, error) { | |||
L := message.Content | |||
message.Content = L[:len(L)-lineEndingChars(L)] | |||
|
|||
if len(message.Original) > 0 { | |||
O := message.Original | |||
message.Original = O[:len(O)-lineEndingChars(O)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor detail, but maybe we don't want to strip the newline here? Right now the multiline reader inserts a separator, after this one has been removed. => we don't report the original, but some modified contents. We can simply concatenate message.Original
if we were not strip newlines here.
Co-Authored-By: kvch <[email protected]>
36c5b74
to
7bcd2ae
Compare
Closing as new processors |
New configuration option is added for
docker
input andjson
settings oflog
input namedkeep_original
. By default it is set to false.If the option is set, the raw content is published under
log.original
field. The length of the original message is limited bymax_bytes
option of the input.If
log.original
is part of the incoming event, the field added by Filebeat is overwritten. It behaves the same aslog.offset
orlog.source
. (Iflog.original
is part of the event, it is possible to keep both fields by settingjson.fields_under_root
tofalse
.)Partial implementation of #8083