-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: customizable cloudwatch batch size when querying aws #10851
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this approach used rather than just configuring metric_batch_size
for the plugin?
This is an input, not an output. This batch size is specific to how the plugin gathers data to avoid a limit in AWS. |
Oh, didn’t notice that. |
Is there a way to determine a good batch size automatically? Maybe we calculate how big a request will be, and ask cloudwatch what the max batch is, then have our code determine how many requests can fit in a batch? Adding a setting gives the user a way to fix the problem, but requires them to think more than maybe they should have to. |
AWS's GetMetricData docs reference a limit of 500 metrics in a single request. This explains the currently hard-coded value of 500. Presumably, this is why we have not seen this issue before. The error the user received was a 413: I do not see any payload limit on the CloudWatch service quotas page. I also did not see any other reference to a 413 or too large error in that document or their docs on common errors. In the code, during Gather we split the queries into batches and then launch goroutines for each batch. It is not clear to me how we could check that we would avoid the 413 error before the goroutines. @reimda, thoughts? |
I'm wondering if this error actually comes from AWS or some in between proxy? |
If there's no way to know the byte limit, maybe we should watch for 413 errors and handle them afterward by splitting the batch in half and trying again? |
Maybe there is a documented api limit, just not a quota limit. The cloudwatch docs are not the easiest to sift through, but I would expect there to be something about this kind of limit there somewhere. |
d6a921d
to
4d385ea
Compare
I have continued to go through the web docs, as well as CloudWatch API Reference and CloudWatch User Guide with no references to limits. I have pushed another commit that tries the splitting method, but I think I prefer the user setting over this. The original bug report shows this occurring on every single request. This means the user will always be making 3x the requests. This does seem to be a one-off issue, and having a setting that tunes this makes more sense over something that introduces more complexity that is not easily tested either. @reimda, that said, I'd like to close this one way or another, so please let me know which way you want to go with this. |
It looks like there's no way to tune the batch size automatically so a setting is the next best thing. Would it be useful to have both the splitting and a user setting? That way if the setting is too low they will get splitting instead of failures, but if/when they notice the 3x requests they can increase the setting? It does seem like this is a relatively rare situation. I'm also ok with a setting and no splitting if that's what you prefer. |
This reverts commit 963c1ac.
Download PR build artifacts for linux_amd64.tar.gz, darwin_amd64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
(cherry picked from commit 196abb7)
Fixes: #10842