-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddtrace/tracer: possible memory leak with very large traces (50MB+) #971
Comments
The reason About the issue at hand, it needs more investigation. |
I'm probably missing something, but where is the dd-trace-go/ddtrace/tracer/writer.go Lines 77 to 105 in 63dea35
|
Oh my! Indeed you are right. My bad. |
@nik-andreev - I wasn't the only one! #948 (comment) |
@gbbr, any possibility of a fix for this? |
Yep, sorry for missing your comment on #948. FWIW, our workaround was to find the "large" traces and fix them so that they are smaller. |
I'm OK with removing the I'm not sure why removing that line reduces the memory consumption. It would be nice to understand this better. |
Ah no worries! I'm just thankful that you tried something else out too. I was thinking about that, but to @knusbaum's comment, I couldn't figure out why that releases the memory. After all, Golang's HTTP client implementation should call close on the request body? |
I spent some time going through |
@nik-andreev I've seen other spurious reports of the http client leaking, but nothing concrete. Can you tell me which version of Go you are using? |
It looks like if the agent responds but not with a parseable json, the trace client can forget to close the response body: dd-trace-go/ddtrace/tracer/sampler.go Lines 121 to 124 in d62beaa
I'm not sure if this ever happens in practice. Do you ever see the metric dd-trace-go/ddtrace/tracer/writer.go Line 100 in d62beaa
|
$ go version
go version go1.16.6 linux/amd64 I'm curious whether you can reproduce in your environment with the script that I pasted. |
Thanks for the response.
|
Yep, that seems normal. Maybe try to ratchet |
Isn't the issue with the request body not being closed? Based on the memory profile, the allocation of the memory is in the payload that stores traces. |
@nik-andreev I've been digging in the code, but can't figure out what is holding the reference. Can you confirm? |
@shannontan-bolt
Yes, that issue is with the request body not being closed, but we're suspecting some http machinery is holding onto the request, which is holding onto the payload, since there are no references to the payload kept by the tracer once the payload is sent. Incorrectly closing the request could potentially cause it to linger. |
Calling just
It seems like
This makes sense to me since before the |
Datadog v1.31.1 |
Figured I'd get this running so I can jump into the debugging myself :) |
I did a few rounds of debugging myself (sorry if this was already known):
|
@shannontan-bolt:
The only reason for it to be there is to follow documentation and wait for the Close call. I'd prefer keeping it for correctness. But as @knusbaum also said, if we can understand more exactly what's causing the memory leak and that results in us removing this call, I have no problem with that. I don't think it's a good idea to make code changes without understanding what we are doing. Not knowing why this mechanism uses more memory in your specific use case is not a great argument for removing something that is there to avoid a (presently unlikely) race condition. @nik-andreev: |
I've spent quite a few hours trialing a few solutions and diving into the depths of the standard library to try to understand why this might be happening - so I generally agree that if we can understand exactly why this causes a memory leak, it'd be great to do. However, this is a ticking time bomb in our production systems too - we don't have complete visibility or audit of all code paths within our systems to know whether they will generate a lot of traces either. And there's no good workarounds. So while I normally agree that we should get to the bottom of it, I can't come up with a good reason why we'd need to wait for the Don't mean to be too aggressive here though - I just am personally worried about this issue and my numerous hours trialing different solutions to the problem have not worked. |
A few additional, interesting, but not conclusive findings. Contrary to @nik-andreev's statement above regarding
I found that not using a channel in waitClose has solved the memory issue (but no clue why):
I'd invite folks to test it and see if this fixes the memory issue (if you can replicate it) using the snippet @nik-andreev provided. Additionally, buffer.Reset not working seems to be consistent with the way that bytes buffer reset seems to work.
|
I feel very unconfident in this solution, but somehow this "works on my machine".
|
Issues like this are very difficult to solve without a heap explorer. Nothing in our code is holding the reference, and it's difficult to trace the standard library's http client code to detect what might be holding the reference. Like @gbbr said, we don't want to start changing documented behavior (waiting for the However, resetting the payload buffer on |
I'm not sure I understand this rationale - what is the need for the wait to be closed? If that were common and/or good practice, wouldn't the standard library take this into account rather than wait for it to complete asynchronously? I also think that a good Close() would consider reading the bytes.Buffer in its entirety since we seem to have this problem mostly with 413s (payload-too-large) errors, which might mean that the buffer is not completely read out over the network.
I think this makes reasonable sense and is a good solution for now. Who makes this decision or at what point do we consider that a fix for this problem? I mostly don't want this issue to hang around since it looks like getting to the bottom of it is quite challenging.
Aside from the other short term solution, could someone try this solution out and see if this works? I am aware that this is functionally similar to After adding print lines as well, I still could not find any other place that was a sender/receiver for this channel and yet it seems that the Go GC is not comfortable evicting the payload that contains this channel. If there is a sender/receiver that I'm not aware of, it would make sense - the Goroutine is synchronously blocked, which means that the stack (with |
Sorry @gbbr, what documentation are you referring to with this comment? |
@shannontan-bolt: From https://github.com/golang/go/blob/go1.16/src/net/http/client.go#L134-L138:
I guess we are not reusing the body anymore, so we can remove the call? What do you think @knusbaum? Based on the documentation it seems we'd still be fine. The only possible (complete guess) explanation I have is that the transport (be it standard library or custom to some package in the program) either keeps a longer than usual reference to the request or does not close the body. Both of these would cause the initial buffer to keep piling up in memory or for the "wait close" call to stall. However, this is pure speculation. |
One further note is that the goroutine in which we are sending the payload is not synchronised, doesn't have any waiting mechanisms nor any counters. It could easily happen for these goroutines to grow in number and end up blocking and we wouldn't know. @shannontan-bolt have you managed to extract a blocking profile? We might be surprised what we see 🙂 We should likely add at least a statsd counter to these goroutines. |
I took the script and added
But I don't really know what I'm looking for out of the produced files - any help would be appreciated. |
@shannontan-bolt thanks for doing that! Could you save it to a file and send it to us? You can use any support means you prefer from here: https://www.datadoghq.com/support/. Simply attach the file, mention this issue and myself and/or Kyle and it should reach us. In the meantime, the PR stays up and will likely resolve the issue based on what I'm reading in this thread. Additionally, is there any chance you could try out the PR in your program and confirm that it resolves your problem? |
@gbbr , FWIW the PR resolves the problem in my test environment. Thanks for the speedy response & PR! |
Excellent! Thanks for your patience and support with this! |
Yeah, thanks everyone for the assistance and patience on this. When we get this merged I'll put out a patch release since this seems to be a serious issue for multiple users. |
I've looked at the profile, here's an overview of what's in it:
I think all of these are fine and active goroutines in the system, which should be blocking at times. Then, we have a few small ones:
While most of the stuff on the profile makes sense to show up, these two also made it to the top, so I'm sure that the removal of |
Since |
What
Using the v1.31.1 version of
dd-trace-go
I sometimes see leaked serialized trace content consuming very large amounts of memory:Replication
The issue could be demonstrated by running this little script that constructs artificially bloated trace:
The resulting output with large enough trace looks like this:
Some experiments
Commenting out https://github.com/DataDog/dd-trace-go/blob/v1/ddtrace/tracer/transport.go#L168 seems to solve the problem. I'm not entire sure why but seems like the default
http client/transport
can hold a reference to thepayload
if the server (datadog-agent) responds before the payload isClosed
.Client side workaround appears to be to periodically call
tracer.Start(WithHTTPClient())
which seems to recycle the cachedhttpClient/transport
and the related traces.It seems like
payload.Close
andpayload.waitClose
were introduced in #475 in order to fix a race condition which doesn't occur anymore since thepayload
object is no longer reused thanks to #549. Maybe #475 could be safely reverted now?The text was updated successfully, but these errors were encountered: