Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer: possible memory leak with very large traces (50MB+) #971

Closed
nik-andreev opened this issue Jul 30, 2021 · 40 comments · Fixed by #976
Closed

ddtrace/tracer: possible memory leak with very large traces (50MB+) #971

nik-andreev opened this issue Jul 30, 2021 · 40 comments · Fixed by #976
Labels
bug unintended behavior that has to be fixed core

Comments

@nik-andreev
Copy link

nik-andreev commented Jul 30, 2021

What

Using the v1.31.1 version of dd-trace-go I sometimes see leaked serialized trace content consuming very large amounts of memory:

image

Replication
The issue could be demonstrated by running this little script that constructs artificially bloated trace:

package main

import (
	"context"
	"fmt"
	"os"
	"runtime"
	"time"

	"gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
)

func main() {
	os.Setenv("DD_TRACE_STARTUP_LOGS", "false")
	tracer.Start()

	go printHeapSize()

	ctx := context.Background()
	largeTag := make([]byte, 1000*1000)

	//times, _ := strconv.Atoi(os.Args[1])
	times := 40 // large enough to crash it, 30 is okay?
	span, _ := tracer.StartSpanFromContext(ctx, "operation")
	for i := 0; i < times; i++ {
		span.SetTag(fmt.Sprintf("tag %d", i), largeTag)
	}
	span.Finish()

	fmt.Println("sleeping")
	time.Sleep(5 * time.Minute)
}

func printHeapSize() {
	var stats runtime.MemStats
	for {
		runtime.GC()
		runtime.ReadMemStats(&stats)
		fmt.Println("Heap Size:", stats.HeapAlloc)
		time.Sleep(10 * time.Second)
	}
}

The resulting output with large enough trace looks like this:

$ go run main.go
Heap Size: 7018216
sleeping
Heap Size: 131870176
Heap Size: 131842760
Heap Size: 131842752
Heap Size: 131842752
Heap Size: 131842752
Heap Size: 131842760
2021/07/30 11:25:45 Datadog Tracer v1.31.1 ERROR: lost 1 traces: payload-too-large
 (Status: Request Entity Too Large) (occurred: 30 Jul 21 11:24 UTC)
Heap Size: 131842888
Heap Size: 131842888
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768
Heap Size: 131842768

Some experiments
Commenting out https://github.com/DataDog/dd-trace-go/blob/v1/ddtrace/tracer/transport.go#L168 seems to solve the problem. I'm not entire sure why but seems like the default http client/transport can hold a reference to the payload if the server (datadog-agent) responds before the payload is Closed.

Client side workaround appears to be to periodically call tracer.Start(WithHTTPClient()) which seems to recycle the cached httpClient/transport and the related traces.

It seems like payload.Close and payload.waitClose were introduced in #475 in order to fix a race condition which doesn't occur anymore since the payload object is no longer reused thanks to #549. Maybe #475 could be safely reverted now?

@gbbr
Copy link
Contributor

gbbr commented Aug 2, 2021

The reason waitClose is there is because the net/http transport may continue reading from the request body even after (*http.Client).Do returns, causing a race condition when the payload is reused by someone else from the pool. That race is still possible and will always be possible as long as the pool is there. This is described clearly and accurately in the PR #549.

About the issue at hand, it needs more investigation.

@nik-andreev
Copy link
Author

nik-andreev commented Aug 2, 2021

The reason waitClose is there is because the net/http transport may continue reading from the request body even after (*http.Client).Do returns, causing a race condition when the payload is reused by someone else from the pool. That race is still possible and will always be possible as long as the pool is there. This is described clearly and accurately in the PR #549.

About the issue at hand, it needs more investigation.

I'm probably missing something, but where is the payload reused? It seems like it is discarded/replaced with fresh object the moment it is scheduled for flushing -

// flush will push any currently buffered traces to the server.
func (h *agentTraceWriter) flush() {
if h.payload.itemCount() == 0 {
return
}
h.wg.Add(1)
h.climit <- struct{}{}
go func(p *payload) {
defer func(start time.Time) {
<-h.climit
h.wg.Done()
h.config.statsd.Timing("datadog.tracer.flush_duration", time.Since(start), nil, 1)
}(time.Now())
size, count := p.size(), p.itemCount()
log.Debug("Sending payload: size: %d traces: %d\n", size, count)
rc, err := h.config.transport.send(p)
if err != nil {
h.config.statsd.Count("datadog.tracer.traces_dropped", int64(count), []string{"reason:send_failed"}, 1)
log.Error("lost %d traces: %v", count, err)
} else {
h.config.statsd.Count("datadog.tracer.flush_bytes", int64(size), nil, 1)
h.config.statsd.Count("datadog.tracer.flush_traces", int64(count), nil, 1)
if err := h.prioritySampling.readRatesJSON(rc); err != nil {
h.config.statsd.Incr("datadog.tracer.decode_error", nil, 1)
}
}
}(h.payload)
h.payload = newPayload()
}

@gbbr
Copy link
Contributor

gbbr commented Aug 2, 2021

Oh my! Indeed you are right. My bad.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 4, 2021

@nik-andreev - I wasn't the only one! #948 (comment)

@sjmtan
Copy link
Contributor

sjmtan commented Aug 4, 2021

@gbbr, any possibility of a fix for this?

@nik-andreev
Copy link
Author

@nik-andreev - I wasn't the only one! #948 (comment)

Yep, sorry for missing your comment on #948. FWIW, our workaround was to find the "large" traces and fix them so that they are smaller.

@knusbaum
Copy link
Contributor

knusbaum commented Aug 4, 2021

I'm OK with removing the p.waitClose() since it looks like it works around an issue we don't have anymore.

I'm not sure why removing that line reduces the memory consumption. waitClose was just there to prevent reusing the payload until the transport was done with it. Even though we don't reuse the payload, the transport will still hang onto the payload as long as it wants, keeping it from being collected. As soon as the transport is finished and the payload is closed, it should be collectable as garbage.

It would be nice to understand this better.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 4, 2021

@nik-andreev - I wasn't the only one! #948 (comment)

Yep, sorry for missing your comment on #948. FWIW, our workaround was to find the "large" traces and fix them so that they are smaller.

Ah no worries! I'm just thankful that you tried something else out too. I was thinking about that, but to @knusbaum's comment, I couldn't figure out why that releases the memory. After all, Golang's HTTP client implementation should call close on the request body?

@nik-andreev
Copy link
Author

I spent some time going through net/http but also couldn't spot the bug. Using the script from the description and some tweaks I found that it doesn't really matter what is in p.waitClose - as long as another function is called p.anything it hits some edge case where some of the transport internals need a reference to payload. This happens only if the request is too large and httpClient.Do returns before reading the payload content is done.

@knusbaum
Copy link
Contributor

knusbaum commented Aug 4, 2021

@nik-andreev I've seen other spurious reports of the http client leaking, but nothing concrete.

Can you tell me which version of Go you are using?

@knusbaum
Copy link
Contributor

knusbaum commented Aug 4, 2021

It looks like if the agent responds but not with a parseable json, the trace client can forget to close the response body:

if err := json.NewDecoder(rc).Decode(&payload); err != nil {
return err
}
rc.Close()

I'm not sure if this ever happens in practice. Do you ever see the metric datadog.tracer.decode_error > 0?

h.config.statsd.Incr("datadog.tracer.decode_error", nil, 1)

@nik-andreev
Copy link
Author

@nik-andreev I've seen other spurious reports of the http client leaking, but nothing concrete.

Can you tell me which version of Go you are using?

$ go version
go version go1.16.6 linux/amd64

I'm curious whether you can reproduce in your environment with the script that I pasted.

@nik-andreev
Copy link
Author

nik-andreev commented Aug 4, 2021

I'm not sure if this ever happens in practice. Do you ever see the metric datadog.tracer.decode_error > 0?

Nope, don't have this one:
image

@knusbaum
Copy link
Contributor

knusbaum commented Aug 4, 2021

Thanks for the response.
Running your test locally, I get this, which does not look problematic to me:

go version go1.16.6 darwin/amd64
...
Heap Size: 5948512
sleeping
Heap Size: 4879208
Heap Size: 4874184
Heap Size: 4874200
Heap Size: 4874296
Heap Size: 4874312
Heap Size: 4874312
2021/08/04 16:32:54 Datadog Tracer v1.31.1 ERROR: lost 1 traces: payload-too-large
 (Status: Request Entity Too Large) (occurred: 04 Aug 21 16:31 CDT)
Heap Size: 4879376
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879600
Heap Size: 4879896
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992

@nik-andreev
Copy link
Author

Thanks for the response.
Running your test locally, I get this, which does not look problematic to me:

go version go1.16.6 darwin/amd64
...
Heap Size: 5948512
sleeping
Heap Size: 4879208
Heap Size: 4874184
Heap Size: 4874200
Heap Size: 4874296
Heap Size: 4874312
Heap Size: 4874312
2021/08/04 16:32:54 Datadog Tracer v1.31.1 ERROR: lost 1 traces: payload-too-large
 (Status: Request Entity Too Large) (occurred: 04 Aug 21 16:31 CDT)
Heap Size: 4879376
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879312
Heap Size: 4879600
Heap Size: 4879896
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992
Heap Size: 4879992

Yep, that seems normal. Maybe try to ratchet times so the size and the corresponding delay is larger? I saw some variability there (I was doing the test on a shared machine).

@sjmtan
Copy link
Contributor

sjmtan commented Aug 4, 2021

It looks like if the agent responds but not with a parseable json, the trace client can forget to close the response body:

if err := json.NewDecoder(rc).Decode(&payload); err != nil {
return err
}
rc.Close()

I'm not sure if this ever happens in practice. Do you ever see the metric datadog.tracer.decode_error > 0?

h.config.statsd.Incr("datadog.tracer.decode_error", nil, 1)

Isn't the issue with the request body not being closed? Based on the memory profile, the allocation of the memory is in the payload that stores traces.

@knusbaum
Copy link
Contributor

knusbaum commented Aug 5, 2021

@nik-andreev
I tried increasing times to 80 and see hundreds of MB of memory just sitting there.

I've been digging in the code, but can't figure out what is holding the reference.
Adding a p.reset() to the case here seems to fix it, and looks like the cleanest solution to me:
https://github.com/DataDog/dd-trace-go/blob/v1/ddtrace/tracer/payload.go#L125

Can you confirm?

@knusbaum
Copy link
Contributor

knusbaum commented Aug 5, 2021

@shannontan-bolt

Isn't the issue with the request body not being closed? Based on the memory profile, the allocation of the memory is in the payload that stores traces.

Yes, that issue is with the request body not being closed, but we're suspecting some http machinery is holding onto the request, which is holding onto the payload, since there are no references to the payload kept by the tracer once the payload is sent. Incorrectly closing the request could potentially cause it to linger.

@nik-andreev
Copy link
Author

nik-andreev commented Aug 5, 2021

@nik-andreev
I tried increasing times to 80 and see hundreds of MB of memory just sitting there.

I've been digging in the code, but can't figure out what is holding the reference.
Adding a p.reset() to the case here seems to fix it, and looks like the cleanest solution to me:
https://github.com/DataDog/dd-trace-go/blob/v1/ddtrace/tracer/payload.go#L125

Can you confirm?

Calling just p.reset() didn't help but manually resetting the buffer seems to work for me:

// Close implements io.Closer
func (p *payload) Close() error {
	select {
	case p.closed <- struct{}{}:
		p.buf = bytes.Buffer{}
	default:
		// ignore subsequent Close calls
	}
	return nil
}

It seems like Reset on the buffer is not enough

Reset resets the buffer to be empty, but it retains the underlying storage for use by future writes. Reset is the same as Truncate(0).

This makes sense to me since before the payload was reused and keeping the storage around was okay. payload is not reused anymore so reset could be more aggressive.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 6, 2021

Heap Size: 5955048
sleeping
Heap Size: 130948136
Heap Size: 130936176
Heap Size: 130936112
Heap Size: 130936136
Heap Size: 130936232
Heap Size: 130936232
2021/08/06 09:35:13 Datadog Tracer v1.31.1 ERROR: lost 1 traces: payload-too-large
 (Status: Request Entity Too Large) (occurred: 06 Aug 21 09:34 EDT)
Heap Size: 130940848
Heap Size: 130940592
Heap Size: 130940592
Heap Size: 130940480
Heap Size: 130940576
Heap Size: 130940576
Heap Size: 130940680
Heap Size: 130940680
Heap Size: 130940680
Heap Size: 130940776
Heap Size: 130940784
Heap Size: 130940784
Heap Size: 130940792
Heap Size: 130940792
Heap Size: 130940792
Heap Size: 130940888
Heap Size: 130940888
Heap Size: 130940888
Heap Size: 130940888
Heap Size: 130940888
Heap Size: 130940888
Heap Size: 130940984
Heap Size: 130941080
Heap Size: 130941080

Datadog v1.31.1

@sjmtan
Copy link
Contributor

sjmtan commented Aug 6, 2021

Figured I'd get this running so I can jump into the debugging myself :)

@sjmtan
Copy link
Contributor

sjmtan commented Aug 6, 2021

I did a few rounds of debugging myself (sorry if this was already known):

  • I found that closing channels didn't help
  • waitClose does actually wait for the channel, and close is run asynchronously from client.Do(...)
  • the only other usage (i see) of the channel is for reset, which is only used within tests. Is there a strong case for why we can't remove it?
  • Also confirmed that making waitClose just empty (no channel consumption) removes the memory leak
  • Also tried bumping the channel to unbuffered, and that did not solve the problem (since we only care about the first occurrence)
Heap Size: 6016944
sleeping
Sending payload: size: 80000829 traces: 1
waitClose
close
close done
waitClose done
flush doneHeap Size: 130948232
Heap Size: 130936296

@gbbr
Copy link
Contributor

gbbr commented Aug 9, 2021

@shannontan-bolt:

Is there a strong case for why we can't remove it?

The only reason for it to be there is to follow documentation and wait for the Close call. I'd prefer keeping it for correctness. But as @knusbaum also said, if we can understand more exactly what's causing the memory leak and that results in us removing this call, I have no problem with that.

I don't think it's a good idea to make code changes without understanding what we are doing. Not knowing why this mechanism uses more memory in your specific use case is not a great argument for removing something that is there to avoid a (presently unlikely) race condition.

@nik-andreev:
Perhaps the reference to this object is kept in memory longer by the HTTP transport implementation and that's why resetting the buffer releases the memory back to the GC. In which case, doing that on Close would make sense. But we'd need proof that's the case.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 9, 2021

I've spent quite a few hours trialing a few solutions and diving into the depths of the standard library to try to understand why this might be happening - so I generally agree that if we can understand exactly why this causes a memory leak, it'd be great to do.

However, this is a ticking time bomb in our production systems too - we don't have complete visibility or audit of all code paths within our systems to know whether they will generate a lot of traces either. And there's no good workarounds.

So while I normally agree that we should get to the bottom of it, I can't come up with a good reason why we'd need to wait for the Close function to complete, and at some point, there ought to be a tradeoff of a time bomb for complete certainty?

Don't mean to be too aggressive here though - I just am personally worried about this issue and my numerous hours trialing different solutions to the problem have not worked.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 9, 2021

A few additional, interesting, but not conclusive findings. Contrary to @nik-andreev's statement above regarding

it doesn't really matter what is in p.waitClose - as long as another function is called p.anything it hits some edge case where some of the transport internals need a reference to payload.

I found that not using a channel in waitClose has solved the memory issue (but no clue why):

// Close implements io.Closer
func (p *payload) Close() error {
	p.closedBool = true

	//p.closed <- struct{}{}
	//close(p.closed)
	return nil
}

// waitClose blocks until the first Close call occurs since the payload
// was constructed or the last reset happened.
func (p *payload) waitClose() string {
	//<-p.closed
	//print(p.buf.Len())
	//return ""

	for {
		if p.closedBool {
			fmt.Println("boom")
			return ""
		}
	}
}

I'd invite folks to test it and see if this fixes the memory issue (if you can replicate it) using the snippet @nik-andreev provided. Additionally, buffer.Reset not working seems to be consistent with the way that bytes buffer reset seems to work.

// Reset resets the buffer to be empty,
// but it retains the underlying storage for use by future writes.
// Reset is the same as Truncate(0).

@sjmtan
Copy link
Contributor

sjmtan commented Aug 9, 2021

I feel very unconfident in this solution, but somehow this "works on my machine".

// waitClose blocks until the first Close call occurs since the payload
// was constructed or the last reset happened.
func (p *payload) waitClose() string {
	for {
		select {
		case <-p.closed:
			return ""
		default:
		}
	}
}

@knusbaum
Copy link
Contributor

Issues like this are very difficult to solve without a heap explorer. Nothing in our code is holding the reference, and it's difficult to trace the standard library's http client code to detect what might be holding the reference.

Like @gbbr said, we don't want to start changing documented behavior (waiting for the payload to be closed) without knowing why it affects the issue.

However, resetting the payload buffer on Close is a sensible implementation of payload that doesn't violate any existing behavior, and we have an understanding of why this change would improve the memory issue. Because of that, I would prefer to pursue that avenue rather than the waitClose avenue.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 10, 2021

start changing documented behavior (waiting for the payload to be closed) without knowing why it affects the issue.

I'm not sure I understand this rationale - what is the need for the wait to be closed? If that were common and/or good practice, wouldn't the standard library take this into account rather than wait for it to complete asynchronously? I also think that a good Close() would consider reading the bytes.Buffer in its entirety since we seem to have this problem mostly with 413s (payload-too-large) errors, which might mean that the buffer is not completely read out over the network.

resetting the payload buffer on Close is a sensible implementation of payload that doesn't violate any existing behavior, and we have an understanding of why this change would improve the memory issue.

I think this makes reasonable sense and is a good solution for now. Who makes this decision or at what point do we consider that a fix for this problem? I mostly don't want this issue to hang around since it looks like getting to the bottom of it is quite challenging.

I feel very unconfident in this solution, but somehow this "works on my machine".

// waitClose blocks until the first Close call occurs since the payload
// was constructed or the last reset happened.
func (p *payload) waitClose() string {
	for {
		select {
		case <-p.closed:
			return ""
		default:
		}
	}
}

Aside from the other short term solution, could someone try this solution out and see if this works? I am aware that this is functionally similar to <-p.closed, but my memory is reporting wildly different statistics with this (as in successful garbage collection). One possibility I explored was creating a separate channel within httpTransport and passing it as p.closed rather than constructing it within payload. This still was not the solution without the for { select ... } } of the channel.

After adding print lines as well, I still could not find any other place that was a sender/receiver for this channel and yet it seems that the Go GC is not comfortable evicting the payload that contains this channel. If there is a sender/receiver that I'm not aware of, it would make sense - the Goroutine is synchronously blocked, which means that the stack (with p included) would need to be put into Golang's channel sendq or receiveq, and that would prevent eviction.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 10, 2021

follow documentation and wait for the Close call. I'd prefer keeping it for correctness.

Sorry @gbbr, what documentation are you referring to with this comment?

@gbbr
Copy link
Contributor

gbbr commented Aug 10, 2021

@shannontan-bolt:

From https://github.com/golang/go/blob/go1.16/src/net/http/client.go#L134-L138:

// [...] callers wanting to reuse the body for subsequent requests
// must arrange to wait for the Close call before doing so.

I guess we are not reusing the body anymore, so we can remove the call? What do you think @knusbaum? Based on the documentation it seems we'd still be fine.

The only possible (complete guess) explanation I have is that the transport (be it standard library or custom to some package in the program) either keeps a longer than usual reference to the request or does not close the body. Both of these would cause the initial buffer to keep piling up in memory or for the "wait close" call to stall. However, this is pure speculation.

@gbbr gbbr changed the title Possible memory leak with very large traces (50MB+) ddtrace/tracer: possible memory leak with very large traces (50MB+) Aug 10, 2021
@gbbr gbbr added bug unintended behavior that has to be fixed core labels Aug 10, 2021
@gbbr
Copy link
Contributor

gbbr commented Aug 10, 2021

I created a proposal PR (#976 ☝🏻 ). Let me know if that solution works for you, if you could please try it out. In the meantime let's wait to hear @knusbaum's thoughts on it too.

@gbbr
Copy link
Contributor

gbbr commented Aug 10, 2021

One further note is that the goroutine in which we are sending the payload is not synchronised, doesn't have any waiting mechanisms nor any counters. It could easily happen for these goroutines to grow in number and end up blocking and we wouldn't know. @shannontan-bolt have you managed to extract a blocking profile? We might be surprised what we see 🙂

We should likely add at least a statsd counter to these goroutines.

@sjmtan
Copy link
Contributor

sjmtan commented Aug 10, 2021

I took the script and added

	go func() { log.Fatal(http.ListenAndServe(":4000", nil)) }()

	runtime.SetBlockProfileRate(1)
go tool pprof http://localhost:4000/debug/pprof/block\?seconds\=30

But I don't really know what I'm looking for out of the produced files - any help would be appreciated.

@gbbr
Copy link
Contributor

gbbr commented Aug 11, 2021

@shannontan-bolt thanks for doing that! Could you save it to a file and send it to us? You can use any support means you prefer from here: https://www.datadoghq.com/support/. Simply attach the file, mention this issue and myself and/or Kyle and it should reach us.

In the meantime, the PR stays up and will likely resolve the issue based on what I'm reading in this thread. Additionally, is there any chance you could try out the PR in your program and confirm that it resolves your problem?

@nik-andreev
Copy link
Author

@gbbr , FWIW the PR resolves the problem in my test environment. Thanks for the speedy response & PR!

@sjmtan
Copy link
Contributor

sjmtan commented Aug 11, 2021

Yep - it fixes the problem. I'll submit the file via support.

Thanks @gbbr and @knusbaum!

@gbbr
Copy link
Contributor

gbbr commented Aug 11, 2021

Excellent! Thanks for your patience and support with this!

@knusbaum
Copy link
Contributor

Yeah, thanks everyone for the assistance and patience on this. When we get this merged I'll put out a patch release since this seems to be a serious issue for multiple users.

@gbbr
Copy link
Contributor

gbbr commented Aug 12, 2021

I've looked at the profile, here's an overview of what's in it:

  • 32% Sending dogstatsd metrics
  • 17% Serving delta profiles by the profiler
  • 15% Main worker routinue
  • 11% Concentrator/APM stats flushing - this should arguably be off when disabled, although probably no impact
  • 11% Health metrics goroutine

I think all of these are fine and active goroutines in the system, which should be blocking at times. Then, we have a few small ones:

  • 0.19% payload flushing goroutine
  • 0.29% waitClose

While most of the stuff on the profile makes sense to show up, these two also made it to the top, so I'm sure that the removal of waitClose should provide further improvement beyond this leak.

@gbbr gbbr closed this as completed in #976 Aug 13, 2021
@knusbaum
Copy link
Contributor

Since v1.33.0 has just been tagged and released, I don't see a need to put up patch releases for previous versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unintended behavior that has to be fixed core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants