Fiber spikes from normal query observers can permanently hose Node process #7747

kentonv · 2016-09-01T23:14:47Z

We've discovered that any time the server has more than 120 concurrent fibers in flight, each new fiber created:

Permanently leaks some memory.
Permanently makes fiber-switching slower, which effectively makes the whole process slower.

We have observed processes in production stuck spending 65% or more of their CPU time just in fiber-switching code. It only takes a few thousand leaked fibers to start seeing severe performance problems.

The root cause of this problem is that v8's green-threading code (which node-fibers is based on) is designed assuming a fixed-size thread pool with a handful of threads (again, referring to green threads, not OS threads). v8 stores some information about these threads in a linked list, and it performs lookups on this linked list during every thread switch. Hence, switching threads is O(n) in the number of threads. The v8 team has indicated that they are not interested in fixing this problem as they consider anything other than a small fixed thread pool to be out-of-scope.

The "120" number comes from node-fibers' attempt to work around the issue. It maintains a fiber pool with a default size of 120. As long as you stay under 120 concurrent fibers, the v8-level threading state will be reused and so won't leak. But if you exceed the pool size, additional threads leak when the fiber completes.

It is possible to increase node-fibers' pool size by setting Fiber.poolSize. Setting it to a very large number effectively gives you an adaptive pool that scales to the maximum number of concurrent fibers the process creates. I would recommend that Meteor do this automatically as any app that exceeds the default pool size will almost certainly run into severe problems quickly (and any app that does not exceed the pool size will not be affected by increasing the size).

However, while increasing the pool size can avoid a gradual leak, an app can still experience problems if at some point it actually creates a few thousand concurrent fibers.

Unfortunately, it is fairly easy to cause Meteor to allocate lots of concurrent fibers in response to a database query. For example, imagine that we are subscribed to this query:

MyCollection.find({hidden: false});

Now, elsewhere, we do:

MyCollection.update({hidden: true}, {$set: {hidden: false}}, {multi: true});

Let's say that this query changes 5000 documents to be un-hidden, thus causing our subscription to observe 5000 results added. This will create ~5000 concurrent fibers. The Node process is now permanently hosed. It will burn excessive CPU until it is killed.

We have an example app showing a simple query causing a fiber explosion here: https://github.com/jparyani/meteor-fiber-repro

Note that we only started observing this problem when we upgraded to Meteor 1.4. However, we haven't actually tracked down what query is causing our fiber explosions yet, so it's entirely possible that it had nothing to do with the Meteor 1.4 update. (In fact, we don't even know if our production issues are tied to subscriptions at all; we just observe that this is one possible way to cause such a problem.)

Related issues on other projects:

https://bugs.chromium.org/p/v8/issues/detail?id=5338
https://bugs.chromium.org/p/v8/issues/detail?id=3777
laverdet/node-fibers#305

It seems like Meteor could solve this by maintaining its own fixed-size fiber pool, holding tasks in a queue when no fibers are available to execute them.

Thoughts? Has this issue been observed before? Is there a recommended work-around?

The text was updated successfully, but these errors were encountered:

When using Promises, a callback passed to .then() will sometimes pass through multiple .then() calls. If we aren't careful, we may end up wrapping a callback multiple times, resulting in a callback that creates a fiber which immediately creates another fiber and so on. This is especially bad because in this chain situation of fibers-creating-fibers, none of the fibers complete before others are created, thus they force expansion of the fiber pool. This, in turn, runs into a v8 bug in which each fiber created makes fiber-switching permanently slower, due to use of a linked list. See: meteor/meteor#7747 This change simply adds a field to the wrapped callbacks that marks them as not needing further wrapping. Fixes meteor#10

kentonv · 2016-09-05T19:06:41Z

This patch to meteor-promise fixes our problems:

meteor/promise#11

I would appreciate if this could either rapidly find its way into a Meteor point release or if you could instruct me on how to force Meteor to use this patch in our build. This bug is causing repeated production outages, and although I have hotfixed it for now by editing the code in-place on the server, that's obviously not a sustainable solution.

Thanks!

laosb · 2016-09-05T22:10:38Z

Currently there is no direct way to use this patch (maybe), since Meteor package direct depends on that packge.

TimHeckel · 2016-09-09T16:28:36Z

I've ran our app locally just to inspect the fiber count, and after logging in as two clients and loading the data via our subscriptions, I was up to a fiber count of 81. I don't know what our count looks like in production yet, but based on my initial local testing, I believe it's very likely over 120. This is a very interesting find , as we've struggled with memory leaks and CPU cycling for a while. I know that Fibers was updated from 1.0.1 to 1.0.5 in Meteor 1.0.4.

As a side note, I was not, however, able to reproduce the issue with the repro? I was alternating the setting and unsetting, but perhaps not often enough?

jparyani · 2016-09-09T19:32:00Z

Unfortunately, the repro is inconsistent on my end as well. It's somehow related to load, so I'd suggest try upping the number of records.

thorjarhun · 2016-09-14T15:38:36Z

I think I can reliable reproduce the situation with https://github.com/jparyani/meteor-fiber-repro

Start meteor
Have a tab open to localhost:3000 (this is critical)
Start meteor shell
Execute the following function:

function Kill() {
  Widgets.update({}, {$unset: {a: 1}}, {multi:true});
  Widgets.update({}, {$set: {a: Math.random()}}, {multi:true});
}

This works like a charm every time for me.

The takeaway seems to be that this bug may be occurring around pub/subs. You could try replacing the Widgets collection with a fake custom collection without Mongo backing to try to eliminate more suspects.

thorjarhun · 2016-09-14T19:43:14Z

I applied the meteor/promise#11 solution but I can still create a fiber spike on demand.
@kentonv Are you still able to create fiber spikes with https://github.com/jparyani/meteor-fiber-repro with the promise patch?

jparyani · 2016-09-14T20:02:07Z

I can confirm, the repro still exhibits this behavior with [email protected].

Fixes a bug where .then callbacks could be wrapped multiple times: meteor/promise#11

kentonv · 2016-09-17T03:06:04Z

@thorjarhun Indeed, meteor/promise#11 is not expected to fix this specific way of creating fiber spikes. It did, however, fix a different cause of fiber spikes, which turned out to be the one we were actually facing in production.

yim1990 · 2016-11-08T03:55:32Z

This is very severe bug..

hwillson · 2018-03-05T15:40:21Z

While we think resolving this issue would be a great addition to the Meteor project, we're going to close it for now due to inactivity (see the bug issue lifespan section of Meteor's traige doc for more information). If anyone comes across this issue in the future, and is interested in working on resolving it, please let us know by posting here and we'll consider re-opening this issue. Thanks!

xet7 · 2018-04-11T21:52:05Z

I'm using Meteor 1.6.0.1, having same problems, adding this comment to follow this issue.

kevb · 2018-04-24T08:38:49Z

Same, we seem to be facing this issue in production

kentonv · 2018-04-24T18:01:29Z

This has been picked back up in issue #9796 -- see there for latest discussion.

dagatsoin · 2018-05-03T15:54:29Z

I have this issue in 1.4 1.5 and 1.6
I will be glad to help. What information do you need ?

kentonv mentioned this issue Sep 4, 2016

Don't force non-fiber-dependent code into fibers meteor/promise#9

Open

kentonv mentioned this issue Sep 5, 2016

Don't double-wrap callbacks. meteor/promise#11

Merged

laosb added confirmed We want to fix or implement it Type:Bug Severity:production Impact:some labels Sep 5, 2016

TimHeckel referenced this issue Sep 16, 2016

Update meteor-promise npm package to v0.7.4.

26d082b

Fixes a bug where .then callbacks could be wrapped multiple times: meteor/promise#11

hwillson added the review label Aug 23, 2017

hwillson added the pull-requests-encouraged label Dec 21, 2017

hwillson added not-implemented and removed review labels Mar 5, 2018

hwillson closed this as completed Mar 5, 2018

hwillson mentioned this issue Apr 11, 2018

[1.6.1] 100% CPU on Ubuntu server, caused by client? #9796

Open

KoenLav mentioned this issue May 10, 2018

Batch messages sent over DDP from server to client (DDP V2?) #9862

Closed

ebbe-brandstrup mentioned this issue Nov 28, 2018

Regression since Meteor 1.7.x: Order of magnitude more "fiber" objects created #10359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fiber spikes from normal query observers can permanently hose Node process #7747

Fiber spikes from normal query observers can permanently hose Node process #7747

kentonv commented Sep 1, 2016

kentonv commented Sep 5, 2016 •

edited

Loading

laosb commented Sep 5, 2016

TimHeckel commented Sep 9, 2016 •

edited

Loading

jparyani commented Sep 9, 2016

thorjarhun commented Sep 14, 2016

thorjarhun commented Sep 14, 2016 •

edited

Loading

jparyani commented Sep 14, 2016

kentonv commented Sep 17, 2016

yim1990 commented Nov 8, 2016

hwillson commented Mar 5, 2018

xet7 commented Apr 11, 2018

kevb commented Apr 24, 2018

kentonv commented Apr 24, 2018

dagatsoin commented May 3, 2018 •

edited

Loading

Fiber spikes from normal query observers can permanently hose Node process #7747

Fiber spikes from normal query observers can permanently hose Node process #7747

Comments

kentonv commented Sep 1, 2016

kentonv commented Sep 5, 2016 • edited Loading

laosb commented Sep 5, 2016

TimHeckel commented Sep 9, 2016 • edited Loading

jparyani commented Sep 9, 2016

thorjarhun commented Sep 14, 2016

thorjarhun commented Sep 14, 2016 • edited Loading

jparyani commented Sep 14, 2016

kentonv commented Sep 17, 2016

yim1990 commented Nov 8, 2016

hwillson commented Mar 5, 2018

xet7 commented Apr 11, 2018

kevb commented Apr 24, 2018

kentonv commented Apr 24, 2018

dagatsoin commented May 3, 2018 • edited Loading

kentonv commented Sep 5, 2016 •

edited

Loading

TimHeckel commented Sep 9, 2016 •

edited

Loading

thorjarhun commented Sep 14, 2016 •

edited

Loading

dagatsoin commented May 3, 2018 •

edited

Loading