-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fiber spikes from normal query observers can permanently hose Node process #7747
Comments
When using Promises, a callback passed to .then() will sometimes pass through multiple .then() calls. If we aren't careful, we may end up wrapping a callback multiple times, resulting in a callback that creates a fiber which immediately creates another fiber and so on. This is especially bad because in this chain situation of fibers-creating-fibers, none of the fibers complete before others are created, thus they force expansion of the fiber pool. This, in turn, runs into a v8 bug in which each fiber created makes fiber-switching permanently slower, due to use of a linked list. See: meteor/meteor#7747 This change simply adds a field to the wrapped callbacks that marks them as not needing further wrapping. Fixes meteor#10
This patch to meteor-promise fixes our problems: I would appreciate if this could either rapidly find its way into a Meteor point release or if you could instruct me on how to force Meteor to use this patch in our build. This bug is causing repeated production outages, and although I have hotfixed it for now by editing the code in-place on the server, that's obviously not a sustainable solution. Thanks! |
Currently there is no direct way to use this patch (maybe), since Meteor package direct depends on that packge. |
I've ran our app locally just to inspect the fiber count, and after logging in as two clients and loading the data via our subscriptions, I was up to a fiber count of 81. I don't know what our count looks like in production yet, but based on my initial local testing, I believe it's very likely over 120. This is a very interesting find , as we've struggled with memory leaks and CPU cycling for a while. I know that Fibers was updated from 1.0.1 to 1.0.5 in Meteor 1.0.4. As a side note, I was not, however, able to reproduce the issue with the repro? I was alternating the setting and unsetting, but perhaps not often enough? |
Unfortunately, the repro is inconsistent on my end as well. It's somehow related to load, so I'd suggest try upping the number of records. |
I think I can reliable reproduce the situation with https://github.com/jparyani/meteor-fiber-repro
This works like a charm every time for me. The takeaway seems to be that this bug may be occurring around pub/subs. You could try replacing the Widgets collection with a fake custom collection without Mongo backing to try to eliminate more suspects. |
I applied the meteor/promise#11 solution but I can still create a fiber spike on demand. |
I can confirm, the repro still exhibits this behavior with [email protected]. |
Fixes a bug where .then callbacks could be wrapped multiple times: meteor/promise#11
@thorjarhun Indeed, meteor/promise#11 is not expected to fix this specific way of creating fiber spikes. It did, however, fix a different cause of fiber spikes, which turned out to be the one we were actually facing in production. |
This is very severe bug.. |
While we think resolving this issue would be a great addition to the Meteor project, we're going to close it for now due to inactivity (see the bug issue lifespan section of Meteor's traige doc for more information). If anyone comes across this issue in the future, and is interested in working on resolving it, please let us know by posting here and we'll consider re-opening this issue. Thanks! |
I'm using Meteor 1.6.0.1, having same problems, adding this comment to follow this issue. |
Same, we seem to be facing this issue in production |
This has been picked back up in issue #9796 -- see there for latest discussion. |
I have this issue in 1.4 1.5 and 1.6 |
We've discovered that any time the server has more than 120 concurrent fibers in flight, each new fiber created:
We have observed processes in production stuck spending 65% or more of their CPU time just in fiber-switching code. It only takes a few thousand leaked fibers to start seeing severe performance problems.
The root cause of this problem is that v8's green-threading code (which node-fibers is based on) is designed assuming a fixed-size thread pool with a handful of threads (again, referring to green threads, not OS threads). v8 stores some information about these threads in a linked list, and it performs lookups on this linked list during every thread switch. Hence, switching threads is O(n) in the number of threads. The v8 team has indicated that they are not interested in fixing this problem as they consider anything other than a small fixed thread pool to be out-of-scope.
The "120" number comes from node-fibers' attempt to work around the issue. It maintains a fiber pool with a default size of 120. As long as you stay under 120 concurrent fibers, the v8-level threading state will be reused and so won't leak. But if you exceed the pool size, additional threads leak when the fiber completes.
It is possible to increase node-fibers' pool size by setting
Fiber.poolSize
. Setting it to a very large number effectively gives you an adaptive pool that scales to the maximum number of concurrent fibers the process creates. I would recommend that Meteor do this automatically as any app that exceeds the default pool size will almost certainly run into severe problems quickly (and any app that does not exceed the pool size will not be affected by increasing the size).However, while increasing the pool size can avoid a gradual leak, an app can still experience problems if at some point it actually creates a few thousand concurrent fibers.
Unfortunately, it is fairly easy to cause Meteor to allocate lots of concurrent fibers in response to a database query. For example, imagine that we are subscribed to this query:
Now, elsewhere, we do:
Let's say that this query changes 5000 documents to be un-hidden, thus causing our subscription to observe 5000 results added. This will create ~5000 concurrent fibers. The Node process is now permanently hosed. It will burn excessive CPU until it is killed.
We have an example app showing a simple query causing a fiber explosion here: https://github.com/jparyani/meteor-fiber-repro
Note that we only started observing this problem when we upgraded to Meteor 1.4. However, we haven't actually tracked down what query is causing our fiber explosions yet, so it's entirely possible that it had nothing to do with the Meteor 1.4 update. (In fact, we don't even know if our production issues are tied to subscriptions at all; we just observe that this is one possible way to cause such a problem.)
Related issues on other projects:
https://bugs.chromium.org/p/v8/issues/detail?id=5338
https://bugs.chromium.org/p/v8/issues/detail?id=3777
laverdet/node-fibers#305
It seems like Meteor could solve this by maintaining its own fixed-size fiber pool, holding tasks in a queue when no fibers are available to execute them.
Thoughts? Has this issue been observed before? Is there a recommended work-around?
The text was updated successfully, but these errors were encountered: