[fix] File transport under stress hangs logger #1318

mempf · 2018-05-20T13:37:52Z

I noticed in one of our project using winston3 that all our logging would stop after a few large (sometimes very large) log messages. The use of the file transport was the cause and it seemed like it was not draining correctly. Upon investigation I found that the drain mechanism seemed flawed by using a drain counter and not resetting it to zero when the drain event occurs.

Additionally the file transport would attempt to write to the stream after receiving the indicator to stop and wait for a drain event.

I have tested this against the examples provided by @juanda99 and @kramer65 mentioned in #1144 and they both seem to work although the @kramer65 code had to by slightly modified to allow the while(true) loop to give the event loop some time to breathe.

I know this transport has been particularly tricky to get right and I hope this helps but unfortunately as it stands now, winston3 rc5 still can get the file transport stuck leading to all logging getting stuck. I am trying to create an isolated test case although my attempts so far have not gone well.

This looks like it was removed (by accident?) in 40c1ce4

mempf · 2018-05-21T05:01:06Z

Here is an example that using winston master should freeze fairly quickly as it is an extreme example.

https://pastebin.com/cZAikMTf

indexzero · 2018-05-21T15:42:12Z

First: thank you for your PR @mempf – fixing this scenario once and for all is high on the project's list of priorities.

Would if be possible for you to write a unit test verifying this fix works appropriately? Given the importance of this fix it's prudent to prevent future regressions.

ChrisAlderson

I've played around with the PR and it seems it is working. However, I think the fix can be much simpler.
As for the tests I've noticed this problem occurs on master when logging messages with a length - or longer than 16384. I've created a MWE below which should NOT work on master but does work with your patch @mempf, so the tests can be based around this.

'use strict';

const fs = require('fs');
const path = require('path');
const os = require('os');
const winston = require('winston');

const filename = path.join(__dirname, 'combined.log');

try {
  fs.unlinkSync(filename); // eslint-disable-line no-sync
} catch (ex) { } // eslint-disable-line no-empty

const logger = winston.createLogger({
  //
  // Custom format to only print out the message
  //
  format: winston.format.printf(info => info.message),
  transports: [
    new winston.transports.File({
      filename,
      //
      // By default winston already sets the
      // eol this way, but this is for the demo.
      //
      eol: os.EOL
    })
  ]
});

setInterval(() => {
  logger.log({
    level: 'info',
    //
    // Creates a message which is just under the `highWaterMark`limit.
    // See: https://nodejs.org/api/stream.html#stream_constructor_new_stream_writable_options
    // This should NOT work on `master`, but does work with @mempf's patch.
    //
    message: 'a'.repeat(16384 - os.EOL.length - 1)
  });
}, 100);

ChrisAlderson · 2018-05-21T15:53:33Z

lib/winston/transports/file.js

@@ -89,7 +88,7 @@ module.exports = class File extends TransportStream {
    this._size = 0;
    this._pendingSize = 0;
    this._created = 0;
-    this._drains = 0;
+    this._drain = false;
    this._next = noop;


this._next is never used so it can be removed. noop function on L25 could also be removed if you replace the default callback parameter of the log method on L104.

ChrisAlderson · 2018-05-21T15:55:10Z

lib/winston/transports/file.js

+      });
+      return;
+    }
+
    const written = this._stream.write(output, logged.bind(this));
    if (written === false) {


Move to L175.

ChrisAlderson · 2018-05-21T15:55:25Z

lib/winston/transports/file.js

@@ -143,9 +142,21 @@ module.exports = class File extends TransportStream {
      this._endStream(() => this._rotateFile());
    }

+    if (this._drain) {


This seems unnecessary. With the example you've provided in the PR I haven't been able to get in this block of code, but feel free to prove me wrong.

I am trying all my test cases and you seem to be right, this might be unnecessary. I haven't made a huge effort to investigate if the log function of the file transport might be called again if we've been told to wait for a drain event. This was to ensure we don't call .write on the stream again until its ready but if something else higher up prevents that making this check unnecessary than it can be removed.

The _writev method of winstonjs/winston-transport seems to call the log method multiple times. I've managed to get into the if-statement by changing the setInterval part of the script I provided in my previous comment with the following:

const infos = new Array(10).fill().map(() => ({ level: 'info', // // Creates a message which is just under the `highWaterMark`limit. // See: https://nodejs.org/api/stream.html#stream_constructor_new_stream_writable_options // This should NOT work on `master`, but does work with @mempf's patch. // message: 'a'.repeat(16385 - os.EOL.length - 1) })); infos.forEach(info => logger.log(info));

So synchronous logging calls could overload the stream buffer still without something preventing it then.

…ting" This reverts commit 7ff756b.

This test should fail on current winston 'master' and succeed on this PR

indexzero · 2018-05-22T01:30:44Z

Thanks for adding the test. If @ChrisAlderson is 👍then this should be good to go.

ChrisAlderson · 2018-05-22T07:03:11Z

Looks great! Thanks for the contribution. We could add one more test for the synchronous logging I have it below, you can add it @mempf or we can add it post-merge.

  it('should handle a high volume of large writes synchronous', function (done) {
    const logger = winston.createLogger({
      transports: [new winston.transports.File({
        filename: logPath
      })]
    });

    const counters = {
      write: 0,
      read: 0
    };

    const msgs = new Array(10).fill().map(() => ({
      counter: ++counters.write,
      message: 'a'.repeat(16384 - os.EOL.length - 1)
    }));
    msgs.forEach(msg => logger.info(msg));

    setTimeout(function () {
      helpers.tryRead(logPath)
        .on('error', function (err) {
          assume(err).false();
          done();
        })
        .pipe(split())
        .on('data', function (d) {
          const json = JSON.parse(d);
          assume(json.level).equal('info');
          assume(json.message).equal('a'.repeat(16384 - os.EOL.length - 1));
          assume(json.counter).equal(++counters.read);
        })
        .on('end', function () {
          assume(counters.write).equal(counters.read);
          done();
        });
    }, 10000);
  });

mempf · 2018-05-22T07:33:45Z

Added your synchronous test, nice catch with that one.

ChrisAlderson · 2018-05-22T08:00:37Z

Looks good to merge @indexzero.

kramer65 · 2018-06-14T09:56:59Z

@mempf - Thanks for the fix. Unfortunately it still doesn't work for us. You say this however:

although the @kramer65 code had to by slightly modified to allow the while(true) loop to give the event loop some time to breathe

Can you share with us how you modified my code? We tried this in our production app and although it doesn't stop the logging within a second like my example code does, it does make it hang within about 15 minutes. We've got a very high log throughput which is detrimental to understanding and managing our application (which handles financial decisions).

Also; would you know a workaround for this? We might for example queue Winston logging alltogether using something like kue. Would that be an option or do you see problems with that?

mempf · 2018-06-14T10:04:42Z

@kramer65 This whole issue is about the file writing buffering getting backed up because the OS is not writing it to disk fast enough. If you have enough sustained logging beyond what is being written to disk for long enough I imagine something in node will break, maybe even run out of memory. Is it possible to monitor the ram/heap usage of your application and see if it grows out of control.

If I recall the modified version of your example was simply to exit out of the while loop eventually. Since node is single threaded a while(true) loop will run forever never giving a chance for other code to run.

Edit: I also am curious what behavior you see in your production use case with winston 2.X and if this is a regression in winston 3 or if this has always been a problematic use case.

* [fix] Add drain handler back into file transport This looks like it was removed (by accident?) in 40c1ce4 * [fix] Remove extra drain event * Prevent this._drains from going negative * Refactor drain mechanism to fix file transport logging under stress * Removed early return when write returns false * Greatly simplify drain mechanism based on PR feedback and testing * Revert "Greatly simplify drain mechanism based on PR feedback and testing" This reverts commit 7ff756b. * Changes and cleanup based on PR feebback * Add test case that should cover the original reason for this change This test should fail on current winston 'master' and succeed on this PR * Test the content of the message as well just in case * Add synchronous version of the file transport large message stress test

Josh Ancill added 3 commits May 20, 2018 04:30

[fix] Add drain handler back into file transport

8e13ad2

This looks like it was removed (by accident?) in 40c1ce4

[fix] Remove extra drain event

dce81c8

Prevent this._drains from going negative

712d58b

mempf closed this May 20, 2018

mempf changed the title ~~Fix accidentally removed drain event handler~~ Fix accidentally removed drain event handler (incomplete) May 20, 2018

Refactor drain mechanism to fix file transport logging under stress

bedcc95

mempf changed the title ~~Fix accidentally removed drain event handler (incomplete)~~ [fix] File transport under stress hangs logger May 21, 2018

mempf reopened this May 21, 2018

Removed early return when write returns false

0e4d149

indexzero requested review from ChrisAlderson and jcrugzz May 21, 2018 15:40

ChrisAlderson suggested changes May 21, 2018

View reviewed changes

Josh Ancill added 5 commits May 21, 2018 14:32

Greatly simplify drain mechanism based on PR feedback and testing

7ff756b

Revert "Greatly simplify drain mechanism based on PR feedback and tes…

2bb8a3f

…ting" This reverts commit 7ff756b.

Changes and cleanup based on PR feebback

f48a2b9

Add test case that should cover the original reason for this change

c7224c7

This test should fail on current winston 'master' and succeed on this PR

Test the content of the message as well just in case

9f1d70f

ChrisAlderson approved these changes May 22, 2018

View reviewed changes

Add synchronous version of the file transport large message stress test

8cb98cf

indexzero merged commit 5c6c7b8 into winstonjs:master May 22, 2018

ChrisAlderson mentioned this pull request May 22, 2018

[test] Enhance testcases to be run on windows #1320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] File transport under stress hangs logger #1318

[fix] File transport under stress hangs logger #1318

mempf commented May 20, 2018 •

edited

Loading

mempf commented May 21, 2018

indexzero commented May 21, 2018

ChrisAlderson left a comment •

edited

Loading

ChrisAlderson May 21, 2018

ChrisAlderson May 21, 2018

ChrisAlderson May 21, 2018

mempf May 21, 2018

ChrisAlderson May 21, 2018

mempf May 21, 2018

indexzero commented May 22, 2018

ChrisAlderson commented May 22, 2018

mempf commented May 22, 2018

ChrisAlderson commented May 22, 2018 •

edited

Loading

kramer65 commented Jun 14, 2018

mempf commented Jun 14, 2018 •

edited

Loading

[fix] File transport under stress hangs logger #1318

[fix] File transport under stress hangs logger #1318

Conversation

mempf commented May 20, 2018 • edited Loading

mempf commented May 21, 2018

indexzero commented May 21, 2018

ChrisAlderson left a comment • edited Loading

Choose a reason for hiding this comment

ChrisAlderson May 21, 2018

Choose a reason for hiding this comment

ChrisAlderson May 21, 2018

Choose a reason for hiding this comment

ChrisAlderson May 21, 2018

Choose a reason for hiding this comment

mempf May 21, 2018

Choose a reason for hiding this comment

ChrisAlderson May 21, 2018

Choose a reason for hiding this comment

mempf May 21, 2018

Choose a reason for hiding this comment

indexzero commented May 22, 2018

ChrisAlderson commented May 22, 2018

mempf commented May 22, 2018

ChrisAlderson commented May 22, 2018 • edited Loading

kramer65 commented Jun 14, 2018

mempf commented Jun 14, 2018 • edited Loading

mempf commented May 20, 2018 •

edited

Loading

ChrisAlderson left a comment •

edited

Loading

ChrisAlderson commented May 22, 2018 •

edited

Loading

mempf commented Jun 14, 2018 •

edited

Loading