Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] File transport under stress hangs logger #1318

Merged
merged 11 commits into from
May 22, 2018
Merged

Conversation

mempf
Copy link
Contributor

@mempf mempf commented May 20, 2018

I noticed in one of our project using winston3 that all our logging would stop after a few large (sometimes very large) log messages. The use of the file transport was the cause and it seemed like it was not draining correctly. Upon investigation I found that the drain mechanism seemed flawed by using a drain counter and not resetting it to zero when the drain event occurs.

Additionally the file transport would attempt to write to the stream after receiving the indicator to stop and wait for a drain event.

I have tested this against the examples provided by @juanda99 and @kramer65 mentioned in #1144 and they both seem to work although the @kramer65 code had to by slightly modified to allow the while(true) loop to give the event loop some time to breathe.

I know this transport has been particularly tricky to get right and I hope this helps but unfortunately as it stands now, winston3 rc5 still can get the file transport stuck leading to all logging getting stuck. I am trying to create an isolated test case although my attempts so far have not gone well.

@mempf mempf closed this May 20, 2018
@mempf mempf changed the title Fix accidentally removed drain event handler Fix accidentally removed drain event handler (incomplete) May 20, 2018
@mempf mempf changed the title Fix accidentally removed drain event handler (incomplete) [fix] File transport under stress hangs logger May 21, 2018
@mempf mempf reopened this May 21, 2018
@mempf
Copy link
Contributor Author

mempf commented May 21, 2018

Here is an example that using winston master should freeze fairly quickly as it is an extreme example.

https://pastebin.com/cZAikMTf

@indexzero
Copy link
Member

First: thank you for your PR @mempf – fixing this scenario once and for all is high on the project's list of priorities.

Would if be possible for you to write a unit test verifying this fix works appropriately? Given the importance of this fix it's prudent to prevent future regressions.

Copy link
Member

@ChrisAlderson ChrisAlderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with the PR and it seems it is working. However, I think the fix can be much simpler.
As for the tests I've noticed this problem occurs on master when logging messages with a length - or longer than 16384. I've created a MWE below which should NOT work on master but does work with your patch @mempf, so the tests can be based around this.

'use strict';

const fs = require('fs');
const path = require('path');
const os = require('os');
const winston = require('winston');

const filename = path.join(__dirname, 'combined.log');

try {
  fs.unlinkSync(filename); // eslint-disable-line no-sync
} catch (ex) { } // eslint-disable-line no-empty

const logger = winston.createLogger({
  //
  // Custom format to only print out the message
  //
  format: winston.format.printf(info => info.message),
  transports: [
    new winston.transports.File({
      filename,
      //
      // By default winston already sets the
      // eol this way, but this is for the demo.
      //
      eol: os.EOL
    })
  ]
});

setInterval(() => {
  logger.log({
    level: 'info',
    //
    // Creates a message which is just under the `highWaterMark`limit.
    // See: https://nodejs.org/api/stream.html#stream_constructor_new_stream_writable_options
    // This should NOT work on `master`, but does work with @mempf's patch.
    //
    message: 'a'.repeat(16384 - os.EOL.length - 1)
  });
}, 100);

@@ -89,7 +88,7 @@ module.exports = class File extends TransportStream {
this._size = 0;
this._pendingSize = 0;
this._created = 0;
this._drains = 0;
this._drain = false;
this._next = noop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this._next is never used so it can be removed. noop function on L25 could also be removed if you replace the default callback parameter of the log method on L104.

});
return;
}

const written = this._stream.write(output, logged.bind(this));
if (written === false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to L175.

@@ -143,9 +142,21 @@ module.exports = class File extends TransportStream {
this._endStream(() => this._rotateFile());
}

if (this._drain) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary. With the example you've provided in the PR I haven't been able to get in this block of code, but feel free to prove me wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying all my test cases and you seem to be right, this might be unnecessary. I haven't made a huge effort to investigate if the log function of the file transport might be called again if we've been told to wait for a drain event. This was to ensure we don't call .write on the stream again until its ready but if something else higher up prevents that making this check unnecessary than it can be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _writev method of winstonjs/winston-transport seems to call the log method multiple times. I've managed to get into the if-statement by changing the setInterval part of the script I provided in my previous comment with the following:

const infos = new Array(10).fill().map(() => ({
  level: 'info',
  //
  // Creates a message which is just under the `highWaterMark`limit.
  // See: https://nodejs.org/api/stream.html#stream_constructor_new_stream_writable_options
  // This should NOT work on `master`, but does work with @mempf's patch.
  //
  message: 'a'.repeat(16385 - os.EOL.length - 1)
}));
infos.forEach(info => logger.log(info));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So synchronous logging calls could overload the stream buffer still without something preventing it then.

@indexzero
Copy link
Member

Thanks for adding the test. If @ChrisAlderson is 👍then this should be good to go.

@ChrisAlderson
Copy link
Member

Looks great! Thanks for the contribution. We could add one more test for the synchronous logging I have it below, you can add it @mempf or we can add it post-merge.

  it('should handle a high volume of large writes synchronous', function (done) {
    const logger = winston.createLogger({
      transports: [new winston.transports.File({
        filename: logPath
      })]
    });

    const counters = {
      write: 0,
      read: 0
    };

    const msgs = new Array(10).fill().map(() => ({
      counter: ++counters.write,
      message: 'a'.repeat(16384 - os.EOL.length - 1)
    }));
    msgs.forEach(msg => logger.info(msg));

    setTimeout(function () {
      helpers.tryRead(logPath)
        .on('error', function (err) {
          assume(err).false();
          done();
        })
        .pipe(split())
        .on('data', function (d) {
          const json = JSON.parse(d);
          assume(json.level).equal('info');
          assume(json.message).equal('a'.repeat(16384 - os.EOL.length - 1));
          assume(json.counter).equal(++counters.read);
        })
        .on('end', function () {
          assume(counters.write).equal(counters.read);
          done();
        });
    }, 10000);
  });

@mempf
Copy link
Contributor Author

mempf commented May 22, 2018

Added your synchronous test, nice catch with that one.

@ChrisAlderson
Copy link
Member

ChrisAlderson commented May 22, 2018

Looks good to merge @indexzero.

@indexzero indexzero merged commit 5c6c7b8 into winstonjs:master May 22, 2018
@kramer65
Copy link

@mempf - Thanks for the fix. Unfortunately it still doesn't work for us. You say this however:

although the @kramer65 code had to by slightly modified to allow the while(true) loop to give the event loop some time to breathe

Can you share with us how you modified my code? We tried this in our production app and although it doesn't stop the logging within a second like my example code does, it does make it hang within about 15 minutes. We've got a very high log throughput which is detrimental to understanding and managing our application (which handles financial decisions).

Also; would you know a workaround for this? We might for example queue Winston logging alltogether using something like kue. Would that be an option or do you see problems with that?

@mempf
Copy link
Contributor Author

mempf commented Jun 14, 2018

@kramer65 This whole issue is about the file writing buffering getting backed up because the OS is not writing it to disk fast enough. If you have enough sustained logging beyond what is being written to disk for long enough I imagine something in node will break, maybe even run out of memory. Is it possible to monitor the ram/heap usage of your application and see if it grows out of control.

If I recall the modified version of your example was simply to exit out of the while loop eventually. Since node is single threaded a while(true) loop will run forever never giving a chance for other code to run.

Edit: I also am curious what behavior you see in your production use case with winston 2.X and if this is a regression in winston 3 or if this has always been a problematic use case.

Mizumaki pushed a commit to Mizumaki/winston that referenced this pull request Jun 11, 2020
* [fix] Add drain handler back into file transport

This looks like it was removed (by accident?) in
40c1ce4

* [fix] Remove extra drain event

* Prevent this._drains from going negative

* Refactor drain mechanism to fix file transport logging under stress

* Removed early return when write returns false

* Greatly simplify drain mechanism based on PR feedback and testing

* Revert "Greatly simplify drain mechanism based on PR feedback and testing"

This reverts commit 7ff756b.

* Changes and cleanup based on PR feebback

* Add test case that should cover the original reason for this change

This test should fail on current winston 'master' and succeed on this PR

* Test the content of the message as well just in case

* Add synchronous version of the file transport large message stress test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants