Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that the millRun goroutine terminates when Close called. #100

Open
wants to merge 3 commits into
base: v2.0
Choose a base branch
from

Conversation

howbazaar
Copy link

Currently the millRun goroutines leaks. This is very noticable if
a Logger is constructed periodically, used and then closed.

This change ensures that the millCh channel is closed if it exists.

Existing log rotation tests cover the duplicate Close calls.

Currently the millRun goroutines leaks. This is very noticable if
a Logger is constructed periodically, used and then closed.

This change ensures that the millCh channel is closed if it exists.
@howbazaar
Copy link
Author

I notice that the travis-ci has failures. Interestingly I noticed that could happen when I was reading the tests as there is no synchronisation between the goroutine running the millRunOnce and the tests, so the test triggers the mill, but doesn't wait for the milling to complete before it checks the files.

Do you have any preference on how you'd like to see the synchronisation added?

@natefinch
Copy link
Owner

No preference, do whatever you feel is best and is most straight forward. I'm not too picky. I really need to make a v3 of this thing and clean some of this stuff up.

@howbazaar
Copy link
Author

Wow, adding synchronisation and determinism has proved to be way more complicated than I thought.

@howbazaar
Copy link
Author

FYI, here is the comment I wrote while evaluating the race in TestCleanupExistingBackups:

       // There is a race here between the first mill request which happens
       // when openExistingOrNew starts, and the mill that happens as a part
       // of rotate, that gets called due the the size written into the log file.
       // If the first mill managed to process the old log file before the mill
       // request due to rotate, then we get a notification and may get three
       // files back, as the rotate call creates an additional log file. The
       // second mill call (due to rotate) will remove that additional file.
       // If there is a second call, we need to call waitForNotify or the Close
       // method blocks.

While fixing the tests I noticed that the original patch closed the millRun goroutine in the wrong place. I didn't realise that the `close` method was called internally as well as part of the `rotate` method. The closing of the mill signalling channel is now done in the `Close` method.

There were a bunch of race errors detected, mostly around the updating of the time, and the `fakeFS`. Synchronisation is added to these.

All of the `time.After` calls have been removed from the tests and the execution time has gone from 700ms to 7ms on my machine.

Two different notify channels were added to the `Logger` internals. These are only ever set in the tests, and no notification is done for any normal `Logger` use. In order to avoid spurious errors the `Close` method needed to wait for the `millRun` goroutine to complete, otherwise there was potential for the `millRunOnce` method to return errors. I temporarily added a panic to that method while testing. I use a wait group to wait for the goroutine to be complete. However due to the way the `sync.WaitGroup` works, you can't reuse them, so we have a pointer to a `sync.WaitGroup`.

This patch does introduce a change in behaviour, which is in evidence due to the deleted test. Effectively I was left with two choices: allow the compression of existing old log files as part of writing to a new logger (which wouldn't rotate the files yet); or have a race in the `TestCleanupExistingBackups`. This test failure was intermittent, due to the race. I decided on determinism as the likelihood of having old uncompressed files around that needed to be compressed was small.
@howbazaar howbazaar force-pushed the fix-mill-once-goroutine-leak branch from 756caf0 to 6dba581 Compare March 27, 2020 03:11
@howbazaar
Copy link
Author

OK, after disussion with @4a6f656c, realised that we really do need the rotate on resume functionality.

To make the test deterministic, I changed it to not write enough to trigger a rotate. This way we always just have one notification.

Copy link
Contributor

@4a6f656c 4a6f656c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good - some minor issues and cleanup suggestions inline.

@@ -165,7 +175,16 @@ func (l *Logger) Write(p []byte) (n int, err error) {
func (l *Logger) Close() error {
l.mu.Lock()
defer l.mu.Unlock()
return l.close()
if err := l.close(); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not return here - if openExistingOrNew gets called, it will call l.mill and l.millCh will become non-nil. If the l.close call then fails (for example, Close fails on the underlying file descriptor), the millCh will not be closed.

Either the close error can be preserved and returned late, or you may be able to close the millCh first.

case <-notify:
// All good.
case <-time.After(2 * time.Second):
fmt.Println("logger notify not signalled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging testing failures to stdout is less than ideal - use t.Fatal

@@ -814,3 +815,13 @@ func exists(path string, t testing.TB) {
_, err := os.Stat(path)
assertUp(err == nil, t, 1, "expected file to exist, but got error from os.Stat: %v", err)
}

func waitForNotify(notify <-chan struct{}, t testing.TB) {
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should call t.Helper so that the calls to t are more useful.

// It is safe to check the millCh here as we are inside the mutex lock.
if l.millCh == nil {
l.millCh = make(chan struct{}, 1)
l.wg = &sync.WaitGroup{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: A sync.WaitGroup seems like overkill here - you're only ever having one waiter, I'd probably just use a millShutdownCh channel instead.

return err
}

// millRun runs in a goroutine to manage post-rotation compression and removal
// of old log files.
func (l *Logger) millRun() {
for _ = range l.millCh {
func (l *Logger) millRun(ch <-chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, there should be no need to pass the channel as an argument - we should be still able to use l.millCh directly.

Filename: filename,
MaxBackups: 1,
MaxSize: 100, // megabytes
notifyCompressed: notify,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could initialise inline and call waitForNotify with l.notifyCompressed (which seems more readable/self-documenting).

// we need to wait a little bit since the files get compressed on a different
// goroutine.
<-time.After(10 * time.Millisecond)
waitForNotify(notify, t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. waitForNotify(l.notifyCompressed, t)

Filename: filename,
MaxSize: 10,
MaxBackups: 1,
notifyRemoved: notify,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier - consider initialising inline and calling waitForNotify(l.notifyRemoved, t) below.

@haykbaluyan
Copy link

@howbazaar sorry bothering, but when can we expect this to merge?

@hloeung
Copy link

hloeung commented Apr 5, 2020

@howbazaar sorry bothering, but when can we expect this to merge?

One more for @natefinch

@borud
Copy link

borud commented Jan 14, 2022

Any progress on this?

@DellCurry
Copy link

Would this PR (or #57 / #80 which solve the same bug) merged? I think it is really a big bug to many go developers since zap officially recommend this repo to rotate log file.

@SimonRichardson
Copy link

This is now superseded by #211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants