-
Notifications
You must be signed in to change notification settings - Fork 620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection shutdown, channel close data race #196
base: master
Are you sure you want to change the base?
Conversation
ba1648b
to
bc44d43
Compare
Please notice too that I had to remove the locking around Additionally, since adding |
d785718
to
d465039
Compare
3faff45
to
7bac62b
Compare
ea536fa
to
f997802
Compare
dfd04b5
to
f997802
Compare
I am sorry, it took me a while to figure it out there was a deadlock in the use of pipes with go1.1. The last commit should work around it. |
Looks reasonable at first approximation, thank you. |
Note that this PR also seems to include #197. |
Please rebase this against master. |
@@ -141,7 +141,11 @@ func (me *Channel) open() error { | |||
// Performs a request/response call for when the message is not NoWait and is | |||
// specified as Synchronous. | |||
func (me *Channel) call(req message, res ...message) error { | |||
if err := me.send(me, req); err != nil { | |||
me.m.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a separate mutex for sending stuff, me.sendM
. I think we should use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code is very misleading.
Please note that m.send
starts as sendOpen
and then switches to sendClosed
:
https://github.com/imkira/amqp/blob/0563adcdb2a14308873fce26b14d3637dc2ac685/channel.go#L85
https://github.com/imkira/amqp/blob/0563adcdb2a14308873fce26b14d3637dc2ac685/channel.go#L104
Please, also note that each of those 2 functions use the mutex you specified for actually sending data.
What I am doing here is to protect a race condition that could happen if we used me.send directly.
That would collide with the following modification:
https://github.com/imkira/amqp/blob/0563adcdb2a14308873fce26b14d3637dc2ac685/channel.go#L104
connection.go
Outdated
@@ -578,6 +577,8 @@ func (me *Connection) openChannel() (*Channel, error) { | |||
// this connection. | |||
func (me *Connection) closeChannel(ch *Channel, e *Error) { | |||
ch.shutdown(e) | |||
me.m.Lock() | |||
defer me.m.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the intent be clearer if we explicitly unlocked after releaseChannel
? or are we after a particular failure that should be guarded against with defer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see any problem with that.
I will change to that then.
Yes. I left a comment regarding that. Please check: |
0563adc
to
85f0c79
Compare
Just committed one of the changes you asked and the rebase against master. |
Note that #197, when merged, broke CI in ways that I cannot reproduce in other environments. |
if err := me.call( | ||
&confirmSelect{Nowait: noWait}, | ||
&confirmSelectOk{}, | ||
); err != nil { | ||
return err | ||
} | ||
|
||
me.m.Lock() | ||
me.confirming = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bet advised - the me.confirming
is also accessed:
https://github.com/streadway/amqp/blob/master/channel.go#L276
https://github.com/streadway/amqp/blob/master/channel.go#L285
Without any sort of mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out. I fixed it in c6a5ab1
85f0c79
to
eeb226c
Compare
Please note that travis is failing because it is detecting a data race in connection.go that was not fixed by #210 against which I handled the merged conflict by using that version (therefore removing my original code). https://api.travis-ci.org/jobs/179778933/log.txt?deansi=true My original code, I believe fixes this by protecting the whole shutdown. |
@imkira Any advice? |
ff6f6b5
to
93127f4
Compare
shared_test.go
Outdated
|
||
func (me *logIO) logf(format string, args ...interface{}) { | ||
me.m.Lock() | ||
me.t.Logf(format, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still fails - according to https://api.travis-ci.org/jobs/179844976/log.txt?deansi=true. On 1,6, but still.
This one is quite illusive, isn't it? 😆
42635f8
to
114eca9
Compare
client test code was opening connections in separate goroutines but not waiting their termination. This is particularly dangerous if any calls are made (asynchronously) to testing.*.Log* after the test function is over. Not only the behavior is undefined, it was causing a data race in go1.6 which is fixed in go1.7 by the following commit: golang/go@5c83e65
114eca9
to
524f95a
Compare
@DmitriyMV OK, so I ended up removing all #210 code which was just a subset of the problems I addressed originally with this PR. The "illusiveness" of the bugs in here are, in my opinion, a reflection of the lack of I added 524f95a to try to fix some of these problems. With this commit I am synchronizing
Please let me know what you think. By the way, I am getting the following error sometimes on travis https://api.travis-ci.org/jobs/179929218/log.txt?deansi=true:
Do you have any idea why this may be happening? |
Sorry, no. I spent last two days, trying to figure out, why this problem occurs (at random times too). My current thought is that we get func (me *Channel) dispatch(msg message) {
switch m := msg.(type) {
case *channelClose:
me.connection.closeChannel(me, newError(m.ReplyCode, m.ReplyText))
me.send(me, &channelCloseOk{})
... I'm not familiar with AMQP protocol tho. And because of that I'm left to guesses. |
There is no |
@michaelklishin what if we are immediately trying to open channel after we had closed it, with the same channel id? Can you describe a EDIT: Clarification. |
@DmitriyMV I thought the reusing of the channel could be a good explanation for what is happening, but when you look at The error message received from the server mentions
It implies a "invalid sequence of frames". What if we are not waiting enough time between creating a channel and publishing to it? I looked at the code but couldn't find a reason for it, though I found this case block maybe "too inclusive": Lines 298 to 299 in 1b88538
|
But the other tests do not fail, so either they don't generate enough Open/Close channel sequences or it has something to do with situation where channel got closed by the RabbitMQ due to error. @imkira I'm open to ideas of how to create a test which will show if we actually send wrong sequence of frames, without provoking an error from server in the first place (like |
Hey folks, I'm posting this on behalf of the core team. As you have noticed, this client hasn't seen a lot of activity recently. Because this client has a long tradition of "no breaking public API changes", certain We would like to thank @streadway Team RabbitMQ has adopted a "hard fork" of this client What do we mean by "hard fork" and what does it mean for you? The entire history of the project What does change is that this new fork will accept reasonable breaking API changes according If your PR hasn't been accepted or reviewed, you are welcome to re-submit it for Note that it is a high season for holidays in some parts of the world, so we may be slower Thank you for using RabbitMQ and contributing to this client. On behalf of the RabbitMQ core team, |
I am running tests with
go test -v -race -cpu=1,2,4
and, I am not sure but, I think I found the following data race (commitb4f3ceab0337f013208d31348b578d83c0064744
):It appears that
Connection.shutdown
is accessingme.channels
to close them (goroutine 80) without any kind of mutex locking, while concurrently getting aConnection.releaseChannel
(viachannel.Close
) which accessesme.channels
too (using the connection lock though).After trying to fix this by protecting
Connection.shutdown
I noticed I started having another race condition due toChannel.call
accessingme.send
whileChannel.shutdown
changes it tosendClosed
.Seriously, I don't know if this the correct fix.
I wrote a simple test that causes at least one of these issues to happen, and enabled
-race
for travis builds.There is another data race related to IO handling in
shared_test.go
or whatever.I didn't understand exactly what the problem was, but it looked like a test code-only problem, and since it is out of the scope of this PR I leave it open for you to take a look first.