-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FindProvidersAsync misses a provider #750
Comments
The scope of the lock is too long unfortunately, and it may be held for an
arbitrary time period.
So we need a better solution.
…On Sun, Oct 17, 2021, 15:31 Nikos Fotiou ***@***.***> wrote:
TL;DR There is a race condition that causes FindProvidersAsync to miss a
provider
I was experimenting with testground using a lossless network and TCP. In
my experiment I am using 200 nodes, each providing a record. Then each node
executes FindProvidersAsync for all 200 records. Many times 3-4 out of
40000 requests were not successful but this shouldn't happen.
Through trial and error I figured out that this was caused here
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533
ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient
condition for this function to return true:
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L559
However, this causes followUpCtx to cancel hence the code jumps to
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L537
If this happens fast, the provider may not be returned back, since this
happens in the previous line:
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L536
I (probably) solved this problem by using a lock. So lines 533-541 in my
code look like this
peersetMutex.Lock()if ps.TryAdd(prov.ID) {
logger.Debugf("using provider: %s", prov)
select {
case peerOut <- *prov:
sentprov = 1
case <-ctx.Done():
logger.Debug("context timed out sending more providers")
return nil, ctx.Err()
}
}peersetMutex.Unlock()
And then I modified the function in line 559 as follows:
func() bool {
peersetMutex.Lock()
output:= !findAll && ps.Size() >= count
peersetMutex.Unlock()
return output
}
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#750>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAI4ST7VWHI7JWLY5674BDUHK6ZRANCNFSM5GEYXY7A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
In order to determine whether the bug is there, I used a variable that counted the records sent through peerOut. So the validation code looked similar to the following: if ps.TryAdd(prov.ID) {
logger.Debugf("using provider: %s", prov)
select {
case peerOut <- *prov:
returnedItems++
case <-ctx.Done():
logger.Debug("context timed out sending more providers")
if returnedItems < ps.Size() {
logger.Debug("!!!Something strange happened!!!")
}
return nil, ctx.Err()
}
} May be try something like that, i.e., add somewhere a condition that checks how many providers we have send back before closing the channel. |
@aschmahmann thoughts on the matter? |
@vyzo without looking too deeply at this I'd say I'd believe that a context cancellation race could occur here. There's some tradeoff around if the channel is allowed to backup and hold open a goroutine due to backpressure or if there's a maximal amount of waiting time. For example, using locking if This doesn't seem like a high priority item, but a fix that doesn't involve an excessive amount of locking seems reasonable. One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled, and if it was due to a Stop condition then decide when to abort. We'd have to decide what the abort semantics are though, such as:
WDYT? |
I run Line 533 in 5a8437e
As follows: if ps.TryAdd(prov.ID) {
peerOut <- *prov
} Which removes the race condition. Then I run @aschmahmann WRT
The context included in the Line 108 in 5a8437e
FindProvidersAsync call and, at least in our experiments, the race condition appears approximately once every 1000 requests.
|
I don't think we should hang the goroutine -- immediately sending down the channel seems like the sanest option. |
TL;DR There is a race condition that causes FindProvidersAsync to miss a provider
I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.
Through trial and error I figured out that this was caused here
go-libp2p-kad-dht/routing.go
Line 533 in 5a8437e
ps.TryAdd(prov.ID)
increases the size of ps and this is a sufficient condition for this function to returntrue
:go-libp2p-kad-dht/routing.go
Line 559 in 5a8437e
However, this causes
followUpCtx
to cancel hence the code jumps togo-libp2p-kad-dht/routing.go
Line 537 in 5a8437e
If this happens fast, the provider may not be returned back, since this happens in the previous line:
go-libp2p-kad-dht/routing.go
Line 536 in 5a8437e
I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this
And then I modified the function in line 559 as follows:
The text was updated successfully, but these errors were encountered: