Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] targetcli commands hanging #178

Open
karibertils opened this issue Dec 28, 2020 · 8 comments
Open

[BUG] targetcli commands hanging #178

karibertils opened this issue Dec 28, 2020 · 8 comments

Comments

@karibertils
Copy link

I have 80 targets, each with 2 block devices using zfs clones from snapshot.

The issue starts usually after 3-7 days. When it starts, various commands hang indefinently.

Right now I'm running /iscsi> delete iqn.2020-01.is.gz:192-168-101-163 and it's frozen indefinently.

logs show

[361892.210559] INFO: task targetcli:3414923 blocked for more than 362 seconds.
[361892.219399]       Tainted: P           O      5.4.78-2-pve #1                          
[361892.226796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[361892.236489] targetcli       D    0 3414923 3406117 0x00004000                          
[361892.243839] Call Trace:                    
[361892.247510]  __schedule+0x2e6/0x6f0                                                    
[361892.252312]  ? signal_wake_up_state+0x19/0x30
[361892.258079]  schedule+0x33/0xa0                                                        
[361892.262472]  schedule_timeout+0x205/0x330  
[361892.267868]  wait_for_completion+0xb7/0x140                                            
[361892.273442]  ? wake_up_q+0x80/0x80         
[361892.278144]  iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod]                       
[361892.285765]  iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod]
[361892.294186]  lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]                 
[361892.302411]  target_fabric_np_base_release+0x31/0x40 [target_core_mod]
[361892.310633]  config_item_release+0x62/0xd0                                             
[361892.316042]  config_item_put+0x29/0x30     
[361892.321044]  configfs_rmdir+0x1bd/0x300                                                
[361892.326155]  vfs_rmdir+0x81/0x180          
[361892.330658]  do_rmdir+0x1b6/0x210                                                      
[361892.335137]  __x64_sys_rmdir+0x17/0x20     
[361892.340166]  do_syscall_64+0x57/0x190                                                  
[361892.345599]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361892.352031] RIP: 0033:0x7f77999cb027                                                   
[361892.356797] Code: Bad RIP value.           
[361892.361159] RSP: 002b:00007ffd828179c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054     
[361892.370411] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00007f77999cb027
[361892.379189] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f7796820e50          
[361892.387956] RBP: 0000000001952f30 R08: 0000000000000000 R09: 0000000000761090
[361892.396716] R10: 000000000076108f R11: 0000000000000246 R12: 00007f77992ed408          
[361892.405426] R13: 00007f7797995d60 R14: 00000000005ff420 R15: 0000000001952f30

Running Proxmox 6.3 with kernel 5.4.78-2-pve. Using targetcli-fb version 2.1.48-2

@maurizio-lombardi
Copy link
Collaborator

Hello, thanks for the report.
This looks like a kernel bug in the target driver, I will look at it soon

@maurizio-lombardi
Copy link
Collaborator

@karibertils Can you please post a link to the entire dmesg output?

@karibertils
Copy link
Author

@maurizio-lombardi Here is the log for the whole day when issue occurred last time. kernel.2020.12.28.log

There were some PC's trying to attach non existing targets at the time. Don't know if that could be related to the issue ?

I have been rebooting every other day since to avoid the issue. I can skip reboots to gather new logs if that helps. Maybe enable some debug mode also ?

@karibertils
Copy link
Author

karibertils commented Mar 4, 2021

No news on this bug ?

I have tried also Ubuntu Server 20.10. Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also.

@maurizio-lombardi
Copy link
Collaborator

@karibertils Hi, I am trying to reproduce it.

"Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also."

I am sure the bug is still present in the latest upstream kernel.
Question: it happens when a number of initiators try to connect to a non-existing target and in the meanwhile you delete another target via targetcli? Correct?

@karibertils
Copy link
Author

@maurizio-lombardi No I believe it happens regardless of initiators connecting to non-existing targets or not.

When the issue starts every targetcli command hangs. Doing targetcli ls or even just targetcli to open the cli interface hangs.

I am network booting 80 PC's. All of them have 1 target which has 2 LUN's. LUN 0 is almost never changed. But LUN 1 is removed and re-added every time the PC's boot.

example:

targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns/ delete lun=1'
targetcli '/backstores/block/ delete games-192-168-101-152'
zfs destroy -R speedy/games/192-168-101-152
zfs destroy -R speedy/games/master@192-168-101-152
zfs snapshot speedy/games/master@192-168-101-152
zfs clone speedy/games/master@192-168-101-152 speedy/games/192-168-101-152
targetcli '/backstores/block/ create games-192-168-101-152 /dev/zvol/speedy/games/192-168-101-152'
targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns create lun=1 /backstores/block/games-192-168-101-152'

We did previously remove and re-add LUN 1 once every 24 hours. And the issue happened with similar frequency then. It can happen after running for 1-9 days. Usually it takes 3+ days though.

@maurizio-lombardi
Copy link
Collaborator

Hmm, I asked because of the following backtrace

Dec 28 17:17:16 rocky kernel: [361650.549542] INFO: task targetcli:3414923 blocked for more than 120 seconds.
Dec 28 17:17:16 rocky kernel: [361650.558420] Tainted: P O 5.4.78-2-pve #1
Dec 28 17:17:16 rocky kernel: [361650.566177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 28 17:17:16 rocky kernel: [361650.575912] targetcli D 0 3414923 3406117 0x00004000
Dec 28 17:17:16 rocky kernel: [361650.583247] Call Trace:
Dec 28 17:17:16 rocky kernel: [361650.587097] __schedule+0x2e6/0x6f0
Dec 28 17:17:16 rocky kernel: [361650.591884] ? signal_wake_up_state+0x19/0x30
Dec 28 17:17:16 rocky kernel: [361650.597637] schedule+0x33/0xa0
Dec 28 17:17:16 rocky kernel: [361650.602027] schedule_timeout+0x205/0x330
Dec 28 17:17:16 rocky kernel: [361650.607478] wait_for_completion+0xb7/0x140
Dec 28 17:17:16 rocky kernel: [361650.613097] ? wake_up_q+0x80/0x80
Dec 28 17:17:16 rocky kernel: [361650.617818] iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod]
Dec 28 17:17:16 rocky kernel: [361650.625519] iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod]
Dec 28 17:17:16 rocky kernel: [361650.634118] lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]

that makes me think that there is a race condition somewhere, causing a problem with the refcounting
against the tpg_np structure so the iscsit_reset_np_thread() kernel function gets stuck in wait_for_completion()

Note that lio_target_call_delnpfromtpg() is called when you execute a command like "targetcli iscsi/ delete iqn...."

If you have new dmesg logs please post them here, they might help

@karibertils
Copy link
Author

Ok that sounds plausible. But we did previously try few times to make sure there were no connections to non-existant targets. And the issue did persist. But I guess there have always been at least a few attempts.

We only delete targets few times a day. But the boot script deletes&readds the LUN's every boot so there's more stress on that path.

Here is a recent kernel log.
targetcli-hanging.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants