[BUG] targetcli commands hanging #178

karibertils · 2020-12-28T17:30:28Z

I have 80 targets, each with 2 block devices using zfs clones from snapshot.

The issue starts usually after 3-7 days. When it starts, various commands hang indefinently.

Right now I'm running /iscsi> delete iqn.2020-01.is.gz:192-168-101-163 and it's frozen indefinently.

logs show

[361892.210559] INFO: task targetcli:3414923 blocked for more than 362 seconds.
[361892.219399]       Tainted: P           O      5.4.78-2-pve #1                          
[361892.226796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[361892.236489] targetcli       D    0 3414923 3406117 0x00004000                          
[361892.243839] Call Trace:                    
[361892.247510]  __schedule+0x2e6/0x6f0                                                    
[361892.252312]  ? signal_wake_up_state+0x19/0x30
[361892.258079]  schedule+0x33/0xa0                                                        
[361892.262472]  schedule_timeout+0x205/0x330  
[361892.267868]  wait_for_completion+0xb7/0x140                                            
[361892.273442]  ? wake_up_q+0x80/0x80         
[361892.278144]  iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod]                       
[361892.285765]  iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod]
[361892.294186]  lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]                 
[361892.302411]  target_fabric_np_base_release+0x31/0x40 [target_core_mod]
[361892.310633]  config_item_release+0x62/0xd0                                             
[361892.316042]  config_item_put+0x29/0x30     
[361892.321044]  configfs_rmdir+0x1bd/0x300                                                
[361892.326155]  vfs_rmdir+0x81/0x180          
[361892.330658]  do_rmdir+0x1b6/0x210                                                      
[361892.335137]  __x64_sys_rmdir+0x17/0x20     
[361892.340166]  do_syscall_64+0x57/0x190                                                  
[361892.345599]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361892.352031] RIP: 0033:0x7f77999cb027                                                   
[361892.356797] Code: Bad RIP value.           
[361892.361159] RSP: 002b:00007ffd828179c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000054     
[361892.370411] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00007f77999cb027
[361892.379189] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f7796820e50          
[361892.387956] RBP: 0000000001952f30 R08: 0000000000000000 R09: 0000000000761090
[361892.396716] R10: 000000000076108f R11: 0000000000000246 R12: 00007f77992ed408          
[361892.405426] R13: 00007f7797995d60 R14: 00000000005ff420 R15: 0000000001952f30

Running Proxmox 6.3 with kernel 5.4.78-2-pve. Using targetcli-fb version 2.1.48-2

The text was updated successfully, but these errors were encountered:

maurizio-lombardi · 2021-01-02T10:57:49Z

Hello, thanks for the report.
This looks like a kernel bug in the target driver, I will look at it soon

maurizio-lombardi · 2021-01-04T12:58:38Z

@karibertils Can you please post a link to the entire dmesg output?

karibertils · 2021-01-04T13:48:06Z

@maurizio-lombardi Here is the log for the whole day when issue occurred last time. kernel.2020.12.28.log

There were some PC's trying to attach non existing targets at the time. Don't know if that could be related to the issue ?

I have been rebooting every other day since to avoid the issue. I can skip reboots to gather new logs if that helps. Maybe enable some debug mode also ?

karibertils · 2021-03-04T17:40:12Z

No news on this bug ?

I have tried also Ubuntu Server 20.10. Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also.

maurizio-lombardi · 2021-03-05T14:54:09Z

@karibertils Hi, I am trying to reproduce it.

"Using kernel 5.8.0-44-generic and targetcli-fb 2.1.53 and same thing happens there also."

I am sure the bug is still present in the latest upstream kernel.
Question: it happens when a number of initiators try to connect to a non-existing target and in the meanwhile you delete another target via targetcli? Correct?

karibertils · 2021-03-05T15:25:06Z

@maurizio-lombardi No I believe it happens regardless of initiators connecting to non-existing targets or not.

When the issue starts every targetcli command hangs. Doing targetcli ls or even just targetcli to open the cli interface hangs.

I am network booting 80 PC's. All of them have 1 target which has 2 LUN's. LUN 0 is almost never changed. But LUN 1 is removed and re-added every time the PC's boot.

example:

targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns/ delete lun=1'
targetcli '/backstores/block/ delete games-192-168-101-152'
zfs destroy -R speedy/games/192-168-101-152
zfs destroy -R speedy/games/master@192-168-101-152
zfs snapshot speedy/games/master@192-168-101-152
zfs clone speedy/games/master@192-168-101-152 speedy/games/192-168-101-152
targetcli '/backstores/block/ create games-192-168-101-152 /dev/zvol/speedy/games/192-168-101-152'
targetcli '/iscsi/iqn.2020-01.is.gz:192-168-101-152/tpg1/luns create lun=1 /backstores/block/games-192-168-101-152'

We did previously remove and re-add LUN 1 once every 24 hours. And the issue happened with similar frequency then. It can happen after running for 1-9 days. Usually it takes 3+ days though.

maurizio-lombardi · 2021-03-05T15:35:35Z

Hmm, I asked because of the following backtrace

Dec 28 17:17:16 rocky kernel: [361650.549542] INFO: task targetcli:3414923 blocked for more than 120 seconds.
Dec 28 17:17:16 rocky kernel: [361650.558420] Tainted: P O 5.4.78-2-pve #1
Dec 28 17:17:16 rocky kernel: [361650.566177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 28 17:17:16 rocky kernel: [361650.575912] targetcli D 0 3414923 3406117 0x00004000
Dec 28 17:17:16 rocky kernel: [361650.583247] Call Trace:
Dec 28 17:17:16 rocky kernel: [361650.587097] __schedule+0x2e6/0x6f0
Dec 28 17:17:16 rocky kernel: [361650.591884] ? signal_wake_up_state+0x19/0x30
Dec 28 17:17:16 rocky kernel: [361650.597637] schedule+0x33/0xa0
Dec 28 17:17:16 rocky kernel: [361650.602027] schedule_timeout+0x205/0x330
Dec 28 17:17:16 rocky kernel: [361650.607478] wait_for_completion+0xb7/0x140
Dec 28 17:17:16 rocky kernel: [361650.613097] ? wake_up_q+0x80/0x80
Dec 28 17:17:16 rocky kernel: [361650.617818] iscsit_reset_np_thread+0xb4/0xe0 [iscsi_target_mod]
Dec 28 17:17:16 rocky kernel: [361650.625519] iscsit_tpg_del_network_portal+0xed/0x190 [iscsi_target_mod]
Dec 28 17:17:16 rocky kernel: [361650.634118] lio_target_call_delnpfromtpg+0x30/0x90 [iscsi_target_mod]

that makes me think that there is a race condition somewhere, causing a problem with the refcounting
against the tpg_np structure so the iscsit_reset_np_thread() kernel function gets stuck in wait_for_completion()

Note that lio_target_call_delnpfromtpg() is called when you execute a command like "targetcli iscsi/ delete iqn...."

If you have new dmesg logs please post them here, they might help

karibertils · 2021-03-05T16:06:33Z

Ok that sounds plausible. But we did previously try few times to make sure there were no connections to non-existant targets. And the issue did persist. But I guess there have always been at least a few attempts.

We only delete targets few times a day. But the boot script deletes&readds the LUN's every boot so there's more stress on that path.

Here is a recent kernel log.
targetcli-hanging.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] targetcli commands hanging #178

[BUG] targetcli commands hanging #178

karibertils commented Dec 28, 2020

maurizio-lombardi commented Jan 2, 2021

maurizio-lombardi commented Jan 4, 2021

karibertils commented Jan 4, 2021

karibertils commented Mar 4, 2021 •

edited

Loading

maurizio-lombardi commented Mar 5, 2021

karibertils commented Mar 5, 2021

maurizio-lombardi commented Mar 5, 2021

karibertils commented Mar 5, 2021

[BUG] targetcli commands hanging #178

[BUG] targetcli commands hanging #178

Comments

karibertils commented Dec 28, 2020

maurizio-lombardi commented Jan 2, 2021

maurizio-lombardi commented Jan 4, 2021

karibertils commented Jan 4, 2021

karibertils commented Mar 4, 2021 • edited Loading

maurizio-lombardi commented Mar 5, 2021

karibertils commented Mar 5, 2021

maurizio-lombardi commented Mar 5, 2021

karibertils commented Mar 5, 2021

karibertils commented Mar 4, 2021 •

edited

Loading