This post is about a USB SATA adapter, that seemed to work, but actually didn’t in the end. I hope that this is helpful for others, that observe the same or similar behavior of their adapters.

Symptoms

I was using my USB SATA adapter as always. I used it quiet a while already, I guess some years. Actually I looked it up - I bought it in 2020. If I remember correctly, I had a bit of hassle to get it working reliably. I’m not exactly sure about the reasons, but in hindsight this might have been a sign already. I ordered a USB 3 SATA adapter for a SSD drive I already had. The USB SATA adapter should be USB 3 for speed reasons. It had a USB C plug, so I order a USB C to A adapter as well. I made the experience, that this combo worked best when I directly plugged it into my laptop. When I plugged it into my USB 3 hub (a Dell D3100 docking station), the SATA adapter worked, but it was very slow. At that time, I thought: well, I’ve often read, that sometimes hubs and devices are somehow incompatible and didn’t bother further. I had a working solution - directly plug it in without a extra hub.

But recently, it stopped working somehow: The SATA adapter seemed to stop responding. My use case for this external SSD is, that I have here my storage for my virtual Windows environment, that I need from time to time. I’m used to it, that this virtual Windows is very slow - but since I switched from a USB 2.0 HDD to this USB 3 SATA SSD solution, it was actually usable. But now, the VM seemed to freeze. And what’s worse: After a while, I noticed that my linux system hat very high load, greater then 20 or so. And eventually, the my linux system also started to freeze. Oh dear. Nothing is responding anymore. Only way out here seemed to be to do a hard power cycle.

Now I knew, I had a serious problem. But I didn’t knew, what exactly the problem was. It seemed to be reproducible: It happened, when I tried to access a specific file in Windows inside the VM. A file I just downloaded…

Ok, looking at the system logs to get some clue.

Syslog excerpts

After a reboot, I inspected /var/log/syslog for any clues, in the hope the system was still able to write something down. And, well, maybe I found a kernel bug?

Here’s the syslog in full detail - in case someone searches for the same errors:

2023-05-20T20:40:56.387094+02:00 adangel kernel: [138743.648098] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd

So, this reset message I’ve seen often with this adapter. And this happens more often, if I plug it in via a hub. This is the reason, why the speed is slower when the adapter is used together with a hub. But the root cause? I don’t know.

2023-05-20T20:40:56.407029+02:00 adangel kernel: [138743.669124] sd 9:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
2023-05-20T20:40:56.407063+02:00 adangel kernel: [138743.669139] sd 9:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 1f 8c fe b0 00 01 00 00
2023-05-20T20:40:56.407346+02:00 adangel kernel: [138743.669146] blk_print_req_error: 89 callbacks suppressed
2023-05-20T20:40:56.407694+02:00 adangel kernel: [138743.669151] I/O error, dev sdb, sector 529333936 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
2023-05-20T20:41:27.091131+02:00 adangel kernel: [138774.351993] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:41:27.115087+02:00 adangel kernel: [138774.377122] sd 9:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
2023-05-20T20:41:27.115153+02:00 adangel kernel: [138774.377155] sd 9:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 1f 8c ff b0 00 01 00 00
2023-05-20T20:41:27.115165+02:00 adangel kernel: [138774.377170] I/O error, dev sdb, sector 529334192 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
2023-05-20T20:41:59.863160+02:00 adangel kernel: [138807.123856] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:41:59.883064+02:00 adangel kernel: [138807.144590] sd 9:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=31s
2023-05-20T20:41:59.883147+02:00 adangel kernel: [138807.144613] sd 9:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 1f 8d 00 b0 00 00 80 00
2023-05-20T20:41:59.883155+02:00 adangel kernel: [138807.144622] I/O error, dev sdb, sector 529334448 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
2023-05-20T20:42:30.583123+02:00 adangel kernel: [138837.843502] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:42:30.604155+02:00 adangel kernel: [138837.865031] sd 9:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=30s
2023-05-20T20:42:30.604207+02:00 adangel kernel: [138837.865062] sd 9:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 1f 8d 05 30 00 01 00 00
2023-05-20T20:42:30.604218+02:00 adangel kernel: [138837.865075] I/O error, dev sdb, sector 529335600 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 2
2023-05-20T20:43:01.298975+02:00 adangel kernel: [138868.558915] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:43:01.298975+02:00 adangel kernel: [138868.558915] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd

Here are some I/O errors. The USB SATA adapter (“sdb”) is telling, that it can’t read a specific sector. I’m using a a SSD drive, so that’s very strange.

Later in the logs, we see some stack traces and some hung tasks:

2023-05-20T20:44:29.203531+02:00 adangel kernel: [138956.462041] INFO: task kcompactd0:44 blocked for more than 120 seconds.
2023-05-20T20:44:29.203553+02:00 adangel kernel: [138956.462051]       Tainted: G          I        6.1.0-9-amd64 #1 Debian 6.1.27-1
2023-05-20T20:44:29.203556+02:00 adangel kernel: [138956.462055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-20T20:44:29.203557+02:00 adangel kernel: [138956.462058] task:kcompactd0      state:D stack:0     pid:44    ppid:2      flags:0x00004000
2023-05-20T20:44:29.203558+02:00 adangel kernel: [138956.462064] Call Trace:
2023-05-20T20:44:29.203559+02:00 adangel kernel: [138956.462067]  <TASK>
2023-05-20T20:44:29.203559+02:00 adangel kernel: [138956.462072]  __schedule+0x351/0xa20
2023-05-20T20:44:29.203977+02:00 adangel kernel: [138956.462082]  schedule+0x5d/0xe0
2023-05-20T20:44:29.203985+02:00 adangel kernel: [138956.462086]  io_schedule+0x42/0x70
2023-05-20T20:44:29.203986+02:00 adangel kernel: [138956.462089]  folio_wait_bit_common+0x13d/0x350
2023-05-20T20:44:29.203987+02:00 adangel kernel: [138956.462096]  ? filemap_alloc_folio+0x100/0x100
2023-05-20T20:44:29.204100+02:00 adangel kernel: [138956.462101]  migrate_pages+0xdf1/0x1480
2023-05-20T20:44:29.204103+02:00 adangel kernel: [138956.462106]  ? isolate_freepages_block+0x410/0x410
2023-05-20T20:44:29.204105+02:00 adangel kernel: [138956.462111]  ? release_freepages+0xc0/0xc0
2023-05-20T20:44:29.204105+02:00 adangel kernel: [138956.462117]  ? do_pages_stat+0x360/0x360
2023-05-20T20:44:29.204215+02:00 adangel kernel: [138956.462123]  compact_zone+0x97e/0xdb0
2023-05-20T20:44:29.204218+02:00 adangel kernel: [138956.462130]  ? finish_task_switch.isra.0+0x9b/0x300
2023-05-20T20:44:29.204219+02:00 adangel kernel: [138956.462136]  proactive_compact_node+0x87/0xc0
2023-05-20T20:44:29.204220+02:00 adangel kernel: [138956.462144]  kcompactd+0x34c/0x420
2023-05-20T20:44:29.204221+02:00 adangel kernel: [138956.462149]  ? dequeue_task_stop+0x70/0x70
2023-05-20T20:44:29.204222+02:00 adangel kernel: [138956.462154]  ? kcompactd_do_work+0x2a0/0x2a0
2023-05-20T20:44:29.204223+02:00 adangel kernel: [138956.462158]  kthread+0xe6/0x110
2023-05-20T20:44:29.204224+02:00 adangel kernel: [138956.462162]  ? kthread_complete_and_exit+0x20/0x20
2023-05-20T20:44:29.204224+02:00 adangel kernel: [138956.462167]  ret_from_fork+0x1f/0x30
2023-05-20T20:44:29.204226+02:00 adangel kernel: [138956.462175]  </TASK>
2023-05-20T20:44:29.204226+02:00 adangel kernel: [138956.462422] INFO: task jbd2/sdb1-8:2078173 blocked for more than 120 seconds.
2023-05-20T20:44:29.204227+02:00 adangel kernel: [138956.462428]       Tainted: G          I        6.1.0-9-amd64 #1 Debian 6.1.27-1
2023-05-20T20:44:29.204228+02:00 adangel kernel: [138956.462431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-20T20:44:29.204229+02:00 adangel kernel: [138956.462433] task:jbd2/sdb1-8     state:D stack:0     pid:2078173 ppid:2      flags:0x00004000
2023-05-20T20:44:29.204230+02:00 adangel kernel: [138956.462439] Call Trace:
2023-05-20T20:44:29.204231+02:00 adangel kernel: [138956.462440]  <TASK>
2023-05-20T20:44:29.204232+02:00 adangel kernel: [138956.462444]  __schedule+0x351/0xa20
2023-05-20T20:44:29.204232+02:00 adangel kernel: [138956.462450]  schedule+0x5d/0xe0
2023-05-20T20:44:29.204233+02:00 adangel kernel: [138956.462453]  io_schedule+0x42/0x70
2023-05-20T20:44:29.204234+02:00 adangel kernel: [138956.462457]  blk_mq_get_tag+0x11a/0x2a0
2023-05-20T20:44:29.204235+02:00 adangel kernel: [138956.462462]  ? dequeue_task_stop+0x70/0x70
2023-05-20T20:44:29.204530+02:00 adangel kernel: [138956.462467]  __blk_mq_alloc_requests+0x191/0x2e0
2023-05-20T20:44:29.204551+02:00 adangel kernel: [138956.462474]  blk_mq_submit_bio+0x2fe/0x580
2023-05-20T20:44:29.204553+02:00 adangel kernel: [138956.462480]  submit_bio_noacct_nocheck+0x32e/0x370
2023-05-20T20:44:29.204553+02:00 adangel kernel: [138956.462484]  ? submit_bio_noacct+0x79/0x4a0
2023-05-20T20:44:29.204554+02:00 adangel kernel: [138956.462490]  jbd2_journal_commit_transaction+0xdb3/0x1a70 [jbd2]
2023-05-20T20:44:29.204555+02:00 adangel kernel: [138956.462516]  ? _raw_spin_unlock+0x15/0x30
2023-05-20T20:44:29.204556+02:00 adangel kernel: [138956.462520]  ? finish_task_switch.isra.0+0x9b/0x300
2023-05-20T20:44:29.204557+02:00 adangel kernel: [138956.462524]  ? __switch_to+0x106/0x410
2023-05-20T20:44:29.204558+02:00 adangel kernel: [138956.462532]  kjournald2+0xa9/0x280 [jbd2]
2023-05-20T20:44:29.204559+02:00 adangel kernel: [138956.462556]  ? dequeue_task_stop+0x70/0x70
2023-05-20T20:44:29.204559+02:00 adangel kernel: [138956.462561]  ? jbd2_fc_wait_bufs+0xa0/0xa0 [jbd2]
2023-05-20T20:44:29.204560+02:00 adangel kernel: [138956.462580]  kthread+0xe6/0x110
2023-05-20T20:44:29.204561+02:00 adangel kernel: [138956.462584]  ? kthread_complete_and_exit+0x20/0x20
2023-05-20T20:44:29.204562+02:00 adangel kernel: [138956.462589]  ret_from_fork+0x1f/0x30
2023-05-20T20:44:29.204563+02:00 adangel kernel: [138956.462597]  </TASK>
2023-05-20T20:44:29.204563+02:00 adangel kernel: [138956.462600] INFO: task worker:2572098 blocked for more than 120 seconds.
2023-05-20T20:44:29.204714+02:00 adangel kernel: [138956.462604]       Tainted: G          I        6.1.0-9-amd64 #1 Debian 6.1.27-1
2023-05-20T20:44:29.204717+02:00 adangel kernel: [138956.462607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-20T20:44:29.204718+02:00 adangel kernel: [138956.462609] task:worker          state:D stack:0     pid:2572098 ppid:1      flags:0x00004002
2023-05-20T20:44:29.204719+02:00 adangel kernel: [138956.462614] Call Trace:
2023-05-20T20:44:29.204720+02:00 adangel kernel: [138956.462615]  <TASK>
2023-05-20T20:44:29.204721+02:00 adangel kernel: [138956.462617]  __schedule+0x351/0xa20
2023-05-20T20:44:29.204721+02:00 adangel kernel: [138956.462623]  schedule+0x5d/0xe0
2023-05-20T20:44:29.204722+02:00 adangel kernel: [138956.462626]  io_schedule+0x42/0x70
2023-05-20T20:44:29.204723+02:00 adangel kernel: [138956.462629]  blk_mq_get_tag+0x11a/0x2a0
2023-05-20T20:44:29.204724+02:00 adangel kernel: [138956.462633]  ? blk_rq_merge_ok+0x89/0xa0
2023-05-20T20:44:29.204730+02:00 adangel kernel: [138956.462638]  ? dequeue_task_stop+0x70/0x70
2023-05-20T20:44:29.204731+02:00 adangel kernel: [138956.462642]  __blk_mq_alloc_requests+0x191/0x2e0
2023-05-20T20:44:29.204732+02:00 adangel kernel: [138956.462648]  ? _raw_spin_unlock+0x15/0x30
2023-05-20T20:44:29.204733+02:00 adangel kernel: [138956.462653]  blk_mq_submit_bio+0x2fe/0x580
2023-05-20T20:44:29.204734+02:00 adangel kernel: [138956.462658]  submit_bio_noacct_nocheck+0x32e/0x370
2023-05-20T20:44:29.204735+02:00 adangel kernel: [138956.462662]  ? bio_add_page+0x5b/0x90
2023-05-20T20:44:29.204735+02:00 adangel kernel: [138956.462667]  ext4_mpage_readpages+0x53b/0x870 [ext4]
2023-05-20T20:44:29.204736+02:00 adangel kernel: [138956.462733]  ? ext4_invalidate_folio+0x90/0x90 [ext4]
2023-05-20T20:44:29.204912+02:00 adangel kernel: [138956.462786]  filemap_read_folio+0x3d/0xd0
2023-05-20T20:44:29.204915+02:00 adangel kernel: [138956.462792]  filemap_get_pages+0x487/0x630
2023-05-20T20:44:29.204920+02:00 adangel kernel: [138956.462799]  ? copy_page_to_iter+0x7c/0x1e0
2023-05-20T20:44:29.204921+02:00 adangel kernel: [138956.462804]  filemap_read+0xd2/0x340
2023-05-20T20:44:29.204922+02:00 adangel kernel: [138956.462812]  do_iter_readv_writev+0x139/0x150
2023-05-20T20:44:29.204923+02:00 adangel kernel: [138956.462820]  do_iter_read+0xe8/0x1e0
2023-05-20T20:44:29.204924+02:00 adangel kernel: [138956.462824]  vfs_readv+0xa7/0xe0
2023-05-20T20:44:29.204924+02:00 adangel kernel: [138956.462833]  __x64_sys_preadv+0xb5/0x100
2023-05-20T20:44:29.204925+02:00 adangel kernel: [138956.462838]  do_syscall_64+0x58/0xc0
2023-05-20T20:44:29.204926+02:00 adangel kernel: [138956.462842]  ? do_syscall_64+0x67/0xc0
2023-05-20T20:44:29.204927+02:00 adangel kernel: [138956.462845]  ? syscall_exit_to_user_mode+0x17/0x40
2023-05-20T20:44:29.204927+02:00 adangel kernel: [138956.462849]  ? do_syscall_64+0x67/0xc0
2023-05-20T20:44:29.204928+02:00 adangel kernel: [138956.462852]  ? exit_to_user_mode_prepare+0x147/0x1d0
2023-05-20T20:44:29.205110+02:00 adangel kernel: [138956.462857]  ? syscall_exit_to_user_mode+0x17/0x40
2023-05-20T20:44:29.205113+02:00 adangel kernel: [138956.462861]  ? do_syscall_64+0x67/0xc0
2023-05-20T20:44:29.205114+02:00 adangel kernel: [138956.462864]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
2023-05-20T20:44:29.205115+02:00 adangel kernel: [138956.462869] RIP: 0033:0x7ff17d950cd8
2023-05-20T20:44:29.205115+02:00 adangel kernel: [138956.462873] RSP: 002b:00007ff0255e9600 EFLAGS: 00000246 ORIG_RAX: 0000000000000127
2023-05-20T20:44:29.205116+02:00 adangel kernel: [138956.462877] RAX: ffffffffffffffda RBX: 00007ff1745ff830 RCX: 00007ff17d950cd8
2023-05-20T20:44:29.205117+02:00 adangel kernel: [138956.462880] RDX: 0000000000000002 RSI: 00007ff064076f70 RDI: 0000000000000010
2023-05-20T20:44:29.205122+02:00 adangel kernel: [138956.462883] RBP: 000056471abdb0e0 R08: 0000000000000000 R09: 0000000000000000
2023-05-20T20:44:29.205123+02:00 adangel kernel: [138956.462885] R10: 00000003dc706000 R11: 0000000000000246 R12: 000056471abdb0f0
2023-05-20T20:44:29.205124+02:00 adangel kernel: [138956.462888] R13: 0000564719517b38 R14: 000056471abdb158 R15: 00007ff024dea000
2023-05-20T20:44:29.205125+02:00 adangel kernel: [138956.462894]  </TASK>
2023-05-20T20:44:29.205126+02:00 adangel kernel: [138956.462895] INFO: task worker:2577604 blocked for more than 120 seconds.
2023-05-20T20:44:29.205127+02:00 adangel kernel: [138956.462899]       Tainted: G          I        6.1.0-9-amd64 #1 Debian 6.1.27-1
2023-05-20T20:44:29.205127+02:00 adangel kernel: [138956.462903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-20T20:44:29.205128+02:00 adangel kernel: [138956.462905] task:worker          state:D stack:0     pid:2577604 ppid:1      flags:0x00000002
2023-05-20T20:44:29.205129+02:00 adangel kernel: [138956.462910] Call Trace:
2023-05-20T20:44:29.205130+02:00 adangel kernel: [138956.462911]  <TASK>
2023-05-20T20:44:29.205131+02:00 adangel kernel: [138956.462914]  __schedule+0x351/0xa20
2023-05-20T20:44:29.205132+02:00 adangel kernel: [138956.462919]  schedule+0x5d/0xe0
2023-05-20T20:44:29.205133+02:00 adangel kernel: [138956.462922]  io_schedule+0x42/0x70
2023-05-20T20:44:29.205133+02:00 adangel kernel: [138956.462925]  blk_mq_get_tag+0x11a/0x2a0
2023-05-20T20:44:29.205134+02:00 adangel kernel: [138956.462929]  ? blk_rq_merge_ok+0x89/0xa0
2023-05-20T20:44:29.205318+02:00 adangel kernel: [138956.462934]  ? dequeue_task_stop+0x70/0x70
2023-05-20T20:44:29.205321+02:00 adangel kernel: [138956.462939]  __blk_mq_alloc_requests+0x191/0x2e0
2023-05-20T20:44:29.205322+02:00 adangel kernel: [138956.462944]  ? _raw_spin_unlock+0x15/0x30
2023-05-20T20:44:29.205323+02:00 adangel kernel: [138956.462949]  blk_mq_submit_bio+0x2fe/0x580
2023-05-20T20:44:29.205324+02:00 adangel kernel: [138956.462955]  submit_bio_noacct_nocheck+0x32e/0x370
2023-05-20T20:44:29.205325+02:00 adangel kernel: [138956.462959]  ? bio_add_page+0x5b/0x90
2023-05-20T20:44:29.205329+02:00 adangel kernel: [138956.462964]  ext4_mpage_readpages+0x53b/0x870 [ext4]
2023-05-20T20:44:29.205330+02:00 adangel kernel: [138956.463023]  read_pages+0x85/0x260
2023-05-20T20:44:29.205331+02:00 adangel kernel: [138956.463026]  ? folio_add_lru+0x70/0xd0
2023-05-20T20:44:29.205331+02:00 adangel kernel: [138956.463030]  page_cache_ra_unbounded+0x129/0x180
2023-05-20T20:44:29.205332+02:00 adangel kernel: [138956.463035]  filemap_get_pages+0x10a/0x630
2023-05-20T20:44:29.205333+02:00 adangel kernel: [138956.463040]  ? _raw_spin_unlock_irqrestore+0x23/0x40
2023-05-20T20:44:29.205334+02:00 adangel kernel: [138956.463044]  ? hrtimer_try_to_cancel+0x78/0x110
2023-05-20T20:44:29.205335+02:00 adangel kernel: [138956.463052]  filemap_read+0xd2/0x340
2023-05-20T20:44:29.205903+02:00 adangel kernel: [138956.463060]  vfs_read+0x239/0x310
2023-05-20T20:44:29.205907+02:00 adangel kernel: [138956.463067]  __x64_sys_pread64+0x94/0xc0
2023-05-20T20:44:29.205908+02:00 adangel kernel: [138956.463072]  do_syscall_64+0x58/0xc0
2023-05-20T20:44:29.205913+02:00 adangel kernel: [138956.463076]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
2023-05-20T20:44:29.205914+02:00 adangel kernel: [138956.463081] RIP: 0033:0x7ff17d9491a7
2023-05-20T20:44:29.205915+02:00 adangel kernel: [138956.463083] RSP: 002b:00007ff0272f15e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
2023-05-20T20:44:29.205915+02:00 adangel kernel: [138956.463087] RAX: ffffffffffffffda RBX: 00007ff0559dd830 RCX: 00007ff17d9491a7
2023-05-20T20:44:29.205916+02:00 adangel kernel: [138956.463089] RDX: 0000000000001000 RSI: 00007ff0e119c000 RDI: 0000000000000010
2023-05-20T20:44:29.205917+02:00 adangel kernel: [138956.463092] RBP: 00007ff0e119c000 R08: 0000000000000000 R09: 000056471abdb184
2023-05-20T20:44:29.205921+02:00 adangel kernel: [138956.463094] R10: 00000004935bf000 R11: 0000000000000293 R12: 0000000000000000
2023-05-20T20:44:29.205922+02:00 adangel kernel: [138956.463096] R13: 0000564719517b38 R14: 000056471abdb158 R15: 00007ff026af2000
2023-05-20T20:44:29.205923+02:00 adangel kernel: [138956.463102]  </TASK>
2023-05-20T20:44:29.205924+02:00 adangel kernel: [138956.463115] INFO: task kworker/u8:0:2553334 blocked for more than 120 seconds.
2023-05-20T20:44:29.205925+02:00 adangel kernel: [138956.463118]       Tainted: G          I        6.1.0-9-amd64 #1 Debian 6.1.27-1
2023-05-20T20:44:29.205925+02:00 adangel kernel: [138956.463121] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-05-20T20:44:29.205926+02:00 adangel kernel: [138956.463123] task:kworker/u8:0    state:D stack:0     pid:2553334 ppid:2      flags:0x00004000
2023-05-20T20:44:29.206113+02:00 adangel kernel: [138956.463128] Workqueue: writeback wb_workfn (flush-8:16)
2023-05-20T20:44:29.206116+02:00 adangel kernel: [138956.463135] Call Trace:
2023-05-20T20:44:29.206117+02:00 adangel kernel: [138956.463136]  <TASK>
2023-05-20T20:44:29.206118+02:00 adangel kernel: [138956.463139]  __schedule+0x351/0xa20
2023-05-20T20:44:29.206119+02:00 adangel kernel: [138956.463144]  schedule+0x5d/0xe0
2023-05-20T20:44:29.206119+02:00 adangel kernel: [138956.463147]  io_schedule+0x42/0x70
2023-05-20T20:44:29.206120+02:00 adangel kernel: [138956.463151]  bit_wait_io+0xd/0x60
2023-05-20T20:44:29.206121+02:00 adangel kernel: [138956.463154]  __wait_on_bit+0x48/0x140
2023-05-20T20:44:29.206121+02:00 adangel kernel: [138956.463158]  ? bit_wait+0x60/0x60
2023-05-20T20:44:29.206156+02:00 adangel kernel: [138956.463163]  out_of_line_wait_on_bit+0x91/0xb0
2023-05-20T20:44:29.206158+02:00 adangel kernel: [138956.463167]  ? sugov_init+0x350/0x350
2023-05-20T20:44:29.206158+02:00 adangel kernel: [138956.463172]  do_get_write_access+0x244/0x3e0 [jbd2]
2023-05-20T20:44:29.206159+02:00 adangel kernel: [138956.463194]  jbd2_journal_get_write_access+0x5b/0x80 [jbd2]
2023-05-20T20:44:29.207169+02:00 adangel kernel: [138956.463212]  __ext4_journal_get_write_access+0x81/0x180 [ext4]
2023-05-20T20:44:29.207200+02:00 adangel kernel: [138956.463268]  ext4_reserve_inode_write+0x5d/0xb0 [ext4]
2023-05-20T20:44:29.207201+02:00 adangel kernel: [138956.463328]  __ext4_mark_inode_dirty+0x74/0x240 [ext4]
2023-05-20T20:44:29.207202+02:00 adangel kernel: [138956.463382]  ? __ext4_journal_start_sb+0x105/0x120 [ext4]
2023-05-20T20:44:29.207203+02:00 adangel kernel: [138956.463431]  ext4_dirty_inode+0x56/0x80 [ext4]
2023-05-20T20:44:29.207203+02:00 adangel kernel: [138956.463482]  __mark_inode_dirty+0x53/0x380
2023-05-20T20:44:29.207204+02:00 adangel kernel: [138956.463487]  ext4_da_update_reserve_space+0x125/0x180 [ext4]
2023-05-20T20:44:29.207205+02:00 adangel kernel: [138956.463538]  ext4_ext_map_blocks+0xba7/0x19d0 [ext4]
2023-05-20T20:44:29.207206+02:00 adangel kernel: [138956.463589]  ? find_get_pages_range_tag+0x1a9/0x200
2023-05-20T20:44:29.207207+02:00 adangel kernel: [138956.463598]  ext4_map_blocks+0x24d/0x620 [ext4]
2023-05-20T20:44:29.207207+02:00 adangel kernel: [138956.463650]  ext4_writepages+0x822/0x1120 [ext4]
2023-05-20T20:44:29.207208+02:00 adangel kernel: [138956.463704]  ? ext4_inode_csum+0x20b/0x290 [ext4]
2023-05-20T20:44:29.207209+02:00 adangel kernel: [138956.463763]  ? xts_crypt+0x123/0x340 [aesni_intel]
2023-05-20T20:44:29.207210+02:00 adangel kernel: [138956.463789]  do_writepages+0xbd/0x1c0
2023-05-20T20:44:29.207211+02:00 adangel kernel: [138956.463795]  ? blkcg_rstat_flush+0xa4/0x110
2023-05-20T20:44:29.207211+02:00 adangel kernel: [138956.463800]  __writeback_single_inode+0x3d/0x360
2023-05-20T20:44:29.207212+02:00 adangel kernel: [138956.463803]  ? fprop_fraction_percpu+0x2b/0x80
2023-05-20T20:44:29.207213+02:00 adangel kernel: [138956.463809]  writeback_sb_inodes+0x1ed/0x4a0
2023-05-20T20:44:29.207214+02:00 adangel kernel: [138956.463817]  __writeback_inodes_wb+0x4c/0xf0
2023-05-20T20:44:29.207214+02:00 adangel kernel: [138956.463821]  wb_writeback+0x204/0x2f0
2023-05-20T20:44:29.207215+02:00 adangel kernel: [138956.463826]  wb_workfn+0x350/0x4f0
2023-05-20T20:44:29.207216+02:00 adangel kernel: [138956.463833]  process_one_work+0x1c4/0x380
2023-05-20T20:44:29.207216+02:00 adangel kernel: [138956.463840]  worker_thread+0x4d/0x380
2023-05-20T20:44:29.207217+02:00 adangel kernel: [138956.463844]  ? _raw_spin_lock_irqsave+0x23/0x50
2023-05-20T20:44:29.207218+02:00 adangel kernel: [138956.463849]  ? rescuer_thread+0x3a0/0x3a0
2023-05-20T20:44:29.207219+02:00 adangel kernel: [138956.463853]  kthread+0xe6/0x110
2023-05-20T20:44:29.207219+02:00 adangel kernel: [138956.463857]  ? kthread_complete_and_exit+0x20/0x20
2023-05-20T20:44:29.207220+02:00 adangel kernel: [138956.463861]  ret_from_fork+0x1f/0x30
2023-05-20T20:44:29.207221+02:00 adangel kernel: [138956.463870]  </TASK>
2023-05-20T20:44:29.843115+02:00 adangel kernel: [138957.102640] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:44:36.115099+02:00 adangel kernel: [138963.374181] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.234969+02:00 adangel kernel: [138963.493985] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.291133+02:00 adangel kernel: [138963.550063] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.346971+02:00 adangel kernel: [138963.605980] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.407069+02:00 adangel kernel: [138963.666227] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.431098+02:00 adangel kernel: [138963.693809] usb 2-1: Device not responding to setup address.
2023-05-20T20:44:36.463152+02:00 adangel kernel: [138963.722102] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.519148+02:00 adangel kernel: [138963.778113] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.583032+02:00 adangel kernel: [138963.841999] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.642973+02:00 adangel kernel: [138963.905252] usb 2-1: Device not responding to setup address.
2023-05-20T20:44:36.650962+02:00 adangel kernel: [138963.909970] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.719061+02:00 adangel kernel: [138963.978140] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.775111+02:00 adangel kernel: [138964.034050] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.831059+02:00 adangel kernel: [138964.090068] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.851052+02:00 adangel kernel: [138964.110063] usb 2-1: device not accepting address 28, error -71
2023-05-20T20:44:36.887043+02:00 adangel kernel: [138964.146116] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:36.947137+02:00 adangel kernel: [138964.206143] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:37.010994+02:00 adangel kernel: [138964.269978] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c
2023-05-20T20:44:37.054965+02:00 adangel kernel: [138964.314015] usb 2-1: reset SuperSpeed USB device number 28 using xhci_hcd
2023-05-20T20:44:37.071082+02:00 adangel kernel: [138964.330072] usb-storage 2-1:1.0: Error in queuecommand_lck: us->srb = 00000000fe80b03c

There multiple different tasks that hang. This probably adds up to the high load value I’ve observed. These tasks are kernel level tasks and supposed to be always runnable, but somehow they are blocked now. In the stack traces it’s somewhere in the io scheduler, originating from ext4 or so.

Now on to the potential kernel bug:

2023-05-20T20:44:38.293416+02:00 adangel kernel: [138964.335567] list_add corruption. prev->next should be next (ffff8fd9b2489808), but was ffff8fd80fbe72c8. (prev=ffff8fd80fbe72c8).
2023-05-20T20:44:38.293433+02:00 adangel kernel: [138964.335596] ------------[ cut here ]------------
2023-05-20T20:44:38.293436+02:00 adangel kernel: [138964.335600] kernel BUG at lib/list_debug.c:30!
2023-05-20T20:44:38.293437+02:00 adangel kernel: [138964.335614] invalid opcode: 0000 [#1] PREEMPT SMP PTI
2023-05-20T20:44:38.293443+02:00 adangel kernel: [138964.335624] CPU: 1 PID: 58 Comm: kworker/1:1H Tainted: G          I        6.1.0-9-amd64 #1  Debian 6.1.27-1
2023-05-20T20:44:38.293445+02:00 adangel kernel: [138964.335636] Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.12.2 12/15/2019
2023-05-20T20:44:38.293446+02:00 adangel kernel: [138964.335642] Workqueue: kblockd blk_mq_requeue_work
2023-05-20T20:44:38.293448+02:00 adangel kernel: [138964.335658] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
2023-05-20T20:44:38.293449+02:00 adangel kernel: [138964.335675] Code: f2 48 89 c1 48 89 fe 48 c7 c7 20 79 b5 ad e8 c1 a6 fe ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 c8 78 b5 ad e8 aa a6 fe ff <0f> 0b 48 89 c1 48 c7 c7 70 78 b5 ad e8 99 a6 fe ff 0f 0b 48 c7 c7
2023-05-20T20:44:38.293450+02:00 adangel kernel: [138964.335683] RSP: 0018:ffff9fd9403bbe10 EFLAGS: 00010246
2023-05-20T20:44:38.293451+02:00 adangel kernel: [138964.335693] RAX: 0000000000000075 RBX: ffff8fd80fbe7280 RCX: 0000000000000000
2023-05-20T20:44:38.293452+02:00 adangel kernel: [138964.335702] RDX: 0000000000000000 RSI: ffffffffadb3fa66 RDI: 00000000ffffffff
2023-05-20T20:44:38.293452+02:00 adangel kernel: [138964.335708] RBP: ffff8fd80fbe72c8 R08: 0000000000000000 R09: ffff9fd9403bbca8
2023-05-20T20:44:38.293453+02:00 adangel kernel: [138964.335715] R10: 0000000000000003 R11: ffffffffae2cda30 R12: ffff8fd80fbe72c8
2023-05-20T20:44:38.293455+02:00 adangel kernel: [138964.335721] R13: ffff8fd9b2489808 R14: ffff8fd9b2489800 R15: 0000000000000000
2023-05-20T20:44:38.293456+02:00 adangel kernel: [138964.335727] FS:  0000000000000000(0000) GS:ffff8fdb6ec80000(0000) knlGS:0000000000000000
2023-05-20T20:44:38.293458+02:00 adangel kernel: [138964.335742] CR2: 00007f6bc695accc CR3: 0000000299c6a005 CR4: 00000000003726e0
2023-05-20T20:44:38.293460+02:00 adangel kernel: [138964.335750] Call Trace:
2023-05-20T20:44:38.293461+02:00 adangel kernel: [138964.335756]  <TASK>
2023-05-20T20:44:38.293462+02:00 adangel kernel: [138964.335761]  blk_mq_request_bypass_insert+0x84/0xc0
2023-05-20T20:44:38.293463+02:00 adangel kernel: [138964.335776]  blk_mq_requeue_work+0xa2/0x180
2023-05-20T20:44:38.293464+02:00 adangel kernel: [138964.335788]  process_one_work+0x1c4/0x380
2023-05-20T20:44:38.293466+02:00 adangel kernel: [138964.335806]  worker_thread+0x4d/0x380
2023-05-20T20:44:38.293467+02:00 adangel kernel: [138964.335817]  ? _raw_spin_lock_irqsave+0x23/0x50
2023-05-20T20:44:38.293469+02:00 adangel kernel: [138964.335832]  ? rescuer_thread+0x3a0/0x3a0
2023-05-20T20:44:38.293470+02:00 adangel kernel: [138964.335844]  kthread+0xe6/0x110
2023-05-20T20:44:38.293471+02:00 adangel kernel: [138964.335854]  ? kthread_complete_and_exit+0x20/0x20
2023-05-20T20:44:38.293472+02:00 adangel kernel: [138964.335867]  ret_from_fork+0x1f/0x30
2023-05-20T20:44:38.293473+02:00 adangel kernel: [138964.335888]  </TASK>
2023-05-20T20:44:38.293475+02:00 adangel kernel: [138964.335892] Modules linked in: tun sd_mod sg uas usb_storage ctr ccm nf_conntrack_netlink xfrm_user xfrm_algo br_netfilter snd_seq_dummy snd_hrtimer rfcomm snd_seq uinput xt_CHECKSUM xt_MASQUERADE nft_chain_nat nf_nat bridge stp llc cmac algif_hash algif_skcipher af_alg overlay qrtr ip6t_REJECT nf_reject_ipv6 xt_hl bnep ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_comment nft_limit snd_sof_pci_intel_skl xt_limit snd_sof_intel_hda_common xt_addrtype xt_tcpudp soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp xt_conntrack nf_conntrack nf_defrag_ipv6 snd_sof mei_hdcp nf_defrag_ipv4 snd_sof_utils nft_compat soundwire_bus intel_rapl_msr nf_tables nfnetlink binfmt_misc snd_soc_skl snd_hda_codec_hdmi x86_pkg_temp_thermal snd_soc_hdac_hda dell_laptop intel_powerclamp snd_hda_ext_core coretemp snd_soc_sst_ipc nls_ascii snd_soc_sst_dsp snd_soc_acpi_intel_match nls_cp437 snd_ctl_led iwlmvm kvm_intel btusb btrtl
2023-05-20T20:44:38.293871+02:00 adangel kernel: [138964.336090]  vfat btbcm dell_smm_hwmon snd_soc_acpi snd_hda_codec_realtek btintel fat mac80211 kvm btmtk snd_hda_codec_generic snd_soc_core libarc4 irqbypass snd_usb_audio bluetooth ledtrig_audio snd_usbmidi_lib snd_compress rapl iTCO_wdt intel_pmc_bxt iwlwifi intel_cstate snd_hda_intel iTCO_vendor_support dell_wmi jitterentropy_rng snd_rawmidi snd_intel_dspcfg snd_intel_sdw_acpi dell_smbios dcdbas intel_uncore wmi_bmof snd_hda_codec pcspkr watchdog intel_wmi_thunderbolt dell_wmi_descriptor uvcvideo cfg80211 snd_hda_core videobuf2_vmalloc snd_seq_device snd_hwdep drbg videobuf2_memops videobuf2_v4l2 ansi_cprng snd_pcm videobuf2_common snd_timer ecdh_generic videodev rfkill processor_thermal_device_pci_legacy ecc mc snd mei_me soundcore processor_thermal_device processor_thermal_rfim processor_thermal_mbox processor_thermal_rapl mei intel_rapl_common intel_pch_thermal intel_xhci_usb_role_switch roles intel_soc_dts_iosf joydev int3403_thermal int340x_thermal_zone intel_pmc_core
2023-05-20T20:44:38.293874+02:00 adangel kernel: [138964.336269]  int3400_thermal intel_hid ac acpi_thermal_rel sparse_keymap acpi_pad cdc_mbim cdc_wdm hid_multitouch evdev serio_raw msr parport_pc ppdev lp parport fuse loop efi_pstore configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic zstd_compress dm_crypt dm_mod efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_logitech_hidpp hid_logitech_dj cdc_ncm cdc_ether usbnet mii i915 drm_buddy usbhid i2c_algo_bit crc32_pclmul nvme crc32c_intel drm_display_helper ghash_clmulni_intel cec hid_generic nvme_core sha512_ssse3 ahci rc_core libahci t10_pi sha512_generic ttm libata crc64_rocksoft_generic rtsx_pci_sdmmc crc64_rocksoft crc_t10dif drm_kms_helper mmc_core crct10dif_generic aesni_intel crypto_simd xhci_pci i2c_hid_acpi xhci_hcd scsi_mod i2c_hid cryptd i2c_i801 intel_lpss_pci crct10dif_pclmul crc64 psmouse usbcore drm rtsx_pci crct10dif_common i2c_smbus
2023-05-20T20:44:38.293875+02:00 adangel kernel: [138964.336505]  intel_lpss idma64 scsi_common usb_common fan hid battery video wmi button
2023-05-20T20:44:38.293877+02:00 adangel kernel: [138964.336538] ---[ end trace 0000000000000000 ]---
2023-05-20T20:44:38.293878+02:00 adangel kernel: [138965.547929] RIP: 0010:__list_add_valid.cold+0x3a/0x5b
2023-05-20T20:44:38.293879+02:00 adangel kernel: [138965.547954] Code: f2 48 89 c1 48 89 fe 48 c7 c7 20 79 b5 ad e8 c1 a6 fe ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 c8 78 b5 ad e8 aa a6 fe ff <0f> 0b 48 89 c1 48 c7 c7 70 78 b5 ad e8 99 a6 fe ff 0f 0b 48 c7 c7
2023-05-20T20:44:38.293880+02:00 adangel kernel: [138965.547957] RSP: 0018:ffff9fd9403bbe10 EFLAGS: 00010246
2023-05-20T20:44:38.293881+02:00 adangel kernel: [138965.547960] RAX: 0000000000000075 RBX: ffff8fd80fbe7280 RCX: 0000000000000000
2023-05-20T20:44:38.293882+02:00 adangel kernel: [138965.547962] RDX: 0000000000000000 RSI: ffffffffadb3fa66 RDI: 00000000ffffffff
2023-05-20T20:44:38.293883+02:00 adangel kernel: [138965.547964] RBP: ffff8fd80fbe72c8 R08: 0000000000000000 R09: ffff9fd9403bbca8
2023-05-20T20:44:38.293883+02:00 adangel kernel: [138965.547966] R10: 0000000000000003 R11: ffffffffae2cda30 R12: ffff8fd80fbe72c8
2023-05-20T20:44:38.293885+02:00 adangel kernel: [138965.547968] R13: ffff8fd9b2489808 R14: ffff8fd9b2489800 R15: 0000000000000000
2023-05-20T20:44:38.293886+02:00 adangel kernel: [138965.547970] FS:  0000000000000000(0000) GS:ffff8fdb6ec80000(0000) knlGS:0000000000000000
2023-05-20T20:44:38.293886+02:00 adangel kernel: [138965.547972] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-05-20T20:44:38.293887+02:00 adangel kernel: [138965.547974] CR2: 00007f6bc695accc CR3: 0000000299c6a005 CR4: 00000000003726e0
2023-05-20T20:44:38.293888+02:00 adangel kernel: [138965.547976] note: kworker/1:1H[58] exited with preempt_count 1
2023-05-20T20:44:54.547672+02:00 adangel kernel: [138981.805830] i915 0000:00:02.0: [drm] *ERROR* [CRTC:72:pipe B] cleanup_done timed out
2023-05-20T20:44:58.079136+02:00 adangel kernel: [138985.341790] rcu: INFO: rcu_preempt self-detected stall on CPU
2023-05-20T20:44:58.079149+02:00 adangel kernel: [138985.341796] rcu:   3-....: (4711 ticks this GP) idle=ad74/1/0x4000000000000000 softirq=37761468/37761468 fqs=2120
2023-05-20T20:44:58.079151+02:00 adangel kernel: [138985.341802]        (t=5251 jiffies g=30808309 q=261925 ncpus=4)
2023-05-20T20:44:58.079153+02:00 adangel kernel: [138985.341805] CPU: 3 PID: 141 Comm: kworker/3:1H Tainted: G      D   I        6.1.0-9-amd64 #1  Debian 6.1.27-1
2023-05-20T20:44:58.079154+02:00 adangel kernel: [138985.341808] Hardware name: Dell Inc. XPS 13 9350/09JHRY, BIOS 1.12.2 12/15/2019
2023-05-20T20:44:58.079155+02:00 adangel kernel: [138985.341810] Workqueue: kblockd blk_mq_run_work_fn
2023-05-20T20:44:58.079166+02:00 adangel kernel: [138985.341833] RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x2b0

It seems, with all these hanging tasks, some list got corrupted…

Also note: “self detected stall on CPU 3…”. Really serious.

Since the filesystem was not unmounted cleanly, it will try to recover before mounting. This might produce again errors after reboot, when plugging in the adapter with the drive. But for me, it luckily didn’t hit these specific blocks, so it didn’t get stuck automatically.

Sometimes, I’ve also seen this error:

2023-05-20T21:44:27.928010+02:00 adangel kernel: [ 2653.393863] i915 0000:00:02.0: [drm] *ERROR* [CRTC:72:pipe B] cleanup_done timed out
2023-05-20T21:44:38.167002+02:00 adangel kernel: [ 2663.633778] i915 0000:00:02.0: [drm] *ERROR* [CRTC:72:pipe B] cleanup_done timed out
2023-05-20T21:44:48.407864+02:00 adangel kernel: [ 2673.873695] i915 0000:00:02.0: [drm] *ERROR* [CRTC:72:pipe B] cleanup_done timed out
2023-05-20T21:44:54.579186+02:00 adangel kernel: [ 2680.049594] watchdog: BUG: soft lockup - CPU#3 stuck for 104s! [kworker/3:1H:138]

I guess, the graphics card (i915) is also complaining. And note the watchdog message: CPU#3 stuck for 104s!

SMART

Ok, back to the first errors: I/O error while reading a sector from a SSD? That sounds unlikely. Using a SSD, you should almost never see such an error - the SSD would already automatically replace any bad sectors and you wouldn’t see such errors at this level. Looking at the SMART values, there are zero (I mean literally “0”) reallocated sectors.

This gives a hint, that this I/O error doesn’t come from the SSD itself…

I also started the self tests on the drive, but these self-tests said, the SSD is in good shape.

Repair attempts

That made me think: Maybe for some reason, some sectors are not accessible. Let’s try to identify those and work around them. Shouldn’t be that hard. I just want to have this damn thing working again, so that I can use it and install wonderful Windows updates. And after that go on with what I actually wanted to do under Windows (like reproducing a software bug involving file paths or line endings).

The error seems to be, that we have bad blocks on the device. So, let’s try to find bad blocks.

badblocks

The badblock command is usually used while formatting a partition - the mkfs command can also check, whether all blocks are readable and/or writable and if not collect the bad blocks in a list to avoid using them and loosing data.

While not designed to be used that way, badblocks can also be run standalone.

So I started of checking the whole drive, splitting the range into halfs until I identified the blocks, that cause trouble.

I eventually found out a very important trick: Whenever I hit the troubling sectors, I couldn’t CTRL+C the command anymore and I also couldn’t kill it. After a while, the whole system would start to hang and I needed to power cycle. To make it faster, I used the SysRq keys to just reboot.

But I found a better solution, that doesn’t involve rebooting: Whenever I hit these troubling blocks and the badblocks command couldn’t be interrupted anymore, I simple unplug the USB SATA adapter. That apparently stopped the bad blocks command, reset the hung kernel tasks and everything is fine. The system kept stable. That allowed me to do even faster test cycles. However, I figured this out very late.

Anyway - I figured out, that the bad blocks are between 66166400 and 70000000. This is about 1.5MiB (block size 4096).

Important is, that you use the same blocksize as the file system, you are using. For me, the block size is 4096.

Now I have a command, that can reproduce the bad behavior:

badblocks -s -v -b 4096 /dev/sda1 70000000 66166400

ext4 magics

When you have identified bad blocks and these blocks are already in use, that you would need to delete maybe the whole file. But which file is bad? debugfs to the rescue:

You can map a block number to inode:

debugfs -R icheck <blocknr> /dev/sda1

And the inode number to a file:

debugfs -R ncheck <inode> /dev/sda1

Unfortunately, this is the disk image file for my virtual windows. So, I need to take care that I don’t loose this file, otherwise I need to reinstall Windows again (and install all updates again) which can take ages.

The strategy with ext4 is that: You add the badblocks to the list and then run fsck.ext4 and let it repair the filesystem. The program will then try to read the bad blocks and copy the data, if it can read it, to other good blocks, so that we don’t need to delete the file. However, this doesn’t work for me: As soon as fsck.ext4 would try to read from the bad blocks, it will get stuck. The problem is, that we don’t get a I/O error at this block, the adapter freezes. And so, we freeze fsck.ext4 as well.

So, I used a little trick: I used another (slow) adapter, so that fsck.ext4 can successfully copy the bad blocks around. That way I didn’t need to manually copy the good parts of the file somewhere else.

With tune2fs -l you can see the current list of bad blocks. This was empty for me.

With seq 66166400 6999999 > badblocklist you can generate a file which lists all the bad blocks and is a input for fsck.ext4.

With fsck.ext4 -l badblocklist -fv ... you can update the badblocks list. -l will preserve any existing badblocks and will just add additional blocks from the given file.

But after that, it still didn’t work in a way, that the virtual Windows would have been usable. It was still slow and I noticed still many errors in the syslog.

Research

Doing research, what the problem might be. I found USB3.0 to SATA adapter causes problems with similar error messages. This issue pointed me to a raspberry pi forum STICKY: If you have a Raspberry Pi 4 and are getting bad speeds transferring data to/from USB3.0 SSDs, read this.

Here I learned:

  • There is UAS - USB Attached SATA. Which is the fasted mode - because then not the old USB mass storage interface is used. The host can send multiple read commands and doesn’t need to wait for each. The device can then execute it in any order. So, in the end, it is more parallel.
  • There are quirks for some USB SATA adapter, since they don’t work reliable. Ok, the first idea was - I should enable the quirk to disable UAS for my adapter. But while trying to figure out the command to issue for that (see kernel-parameters.txt), I figured, the quirk is already enabled and my adapter doesn’t use UAS.

When plugging in the adapter, you see the following kernel messages:

2023-05-21T10:14:56.101641+02:00 adangel kernel: [ 3180.711975] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd
2023-05-21T10:14:56.121645+02:00 adangel kernel: [ 3180.732423] usb 2-1: New USB device found, idVendor=2537, idProduct=1068, bcdDevice= 1.00
2023-05-21T10:14:56.121656+02:00 adangel kernel: [ 3180.732429] usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
2023-05-21T10:14:56.121658+02:00 adangel kernel: [ 3180.732432] usb 2-1: Product: NS1068
2023-05-21T10:14:56.121659+02:00 adangel kernel: [ 3180.732434] usb 2-1: Manufacturer: Norelsys
2023-05-21T10:14:56.121660+02:00 adangel kernel: [ 3180.732436] usb 2-1: SerialNumber: 0123456789ABCDE
2023-05-21T10:14:56.125647+02:00 adangel kernel: [ 3180.737514] usb 2-1: UAS is ignored for this device, using usb-storage instead
2023-05-21T10:14:56.125664+02:00 adangel kernel: [ 3180.737518] usb-storage 2-1:1.0: USB Mass Storage device detected
2023-05-21T10:14:56.125665+02:00 adangel kernel: [ 3180.737632] usb-storage 2-1:1.0: Quirks match for vid 2537 pid 1068: 800000

“UAS is ignored for this device, using usb-storage instead” and “Quirks match for vid 2537 pid 1068: 800000”.

Ok, looking at the source, indeed, my device is already quirked:

/* Reported-by: Icenowy Zheng <icenowy@aosc.io> */
UNUSUAL_DEV(0x2537, 0x1068, 0x0000, 0x9999,
		"Norelsys",
		"NS1068X",
		USB_SC_DEVICE, USB_PR_DEVICE, NULL,
		US_FL_IGNORE_UAS),

(from kernel source: drivers/usb/storage/unusual_uas.h)

So, I guess, there is nothing more to do here. Without that quirk, the adapter would probably not work at all?

More research showed this: How to test SSDs or NVMe for badblocks? And yeah, SSD should really require this procedure. For NVMe devices (e.g. my internal SSD, not the external SATA) there are also additional commands, explained in Using NVMe Command Line Tools to Check NVMe Flash Health.

Solution

What’s the solution for me now? Enough research and time spent on this, I need a new adapter. I’ve chosen an adapter from BENFEI, that is claimed to support UAS. See the Amazon Link BENFEI USB 3.0 for details.

This adapter simply worked without problems. I also tried it with my USB hub - and guess what: no errors. It just works!

Tests

I’ve done some more experiments, checking the logs and connections with the old and the new adapter.

With lsusb --tree you can not only find out, where exactly your device is plugged in, it will also show which speed it uses and which driver, e.g. whether usb_storage or uas is used.

bad adapter

Bad adapter with USB-C-to-A adapter
Adapter's plate

$ lsusb
Bus 002 Device 028: ID 2537:1068 Norelsys NS1068/NS1068X SATA Bridge Controller

Plugged in with the only “good” port of docking station (dell d3100):

$ lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 5000M
            |__ Port 2: Dev 28, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

It is using superspeed (5000M). The logs doesn’t show any error upon plugging in.

Now testing it at a “bad” port of docking station (dell d3100):

Bus 002 Device 029: ID 2537:1068 Norelsys NS1068/NS1068X SATA Bridge Controller
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 5000M
            |__ Port 1: Dev 29, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

Still superspeed, but already errors while mounting:

2023-05-23T19:53:35.759662+02:00 adangel kernel: [35440.754652] sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
2023-05-23T19:53:35.759680+02:00 adangel kernel: [35440.754659] sd 8:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 00 01 08 00 00 f8 00
2023-05-23T19:53:35.759683+02:00 adangel kernel: [35440.754661] I/O error, dev sda, sector 264 op 0x0:(READ) flags 0x80700 phys_seg 31 prio class 2
2023-05-23T19:53:36.391721+02:00 adangel kernel: [35441.385318] usb 2-2.2.1: reset SuperSpeed USB device number 29 using xhci_hcd
2023-05-23T19:53:36.411666+02:00 adangel kernel: [35441.405846] sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
2023-05-23T19:53:36.411684+02:00 adangel kernel: [35441.405854] sd 8:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 00 02 08 00 01 f8 00
2023-05-23T19:53:36.411686+02:00 adangel kernel: [35441.405856] I/O error, dev sda, sector 520 op 0x0:(READ) flags 0x80700 phys_seg 63 prio class 2
2023-05-23T19:53:37.231785+02:00 adangel kernel: [35442.225503] usb 2-2.2.1: reset SuperSpeed USB device number 29 using xhci_hcd
2023-05-23T19:53:37.251719+02:00 adangel kernel: [35442.246105] sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
2023-05-23T19:53:37.251753+02:00 adangel kernel: [35442.246122] sd 8:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 00 08 a0 00 00 c8 00
2023-05-23T19:53:37.251756+02:00 adangel kernel: [35442.246129] I/O error, dev sda, sector 2208 op 0x0:(READ) flags 0x80700 phys_seg 25 prio class 2
2023-05-23T19:53:37.855832+02:00 adangel kernel: [35442.849227] usb 2-2.2.1: reset SuperSpeed USB device number 29 using xhci_hcd
2023-05-23T19:53:37.875725+02:00 adangel kernel: [35442.870007] sd 8:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s
2023-05-23T19:53:37.875739+02:00 adangel kernel: [35442.870013] sd 8:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 00 00 09 68 00 00 80 00
2023-05-23T19:53:37.875741+02:00 adangel kernel: [35442.870015] I/O error, dev sda, sector 2408 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2

And many more error. Eventually the drive mounts successfully. While unmounting, again some errors are logged.

Now directly at notebook port:

Bus 002 Device 030: ID 2537:1068 Norelsys NS1068/NS1068X SATA Bridge Controller
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 1: Dev 30, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

One more test - adapter directly at the usb c port without usb-c-to-a adapter - in case the USB-C-to-A-adapter is the culprit:

Bus 004 Device 002: ID 2537:1068 Norelsys NS1068/NS1068X SATA Bridge Controller
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
    |__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=usb-storage, 5000M

While the port would support 10000M, the adapter doesn’t, only superspeed. But: The error is still reproducible. So, it has nothing to do with any USB hub or port. When I read the above mentioned block, the adapter freezes.

slow adapter

This is an old USB-SATA adapter for USB 2.0 only. So it is very slow, but helps out for testing now.

$ lsusb
Bus 001 Device 015: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technology Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge

$ lsusb -t
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/12p, 480M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 2: Dev 15, If 0, Class=Mass Storage, Driver=usb-storage, 480M

As expected, it is only USB 2.0 speed (480M). There are no errors in the log, even on the “bad” port of the hub. So, the problem can’t be the SSD.

new adapter

New adapter
New adapter's plate

$ lsusb
Bus 002 Device 031: ID 174c:0193 ASMedia Technology Inc. USB 3.0 Destop HD EP0 Product string

Directly connected to the notebook port:

$ lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 1: Dev 31, If 0, Class=Mass Storage, Driver=uas, 5000M

Also notice the driver: it is “uas”. The other adapters where using here “usb-storage”.

At the hub port:

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 5000M
            |__ Port 2: Dev 32, If 0, Class=Mass Storage, Driver=uas, 5000M

No errors in the log while mounting.

At the “bad” hub port:

/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/6p, 5000M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 5000M
            |__ Port 1: Dev 33, If 0, Class=Mass Storage, Driver=uas, 5000M

Again no errors in the log while mounting.

Just to be sure, running badblocks again over the whole drive:

sudo badblocks -b 4096 -s -v /dev/sda1

vmstat 1 shows around 300MB/s - so this is the expected speed. iotop shows the same.

No problems, no errors in the log. Works like a charm.

bad adapter under Windows

I wonder, whether this adapter works under Windows. Or whether it also fails there. I used hddscan, a simple tool to read your hard drive. As I use ext4 as the filesystem, Windows of course won’t understand it. So I need a lower level tool to read the drive.

I ran some tests. First I ran “VERIFY” (read data without transfer). And it worked without a problem. But I guess, this test didn’t read at all anything. The task manager didn’t show any IO transfers. Not sure what this test even does.

The next test I started was “READ” (read data to the host). And there the problem also appeared. It got stuck, slowed down and so on. So, it definitely has some flaws.

reproduction

As mentiond earlier, I could identify the bad blocks between 66166400 and 70000000. This is about 1.5MiB (block size 4096). So, executing the following command reproducibly freezes the USB SATA adapter:

badblocks -s -v -b 4096 /dev/sda1 70000000 66166400

And this is perfectly reproducible even in slow connection. I plugged in the adapter into a USB 2.0 hub:

Bus 001 Device 017: ID 2537:1068 Norelsys NS1068/NS1068X SATA Bridge Controller
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/12p, 480M
    |__ Port 2: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
            |__ Port 3: Dev 6, If 0, Class=Hub, Driver=hub/4p, 480M
                |__ Port 3: Dev 9, If 0, Class=Hub, Driver=hub/4p, 480M
                    |__ Port 2: Dev 17, If 0, Class=Mass Storage, Driver=usb-storage, 480M

It still freezes/hangs with above badblocks command.

There are other techniques to force a slow speed on USB devices, see https://bbs.archlinux.org/viewtopic.php?id=172005 or https://www.systutorials.com/how-to-force-a-usb-3-0-port-to-work-in-usb-2-0-mode-in-linux/. However, this doesn’t work for me, because I have a USB 3.0 only host (xhci_hcd).

But the USB2.0 Hub showed already that’s the problem is still present with 480M.

Conclusion

It seems, that my adapter is indeed flawed and it is not just a “problem” of Linux not supporting a device. I wonder, what actually happens - whether this is a firmware bug. And whether the adapter is broken by design or whether I just have a broken device - and a new one would work. I can’t tell that - I’ve only seen one device.

But the fact, that the Linux developers had to enable some quirks for this device, means that there is something not working according to the specs. Interestingly, Windows seems to use the UAS mode - at least it was shown in the device manager. So, maybe the Windows drivers is less picky with the specifications.