Opened 7 years ago
Last modified 5 years ago
#71 assigned defect
FSH_DOVOLIO blocks on I/O errors with SSD's or USB flash sticks
Reported by: | Valery V. Sedletski | Owned by: | Valery V. Sedletski |
---|---|---|---|
Priority: | major | Milestone: | Future |
Component: | IFS | Version: | |
Severity: | high | Keywords: | hang block disk error I/O usb ssd |
Cc: |
Description
This problem is reported in #67 with SSD drives. I created a separate ticket because ticket #67 deals with another problem.
So, it looks like FSH_DOVOLIO hangs/blocks in the disk driver (USBMSD.ADD/OS2AHCI.ADD/DANIS506.ADD) when an I/O error occured. This can happen on both READ and WRITE requests. No hard error is reported to HARDERR.EXE daemon, and no error is returned by FSH_DOVOLIO. Currently, FSH_DOVOLIO is called with ACKNOWLEDGE flag. I also tried to add ABORT/RETRY/FAIL flags instead, or a flag to report an error to HARDERR.EXE, but this did not helped much. FSH_DOVOLIO just blocks in the disk driver. This can happen at boot time, or at shutdown time, when FSInfo sector is read or updated. So, the boot or shutdown process just hangs. Or I had such problems with some USB flash sticks. It just hangs when writing a big amount of data to the flash stick (> 200 MB). Also, if I correctly remember, I observed the same with DANIS506.ADD and a dying SATA or IDE drives. These were I/O errors, because I heard beeps from DaniS to a PC speaker. DaniS usually beeps on I/O errors (mostly when bad sectors are hit).
With SSD drives, it can occur even without the bad sectors, just temporary READ/WRITE errors. The user that reported this, checked his SSD's with special software and found no bad sectors. When he changed his SSD to another one, errors disappeared first, and everything booted ok. But later he observed read errors on this newer SSD too. The same is observed with USB drives on my machine. I have USB controller on my motherboard, and I began getting hangs when writing big data to my flash stick. This occured on some flash stick more frequently, on others less frequently. And very rare with the same flash sticks when trying on another machines. So, I decided that this is my USB controller is dying and causing many I/O errors (just like it was with my dying IDE drives). My machine is 12 years old, so I thought that my USB controller began dying. So, I bought a separate PCI board with additional USB controller. I had much lesser I/O errors on an external USB controller, even with the same flash sticks I had much errors with an integrated USB controller. But recently, I observed an I/O error on an external USB controller too.
So, some SSD's or flash sticks or USB controllers cause more or less I/O errors. This is not necessarily bad blocks, but it may be some temporary errors. I tried to play with FSH_DOVOLIO flags, and this did not helped much. Everything looks like the blocking occurs in a disk driver. Maybe, this can be fixed on a disk driver side? The question remains open.
PS: So far, such errors was not observed on JFS drives. JFS does not use FSH_DOVOLIO, it calls a driver strat2/strat3 routine directly. HPFS386 seems to behave like JFS does. So, except for FAT32.IFS, this could be observed on HPFS, but not so far.
Change History (3)
comment:2 by , 7 years ago
2Lars: Yes, having OS2DASD reissuing I/O requests in a loop is possible. But why then I don't have such problems with HPFS drives? HPFS is a 16-bit IFS too and should work the same fat32.ifs does. The single difference seems to be that fat32.ifs works currently with strat1 only, whereas hpfs.ifs mostly works via strat2. (Ko Myung Hun removed strat2 calls, for some reason.)
comment:3 by , 5 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
The ADD specification clearly says that an ADD shall NEVER block (in its IORB entry point at least, no matter what IORB command). That means, an ADD shall NEVER EVER call DevHelp_ProcBlock from any place of its IORB entry point.
An ADD driver will always return directly to its requester (which is the DMD) but of course, it might be necessary to queue the IORB request (if it cannot be performed right away). If an IORB request finally finishes, the ADD driver calls the Notification Entrypoint inside the DMD (if the DMD specified one on the original IORB request).
I can say that USBMSD.ADD follows these rules and I am fairly sure that DANIS506.ADD also follows these rules. That said I find it highly unlikely that OS2AHCI.ADD violates these rules.
Now, the other chance for failure would be an ADD to get into an infinite programming loop. However, considering that you are mentioning 3 different ADD drivers I consider it highly unlikely for all 3 ADD drivers to have an infinite programming loop in their IORB entry point processing.
That makes me believe that it is the DMD (OS2DASD.DMD for the most part) that blocks.
What I did observe is that the DMD (OS2DASD.DMD) occasionally called into USBMSD.ADD multiple times in a row to reissue a request if it was "not happy with the ADD response" (the error code that the ADD reported in the DMD notification routine). Maybe there is an error condition that will lead to an endless repetition of a request by the DMD.