Opened 6 years ago

Last modified 6 years ago

#38 new comment

After some time, disk stops working

Reported by: BlondeGuy Owned by:
Priority: Feedback Pending Milestone:
Component: driver Version: 1.32
Keywords: Cc:

Description

The eCS system is using OS2AHCI version 1.32. It was put into production last fall, and it has failed three times. The system runs well for at least three weeks, possibly longer, before failing. When it fails the system appears to be locked up. After a reboot, our application log files are zero length.

I need help to debug this problem. The disk may have been partitioned with OS2AHCI version 1.29. How can I detect if the disk geometry is a problem? It is hard to debug a problem that occurs so infrequently, in a production machine.

There is a possibility that this isn't an AHCI bug, but something else that makes it impossible for the program to write to the disk. Any idea on how to tell if this is even an AHCI bug?

Change History (6)

comment:1 Changed 6 years ago by David Azarewicz

Priority: majorFeedback Pending

What exactly does it mean that "the system appears to be locked up"? No cursor movement, no system activity, hardware reset required? or can you just reboot?

The first question that needs to be answered is "what is the system doing when it stops?". A system dump would probably be the best way to tell.

Why do you suspect OS2AHCI? Are there other symptoms not mentioned in this ticket?

It is extremely unlikely this has anything to do with disk geometries.

comment:2 Changed 6 years ago by BlondeGuy

I have to do this second hand, but certainly NetBIOS stops. It's good that disk geometry is likely not an issue. The 12 serial ports stop. Likely disk activity stops, but since the files are zero length afterward, it's hard to tell.

Thinks like cursor movement and hardware reset are more difficult to tell when it's a remote system. I will try to find out, but in the mean time, I may want to move this issue to an eCS support issue. Are we pretty sure disk geometry would not lead to a system that worked 24/7 for three weeks, then stopped writing to the disk?

I think a system dump is out of the question if the disk doesn't function. Where would the system dump go?

If debugging is needed, I'll need to convince people to buy another computer to do that with.

The first time we had this problem, the SSD was corrupted to the point of not working at all. I could not wipe it, even with DFSee. Even moving it to a different machine did nothing to make the first SSD available. That's why we suspect the AHCI driver, and then with the note about disk geometry, it looked like we might be on to something.

My second guess is that some system resource is exhausted, and a disk write can no longer be done. That will be a lot harder to track down.

comment:3 in reply to:  2 Changed 6 years ago by David Azarewicz

Replying to BlondeGuy:

I have to do this second hand, but certainly NetBIOS stops. It's good that disk geometry is likely not an issue. The 12 serial ports stop. Likely disk activity stops, but since the files are zero length afterward, it's hard to tell.

It sure sounds like a file system problem. What file system is it (JFS?) and what version of the IFS are you using?

Thinks like cursor movement and hardware reset are more difficult to tell when it's a remote system. I will try to find out, but in the mean time, I may want to move this issue to an eCS support issue. Are we pretty sure disk geometry would not lead to a system that worked 24/7 for three weeks, then stopped writing to the disk?

Yes. Geometry problems tend to manifest as partitioning problems. The geometry problems with 1.29 were limited to some problematic BIOSes. You can be pretty sure you don't have the problem from 1.29 if DFSEE doesn't report any multiple DLAT sector problems. The problems caused by versions prior to 1.27 were MUCH worse.

If the system is locked solid, then a system dump can't be done. Otherwise, if the system is still responsive, a system dump can be done. That is the reason for the questions.

I think a system dump is out of the question if the disk doesn't function. Where would the system dump go?

If the disk *hardware* is not working then yes, it won't work. Otherwise if the disk *hardware* still functions, a system dump will work. You just need a partition setup for it and the right settings in config.sys. The partition can even be on a different disk.

If debugging is needed, I'll need to convince people to buy another computer to do that with.

A system dump should tell us everything we need to know.

The first time we had this problem, the SSD was corrupted to the point of not working at all. I could not wipe it, even with DFSee. Even moving it to a different machine did nothing to make the first SSD available. That's why we suspect the AHCI driver, and then with the note about disk geometry, it looked like we might be on to something.

That sounds like an SSD hardware failure.

My second guess is that some system resource is exhausted, and a disk write can no longer be done. That will be a lot harder to track down.

That sounds logical. Or an SMP race condition. Still a system dump should tell all.

comment:4 Changed 6 years ago by BlondeGuy

The file system is JFS version 1.9.5. They got back to me about the locked up state. The keyboard and mouse click doesn't work, but the mouse moves. They rebooted by holding down the power button.

As far as SMP, I have MAXCPU=1 to prevent that. I will look into creating a system dump.

I have opened eCS bug 3660 for this issue. If you like you can close this bug, since it appears unlikely to be an AHCI bug.

http://bugs.ecomstation.nl/view.php?id=3660

comment:5 in reply to:  4 Changed 6 years ago by David Azarewicz

Replying to BlondeGuy:

The file system is JFS version 1.9.5. They got back to me about the locked up state. The keyboard and mouse click doesn't work, but the mouse moves. They rebooted by holding down the power button.

This is not a good sign. If you can't reboot with ctl-alt-del then you probably cannot get a system dump with ctl-alt-f10-f10.

As far as SMP, I have MAXCPU=1 to prevent that. I will look into creating a system dump.

Good. That eliminates SMP race conditions.

I have opened eCS bug 3660 for this issue. If you like you can close this bug, since it appears unlikely to be an AHCI bug.

AHCI is still not ruled out.

comment:6 Changed 6 years ago by David Azarewicz

I would recommend running a full chkdsk on the disk and see if there are any file system errors. Use "chkdsk /f /o".

Note: See TracTickets for help on using tickets.