Opened 17 months ago

Last modified 17 months ago

#74 new defect

ECS system failing to complete boot process

Reported by: landb Owned by:
Priority: major Milestone: Future
Component: IFS Version:
Severity: medium Keywords:
Cc:

Description

Using the lastest release (R357) the ECS22 boot process stopped apparently after completing the processing of CONFIG.SYS with a black screen.

I did the following experiments

ECS22: R913 booted OK ECS22: R316 booted OK ECS22 R357 black screen ECS22 debug nucleus booted OK apart from a trap1 (responded "g")

[I recall that at the end of the experiments for ticket 67 R349 also did not compelete the boot except in debug mode)

ECS22/OS4: R357 black screen ECS22 OS4 Debug: R357 booted OK

ARCAOS: R349 booted OK.

Difficult to know what trace or other information I can add in the circumstances.

Attachments (12)

x01.txt (23.5 KB) - added by landb 17 months ago.
config.ecs (9.8 KB) - added by landb 17 months ago.
config.arc (7.8 KB) - added by landb 17 months ago.
x02.txt (12.3 KB) - added by landb 17 months ago.
x03.txt (15.4 KB) - added by landb 17 months ago.
x04.txt (17.3 KB) - added by landb 17 months ago.
x05.txt (44.6 KB) - added by landb 17 months ago.
x06.txt (45.0 KB) - added by landb 17 months ago.
x07.txt (58.0 KB) - added by landb 17 months ago.
x08.txt (45.9 KB) - added by landb 17 months ago.
x09.txt (65.6 KB) - added by landb 17 months ago.
x10.txt (55.3 KB) - added by landb 17 months ago.

Download all attachments as: .zip

Change History (57)

comment:1 Changed 17 months ago by valerius

Black screen on starting PM? Black screen == no messages? Maybe, it is just common accident PM hang on SMP system? Did you tried to boot again? Sometimes it won't boot first, but after some number of tries, it boots ok.

I tried eCS 2.2 with kernels 14.106 and 14.104 (retail SMP ones), it boots ok. I see no problems on my machines.

Did you tried to boot with /monitor switch enabled, and look into COM port? What message appears last? Do you have loop.add in config.sys?

comment:2 Changed 17 months ago by landb

1) yes, black screen == no messages

2) repeated boots repeat the effect

3) My nucleus is 14.104a (the (a) means the COPY command bug is fixed)

4) I will try with /monitor and see what messages there are

5) I do not have loop.add

comment:3 Changed 17 months ago by landb

More experiments seem to show it is a timing issue. With /monitor the system completes the boot (slowly!) without the boot process stops.

With /monitor on I see a fairly long delay when the Z: RAM setup using QSINIT is being initialised but on the other hand when I have removed QSSETUP.CMD altogether I still bet the same result (booting with /montior completes booting without stops).

I assume loop.add will add a delay; please suggest what might be a good value.

comment:4 Changed 17 months ago by landb

I should clarify. When I said "stopping with a black screen" I actually meant that the system stops with the last message output by CONFIG.SYS processing on a black background. So the system is failing to start the WPS screen.

The last message that I see (currently) is the one from FAT32 saying that the Unicode table for code 850 has been initialised.

comment:5 Changed 17 months ago by valerius

That is, probably, cachef32.exe starting "f32mount.exe /a", and f32mount has been hanged, for some reason. I start f32mount synchronously, so it can pause CONFIG.SYS processing. Now I changed it to asynchronous starting, so it should not stop booting. Please try ftp://osfree.org/upload/fat32/test/fat32-0.10-r358-os2.zip and see. It's strange, of course, that it stops only with 14.104a retail kernel, and without the /monitor switch. Actually, some timing issues.

Loop.add is a new driver for mounting disk images to a drive letter. It does not cause any delays. Cachef32.exe now has two threads polling both fat32.ifs and loop.add, so I suspected that it blocks in loop.add (I needed to isolate problems inside loop.add or fat32.ifs)

comment:6 Changed 17 months ago by landb

R358 does not seem to help. It still stops in the same place when booting normally and boots fine with /monitor

comment:7 Changed 17 months ago by valerius

Last edited 17 months ago by valerius (previous) (diff)

comment:8 Changed 17 months ago by landb

I confirm that I used the test one.

comment:9 Changed 17 months ago by valerius

Hm, strange. Does r356 also hang at the same place? What is a last good version?

comment:10 Changed 17 months ago by landb

I know that r316 works OK. I no longer have other old versions. please give me a pointer to download.

comment:11 Changed 17 months ago by valerius

ftp://osfree.org/upload/fat32/ contains builds archive. If some version is missing, I can compile it for you, if you ask me. I suspect that this problem appeared between r356 and r357, so you need to try r356 first.

comment:12 Changed 17 months ago by landb

I have done the binary search with the result that version R326 boots OK and version R329 does not. R327/328 are not in the archive.

comment:13 Changed 17 months ago by valerius

OK, I added r327 and r328 to the archive.

comment:14 Changed 17 months ago by landb

R327 boots correctly and R328 does not.

comment:15 Changed 17 months ago by valerius

Ok, so, the "guilty" revision is r328, where I first added the disk image mounting feature. I added a thread to CACHEF32.EXE polling the IFS for READ/WRITE/OPEN/CLOSE commands and executing them via QEMUIMG.DLL.

Does it boot successfully if you comment CACHEF32.EXE out? By the chance, didn't you forgot to put QEMUIMG.DLL in \os2\dll? It will not start without it.

PS: Is there any harderrors in popuplog.os2 regarding cachef32.exe during te boot?

Last edited 17 months ago by valerius (previous) (diff)

comment:16 Changed 17 months ago by landb

I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).

I switch versions by

<bootdrive>: unzip <version zip file>

so everything should go in the right place.

Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not. However if any version put it there it would have stayed. I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)

comment:17 Changed 17 months ago by valerius

I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).

So, it's probably fat32.ifs hangs

Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not.

Yes, this is because these versions have qemuimg.lib linked statically with cachef32.exe.

I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)

Good. I suspected that it's cachef32.exe trapped and stopped the boot, but it's not true. Hm, no idea what else we could try, because, as you said, it does not stop booting if "/monitor" is specified on the command line. Otherwise, I'd suggest to boot with "/monitor" and look into the COM port, what it will write last.

comment:18 Changed 17 months ago by valerius

Could you give me config.sys of your eCS 2.2? And from ArcaOS, to compare?

comment:19 Changed 17 months ago by landb

I booted R358(test) with /monitor

There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.

Changed 17 months ago by landb

comment:20 Changed 17 months ago by landb

config.ecs and config.arc attached

Changed 17 months ago by landb

Changed 17 months ago by landb

comment:21 Changed 17 months ago by landb

An obviousd major diffeence ECS/ARCAOS is that for ECSI am using QSINIT and in particular to set up a FAT volume (Z:) from RAM.

I have experimented with removing that and the boot still stopped at the ususal place.

comment:22 Changed 17 months ago by valerius

There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.

So, on which line it has been a long wait, #475? or #473, or another?

Which process has been quit on line #476? cachef32.exe or f32mon.exe, or another?

Last edited 17 months ago by valerius (previous) (diff)

comment:23 Changed 17 months ago by landb

It was waiting with the whole of the trace I attached displayed (so after line 476).

I have no idea which process....

comment:24 Changed 17 months ago by valerius

Probably, it was cachef32.exe has been quit or crashed, and this unblocked the wait on line #475. No idea why waiting on line #475 (which is normal) should stop the system booting.

Did you checked, if cachef32.exe is running, after it booted to desktop? (what does "pstat /c" show?)

Are you sure that it stops booting with cachef32.exe commented out? Because if cachef32.exe didn't started, then no "GET_REQ" fsctl function is called, so no blocking in this function, an nothing can stop the boot process. As far as I can see, at line #475 cachef32.exe is blocked in "GET_REQ" fsctl (function 0x8014). Without cachef32.exe, this function is not called at all, so no block.

Last edited 17 months ago by valerius (previous) (diff)

comment:25 Changed 17 months ago by landb

I decided to repeat the experiment.

Running the test version of R358, with CACHEF32 REMmed out.

1) normal booting stopped (for over 1 minute)

2) booting with /monitor *also stopped*

3) I checked that the CONFIG was correct and repeated it with the same result (boot stops)

I have two traces X02 and X03 which are slightly different (and different to X01). They both show processes exiting at 10 second intervals after the system went quiet.

Changed 17 months ago by landb

Changed 17 months ago by landb

comment:26 Changed 17 months ago by valerius

Trace #02 contains stop on opening disk G: in DASD mode, trace #03 -- mounting disk G:, both stopping the boot. I have no idea why it could be stopped. Trace #01 is completely different -- it stops on GET_REQ ioctl.

What is the disk G:? Is it SSD again?

What is the difference in #02 and #03? Could you get a repeatable result?

Last edited 17 months ago by valerius (previous) (diff)

comment:27 Changed 17 months ago by landb

All my disks are SSD. I have no reason to doubt they are working fine as nothing else is showing any errors. I have 3 SSDs in the system and could try moving G to a different one.

I have no idea why x02 and x03 are different - I can try again and see what happens. Do you know why there are 10 second intervals? Is that a timer?

comment:28 Changed 17 months ago by valerius

Also I know that SSD, as any flash memory can be weared out. And I also know that OS/2 disk drivers lack TRIM command related to SSD's. So, you easily can get on a weared-out sector, and can hang on it, if tried to read it.

I have absolutely no idea why it stops and why 10 seconds interval. Processes exited at 10 sec interval? I need to know, which processes it is, otherwise no idea. Is something written on a COM port console when they're exiting? OS/4 kernels show when .sym files are loaded or closed. So maybe, these files have .sym files and we will see.

May be, its's f32chk.exe started for checking disk. It can be seen by loading .sym files.

Last edited 17 months ago by valerius (previous) (diff)

comment:29 Changed 17 months ago by landb

i have moved the G: partition to a different SSD and same problem. X04 is the trace.

Will try again with the OS4 system.

Changed 17 months ago by landb

Changed 17 months ago by landb

comment:30 Changed 17 months ago by landb

OS4 shows the same symptom. I have attached the log as X05

comment:31 Changed 17 months ago by valerius

As far as I can see, in all four cases, your disk G: has been lost, mounted for the second time, and previous, lost, instance accessed again (unmount or DASD open operation), and got stuck, because that instance is dead. Need to think, what I can do about it. Could you also try bl1.zip from the "test" subdirectory? (ftp://osfree.org/upload/fat32/test/bl1.zip), over the previous r358-test archive?

Last edited 17 months ago by valerius (previous) (diff)

Changed 17 months ago by landb

comment:32 Changed 17 months ago by landb

did that. same stop. log is x06

comment:33 Changed 17 months ago by valerius

Now please try bl2.zip from the same place.

Changed 17 months ago by landb

comment:34 Changed 17 months ago by landb

Done. log is x07

comment:35 Changed 17 months ago by valerius

Now please, try bl3.zip.

Changed 17 months ago by landb

comment:36 Changed 17 months ago by landb

Done. As you can see that ended with a trap D. log is x08

comment:37 Changed 17 months ago by valerius

Now test bl4.zip

Changed 17 months ago by landb

comment:38 Changed 17 months ago by landb

That one worked: successful boot with OS4. log in x09

I will now try without /monitor and with ECS kernel

comment:39 Changed 17 months ago by landb

yes, now boots fine in normal ECS running.

THANKS!!

comment:40 Changed 17 months ago by valerius

Good, so this now works. It appears that your disk gets mounted, then it disappears (becomes uncertain), and gets mounted again, another VPB (Volume Parameter Block) gets created. Then FS_MOUNT detects that it is a dup VPB, and finds a previous one. And the new one is removed as a dup, unmount is called. So, this unmount hanged, and for being not hanged, I need to return ERROR_VOLUME_NOT_MOUNTED. So, I did it and now it's working. I committed the new revision, r359, it is available at the builds archive.

comment:41 Changed 17 months ago by landb

With R359, while boot completes successfully when booting from ECS there is a wait of nearly a minute at the point where it previously hung. This doesnt happen when booting from ArcaOS

Any idea of why this might be and what I could do about it?

comment:42 Changed 17 months ago by landb

Hum..... I restored the call to CACHEF32 which was REMmed as part of the experimentation and now there is no wait during the boot process.

comment:43 Changed 17 months ago by landb

I commented out CACHEF32 again and the boot process again had a long wait. I tried setting /monitor and there was no wait. Any useful diagnostics I can do?

comment:44 Changed 17 months ago by valerius

If you cannot take a log of the wait with /monitor enabled, then I fear that there is no means for any other diagnostics. I'm out of ideas atm.

Changed 17 months ago by landb

comment:45 Changed 17 months ago by landb

OK - I caught a long wait. log is x10. boot is with /monitor and OS4 (to give you maximum data in the log)

Note: See TracTickets for help on using tickets.