Opened 6 years ago
Closed 5 years ago
#74 closed defect (fixed)
ECS system failing to complete boot process
Reported by: | Barry Landy | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Future |
Component: | IFS | Version: | |
Severity: | medium | Keywords: | |
Cc: |
Description
Using the lastest release (R357) the ECS22 boot process stopped apparently after completing the processing of CONFIG.SYS with a black screen.
I did the following experiments
ECS22: R913 booted OK ECS22: R316 booted OK ECS22 R357 black screen ECS22 debug nucleus booted OK apart from a trap1 (responded "g")
[I recall that at the end of the experiments for ticket 67 R349 also did not compelete the boot except in debug mode)
ECS22/OS4: R357 black screen ECS22 OS4 Debug: R357 booted OK
ARCAOS: R349 booted OK.
Difficult to know what trace or other information I can add in the circumstances.
Attachments (12)
Change History (58)
comment:1 by , 6 years ago
comment:2 by , 6 years ago
1) yes, black screen == no messages
2) repeated boots repeat the effect
3) My nucleus is 14.104a (the (a) means the COPY command bug is fixed)
4) I will try with /monitor and see what messages there are
5) I do not have loop.add
comment:3 by , 6 years ago
More experiments seem to show it is a timing issue. With /monitor the system completes the boot (slowly!) without the boot process stops.
With /monitor on I see a fairly long delay when the Z: RAM setup using QSINIT is being initialised but on the other hand when I have removed QSSETUP.CMD altogether I still bet the same result (booting with /montior completes booting without stops).
I assume loop.add will add a delay; please suggest what might be a good value.
comment:4 by , 6 years ago
I should clarify. When I said "stopping with a black screen" I actually meant that the system stops with the last message output by CONFIG.SYS processing on a black background. So the system is failing to start the WPS screen.
The last message that I see (currently) is the one from FAT32 saying that the Unicode table for code 850 has been initialised.
comment:5 by , 6 years ago
That is, probably, cachef32.exe starting "f32mount.exe /a", and f32mount has been hanged, for some reason. I start f32mount synchronously, so it can pause CONFIG.SYS processing. Now I changed it to asynchronous starting, so it should not stop booting. Please try ftp://osfree.org/upload/fat32/test/fat32-0.10-r358-os2.zip and see. It's strange, of course, that it stops only with 14.104a retail kernel, and without the /monitor switch. Actually, some timing issues.
Loop.add is a new driver for mounting disk images to a drive letter. It does not cause any delays. Cachef32.exe now has two threads polling both fat32.ifs and loop.add, so I suspected that it blocks in loop.add (I needed to isolate problems inside loop.add or fat32.ifs)
comment:6 by , 6 years ago
R358 does not seem to help. It still stops in the same place when booting normally and boots fine with /monitor
comment:7 by , 6 years ago
r358 or the test I provided in "test" subdir? It is newer than r358.
ftp://osfree.org/upload/fat32/test/fat32-0.10-r358-os2.zip, and not the ftp://osfree.org/upload/fat32/fat32-0.10-r358-os2.zip
comment:9 by , 6 years ago
Hm, strange. Does r356 also hang at the same place? What is a last good version?
comment:10 by , 6 years ago
I know that r316 works OK. I no longer have other old versions. please give me a pointer to download.
comment:11 by , 6 years ago
ftp://osfree.org/upload/fat32/ contains builds archive. If some version is missing, I can compile it for you, if you ask me. I suspect that this problem appeared between r356 and r357, so you need to try r356 first.
comment:12 by , 6 years ago
I have done the binary search with the result that version R326 boots OK and version R329 does not. R327/328 are not in the archive.
comment:15 by , 6 years ago
Ok, so, the "guilty" revision is r328, where I first added the disk image mounting feature. I added a thread to CACHEF32.EXE polling the IFS for READ/WRITE/OPEN/CLOSE commands and executing them via QEMUIMG.DLL.
Does it boot successfully if you comment CACHEF32.EXE out? By the chance, didn't you forgot to put QEMUIMG.DLL in \os2\dll? It will not start without it.
PS: Is there any harderrors in popuplog.os2 regarding cachef32.exe during te boot?
comment:16 by , 6 years ago
I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).
I switch versions by
<bootdrive>: unzip <version zip file>
so everything should go in the right place.
Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not. However if any version put it there it would have stayed. I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)
comment:17 by , 6 years ago
I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).
So, it's probably fat32.ifs hangs
Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not.
Yes, this is because these versions have qemuimg.lib linked statically with cachef32.exe.
I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)
Good. I suspected that it's cachef32.exe trapped and stopped the boot, but it's not true. Hm, no idea what else we could try, because, as you said, it does not stop booting if "/monitor" is specified on the command line. Otherwise, I'd suggest to boot with "/monitor" and look into the COM port, what it will write last.
comment:18 by , 6 years ago
Could you give me config.sys of your eCS 2.2? And from ArcaOS, to compare?
comment:19 by , 6 years ago
I booted R358(test) with /monitor
There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.
by , 6 years ago
by , 6 years ago
Attachment: | config.ecs added |
---|
by , 6 years ago
Attachment: | config.arc added |
---|
comment:21 by , 6 years ago
An obviousd major diffeence ECS/ARCAOS is that for ECSI am using QSINIT and in particular to set up a FAT volume (Z:) from RAM.
I have experimented with removing that and the boot still stopped at the ususal place.
comment:22 by , 6 years ago
There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.
So, on which line it has been a long wait, #475? or #473, or another?
Which process has been quit on line #476? cachef32.exe or f32mon.exe, or another?
comment:23 by , 6 years ago
It was waiting with the whole of the trace I attached displayed (so after line 476).
I have no idea which process....
comment:24 by , 6 years ago
Probably, it was cachef32.exe has been quit or crashed, and this unblocked the wait on line #475. No idea why waiting on line #475 (which is normal) should stop the system booting.
Did you checked, if cachef32.exe is running, after it booted to desktop? (what does "pstat /c" show?)
Are you sure that it stops booting with cachef32.exe commented out? Because if cachef32.exe didn't started, then no "GET_REQ" fsctl function is called, so no blocking in this function, an nothing can stop the boot process. As far as I can see, at line #475 cachef32.exe is blocked in "GET_REQ" fsctl (function 0x8014). Without cachef32.exe, this function is not called at all, so no block.
comment:25 by , 6 years ago
I decided to repeat the experiment.
Running the test version of R358, with CACHEF32 REMmed out.
1) normal booting stopped (for over 1 minute)
2) booting with /monitor *also stopped*
3) I checked that the CONFIG was correct and repeated it with the same result (boot stops)
I have two traces X02 and X03 which are slightly different (and different to X01). They both show processes exiting at 10 second intervals after the system went quiet.
by , 6 years ago
by , 6 years ago
comment:26 by , 6 years ago
Trace #02 contains stop on opening disk G: in DASD mode, trace #03 -- mounting disk G:, both stopping the boot. I have no idea why it could be stopped. Trace #01 is completely different -- it stops on GET_REQ ioctl.
What is the disk G:? Is it SSD again?
What is the difference in #02 and #03? Could you get a repeatable result?
comment:27 by , 6 years ago
All my disks are SSD. I have no reason to doubt they are working fine as nothing else is showing any errors. I have 3 SSDs in the system and could try moving G to a different one.
I have no idea why x02 and x03 are different - I can try again and see what happens. Do you know why there are 10 second intervals? Is that a timer?
comment:28 by , 6 years ago
Also I know that SSD, as any flash memory can be weared out. And I also know that OS/2 disk drivers lack TRIM command related to SSD's. So, you easily can get on a weared-out sector, and can hang on it, if tried to read it.
I have absolutely no idea why it stops and why 10 seconds interval. Processes exited at 10 sec interval? I need to know, which processes it is, otherwise no idea. Is something written on a COM port console when they're exiting? OS/4 kernels show when .sym files are loaded or closed. So maybe, these files have .sym files and we will see.
May be, its's f32chk.exe started for checking disk. It can be seen by loading .sym files.
comment:29 by , 6 years ago
i have moved the G: partition to a different SSD and same problem. X04 is the trace.
Will try again with the OS4 system.
by , 6 years ago
by , 6 years ago
comment:31 by , 6 years ago
As far as I can see, in all four cases, your disk G: has been lost, mounted for the second time, and previous, lost, instance accessed again (unmount or DASD open operation), and got stuck, because that instance is dead. Need to think, what I can do about it. Could you also try bl1.zip from the "test" subdirectory? (ftp://osfree.org/upload/fat32/test/bl1.zip), over the previous r358-test archive?
by , 6 years ago
by , 6 years ago
by , 6 years ago
by , 6 years ago
comment:38 by , 6 years ago
That one worked: successful boot with OS4. log in x09
I will now try without /monitor and with ECS kernel
comment:40 by , 6 years ago
Good, so this now works. It appears that your disk gets mounted, then it disappears (becomes uncertain), and gets mounted again, another VPB (Volume Parameter Block) gets created. Then FS_MOUNT detects that it is a dup VPB, and finds a previous one. And the new one is removed as a dup, unmount is called. So, this unmount hanged, and for being not hanged, I need to return ERROR_VOLUME_NOT_MOUNTED. So, I did it and now it's working. I committed the new revision, r359, it is available at the builds archive.
comment:41 by , 6 years ago
With R359, while boot completes successfully when booting from ECS there is a wait of nearly a minute at the point where it previously hung. This doesnt happen when booting from ArcaOS
Any idea of why this might be and what I could do about it?
comment:42 by , 6 years ago
Hum..... I restored the call to CACHEF32 which was REMmed as part of the experimentation and now there is no wait during the boot process.
comment:43 by , 6 years ago
I commented out CACHEF32 again and the boot process again had a long wait. I tried setting /monitor and there was no wait. Any useful diagnostics I can do?
comment:44 by , 6 years ago
If you cannot take a log of the wait with /monitor enabled, then I fear that there is no means for any other diagnostics. I'm out of ideas atm.
by , 6 years ago
comment:45 by , 6 years ago
OK - I caught a long wait. log is x10. boot is with /monitor and OS4 (to give you maximum data in the log)
comment:46 by , 5 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
It seems that the issue got solved. I'm closing this ticket. Regards
Black screen on starting PM? Black screen == no messages? Maybe, it is just common accident PM hang on SMP system? Did you tried to boot again? Sometimes it won't boot first, but after some number of tries, it boots ok.
I tried eCS 2.2 with kernels 14.106 and 14.104 (retail SMP ones), it boots ok. I see no problems on my machines.
Did you tried to boot with /monitor switch enabled, and look into COM port? What message appears last? Do you have loop.add in config.sys?