Opened 7 years ago

Closed 5 years ago

#74 closed defect (fixed)

ECS system failing to complete boot process

Reported by: Barry Landy Owned by:
Priority: major Milestone: Future
Component: IFS Version:
Severity: medium Keywords:
Cc:

Description

Using the lastest release (R357) the ECS22 boot process stopped apparently after completing the processing of CONFIG.SYS with a black screen.

I did the following experiments

ECS22: R913 booted OK ECS22: R316 booted OK ECS22 R357 black screen ECS22 debug nucleus booted OK apart from a trap1 (responded "g")

[I recall that at the end of the experiments for ticket 67 R349 also did not compelete the boot except in debug mode)

ECS22/OS4: R357 black screen ECS22 OS4 Debug: R357 booted OK

ARCAOS: R349 booted OK.

Difficult to know what trace or other information I can add in the circumstances.

Attachments (12)

x01.txt (23.5 KB ) - added by Barry Landy 6 years ago.
config.ecs (9.8 KB ) - added by Barry Landy 6 years ago.
config.arc (7.8 KB ) - added by Barry Landy 6 years ago.
x02.txt (12.3 KB ) - added by Barry Landy 6 years ago.
x03.txt (15.4 KB ) - added by Barry Landy 6 years ago.
x04.txt (17.3 KB ) - added by Barry Landy 6 years ago.
x05.txt (44.6 KB ) - added by Barry Landy 6 years ago.
x06.txt (45.0 KB ) - added by Barry Landy 6 years ago.
x07.txt (58.0 KB ) - added by Barry Landy 6 years ago.
x08.txt (45.9 KB ) - added by Barry Landy 6 years ago.
x09.txt (65.6 KB ) - added by Barry Landy 6 years ago.
x10.txt (55.3 KB ) - added by Barry Landy 6 years ago.

Download all attachments as: .zip

Change History (58)

comment:1 by Valery V. Sedletski, 7 years ago

Black screen on starting PM? Black screen == no messages? Maybe, it is just common accident PM hang on SMP system? Did you tried to boot again? Sometimes it won't boot first, but after some number of tries, it boots ok.

I tried eCS 2.2 with kernels 14.106 and 14.104 (retail SMP ones), it boots ok. I see no problems on my machines.

Did you tried to boot with /monitor switch enabled, and look into COM port? What message appears last? Do you have loop.add in config.sys?

comment:2 by Barry Landy, 6 years ago

1) yes, black screen == no messages

2) repeated boots repeat the effect

3) My nucleus is 14.104a (the (a) means the COPY command bug is fixed)

4) I will try with /monitor and see what messages there are

5) I do not have loop.add

comment:3 by Barry Landy, 6 years ago

More experiments seem to show it is a timing issue. With /monitor the system completes the boot (slowly!) without the boot process stops.

With /monitor on I see a fairly long delay when the Z: RAM setup using QSINIT is being initialised but on the other hand when I have removed QSSETUP.CMD altogether I still bet the same result (booting with /montior completes booting without stops).

I assume loop.add will add a delay; please suggest what might be a good value.

comment:4 by Barry Landy, 6 years ago

I should clarify. When I said "stopping with a black screen" I actually meant that the system stops with the last message output by CONFIG.SYS processing on a black background. So the system is failing to start the WPS screen.

The last message that I see (currently) is the one from FAT32 saying that the Unicode table for code 850 has been initialised.

comment:5 by Valery V. Sedletski, 6 years ago

That is, probably, cachef32.exe starting "f32mount.exe /a", and f32mount has been hanged, for some reason. I start f32mount synchronously, so it can pause CONFIG.SYS processing. Now I changed it to asynchronous starting, so it should not stop booting. Please try ftp://osfree.org/upload/fat32/test/fat32-0.10-r358-os2.zip and see. It's strange, of course, that it stops only with 14.104a retail kernel, and without the /monitor switch. Actually, some timing issues.

Loop.add is a new driver for mounting disk images to a drive letter. It does not cause any delays. Cachef32.exe now has two threads polling both fat32.ifs and loop.add, so I suspected that it blocks in loop.add (I needed to isolate problems inside loop.add or fat32.ifs)

comment:6 by Barry Landy, 6 years ago

R358 does not seem to help. It still stops in the same place when booting normally and boots fine with /monitor

comment:7 by Valery V. Sedletski, 6 years ago

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

comment:8 by Barry Landy, 6 years ago

I confirm that I used the test one.

comment:9 by Valery V. Sedletski, 6 years ago

Hm, strange. Does r356 also hang at the same place? What is a last good version?

comment:10 by Barry Landy, 6 years ago

I know that r316 works OK. I no longer have other old versions. please give me a pointer to download.

comment:11 by Valery V. Sedletski, 6 years ago

ftp://osfree.org/upload/fat32/ contains builds archive. If some version is missing, I can compile it for you, if you ask me. I suspect that this problem appeared between r356 and r357, so you need to try r356 first.

comment:12 by Barry Landy, 6 years ago

I have done the binary search with the result that version R326 boots OK and version R329 does not. R327/328 are not in the archive.

comment:13 by Valery V. Sedletski, 6 years ago

OK, I added r327 and r328 to the archive.

comment:14 by Barry Landy, 6 years ago

R327 boots correctly and R328 does not.

comment:15 by Valery V. Sedletski, 6 years ago

Ok, so, the "guilty" revision is r328, where I first added the disk image mounting feature. I added a thread to CACHEF32.EXE polling the IFS for READ/WRITE/OPEN/CLOSE commands and executing them via QEMUIMG.DLL.

Does it boot successfully if you comment CACHEF32.EXE out? By the chance, didn't you forgot to put QEMUIMG.DLL in \os2\dll? It will not start without it.

PS: Is there any harderrors in popuplog.os2 regarding cachef32.exe during te boot?

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

comment:16 by Barry Landy, 6 years ago

I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).

I switch versions by

<bootdrive>: unzip <version zip file>

so everything should go in the right place.

Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not. However if any version put it there it would have stayed. I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)

comment:17 by Valery V. Sedletski, 6 years ago

I tried commenting out CACHEF32.EXE early in this process and it made no difference (though I havent tried it with R328).

So, it's probably fat32.ifs hangs

Looking at the ZIPs, recent versions have DLL\QEMUIMG.DLL but R328/9 does not.

Yes, this is because these versions have qemuimg.lib linked statically with cachef32.exe.

I looked at POPUPLOG early in the process and there was nothing relevant (latest entry 6 May)

Good. I suspected that it's cachef32.exe trapped and stopped the boot, but it's not true. Hm, no idea what else we could try, because, as you said, it does not stop booting if "/monitor" is specified on the command line. Otherwise, I'd suggest to boot with "/monitor" and look into the COM port, what it will write last.

comment:18 by Valery V. Sedletski, 6 years ago

Could you give me config.sys of your eCS 2.2? And from ArcaOS, to compare?

comment:19 by Barry Landy, 6 years ago

I booted R358(test) with /monitor

There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.

by Barry Landy, 6 years ago

Attachment: x01.txt added

comment:20 by Barry Landy, 6 years ago

config.ecs and config.arc attached

by Barry Landy, 6 years ago

Attachment: config.ecs added

by Barry Landy, 6 years ago

Attachment: config.arc added

comment:21 by Barry Landy, 6 years ago

An obviousd major diffeence ECS/ARCAOS is that for ECSI am using QSINIT and in particular to set up a FAT volume (Z:) from RAM.

I have experimented with removing that and the boot still stopped at the ususal place.

comment:22 by Valery V. Sedletski, 6 years ago

There was a VERY long wait at the point where without /monitor the boot hangs forever. I captured the trace buffer as x01 at the point where it was waiting.

So, on which line it has been a long wait, #475? or #473, or another?

Which process has been quit on line #476? cachef32.exe or f32mon.exe, or another?

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

comment:23 by Barry Landy, 6 years ago

It was waiting with the whole of the trace I attached displayed (so after line 476).

I have no idea which process....

comment:24 by Valery V. Sedletski, 6 years ago

Probably, it was cachef32.exe has been quit or crashed, and this unblocked the wait on line #475. No idea why waiting on line #475 (which is normal) should stop the system booting.

Did you checked, if cachef32.exe is in the process list, after it booted?

Are you sure that it hangs with cachef32.exe commented out? Because if cachef32.exe didn't started, then no "GET_REQ" fsctl function is called, so no blocking in this function, an nothing can stop the boot process. As far as I can see, at line #475 cachef32.exe is blocked in "GET_REQ" fsctl (function 0x8014). Without cachef32.exe, this function is not called at all, so no block.

Version 1, edited 6 years ago by Valery V. Sedletski (previous) (next) (diff)

comment:25 by Barry Landy, 6 years ago

I decided to repeat the experiment.

Running the test version of R358, with CACHEF32 REMmed out.

1) normal booting stopped (for over 1 minute)

2) booting with /monitor *also stopped*

3) I checked that the CONFIG was correct and repeated it with the same result (boot stops)

I have two traces X02 and X03 which are slightly different (and different to X01). They both show processes exiting at 10 second intervals after the system went quiet.

by Barry Landy, 6 years ago

Attachment: x02.txt added

by Barry Landy, 6 years ago

Attachment: x03.txt added

comment:26 by Valery V. Sedletski, 6 years ago

Trace #02 contains stop on opening disk G: in DASD mode, trace #03 -- mounting disk G:, both stopping the boot. I have no idea why it could be stopped. Trace #01 is completely different -- it stops on GET_REQ ioctl.

What is the disk G:? Is it SSD again?

What is the difference in #02 and #03? Could you get a repeatable result?

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

comment:27 by Barry Landy, 6 years ago

All my disks are SSD. I have no reason to doubt they are working fine as nothing else is showing any errors. I have 3 SSDs in the system and could try moving G to a different one.

I have no idea why x02 and x03 are different - I can try again and see what happens. Do you know why there are 10 second intervals? Is that a timer?

comment:28 by Valery V. Sedletski, 6 years ago

Also I know that SSD, as any flash memory can be weared out. And I also know that OS/2 disk drivers lack TRIM command related to SSD's. So, you easily can get on a weared-out sector, and can hang on it, if tried to read it.

I have absolutely no idea why it stops and why 10 seconds interval. Processes exited at 10 sec interval? I need to know, which processes it is, otherwise no idea. Is something written on a COM port console when they're exiting? OS/4 kernels show when .sym files are loaded or closed. So maybe, these files have .sym files and we will see.

May be, its's f32chk.exe started for checking disk. It can be seen by loading .sym files.

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

comment:29 by Barry Landy, 6 years ago

i have moved the G: partition to a different SSD and same problem. X04 is the trace.

Will try again with the OS4 system.

by Barry Landy, 6 years ago

Attachment: x04.txt added

by Barry Landy, 6 years ago

Attachment: x05.txt added

comment:30 by Barry Landy, 6 years ago

OS4 shows the same symptom. I have attached the log as X05

comment:31 by Valery V. Sedletski, 6 years ago

As far as I can see, in all four cases, your disk G: has been lost, mounted for the second time, and previous, lost, instance accessed again (unmount or DASD open operation), and got stuck, because that instance is dead. Need to think, what I can do about it. Could you also try bl1.zip from the "test" subdirectory? (ftp://osfree.org/upload/fat32/test/bl1.zip), over the previous r358-test archive?

Last edited 6 years ago by Valery V. Sedletski (previous) (diff)

by Barry Landy, 6 years ago

Attachment: x06.txt added

comment:32 by Barry Landy, 6 years ago

did that. same stop. log is x06

comment:33 by Valery V. Sedletski, 6 years ago

Now please try bl2.zip from the same place.

by Barry Landy, 6 years ago

Attachment: x07.txt added

comment:34 by Barry Landy, 6 years ago

Done. log is x07

comment:35 by Valery V. Sedletski, 6 years ago

Now please, try bl3.zip.

by Barry Landy, 6 years ago

Attachment: x08.txt added

comment:36 by Barry Landy, 6 years ago

Done. As you can see that ended with a trap D. log is x08

comment:37 by Valery V. Sedletski, 6 years ago

Now test bl4.zip

by Barry Landy, 6 years ago

Attachment: x09.txt added

comment:38 by Barry Landy, 6 years ago

That one worked: successful boot with OS4. log in x09

I will now try without /monitor and with ECS kernel

comment:39 by Barry Landy, 6 years ago

yes, now boots fine in normal ECS running.

THANKS!!

comment:40 by Valery V. Sedletski, 6 years ago

Good, so this now works. It appears that your disk gets mounted, then it disappears (becomes uncertain), and gets mounted again, another VPB (Volume Parameter Block) gets created. Then FS_MOUNT detects that it is a dup VPB, and finds a previous one. And the new one is removed as a dup, unmount is called. So, this unmount hanged, and for being not hanged, I need to return ERROR_VOLUME_NOT_MOUNTED. So, I did it and now it's working. I committed the new revision, r359, it is available at the builds archive.

comment:41 by Barry Landy, 6 years ago

With R359, while boot completes successfully when booting from ECS there is a wait of nearly a minute at the point where it previously hung. This doesnt happen when booting from ArcaOS

Any idea of why this might be and what I could do about it?

comment:42 by Barry Landy, 6 years ago

Hum..... I restored the call to CACHEF32 which was REMmed as part of the experimentation and now there is no wait during the boot process.

comment:43 by Barry Landy, 6 years ago

I commented out CACHEF32 again and the boot process again had a long wait. I tried setting /monitor and there was no wait. Any useful diagnostics I can do?

comment:44 by Valery V. Sedletski, 6 years ago

If you cannot take a log of the wait with /monitor enabled, then I fear that there is no means for any other diagnostics. I'm out of ideas atm.

by Barry Landy, 6 years ago

Attachment: x10.txt added

comment:45 by Barry Landy, 6 years ago

OK - I caught a long wait. log is x10. boot is with /monitor and OS4 (to give you maximum data in the log)

comment:46 by martini, 5 years ago

Resolution: fixed
Status: newclosed

It seems that the issue got solved. I'm closing this ticket. Regards

Note: See TracTickets for help on using tickets.