Opened 14 years ago
Closed 14 years ago
#28 closed defect (fixed)
smp hangs
Reported by: | Silvan Scherrer | Owned by: | |
---|---|---|---|
Priority: | blocker | Milestone: | odinized java |
Component: | odin | Version: | |
Severity: | Keywords: | ||
Cc: |
Description
odin tends to hang the active process when a odinized app is terminated.
it may be due to DosExitList() hangs.
Attachments (8)
Change History (25)
comment:1 by , 14 years ago
Milestone: | → odinized flash |
---|
comment:2 by , 14 years ago
Milestone: | odinized flash → odinized java |
---|
comment:3 by , 14 years ago
comment:4 by , 14 years ago
Note that PROCDUMP creates a PDUMP file which CHKDSK fixes after reboot. But apparently this file is not complete and cannot be explored with PMDF.EXE, unfortunately.
comment:5 by , 14 years ago
After getting a kernel fix for PMDF.EXE, I may force the dump of the hung process:
Current slot number: 00a0 Slot Pid Ppid Csid Ord Sta Pri pTSD pPTDA pTCB Disp SG Name *00a0# 004b 004a 004b 0001 blk 0500 f8fde000 fe634730 f935cc98 0eb0 15 THREADS eax=00000000 ebx=0000efc0 ecx=00150053 edx=00000000 esi=0011ff58 edi=297a0000 eip=1ffca29d esp=0015eb34 ebp=0015ec60 iopl=0 -- -- -- nv up ei pl nz ac po cy cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000 cr2=00000000 cr3=00000000 p=** 005b:1ffca29d 83c410 add esp,+10 ............ Ring three stack for all slots Ring three stack for slot a0: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %0015ec60 0000 0000 0000 0000 ee20 0015 ec78 0015 DOSCALL1 _xcptGetRegistrationHead + 6 %0015ed28 0000 0000 0000 0000 ee20 0015 0053 0000 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151 %0015ee4c 0000 ffff 0000 ee60 0015 ee84 0015 0001 Unknown ff48 0011 0000 0000 %0011ff48 0001 0000 0000 0000 000a 004b 000b 004b LIBC064X ___dup2 + 57 %0011ff68 0000 0000 0000 0000 0000 0000 0000 0000 LIBC064X __std_waitid + 1f %0011ff88 0000 0000 0000 0000 ffa8 0011 d754 1df2 LIBC064X ___ctordtorTerm1 + 5d %0011ffa8 0001 0000 0000 ffff ffff 17d0 1df2 ffc5 Unknown ffe0 0011 002a 0001 %0011ffe0 0000 0000 ce38 1ffe 1593 0000 0000 0000 LIBC064X __std_strxfrm + 4b Ring three stack for slot a6: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %035bec50 070c 15b4 ffff ffff ef34 035b edf8 035b WGSS50 WGSS50_2000 + 9c6 %035bec7c 1d8f 0012 6800 0000 0000 0001 0000 ec9c Unknown eccc 035b 9c5b 1d8f %035beccc ee60 035b 0f78 030f ee84 035b ed24 035b DOSCALL1 _xcptR3ExceptionDispatcher + 88 %035bed2c ee60 035b ee84 035b 0053 0000 fe94 030e DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %035bee4c 0000 ffff 0000 ee60 035b ee84 035b 0002 Unknown fe88 030e 0000 0000 Ring three stack for slot a7: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %035cec50 070c 15b4 ffff ffff ef34 035c edf8 035c WGSS50 WGSS50_2000 + 9c6 %035cec7c 1d8f 0014 6800 0000 0000 0001 0000 ec9c Unknown eccc 035c 9c5b 1d8f %035ceccc ee60 035c 0f78 0321 ee84 035c ed24 035c DOSCALL1 _xcptR3ExceptionDispatcher + 88 %035ced2c ee60 035c ee84 035c 0053 0000 fe94 0320 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %035cee4c 0000 ffff 0000 ee60 035c ee84 035c 0002 Unknown fe88 0320 0000 0000 Ring three stack for slot a8: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %035dec50 070c 15b4 ffff ffff ef34 035d edf8 035d WGSS50 WGSS50_2000 + 9c6 %035dec7c 1d8f 0016 6800 0000 0000 0001 0000 ec9c Unknown eccc 035d 9c5b 1d8f %035deccc ee60 035d 0f78 0334 ee84 035d ed24 035d DOSCALL1 _xcptR3ExceptionDispatcher + 88 %035ded2c ee60 035d ee84 035d 0053 0000 fe94 0333 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %035dee4c 0000 ffff 0000 ee60 035d ee84 035d 0002 Unknown fe88 0333 0000 0000 Ring three stack for slot a9: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %035eec50 070c 15b4 ffff ffff ef34 035e edf8 035e WGSS50 WGSS50_2000 + 9c6 %035eec7c 1d8f 0018 6800 0000 0000 0001 0000 ec9c Unknown eccc 035e 9c5b 1d8f %035eeccc ee60 035e 0f78 0346 ee84 035e ed24 035e DOSCALL1 _xcptR3ExceptionDispatcher + 88 %035eed2c ee60 035e ee84 035e 0053 0000 fe94 0345 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %035eee4c 0000 ffff 0000 ee60 035e ee84 035e 0002 Unknown fe88 0345 0000 0000 Ring three stack for slot aa: THREADS Base Ptr Parameters Caller -------- ---------------- ------ %035fec50 070c 15b4 ffff ffff ef34 035f edf8 035f WGSS50 WGSS50_2000 + 9c6 %035fec7c 1d8f 001a 6800 0000 0000 0001 0000 ec9c Unknown eccc 035f 9c5b 1d8f %035feccc ee60 035f 0f78 0359 ee84 035f ed24 035f DOSCALL1 _xcptR3ExceptionDispatcher + 88 %035fed2c ee60 035f ee84 035f 0053 0000 fe94 0358 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %035fee4c 0000 ffff 0000 ee60 035f ee84 035f 0002 Unknown fe88 0358 0000 0000 # Checking call gates for all slots Slot THREADS a0: is in a call to: %1ffc6fb4 DOSCALL1 DOS32EXCEPTIONCALLBACK Slot THREADS a6: is in a call to: %1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM Slot THREADS a7: is in a call to: %1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM Slot THREADS a8: is in a call to: %1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM Slot THREADS a9: is in a call to: %1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM Slot THREADS aa: is in a call to: %1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM
And there is also some more trace information if you select each thread individually (which seems to include the kernel entry points):
for thread a0: Ring Three Stack Trace Location Address Symbol Description -------- ------- ------ ----------- %15eb46 %00120040 OS2KRNL _demVDMStartDoneSem + 13 %15eb56 %00120000 OS2KRNL _aseldemTemp %15eb96 %001201cc OS2KRNL _apiop + 3d4 %15ebe2 1800:0015 InitProcessor + 11 %15ec16 %0012164c OS2KRNL _apvdmbp + 2c8 %15ec64 %1ffc0000 DOSCALL1 DOS32GETINFOBLOCKS + 15 %15eca0 %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46 %15eca4 %1ffc1e59 DOSCALL1 DOS32PSETPATHINFO + 1a3 %15ecac %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 65 %15ecc0 %1ffca330 DOSCALL1 _xcptR3ExceptionHandler %15ecdc %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 10e %15ed14 %1ffc0000 DOSCALL1 DOS32GETINFOBLOCKS + 15 %15ed2c %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151 %15ed4c %1e8d6014 PMMERGE CharRect32 + 5be %15ee04 %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + a %15ee0c %1ffc77d4 DOSCALL1 HT32_Exit + 5 %15ee2c ffef:0000e3c0 DOSCALL1 DOSWRITE + 21 %15ee3a 9807:0000115f DOSCALL1 segsize + d8c %15ee6c ffef:0000e3c0 DOSCALL1 DOSWRITE + 21 %15eea0 %fff4d252 _VMFreeMemOd + 2b2 Releasing Memory %15eec8 %fff6725a _VMFreeMem + 1ea Releasing Memory %15eef0 %ffec7c2c OS2KRNL _ahvmSys %15ef22 ffef:0000e3c0 DOSCALL1 DOSWRITE + 21 for thread aa: Ring Three Stack Trace Location Address Symbol Description -------- ------- ------ ----------- %35fec54 %1d8fad88 WGSS50 WGSS50_2000 + 9c6 %35feca0 %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46 %35fecc0 %1ffca330 DOSCALL1 _xcptR3ExceptionHandler %35fecd0 %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 88 %35fece4 %1ffca330 DOSCALL1 _xcptR3ExceptionHandler %35fed30 %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64 %35fed60 %fff4d252 _VMFreeMemOd + 2b2 Releasing Memory %35fed88 %fff6725a _VMFreeMem + 1ea Releasing Memory %35fedb0 %ffec7c2c OS2KRNL _ahvmSys %35fede6 ffef:0000e3c0 DOSCALL1 DOSWRITE + 21 %35fee6c ffef:0000e3c0 DOSCALL1 DOSWRITE + 21 %35fee9c %fff4d252 _VMFreeMemOd + 2b2 Releasing Memory %35feec4 %fff6725a _VMFreeMem + 1ea Releasing Memory %35feeec %ffec7c2c OS2KRNL _ahvmSys %35fef22 ffef:0000e3c0 DOSCALL1 DOSWRITE + 21
comment:6 by , 14 years ago
I made some tests that, together with the output above, show that the hang happens not in the exit list handler as I previously thought but rather somewhere in the exception handler when handling the XCPT_[ASYNC_]PROCESS_TERMINATE exception (which is sent when the process terminates).
I don't know though how accurate these traces are.
comment:7 by , 14 years ago
What we may also say after further analysis of the dump file (which I'm attaching here just for reference) is that all non-main threads are blocked by the kernel waiting on the same event semaphore while thread 1 is blocked on something else.
The last call it makes is DOS32EXCEPTIONCALLBACK which then causes a wait for some condition...
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.part1.rar added |
---|
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.part2.rar added |
---|
comment:8 by , 14 years ago
I found something really interesting: installing an own exception handler at the beginning of main() completely fixes the exit hang issue in the test case (tried to start/stop the test 100 times from the loop, also tried intensive Ctrl-C). Looks like my guess that WGSS somehow messes up the exception chain is correct.
I will try to apply this workaround to Odin itself and see how it behaves.
comment:9 by , 14 years ago
Note also that marking the testcase's EXE file as MPUNSAFE completely solves the hang problem too, even w/o the additional exception handler trick.
I now start getting an impression that structured exception handling is not thread-safe on OS/2 in SMP mode. This would explain both WGSS hangs at termination and JVM crashes -- JVM uses memory mapped files which Odin implements using exceptions too. For example it could be that memory is being accessed the second time before the exception handler triggered by the first access attempt commits it. And it seems that the MPUNSAFE flag solves this exception problem (by serializing exceptions?).
comment:10 by , 14 years ago
Not that exception handling itself is unsafe (that would be rather strange and should've caused problems all around the system, not only in Odin) but the way WGSS/Odin handle these exceptions is unsafe -- without realizing the fact that on SMP two threads *can* actually run in parallel. So, in case of JVM, there are probably two threads accessing an uncommitted memory block, and the second one is doing this right after the first one did but before the first one commits it, and it doesn't cause a commit itself because it somehow thinks the first one has already did that. It's just a guess so far.
comment:11 by , 14 years ago
JFYI. As I found in toolsref.inf, MARKEXE makes sure that only one thread of a process gets executed at any time (on any of the available CPUs). Among other things, this means that ops such as INC and CMPXCHG without the LOCK prefix will still provide atomicity as if they were executed on a single CPU (which was frequently used in old applications for implementing lightweight semaphores and mutexes).
So it is now more and more clear what's wrong in Odin/WGSS.
comment:12 by , 14 years ago
Unfortunately this dumb workaround doesn't work for JAVA.EXE: there, we already install an OS/2 exception handler on top of the chain before running main() (through RegisterLxExe()) so it basically does the very same thing as I do in the test case. Seems that it only works in the test case because the added dumb exception handler somehow changes timings while this doesn't happen in case of JAVA.EXE.
I also got a dump of the hung JAVA.EXE. In general, the structure of the hang is the same as with my test case, but there is some difference in the stack trace of the main thread:
Base Ptr Parameters Caller -------- ---------------- ------ %0081f5a0 0000 0000 0000 0000 f760 0081 f5b8 0081 DOSCALL1 _xcptGetRegistrationHead + 6 %0081f668 0000 0000 0000 0000 f760 0081 0053 aaaa DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151 %0081f78c 0000 ffff 0000 f7a0 0081 f7c4 0081 0002 Unknown f938 0081 0000 0000 %0081f938 0001 0000 f96c 0081 0000 0000 ffff ffff WGSS50 WGSS50_O32_WaitForSingleObject + 2e %0081f990 1d92 82c0 009d 0154 2000 ffff ffff a7d0 Unknown f9a8 0081 91bd 1d92 %0081f9a8 1d8e 82c0 009d 0154 2000 ffff ffff 0007 Unknown f9f8 0081 e794 1d8e %0081f9f8 1d8c 0007 0000 ffff ffff fa1c 0081 dcd4 Unknown fa08 0081 1cb9 1d8c %0081fa08 1d93 0007 0000 ffff ffff aaaa aaaa fa7c Unknown fa1c 0081 dcd4 1d93 %0081fa1c 0001 0007 0000 ffff ffff 0ad0 0001 fe0c Unknown fa7c 0081 6308 0001 %0081fa7c 0001 0ad0 0001 0000 0000 0000 0000 fe0c Unknown fe5c 0081 23f5 0001 %0081fe5c 0001 0002 0000 691c 200f 0000 0000 aaaa Unknown fe7c 0081 401c 0001 %0081fe7c 1d91 4000 210f 0000 0000 9a40 009d 0001 Unknown feb4 0081 3bb1 1d91 %0081feb4 1d91 9720 009d 4000 0001 0000 0000 ff18 Unknown fee8 0081 3930 1d91 %0081fee8 0001 4000 0001 0000 0000 ff18 0081 0001 Unknown ff08 0081 526f 0001 %0081ff08 0001 ffe0 0081 0012 0001 0004 0000 ff2c Unknown ffe0 0081 0012 0001 %0081ffe0 0003 0000 ce38 1ffe 15dd 0000 0000 0000 LIBC064X __std_strxfrm + 4b Ring Three Stack Trace Location Address Symbol Description -------- ------- ------ ----------- %81f59c %1d83ad88 WGSS50 WGSS50_2000 + dbb %81f5a4 %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + 38 %81f5e0 %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46 %81f5e4 %1ffc1e59 DOSCALL1 DOS32PSETPATHINFO + 1a3 %81f5ec %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 65 %81f600 %1ffca330 DOSCALL1 _xcptR3ExceptionHandler %81f61c %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 10e %81f654 %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + 38 %81f66c %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151 %81f6a0 %fff4d252 _VMFreeMemOd + 2b2 Releasing Memory %81f6c8 %fff6725a _VMFreeMem + 1ea Releasing Memory %81f6f0 %ffec7c2c OS2KRNL _ahvmSys %81f76c %1d83ad88 WGSS50 WGSS50_2000 + dbb %81f7ac %1d83ad88 WGSS50 WGSS50_2000 + dbb %81f7dc %fff4d252 _VMFreeMemOd + 2b2 Releasing Memory %81f804 %fff6725a _VMFreeMem + 1ea Releasing Memory %81f82c %ffec7c2c OS2KRNL _ahvmSys %81f860 %1d83ad88 WGSS50 WGSS50_2000 + dbb
The difference is that in case of Java, Thread 1 fires off another "main" thread and enters the wait state (through WaitForSingleObject()). So the process gets terminated by another thread and Thread 1 gets XCPT_ASYNC_PROCESS_TERMINATE as a result while being in the wait state. Then the situation looks similar to the test case again.
Note that now we may see that the kernel is calling WGSS50_2000 (If I don't misinterpret the second part of the trace) which seems to be an exception handler inside WGSS50.DLL.
by , 14 years ago
Attachment: | PDUMP.000.java.part1.rar added |
---|
by , 14 years ago
Attachment: | PDUMP.000.java.part2.rar added |
---|
comment:13 by , 14 years ago
Attached fresh THREADS.EXE DosExit hang dumps, with the SYSVM option (and with all non-system .SYM files) as requested by Steven.
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.sysvm.part01.rar added |
---|
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.sysvm.part02.rar added |
---|
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.sysvm.part03.rar added |
---|
by , 14 years ago
Attachment: | PDUMP.000.threads.dosexit.sysvm.part04.rar added |
---|
comment:14 by , 14 years ago
With the help of Steve, the following is now evident regarding the state of the threads:
- thread 1 has sent XCPT_ASYNC_PROCESS_TERMINATE to all remaining threads and is waiting for them to handle this exception and die. This is normal and works as designed.
- All threads in their XCPT_ASYNC_PROCESS_TERMINATE handlers are waiting for some critical section (in Win32 terms) through DosEnterCriticalSection() ODINCRT API.
- This critical section is owned by a thread that has already died without leaving it, so that threads from step 2. are never going to be resumed and therefore thread 1 will never get control back.
Most likely the dead thread just ended in response to XCPT_ASYNC_PROCESS_TERMINATE which happened to happen while it was holding the critical section's semaphore.
I will now follow the code path and think on how this can be prevented.
comment:15 by , 14 years ago
JFYI, the test case (its main thread) does the following:
- Starts 10 threads, each of them sleeps for some hundred ms and exits.
- Waits 2 seconds and exits the process. Some threads are still running by that time.
I added some code to make sure the critical section is not locked when the thread owning it terminates.
What I now see is that when the first thread among the still running ones gets XCPT_ASYNC_PROCESS_TERMINATE twice, the second time as a nested call. In some cases such nested call leaves the critical section locked and all other threads hang. I don't yet know about the source of the second nested XCPT_ASYNC_PROCESS_TERMINATE. Tracing the code.
comment:16 by , 14 years ago
After performing the deep dump file analysis, I found that it's DosCloseEventSem() which throws the nested XCPT_ASYNC_PROCESS_TERMINATE. Wonderful. Why on earth would it do that?
This DosCloseEventSem() call is made from the original WGSS50 exception handler under the critical section lock hence it gets never released.
comment:17 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
As of r21626 (and especially after r21620), hangs at process termination should have gone. I performed some excessive testing (running a terminating app in a loop 100 times) against the release version of Odin and can not see hangs any more.
While being in that part of code, I also made unwinding of Win32 exception handlers work -- this was completely disabled in Odin although application may rely on these handlers as they are a common place to cleanup resources (which includes releasing locks etc) when the thread gets unexpectedly terminated by the user or by the system.
I also had to completely review the SEH support code as I found some bugs in it (since unwinding the Win32 exception handlers was not done before these bugs would never be triggered).
Note that the process termination may still hang on SMP if there are lots of threads and you use the debug version of Odin with enabled logging. My analysis shows that the hangs happen within the ODIN (VAC) C runtime libraries deep inside the printf() function while handling some unwind collision. This probably needs some more investigation one day but not relevant so far and there are more important tasks.
Also, there is at least one known crash at process termination, even with the release Odin libraries: the routine in JVM.DLL that prints the stack trace sometimes triggers invalid memory access. I created http://svn.netlabs.org/java/ticket/97 for this particular issue.
The next thing to do is checking if the MMF implementation is SMP-safe. See http://svn.netlabs.org/java/ticket/96 for details. I may want to create an Odin ticket for it as well.
I slightly changed the threads testcase by adding a delay in r21608, so it's now relatively easy to reproduce the hang on SMP: just press Ctrl-Break before all threads terminate normally. In 5 attempts out of 10 you will get a hang (and the ExitThread1 process state, according to kTaskMgr). This proves the assumption that the hang happens when there are multiple active threads and the control goes to thread 1 as a result of either of the signals that are always handled by thread 1 (e.g. normal process exit request or Ctrl-Break request).
Attaching the debugger to the hung process never succeeds.
I'm now trying to use PROCDUMP to force the process dump but it traps the kernel so far (which causes a very long hard disk check afterwards, so keep that in mind).