Opened 14 years ago

Closed 14 years ago

#28 closed defect (fixed)

smp hangs

Reported by: Silvan Scherrer Owned by:
Priority: blocker Milestone: odinized java
Component: odin Version:
Severity: Keywords:
Cc:

Description

odin tends to hang the active process when a odinized app is terminated.

it may be due to DosExitList() hangs.

Attachments (8)

PDUMP.000.threads.dosexit.part1.rar (200.0 KB ) - added by dmik 14 years ago.
PDUMP.000.threads.dosexit.part2.rar (102.5 KB ) - added by dmik 14 years ago.
PDUMP.000.java.part1.rar (200.0 KB ) - added by dmik 14 years ago.
PDUMP.000.java.part2.rar (179.3 KB ) - added by dmik 14 years ago.
PDUMP.000.threads.dosexit.sysvm.part01.rar (200.0 KB ) - added by dmik 14 years ago.
PDUMP.000.threads.dosexit.sysvm.part02.rar (200.0 KB ) - added by dmik 14 years ago.
PDUMP.000.threads.dosexit.sysvm.part03.rar (200.0 KB ) - added by dmik 14 years ago.
PDUMP.000.threads.dosexit.sysvm.part04.rar (122.9 KB ) - added by dmik 14 years ago.

Download all attachments as: .zip

Change History (25)

comment:1 by Silvan Scherrer, 14 years ago

Milestone: odinized flash

comment:2 by Silvan Scherrer, 14 years ago

Milestone: odinized flashodinized java

comment:3 by dmik, 14 years ago

I slightly changed the threads testcase by adding a delay in r21608, so it's now relatively easy to reproduce the hang on SMP: just press Ctrl-Break before all threads terminate normally. In 5 attempts out of 10 you will get a hang (and the ExitThread1 process state, according to kTaskMgr). This proves the assumption that the hang happens when there are multiple active threads and the control goes to thread 1 as a result of either of the signals that are always handled by thread 1 (e.g. normal process exit request or Ctrl-Break request).

Attaching the debugger to the hung process never succeeds.

I'm now trying to use PROCDUMP to force the process dump but it traps the kernel so far (which causes a very long hard disk check afterwards, so keep that in mind).

comment:4 by dmik, 14 years ago

Note that PROCDUMP creates a PDUMP file which CHKDSK fixes after reboot. But apparently this file is not complete and cannot be explored with PMDF.EXE, unfortunately.

comment:5 by dmik, 14 years ago

After getting a kernel fix for PMDF.EXE, I may force the dump of the hung process:

Current slot number: 00a0 
 
 Slot  Pid  Ppid Csid Ord  Sta Pri  pTSD     pPTDA    pTCB     Disp SG Name 
*00a0# 004b 004a 004b 0001 blk 0500 f8fde000 fe634730 f935cc98 0eb0 15 THREADS 
 
eax=00000000 ebx=0000efc0 ecx=00150053 edx=00000000 esi=0011ff58 edi=297a0000 
eip=1ffca29d esp=0015eb34 ebp=0015ec60 iopl=0 -- -- -- nv up ei pl nz ac po cy 
cs=005b ss=0053 ds=0053 es=0053 fs=150b gs=0000 cr2=00000000 cr3=00000000 p=** 
005b:1ffca29d 83c410         add       esp,+10 

............
Ring three stack for all slots

Ring three stack for slot   a0: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %0015ec60  0000 0000 0000 0000 ee20 0015 ec78 0015   DOSCALL1 _xcptGetRegistrationHead + 6
  %0015ed28  0000 0000 0000 0000 ee20 0015 0053 0000   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151
  %0015ee4c  0000 ffff 0000 ee60 0015 ee84 0015 0001   Unknown ff48 0011 0000 0000 
  %0011ff48  0001 0000 0000 0000 000a 004b 000b 004b   LIBC064X ___dup2 + 57
  %0011ff68  0000 0000 0000 0000 0000 0000 0000 0000   LIBC064X __std_waitid + 1f
  %0011ff88  0000 0000 0000 0000 ffa8 0011 d754 1df2   LIBC064X ___ctordtorTerm1 + 5d
  %0011ffa8  0001 0000 0000 ffff ffff 17d0 1df2 ffc5   Unknown ffe0 0011 002a 0001 
  %0011ffe0  0000 0000 ce38 1ffe 1593 0000 0000 0000   LIBC064X __std_strxfrm + 4b
Ring three stack for slot   a6: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %035bec50  070c 15b4 ffff ffff ef34 035b edf8 035b   WGSS50 WGSS50_2000 + 9c6
  %035bec7c  1d8f 0012 6800 0000 0000 0001 0000 ec9c   Unknown eccc 035b 9c5b 1d8f 
  %035beccc  ee60 035b 0f78 030f ee84 035b ed24 035b   DOSCALL1 _xcptR3ExceptionDispatcher + 88
  %035bed2c  ee60 035b ee84 035b 0053 0000 fe94 030e   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
  %035bee4c  0000 ffff 0000 ee60 035b ee84 035b 0002   Unknown fe88 030e 0000 0000 
Ring three stack for slot   a7: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %035cec50  070c 15b4 ffff ffff ef34 035c edf8 035c   WGSS50 WGSS50_2000 + 9c6
  %035cec7c  1d8f 0014 6800 0000 0000 0001 0000 ec9c   Unknown eccc 035c 9c5b 1d8f 
  %035ceccc  ee60 035c 0f78 0321 ee84 035c ed24 035c   DOSCALL1 _xcptR3ExceptionDispatcher + 88
  %035ced2c  ee60 035c ee84 035c 0053 0000 fe94 0320   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
  %035cee4c  0000 ffff 0000 ee60 035c ee84 035c 0002   Unknown fe88 0320 0000 0000 
Ring three stack for slot   a8: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %035dec50  070c 15b4 ffff ffff ef34 035d edf8 035d   WGSS50 WGSS50_2000 + 9c6
  %035dec7c  1d8f 0016 6800 0000 0000 0001 0000 ec9c   Unknown eccc 035d 9c5b 1d8f 
  %035deccc  ee60 035d 0f78 0334 ee84 035d ed24 035d   DOSCALL1 _xcptR3ExceptionDispatcher + 88
  %035ded2c  ee60 035d ee84 035d 0053 0000 fe94 0333   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
  %035dee4c  0000 ffff 0000 ee60 035d ee84 035d 0002   Unknown fe88 0333 0000 0000 
Ring three stack for slot   a9: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %035eec50  070c 15b4 ffff ffff ef34 035e edf8 035e   WGSS50 WGSS50_2000 + 9c6
  %035eec7c  1d8f 0018 6800 0000 0000 0001 0000 ec9c   Unknown eccc 035e 9c5b 1d8f 
  %035eeccc  ee60 035e 0f78 0346 ee84 035e ed24 035e   DOSCALL1 _xcptR3ExceptionDispatcher + 88
  %035eed2c  ee60 035e ee84 035e 0053 0000 fe94 0345   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
  %035eee4c  0000 ffff 0000 ee60 035e ee84 035e 0002   Unknown fe88 0345 0000 0000 
Ring three stack for slot   aa: THREADS 
  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %035fec50  070c 15b4 ffff ffff ef34 035f edf8 035f   WGSS50 WGSS50_2000 + 9c6
  %035fec7c  1d8f 001a 6800 0000 0000 0001 0000 ec9c   Unknown eccc 035f 9c5b 1d8f 
  %035feccc  ee60 035f 0f78 0359 ee84 035f ed24 035f   DOSCALL1 _xcptR3ExceptionDispatcher + 88
  %035fed2c  ee60 035f ee84 035f 0053 0000 fe94 0358   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
  %035fee4c  0000 ffff 0000 ee60 035f ee84 035f 0002   Unknown fe88 0358 0000 0000 

#            Checking call gates for all slots
Slot THREADS    a0: 
 is in a call to: 
%1ffc6fb4 DOSCALL1 DOS32EXCEPTIONCALLBACK  
Slot THREADS    a6: 
 is in a call to: 
%1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM  
Slot THREADS    a7: 
 is in a call to: 
%1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM  
Slot THREADS    a8: 
 is in a call to: 
%1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM  
Slot THREADS    a9: 
 is in a call to: 
%1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM  
Slot THREADS    aa: 
 is in a call to: 
%1ffc08e8 DOSCALL1 DOS32WAITEVENTSEM  

And there is also some more trace information if you select each thread individually (which seems to include the kernel entry points):

for thread a0:

                               Ring Three Stack Trace
 Location    Address       Symbol                         Description
 --------    -------       ------                         -----------
 %15eb46   %00120040 OS2KRNL _demVDMStartDoneSem + 13
 %15eb56   %00120000 OS2KRNL _aseldemTemp
 %15eb96   %001201cc OS2KRNL _apiop + 3d4
 %15ebe2   1800:0015     InitProcessor + 11            
 %15ec16   %0012164c OS2KRNL _apvdmbp + 2c8
 %15ec64   %1ffc0000 DOSCALL1 DOS32GETINFOBLOCKS + 15
 %15eca0   %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46
 %15eca4   %1ffc1e59 DOSCALL1 DOS32PSETPATHINFO + 1a3
 %15ecac   %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 65
 %15ecc0   %1ffca330 DOSCALL1 _xcptR3ExceptionHandler
 %15ecdc   %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 10e
 %15ed14   %1ffc0000 DOSCALL1 DOS32GETINFOBLOCKS + 15
 %15ed2c   %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151
 %15ed4c   %1e8d6014 PMMERGE CharRect32 + 5be
 %15ee04   %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + a
 %15ee0c   %1ffc77d4 DOSCALL1 HT32_Exit + 5
 %15ee2c   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  
 %15ee3a   9807:0000115f DOSCALL1 segsize + d8c  
 %15ee6c   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  
 %15eea0   %fff4d252     _VMFreeMemOd + 2b2             Releasing Memory
 %15eec8   %fff6725a     _VMFreeMem + 1ea               Releasing Memory
 %15eef0   %ffec7c2c OS2KRNL _ahvmSys
 %15ef22   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  

for thread aa:

                               Ring Three Stack Trace
 Location    Address       Symbol                         Description
 --------    -------       ------                         -----------
 %35fec54   %1d8fad88 WGSS50 WGSS50_2000 + 9c6
 %35feca0   %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46
 %35fecc0   %1ffca330 DOSCALL1 _xcptR3ExceptionHandler
 %35fecd0   %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 88
 %35fece4   %1ffca330 DOSCALL1 _xcptR3ExceptionHandler
 %35fed30   %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 64
 %35fed60   %fff4d252     _VMFreeMemOd + 2b2             Releasing Memory
 %35fed88   %fff6725a     _VMFreeMem + 1ea               Releasing Memory
 %35fedb0   %ffec7c2c OS2KRNL _ahvmSys
 %35fede6   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  
 %35fee6c   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  
 %35fee9c   %fff4d252     _VMFreeMemOd + 2b2             Releasing Memory
 %35feec4   %fff6725a     _VMFreeMem + 1ea               Releasing Memory
 %35feeec   %ffec7c2c OS2KRNL _ahvmSys
 %35fef22   ffef:0000e3c0 DOSCALL1 DOSWRITE + 21  

comment:6 by dmik, 14 years ago

I made some tests that, together with the output above, show that the hang happens not in the exit list handler as I previously thought but rather somewhere in the exception handler when handling the XCPT_[ASYNC_]PROCESS_TERMINATE exception (which is sent when the process terminates).

I don't know though how accurate these traces are.

comment:7 by dmik, 14 years ago

What we may also say after further analysis of the dump file (which I'm attaching here just for reference) is that all non-main threads are blocked by the kernel waiting on the same event semaphore while thread 1 is blocked on something else.

The last call it makes is DOS32EXCEPTIONCALLBACK which then causes a wait for some condition...

comment:8 by dmik, 14 years ago

I found something really interesting: installing an own exception handler at the beginning of main() completely fixes the exit hang issue in the test case (tried to start/stop the test 100 times from the loop, also tried intensive Ctrl-C). Looks like my guess that WGSS somehow messes up the exception chain is correct.

I will try to apply this workaround to Odin itself and see how it behaves.

comment:9 by dmik, 14 years ago

Note also that marking the testcase's EXE file as MPUNSAFE completely solves the hang problem too, even w/o the additional exception handler trick.

I now start getting an impression that structured exception handling is not thread-safe on OS/2 in SMP mode. This would explain both WGSS hangs at termination and JVM crashes -- JVM uses memory mapped files which Odin implements using exceptions too. For example it could be that memory is being accessed the second time before the exception handler triggered by the first access attempt commits it. And it seems that the MPUNSAFE flag solves this exception problem (by serializing exceptions?).

comment:10 by dmik, 14 years ago

Not that exception handling itself is unsafe (that would be rather strange and should've caused problems all around the system, not only in Odin) but the way WGSS/Odin handle these exceptions is unsafe -- without realizing the fact that on SMP two threads *can* actually run in parallel. So, in case of JVM, there are probably two threads accessing an uncommitted memory block, and the second one is doing this right after the first one did but before the first one commits it, and it doesn't cause a commit itself because it somehow thinks the first one has already did that. It's just a guess so far.

comment:11 by dmik, 14 years ago

JFYI. As I found in toolsref.inf, MARKEXE makes sure that only one thread of a process gets executed at any time (on any of the available CPUs). Among other things, this means that ops such as INC and CMPXCHG without the LOCK prefix will still provide atomicity as if they were executed on a single CPU (which was frequently used in old applications for implementing lightweight semaphores and mutexes).

So it is now more and more clear what's wrong in Odin/WGSS.

comment:12 by dmik, 14 years ago

Unfortunately this dumb workaround doesn't work for JAVA.EXE: there, we already install an OS/2 exception handler on top of the chain before running main() (through RegisterLxExe()) so it basically does the very same thing as I do in the test case. Seems that it only works in the test case because the added dumb exception handler somehow changes timings while this doesn't happen in case of JAVA.EXE.

I also got a dump of the hung JAVA.EXE. In general, the structure of the hang is the same as with my test case, but there is some difference in the stack trace of the main thread:

  Base Ptr   Parameters                                Caller    
  --------   ----------------                          ------    

  %0081f5a0  0000 0000 0000 0000 f760 0081 f5b8 0081   DOSCALL1 _xcptGetRegistrationHead + 6
  %0081f668  0000 0000 0000 0000 f760 0081 0053 aaaa   DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151
  %0081f78c  0000 ffff 0000 f7a0 0081 f7c4 0081 0002   Unknown f938 0081 0000 0000 
  %0081f938  0001 0000 f96c 0081 0000 0000 ffff ffff   WGSS50 WGSS50_O32_WaitForSingleObject + 2e
  %0081f990  1d92 82c0 009d 0154 2000 ffff ffff a7d0   Unknown f9a8 0081 91bd 1d92 
  %0081f9a8  1d8e 82c0 009d 0154 2000 ffff ffff 0007   Unknown f9f8 0081 e794 1d8e 
  %0081f9f8  1d8c 0007 0000 ffff ffff fa1c 0081 dcd4   Unknown fa08 0081 1cb9 1d8c 
  %0081fa08  1d93 0007 0000 ffff ffff aaaa aaaa fa7c   Unknown fa1c 0081 dcd4 1d93 
  %0081fa1c  0001 0007 0000 ffff ffff 0ad0 0001 fe0c   Unknown fa7c 0081 6308 0001 
  %0081fa7c  0001 0ad0 0001 0000 0000 0000 0000 fe0c   Unknown fe5c 0081 23f5 0001 
  %0081fe5c  0001 0002 0000 691c 200f 0000 0000 aaaa   Unknown fe7c 0081 401c 0001 
  %0081fe7c  1d91 4000 210f 0000 0000 9a40 009d 0001   Unknown feb4 0081 3bb1 1d91 
  %0081feb4  1d91 9720 009d 4000 0001 0000 0000 ff18   Unknown fee8 0081 3930 1d91 
  %0081fee8  0001 4000 0001 0000 0000 ff18 0081 0001   Unknown ff08 0081 526f 0001 
  %0081ff08  0001 ffe0 0081 0012 0001 0004 0000 ff2c   Unknown ffe0 0081 0012 0001 
  %0081ffe0  0003 0000 ce38 1ffe 15dd 0000 0000 0000   LIBC064X __std_strxfrm + 4b
                               Ring Three Stack Trace
 Location    Address       Symbol                         Description
 --------    -------       ------                         -----------
 %81f59c   %1d83ad88 WGSS50 WGSS50_2000 + dbb
 %81f5a4   %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + 38
 %81f5e0   %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 46
 %81f5e4   %1ffc1e59 DOSCALL1 DOS32PSETPATHINFO + 1a3
 %81f5ec   %1ffca2a8 DOSCALL1 _xcptExecuteUserExceptionHandler + 65
 %81f600   %1ffca330 DOSCALL1 _xcptR3ExceptionHandler
 %81f61c   %1ffc9e8c DOSCALL1 _xcptR3ExceptionDispatcher + 10e
 %81f654   %1ffc0049 DOSCALL1 DOS32QUERYSYSINFO + 38
 %81f66c   %1ffc9d18 DOSCALL1 DOS32R3EXCEPTIONDISPATCHER + 151
 %81f6a0   %fff4d252     _VMFreeMemOd + 2b2             Releasing Memory
 %81f6c8   %fff6725a     _VMFreeMem + 1ea               Releasing Memory
 %81f6f0   %ffec7c2c OS2KRNL _ahvmSys
 %81f76c   %1d83ad88 WGSS50 WGSS50_2000 + dbb
 %81f7ac   %1d83ad88 WGSS50 WGSS50_2000 + dbb
 %81f7dc   %fff4d252     _VMFreeMemOd + 2b2             Releasing Memory
 %81f804   %fff6725a     _VMFreeMem + 1ea               Releasing Memory
 %81f82c   %ffec7c2c OS2KRNL _ahvmSys
 %81f860   %1d83ad88 WGSS50 WGSS50_2000 + dbb

The difference is that in case of Java, Thread 1 fires off another "main" thread and enters the wait state (through WaitForSingleObject()). So the process gets terminated by another thread and Thread 1 gets XCPT_ASYNC_PROCESS_TERMINATE as a result while being in the wait state. Then the situation looks similar to the test case again.

Note that now we may see that the kernel is calling WGSS50_2000 (If I don't misinterpret the second part of the trace) which seems to be an exception handler inside WGSS50.DLL.

by dmik, 14 years ago

Attachment: PDUMP.000.java.part1.rar added

by dmik, 14 years ago

Attachment: PDUMP.000.java.part2.rar added

comment:13 by dmik, 14 years ago

Attached fresh THREADS.EXE DosExit hang dumps, with the SYSVM option (and with all non-system .SYM files) as requested by Steven.

comment:14 by dmik, 14 years ago

With the help of Steve, the following is now evident regarding the state of the threads:

  1. thread 1 has sent XCPT_ASYNC_PROCESS_TERMINATE to all remaining threads and is waiting for them to handle this exception and die. This is normal and works as designed.
  1. All threads in their XCPT_ASYNC_PROCESS_TERMINATE handlers are waiting for some critical section (in Win32 terms) through DosEnterCriticalSection() ODINCRT API.
  1. This critical section is owned by a thread that has already died without leaving it, so that threads from step 2. are never going to be resumed and therefore thread 1 will never get control back.

Most likely the dead thread just ended in response to XCPT_ASYNC_PROCESS_TERMINATE which happened to happen while it was holding the critical section's semaphore.

I will now follow the code path and think on how this can be prevented.

comment:15 by dmik, 14 years ago

JFYI, the test case (its main thread) does the following:

  1. Starts 10 threads, each of them sleeps for some hundred ms and exits.
  2. Waits 2 seconds and exits the process. Some threads are still running by that time.

I added some code to make sure the critical section is not locked when the thread owning it terminates.

What I now see is that when the first thread among the still running ones gets XCPT_ASYNC_PROCESS_TERMINATE twice, the second time as a nested call. In some cases such nested call leaves the critical section locked and all other threads hang. I don't yet know about the source of the second nested XCPT_ASYNC_PROCESS_TERMINATE. Tracing the code.

comment:16 by dmik, 14 years ago

After performing the deep dump file analysis, I found that it's DosCloseEventSem() which throws the nested XCPT_ASYNC_PROCESS_TERMINATE. Wonderful. Why on earth would it do that?

This DosCloseEventSem() call is made from the original WGSS50 exception handler under the critical section lock hence it gets never released.

comment:17 by dmik, 14 years ago

Resolution: fixed
Status: newclosed

As of r21626 (and especially after r21620), hangs at process termination should have gone. I performed some excessive testing (running a terminating app in a loop 100 times) against the release version of Odin and can not see hangs any more.

While being in that part of code, I also made unwinding of Win32 exception handlers work -- this was completely disabled in Odin although application may rely on these handlers as they are a common place to cleanup resources (which includes releasing locks etc) when the thread gets unexpectedly terminated by the user or by the system.

I also had to completely review the SEH support code as I found some bugs in it (since unwinding the Win32 exception handlers was not done before these bugs would never be triggered).

Note that the process termination may still hang on SMP if there are lots of threads and you use the debug version of Odin with enabled logging. My analysis shows that the hangs happen within the ODIN (VAC) C runtime libraries deep inside the printf() function while handling some unwind collision. This probably needs some more investigation one day but not relevant so far and there are more important tasks.

Also, there is at least one known crash at process termination, even with the release Odin libraries: the routine in JVM.DLL that prints the stack trace sometimes triggers invalid memory access. I created http://svn.netlabs.org/java/ticket/97 for this particular issue.

The next thing to do is checking if the MMF implementation is SMP-safe. See http://svn.netlabs.org/java/ticket/96 for details. I may want to create an Odin ticket for it as well.

Note: See TracTickets for help on using tickets.