Opened 9 years ago

Closed 9 years ago

#37 closed defect (fixed)

Recursion in exception handler

Reported by: dmik Owned by:
Priority: major Milestone: odinized java
Component: odin Version:
Severity: Keywords:


There is an annoying recursion in the Odin exception handler where it calls DosExit?() while processing the XCPT_PROCESS_TERMINATE and XCPT_ASYNC_PROCESS_TERMINATE (which are themselves the result of DosExit?()).

This recursion in particular leads to a big POPUPLOG.OS2 file (filled up with 0xC0010001 and 0xC0010002 exception records) at Java process termination (either normal or abnormal).

This needs to be solved as it indicates an error in the exception processing logic.

Change History (5)

comment:1 Changed 9 years ago by dmik

I've just got a good test case demonstrating this problem (when I was playing with Java ticket #129 which I need to resolve to make SmartSVN 6.6 work, because SmartSVN 6.5. working under Innotek Java 1.4.2 is very unstable on SMP which annoys a lot).

I will temporarily switch from Qt to this defect (and to Java #129 afterwards).

comment:2 Changed 9 years ago by dmik

It seems to enter the recursion loop due to DosExit?() calls from OS2RtlUnwind() which is in turn called by our SEH handler at process termination (during the unwind phase performed on the OS/2 exception chain). Investigating.

comment:3 Changed 9 years ago by dmik

Testing showed that the problem is not DosExit?() itself. The problem is that applications crash inside OS2RtlUnwind() when trying to unwind the Win32 exception handlers in response to the unwind procedure initiated by OS/2 at process termination (either normal or abnormal, it doesn't matter). And this goes on recursively, until the stack space has ended. The crash happens when walking through the exception chain, i.e. when accessing the previous exception frame through the Prev member of the current frame, because the value contained there is garbage. Since the fames are located on stack, this means that the stack is corrupted.

There are several interesting facts:

  1. The crashes do never happen under the UNI kernel.
  2. The simple test case (testapp/exceptions/seh/crash.c) does not crash even on SMP.
  3. On SMP, we crash regardless of whether we do unwinding from OS2ExceptionHandler2ndLevel() (which is located above all possible Win32 exception frames) or from individual Win32 exception handlers themselves. On UNI both cases work.

This shows that stack corruption only happens on SMP and only under heavy load, when there are lots of threads (several dozens, like in case of Java) and they are all being unwound at once. Which in turn allows us to make an assumption that the SMP kernel does not handle thread stacks well when performing the unwind procedure.

Given that unwinding of Win32 exception handlers is currently only used to make setjmp()/longjmp() work across __try {} bounds and that this functionality is neither complete (see #15) nor really necessary for our current Odin-based projects compiled form the source code (like OpenJDK), I will disable this unwinding.

Since the netlabs ldap server is now down, I can't commit the necessary changes ATM. I will do that tomorrow.

comment:4 Changed 9 years ago by dmik

Note that may be actually related to the same SMP issue we are facing here -- i.e. to the stack (or registers) being corrupted during exception handling when it happens on many threads at the very same time.

comment:5 Changed 9 years ago by dmik

Resolution: fixed
Status: newclosed

Committed the change in r21662. In my tests, I see no recursive crashes and no giant POPUPLOG.OS2 any more.

Note: See TracTickets for help on using tickets.