Opened 14 years ago

Closed 14 years ago

#96 closed defect (fixed)

SMP: Frequent crashes

Reported by: dmik Owned by:
Priority: major Milestone: GA
Component: general Version: 1.6.0-b22 RC2
Severity: highest Keywords:
Cc:

Description

Currently, JVM crashes every now and then if you attempt to run something more or less complex (e.g. SmartSVN) on an SMP machine.

Here is a typical report:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x1d5338d2, pid=971, tid=63635470
#
# JRE version: 6.0-b22
# Java VM: OpenJDK Client VM (19.0-b09 mixed mode os2-x86 )
# Problematic frame:
# V  [JVM+0x3d38d2]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

(the rest is attached).

Attachments (1)

hs_err_pid971.log (20.3 KB ) - added by dmik 14 years ago.

Download all attachments as: .zip

Change History (16)

by dmik, 14 years ago

Attachment: hs_err_pid971.log added

comment:1 by dmik, 14 years ago

This seems to always happen when reading bytes from a memory mapped file. My guess is that the implementation of MMF in Odin is not SMP safe: probably, the code that deals with committing requested memory in the exception handler is not aware that it may be interrupted in the middle on a real SMP machine and leave something in an inconsistent state. So, Odin needs to be investigated for this first.

comment:2 by dmik, 14 years ago

Milestone: EnhancedGA

comment:3 by dmik, 14 years ago

I found an interesting thing. In HotSpot, there are two techniques for serializing access to memory which stores Java thread states between threads (in particular, to make sure that reads and writes are not reordered when running on SMP):

  • Using membars.
  • Using the own synchronization routine.

The second routine uses a special serialization page in memory as follows. Every thread writes to a dedicated cell in this page each time it changes its state. When the VM thread (that does dispatching) is about to analyze thread states, it calls a special function that temporarily sets the page protection flags to RO and then back to RW. AFAIU, this is supposed to force the CPUs to flush caches to memory so that subsequent reads will return actual data. This is the default method of serialization -- according to the comments in the sources, it is much more efficient than issuing membar instructions after each thread state change.

My findings show that crashes happen in the code that implements this method. Threads may still attempt to write to this serialization page when it is set to RO. The exception handlers for this case are set up so that the write operation is retried after the dispatcher thread restores the RW mode. Something doesn't work right here on OS/2 and the application crashes instead of just retrying.

An indirect proof of that is the fact that when I force membar mode (using the -XX:+UseMembar command line option to java.exe) applications look quite stable on SMP. At least I don't see crashes.

This all means that these crashes are not actually related to Odin and its MMF (this synchronization approach just happens to use the same exception-based technique for retrying the operation) and that we have a possible solution for the problem which is much better than setting the MPUNSAFE flag on executables (since it doesn't force sticking to a single CPU).

However, I will still try to fix the synchronization method used by default as there is no reason to not believe that it is more efficient than plain membars.

I also asked reporters to test the -XX:+UseMembar switch with the applications where they have problems.

comment:4 by dmik, 14 years ago

It's much better with +UseMembar, but I get some new crashes in the release build (of course, not in the debug). Something near Unsafe_CompareAndSwapInt.

comment:5 by dmik, 14 years ago

I guess I found the cause for the second crash. Java uses the __try exception handler's address stored in fs:[0] as a base for a very fast access to a the Thread object (stored on the same stack some bytes away).

However, under some circumstances, while being executed in JVM-generated code, fs:[0] gets zeroed. This breaks both the exception handling and the Thread object access. I somehow need to find out who's zeroing it.

Last edited 14 years ago by dmik (previous) (diff)

comment:6 by dmik, 14 years ago

Looks like some code screws the ECX register and that breaks the restoration of the exception chain after __try/__except. Will check this.

comment:7 by dmik, 14 years ago

Recent news: the ECX register is not a problem. What I see is that at certain places (for example, after calling SetWin32TIB in the block of asm code in __except that removes the exception handler set by __try) some wrong value gets loaded into fs:[0] (not necessarily zero). fs:[0] should store the pointer to the current exception handler's record installed by __try and this is vital for removing the exception handler because this record stores the address of the previous exception handler which needs to be restored when removing the current one.

I have no idea why SetWin32TIB corrupts fs:[0] -- I see no code in there that could do that. The problem that I can't trace the execution of the call which causes corruption because the corruption can be detected only after the call and there are hundreds of executions of this block of code before it, so I simply can't step through them all...

Will continue searching.

comment:8 by dmik, 14 years ago

I got the above problem fixed, finally, see r 21633 (in Odin SVN). Even though I couldn't find the exact code that trashes fs:[0], tests show that the problem has gone. I could use SmartSVN 6.6 for committing r 21633 :) And for browsing other complex projects like Qt which trigger a lot of exceptions caught in __try/__except.

My assumption about killing the contents of fs:[0] is that in some rare cases the OS/2 kernel doesn't load 0x150b (the thread data segment used in OS/2) to FS when the OS/2 system exception is generated, and therefore it writes the new exception handler data (used for some internal processing) to the Win32 thread data segment (which we abuse FS for when working in Win32 compatibility mode). I couldn't easily prove that but as long as it works now I don't see the need in spending more time on that.

Note that all this is about working in -XX:+UseMembar mode. W/o this option (i.e. when a pseudo membar is used instead of the real one), it still crashes at the same location as before. This is just a different issue and I'm going to look at it now.

Last edited 14 years ago by dmik (previous) (diff)

comment:9 by dmik, 14 years ago

With no UseMembar, it crashes during VM state transition (i.e. while calling SafepointSynchronize::block() in ThreadStateTransition::transition_and_fence()). One place when it happens is Monitor::wait (through creation of an instance of ThreadBlockInVM which is derived from ThreadStateTransition).

As you see, SafepointSynchronize::block() happens right after the code that implements the remote membar technique involved in no UseMembar mode, i.e. the InterfaceSupport::serialize_memory() call. This call uses a __try/__except block around the os::write_memory_serialize_page() call which writes to the synchronization page and may trigger the access violation exception when this page is switched to RO and back to RW (by os::serialize_thread_states()).

The code in ThreadStateTransition::transition_and_fence() (actually, the code in Monitor::wait(), since the former gets inlined into it) uses the ESI register to hold the Thread pointer within the method. When os::write_memory_serialize_page() throws an exception, the filter expression in InterfaceSupport::serialize_memory() (os::win32::serialize_fault_filter()) instructs the handler to continue execution from the same place as eventually RO memory will become RW again and it will succeed. However, when it succeeds, the contents of ESI (supposded to hold the Thread pointer) appears to be not the same as it was before the exception was thrown. As a result, a wrong Thread pointer is passed on to SafepointSynchronize::block() which crashes when trying to access the Thread object through this pointer (that points to an uncommitted memory block).

So my assumption is that the __try/__except exception handler does not preserve all register contents in case when the exception is resolved with the "continue execution" result.

As it is extremely difficult to debug it "in place" (in the JVM code mentioned above) due to slow recompilation times (most involved code is headers) and very frequent OS/2 hangs when dealing with exceptions and debug output to the console and with the problems in the OS/2 debugger itself (which both result in the infamous Panorama driver's banded freeze problem), I have to recreate an emulation of this case in a simple test case and debug it there.

comment:10 by dmik, 14 years ago

I created a single-threaded test case involving Odin but not involving Java and couldn't reproduce the problem.

Then I replaced our __try/__except block with the pure OS/2 exception handler in the Java sources. And the problem still persists: ESI contents is sometimes not preserved after the exception is handled and the execution continues from the same place. The ESI corruption seems to happen when many threads start changing their state (java<->native etc) at the same time at a very frequent rate (hundreds times per second AFAIR). Both of these facts, connected, suggest us that this may be a problem of the SMP kernel. I cannot think of something else ATM because the current code triggering the bug within Java is very simple:

  1. save ESI contents;
  2. install OS/2 exception handler;
  3. initiate remote membar (just write a dword to a special page);
  4. remove OS/2 exception handler;
  5. compare ESI contents;

Step 3 may cause an access violation exception because the special page is temporarily RO. The exception handler in this case just waits until it becomes RW again and continues execution of the interrupted code (so that the write attempt in step 3 is retried).

Most of the time it works correctly (i.e. ESI in steps 1 and 5 is the same). However, at some point (as I already assumed, when many threads changing their states are fired up), ESI gets garbage after step 3 when the latter it causes an exception and then gets retried. Note that there may be several successive exceptions (that don't corrupt the register) prior to corruption. This is why I think it's somehow related to the number of threads and the CPU load.

comment:11 by dmik, 14 years ago

I removed switching to the Win32 context (and abusing FS) in the new __try/__except implementation but after a full rebuild I experience a couple of strange crashes (XCPT_GUARD_PAGE_VIOLATION) which should not happen (I recall I saw them before, at the very beginning). I recall that there were a couple of other places in Java code where the FS register was used directly. I need to find these places and fix them in Java code.

comment:12 by dmik, 14 years ago

Hmm, getting rid of FS abuse (WRT to SEH) turns out to be tricky. The thing is that there are some OS/2 exceptions (XCPT_GUARD_PAGE_VIOLATION is one of them) that need to be processed by Odin before the Win32 application code gets executed (this includes application exception handlers) and it worked fine in FS abuse mode since the Odin (OS/2) and application (Win32) exception handler chains were different due to different FSes.

However now, since we have a single exception handler chain, the application handler (installed by __try()) kicks in first when XCPT_GUARD_PAGE_VIOLATION happens. In this case, it's the Java exception handler which doesn't specifically handle this exception and simply attempts to write its details hs_err_xxx.log. However, XCPT_GUARD_PAGE_VIOLATION is a critical one, Odin uses it to commit additional stack pages of the thread as the stack usage grows and the application hits the last committed page. When doing its print stuff, Java allocates many bytes on stack and eventually hits the guard stack page again and this eventually reaches some other exception handler (i.e. in LIBC) that kills the application. Sometimes it just seems to happen recursively until the stack space ends and then the application crashes with XCPT_BAD_STACK.

I will try to solve this by calling the Odin exception handler befor the SEH exception handler.

comment:13 by dmik, 14 years ago

This approach didn't quite work. As the practice shows, the XCPT_GUARD_PAGE_VIOLATION exception for committing new stack pages is actually handled by the OS/2 kernel itself (after it tries *all* exception handlers in the chain and finds no one to handle it) and therefore we can't call it manually.

I will try a different approach: sort these exceptions out in the Java exception handlers. The main problem is that Java handlers treat any system exception they can't handle as a fatal error and abort the application. I will see what can be done there.

Last edited 14 years ago by dmik (previous) (diff)

comment:14 by dmik, 14 years ago

Okay, I cannot find a quick solution that would work. What I see now is that sometimes the generated Hotspot code calls some arbitrary location in memory and crashes. I cannot easily get to the failing code with the debugger because the debugger is just too crappy (i.e. it hangs *very* frequently when stepping through exception handlers, it cannot disassemble memory from DATA segments which prevents me from seeing the assembly generated by Hotspot, and so on).

It Looks like the whole exception handling algorithm needs to be redesigned for OS/2 if we go the way of eliminating the FS register abuse. This isn't a task that can be done quickly and we have spent too much time on the issue already.

I think the best so far is to fall back to -XX:+UseMembar option (and make it the default) and move on to other issues and not delay GA any more. In UseMembar mode, JDK is quite stable on my SMP machine (tried many applications). According to JDK devs, there is some "substantial" performance drop due to membars (and this was the reason for removing them), but that's certainly better than crashing from time to time on SMP. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5075546 (and related bugs) for more details about UseMembar and performance.

I've prepared a test build to give it some testing before the final decision. It is available here (note that UseMembar is the default in this build so no need to specify it on the command line):

ftp://ftp.dmik.org/tmp/j/j.zip
ftp://ftp.dmik.org/tmp/j/o.wpi

Note that there is only a client JVM in this build.

If it goes well, we will release it that way and address the membar problem later. So far, I have several guesses about the reason of the crash (this mostly applies to the case when we abuse FS by maintaining two separate exception chains, the OS/2 and Win32 one):

  1. Our __try/__catch implementation is a "hack" in the first place -- there is no way to implement it properly w/o the low level support from the compiler side (this means that this feature should be implemented in the compiler and this is how it is done in MSVC and Watcom).
  1. The OS/2 SMP kernel doesn't like frequent exceptions triggered by the remote membar (mprotect) technique (used instead of membars when no UseMembar is specified) and somehow screws up its handling which results in corrupted contents of some registers when we return from the exception handler.

Note that other parts of the code still use our __try/__except implementation and therefore in theory they can also result in crashes (if my assumptions that it is the reason are correct). But again, my tests don't show any crashes when using it together with the UseMembar approach.

comment:15 by dmik, 14 years ago

Resolution: fixed
Status: newclosed

In r297, I applied the temporary workaround that forces the UseMembar option. I also created #118 to track the issues with the mprotect-based membar scheme on SMP when we decide to come back to this problem later.

Note: See TracTickets for help on using tickets.