Opened 7 years ago

Closed 7 years ago

#160 closed defect (fixed)

Attempt to protect stack guard pages failed

Reported by: dmik Owned by:
Priority: major Milestone: GA2
Component: general Version: 1.6.0-b22 GA
Severity: medium Keywords:
Cc: java6@…

Description

While starting-stopping SmartSVN 12 times in a row, I got this message to the console twice and each time it was followed by a crash in JVM.

The crash log is attached.

Attachments (1)

hs_err_pid232.log (21.4 KB) - added by dmik 7 years ago.

Download all attachments as: .zip

Change History (10)

Changed 7 years ago by dmik

comment:1 Changed 7 years ago by dmik

The message is the same as in #146 BTW. I don't know if it indicates the same problem though. Seems to be worth investigating.

Last edited 7 years ago by dmik (previous) (diff)

comment:2 Changed 7 years ago by yoda

These 'guard page fails' are there half the times OpenFire? crashes too.
Today I even saw an equal fail to 'deallocate guard pages'.

comment:3 Changed 7 years ago by yoda

  • Cc java6@… added

comment:4 Changed 7 years ago by dmik

This warning message is emitted by JVM when it fails to initially mark the special yellow zone at the bottom of the stack of the newly created (or attached) thread with the "guard page" attribute. And although Odin sources (around VirtualProtect?) have a warning that this attribute is not supported by Odin, OS/2 has the same concept and it seems to work -- at least, you can mark/unmark memory pages as PAG_GUARD with DosSetMem?() w/o any error.

However, sometimes DosSetMem?() fails and this is when the warning is shown. I can't say exactly what is done by JVM before it fails since I can't reproduce the crash any longer (tried to start SmartSVN a zillion times). But since this warning is only when doing the initial markup, it must happen when JVM is in the process of creating a new thread. A subsequent crash right after this warning also suggests that something went really wrong.

As far as I understand, the yellow zone mechanism works as follows in the Win32 JVM (this is just a guess). First of all, the OS uses guard pages on its own to implement the automatic stack growth on demand. The top page on the stack is committed and a page right below it is marked as a "guard page". When the application accesses the guard page, EXCEPTION_GUARD_PAGE_VIOLATION is raised and this causes the default OS handler to remove the "guard page" status, commit the page below together with marking it as a guard page and continue execution. If, however, the OS fails to allocate and mark the new guard page, it throws the infamous EXCEPTION_STACK_OVERFLOW exception.

JVM kicks in here by marking some pages at the bottom of the stack as "guard pages" upfront. As I understand, this causes the default OS handler to generate EXCEPTION_STACK_OVERFLOW when an attempt is made to access the page right above the yellow zone because it won't be able to mark the page below it as a guard page (since it is already marked as such by JVM). When this exception is caught by JVM, it throws the Java StackOverflow? exception to let the Java code handle the situation (e.g. free the stack if possible or at least gracefully shut down itself).

This technique allows JVM to have as many pages as it wants in this yellow safety zone instead of 1-2 pages provided by the OS if it works on its own. More pages in this zone are necessary because the shutdown code triggered by the StackOverflow? handler on Java level may need much more memory on the stack to do its work and terminate the application than would have been provided by the OS.

Now I'm going to check if my assumptions regarding the Win32 behavior are correct and if OS/2 behaves the same way as well. BTW, the "PAG_GUARD not currently supported" remark in Odin could mean just that but I want to make sure myself.

If OS/2 lacks this functionality, we will disable guard pages at all in JVM. There is already such a possibility and it is used on Win95/Win98 where guard pages behave differently (most likely, EXCEPTION_STACK_OVERFLOW isn't thrown if the page is already a guard page).

comment:5 Changed 7 years ago by dmik

Windows tests show that it works a bit different. For the memory area used as a thread stack, the OS doesn't let the application see EXCEPTION_GUARD_PAGE_VIOLATION; it always handles it on its own (to implement the dynamic stack commitment procedure). Instead, if you mark some page in the stack area (which is beyond the current committed limit) as PAGE_GUARD, you will get EXCEPTION_STACK_OVERFLOW when trying to access this page (it seems to be completely equivalent to EXCEPTION_GUARD_PAGE_VIOLATION in the sense that it resets PAGE_GUARD).

Running the Windows test under Odin shows that all this doesn't work at all on OS/2. Now I have to find if it's an Odin problem or the OS/2 one.

comment:6 Changed 7 years ago by dmik

I found that the PAG_GUARD mechanism works well on OS/2 but this has some differences which are not not properly mapped to Windows. In particular, on OS/2 you always get XCPT_GUARD_PAGE_VIOLATION when you access a page marked as PAG_GUARD, even if this page is within the current thread's stack. On Windows, you get EXCEPTION_STACK_OVERFLOW (see above).

Also, due the AUTOCOMMIT technique implemented in Odin for the virtual memory managed by Win23 APIs which seems to suppress the XCPT_GUARD_PAGE_VIOLATION exception completely. This needs some more investigation.

For these two reasons, the "guard page" technique can't work properly in OpenJDK ATM (as it doesn't see the needed exceptions) and this may be a reason for the warning and crash from the description (and other similar problems). I will try to implement the missing bits of the functionality in Odin; that shouldn't be difficult.

Last edited 7 years ago by dmik (previous) (diff)

comment:7 Changed 7 years ago by dmik

I also found some strange things in the CreateThread? implementation, regarding the stack size paramenter. According to MSDN, the thread stack size value is interpreted as follows:

  1. 0 = take the stack size from the PE header (by default, it is 1 MB of reserved memory, two pages of which are initially committed).
  2. N = take the stack size from the PE header (by default, 1 MB) but commit N bytes. If N is greater than the stack size in the PE header, the stack will be increased up to N.
  3. N and STACK_SIZE_PARAM_IS_A_RESERVATION = reserve N bytes for the stack and commit two pages.

On OS/2, the stack size parameter in DosCreateThread?() always means the total stack size. By default, 3 pages of this area will be committed and the rest will be reserved. If you pass the STACK_COMMITTED flag, all memory will be pre-committed. As you see, it's a bit different to Windows.

Odin simply passes the value from CreateThread? to DosCreateThread? and doesn't analyze the STACK_SIZE_PARAM_IS_A_RESERVATION flag at all which may give not what the application expects, which in turn may be another reason for crashes. In particular, if a Win32 app passes the value of N as a stack size and N is smaller than 1 MB, this means that the stack will be 1 MB and N bytes of it will be pre-committed. On OS/2 it means that the TOTAL stack size is N. Of course, if the application relies on the particular stack size a lot, it will eventually end up with the stack overflow condition. Odin sources contain a "dirty" hack that seems to temporarily this problem:

    // @@@PH Note: with debug code enabled, ODIN might request more stack space!
    //SvL: Also need more stack in release build (RealPlayer 7 sometimes runs
    //     out of stack
    if (cbStack > 0) {
        cbStack <<= 1;     // double stack
    }
    else {
        cbStack = (WinExe) ? WinExe->getDefaultStackSize() : 0x100000; // per default 1MB stack per thread
    }

So, they simply double what the application requested. But obviously, doubling doesn't necessarily increase the stack size to 1 MB as the app could assume.

All this stuff needs to be fixed as well.

comment:8 Changed 7 years ago by dmik

I created Odin tickets for both issues (http://svn.netlabs.org/odin32/ticket/76 and http://svn.netlabs.org/odin32/ticket/77) and fixed them locally but unfortunately many Java apps crash in the release build and all of them hang in the debug build (the debug build hang is definitely related to logging). For this reason, I can't commit the fixes.

Flash and test apps work well though.

I will try to have some time tomorrow to look at the problem with the fresh eye but if I fail, we will have to postpone the resolution of these problems and release GA2 with the "guard page" feature turned off.

Last edited 7 years ago by dmik (previous) (diff)

comment:9 Changed 7 years ago by dmik

  • Resolution set to fixed
  • Status changed from new to closed

Both problems are fixed within their respective Odin tickets.

The original problem doesn't show up any more. Let's hope it has gone completely.

Note: See TracTickets for help on using tickets.