Opened 10 years ago

Last modified 10 years ago

#230 new defect

IcedTea-Web: build version 1.5.1

Reported by: dmik Owned by:
Priority: major Milestone: Enhanced
Component: icedtea-web Version:
Severity: medium Keywords:
Cc: steve53@…

Description

In r429 trunk was updated to version 1.5.1 from vendor. It doesn't currently build due to a number of changes incompatible with 1.3. This ticket is to address it.

Change History (54)

comment:1 by dmik, 10 years ago

In short, 1.5 went pretty far away from 1.3 configuration-wise. Now the plugin uses various settings files (normally located in the user directory) and it takes things like the JRE path from these files. This made my previous OS/2-specific hacks in this area not working any more. I think we should generally respect the new logic but with some additions. I.e. we will also read the settings files but by default we will use auto-detection for things like JRE instead of always using the hardcoded path (which still may be in use later though, once we have JRE in RPM).

I also think that we should try to build the plugin using its configure instead of our own KMK files (because now we have all the tools like libtool/autoconf/etc). The plugin now comes with a lot of unit tests etc. and it's a pain to translate everything to KMK.

comment:2 by dmik, 10 years ago

Type: taskdefect

Made it build in r431 using the KMK way (to make sure OS/2 code builds per se). Will perform a quick test now.

The configure-based build is next.

comment:3 by dmik, 10 years ago

Type: defecttask

comment:4 by dmik, 10 years ago

A quick test shows that there are some problems with the Java executable path detection. Need to debug it.

Version 0, edited 10 years ago by dmik (next)

comment:5 by dmik, 10 years ago

Type: taskdefect

I fixed the Java executable problem in r433, that was a silly typo. It starts now but something goes wrong with the pipe, the Java side of the plugin says the following and terminates its work:

Unable to write to PIPE. APPLETVIEWER exiting

Not clear why it happens still. This needs some deeper debugging of the applet viewer logic.

The interesting thing is that when dom.ipc.plugins.enabled is set to false in Mozilla (i.e. when the plugin is run in the same process as Firefox), the failure of the applet viewer (running inside the java.ex process) causes firefox.exe to terminate (and java.exe keeps running in a semi-hang state requiring a kill). It seems that despite the workaround I added to prevent early process exit on pipe read error (i.e. signal (SIGPIPE, SIG_IGN)), firefox still exits when the plugin it can't read from its end of the pipe. Anyway, this problem comes from glib (used by the plugin for IPC with java.exe that actually runs the applet). I will need to build the debug version of glib in order to investigate that.

comment:6 by dmik, 10 years ago

The (relatively) new glib version 2.33.12 is built but running across it causes the following error:

(process:27537): GLib-WARNING **: poll(2) failed due to: unknown error (22).

Looks like there is some regression compared to glib 2.25.15 that we used before. Perhaps, the wrong pipe implementation is now on. Will investigate.

comment:7 by dmik, 10 years ago

When I run in with IPC enabled, I get a 100% CPU hang followed by this:

###!!! [Parent][RPCChannel] Error: Channel timeout: cannot send/recv

RWSCLI08: RwsExitListProc - termination code= 0
WARNING: pipe error: Broken pipe: file D:/Coding/mozilla/master/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 712

Also some thing to check.

comment:8 by dmik, 10 years ago

Both problems seem to come from a change in glib where they introduced a new primitive, GWakeup. It needs porting to OS/2. See http://trac.netlabs.org/ports/ticket/14#comment:8 for details.

comment:9 by dmik, 10 years ago

GWakeup was ported to OS/2, the problem described in comment:6 has disappeared and now we again have what we had with glib 2.25.15 (described in comment:5). The problem from comment:7 is something different (happens with the old glib as well). I will focus on the comment:5 problem now. IPC is for later (I have some ideas there too though — the problems may be related to our g_spawn implementation).

comment:10 by dmik, 10 years ago

Still not clear what's wrong with the pipe the applet viewer code is trying to write to. The pipe just seems to be broken from the very beginning. Any attempt to write to it from the side of applet viewer (the JAVA process that the plugin starts in order to run Java on behalf of Firefox) causes an immediate failure. From the Java side everything seems to be fine: it just gets a file descriptor and initializes the output stream with it. Something is not right with the pipe descriptor itself. It's either g_spawn (which relies on file descriptor inheritance from LIBC) or g_io_channel_unix_new used by the plugin to monitor the descriptor activity. It's not clear though how it worked before.

comment:11 by dmik, 10 years ago

Now I guess that the pipe gets broken not because of Java or glib but because Firefox unexpectedly exits when it sends or receives something from the applet. This immediately makes the pipe invalid and causes any read or write operation on it at the other end (i.e. within the Java.exe process) to fail. This is proven by the fact that sometimes besides getting the write error I see log messages indicating that the applet viewer reads null from the other pipe (used by the plugin to send messages to it). Null would have been returned when the pipe gets closed.

I see the following two messages sent from the plugin to the applet:

  PIPE: plugin wrote(?): instance 1 handle 1744830466 width 500 height 150 tag https://www.java.com/en/download/installed.jsp?detect=jre code;JavaDetection.class;height;150;width;500;codebase;https://www.java.com/en/download/;image;/im/download/verify_anim.gif;centerimage;true;boxborder;false;jnlp_href;JavaDetection_applet.jnlp;

and

  PIPE: plugin wrote(?): instance 1 reference -2 GetJavaObject

The applet viewer receives both, starts to handle them and then can't write the answer to the pipe (and can't read the new messages from the other pipe as I already described above).

I need to better learn the surrounding code and add more debugging to see where exactly it stops working. The existing logging is not enough.

comment:12 by dmik, 10 years ago

What about why JAVA.EXE keeps hanging around after Firefox terminates, I did some investigation and came to a conclusion that this is caused by a delayed System.exit processing as I describe in #231. In case of the plugin, the applet viewer calls System.exit right after it fails to read from or write to the pipes it has been given by the plugin for communication. However, since some other thread in JAVA.EXE is being constantly busy with something else (perhaps, reading from the broken pipe in a tight loop at some other place and hence giving 100% CPU load), this exit is never processed (you have to kill the process).

comment:13 by dmik, 10 years ago

When I run the debug version of Firefox from the debugger, I see the "bad stack" exception on Thread 1. No proper call stack by that time of course. This could also explain why no EXCEPTQ report is generated.

Last edited 10 years ago by dmik (previous) (diff)

comment:14 by dmik, 10 years ago

Some new details: Thread 1 actually gets XCPT_ASYNC_PROC_TERMINATE exception — just in the middle of the loop that is waiting for an answer from the applet viewer. This means that some other thread causes the process to terminate (either due to an unhanded exception or due to an explicit DosExit call).

comment:15 by dmik, 10 years ago

Okay, the g_child_watch_add call was the culprit. It sets up some complex processing of the SIGCHLD signal and apparently on OS/2 this causes the process to terminate whenever its child terminates. The thing is that the plugin first starts java.exe -version to check that it works and this child process terminates very quickly. Then the applet viewer is started but by that time SIGCHLD from the first child reaches the firefox process and somehow causes it to terminate. This is a glib problem, I will investigate it tomorrow.

After temporarily disabling the g_child_watch_add call, I get it run further (no pipe read/write errors). However it seems to hang forever in the java process trying to create the Java plugin's "script object" for Firefox. Needs investigation too.

It's not clear to me how this g_child_watch_add thingy worked before — it's there since IcedTea Web v 1.1.2 (i.e. the one we started with). Glib also doesn't seem to be the guilty one — I tried it with the old version and it showed the same behavior. Then it must be something else (libc065?), I'm not going to find that out now.

comment:16 by dmik, 10 years ago

The last thing I see in the logs before it hangs now is the following:

plugin_in_pipe_callback
Using NetX panel
{height=150, centerimage=true, boxborder=false, width=500, jnlp_href=JavaDetection_applet.jnlp, code=JavaDetection.class, image=/im/download/verify_anim.gif, co
debase=https://www.java.com/en/download/}
X and Y are: 500 150
Framing net.sourceforge.jnlp.NetxPanel[panel0,0,0,0x0,invalid,layout=java.awt.FlowLayout]
  PIPE: plugin read: instance 1 status starting applet...
plugin_in_pipe_callback: setting status    starting applet...
plugin_in_pipe_callback return

And after a while it ends up with the following:

Error: Timed out when waiting for response to instance 1 reference -2 GetJavaObject
Error: Unable to fetch applet instance id from Java side.

I.e. the Java process (applet viewer) notifies the plugin that it is about to start the applet and then does nothing simply eating all CPU. Something goes wrong on the Java side. Needs more logging there.

Last edited 10 years ago by dmik (previous) (diff)

comment:17 by dmik, 10 years ago

The problem relies in the code that creates a NetxPanel object. The factory method doing this simply never returns. AFAIR this involves creating the native window (through Odin) so something may not work there as expected with the new plugin code. The debug build of Odin may give us a hint.

comment:18 by dmik, 10 years ago

Odin logs don't show anything unusual. More logging in NetxPanel is needed.

comment:19 by dmik, 10 years ago

As I see, a call to the sun.awt.windows.WEmbeddedFrame constructor never returns. WEmbeddedFrame is what is used in our Odin-based OpenJDK to wrap native PM windows (native PM window handle -> odin32 Win32-like window handle emulation -> WEmbeddedFrame). We've had problems with WEmbeddedFrame before though it is known to work with Firefox 10 and the very same OpenJDK / Odin implementation of this wrapping. Needs checking (on the OpenJDK side which means one more piece of software in debugging mode in the pipeline...).

comment:20 by dmik, 10 years ago

I successfully built OpenJDK with more logging however I got a very strange problem with this build: it can't start a new thread that is used by the plugin for some AWT initialization. The Thread.start method works, but Thread.join never returns and the first statement of the new thread's run() method is never executed. As a result, the AWT code with custom debugging is never called. The stock OpenJDK (which should be the very same source) doesn't have this problem but it also doesn't have extra custom debugging... I will try to substitute some files.

Last edited 10 years ago by dmik (previous) (diff)

comment:21 by dmik, 10 years ago

Okay, manual replacing jawtos2.dll and rt.jar made it use my extra debugging (strange, but will investigate that later). Now I need to analyze the results.

comment:22 by dmik, 10 years ago

It turns out that Java is hanging in an attempt to enter the synchronized (getTreeLock()) in EmbeddedFrame.addNotify() when creating an embedded frame for the plugin. This is very strange since Java must not have any problems with synchronizing primitives (at least it hasn't ever shown them before). My current guess is that some other thread in Java is doing something so hard that there is no chance for the waiting thread to ever get control... Will look in that direction.

comment:23 by dmik, 10 years ago

I have a strong suspicion now that somehow 100% CPU load blocks thread scheduling in Java. And it's not actually clear where this 100% CPU load comes from. Must be sockets. Need to find that out.

comment:24 by dmik, 10 years ago

I've checked several file read and socket-related places in Java and Odin — it all looks clear. So something else is spinning. Will have to check all threads in the plugin code (both C and Java sides).

comment:25 by dmik, 10 years ago

Okay, I have finally tracked down this CPU load problem. Turns out that this is a very dirty plugin code: they do a nop-like (i.e. a very tight) while loop in one thread while waiting for a status change from another thread. This while loop gives us 100% CPU load which in turn prevents other, newly created threads from being executed in our JVM implementation. Here is what happens, in more detail:

  1. Plugin applet viewer's consumer thread in JAVA.EXE (started by the Firefox part of the plugin, i.e. the DLL) reads a message from the Plugin DLL.
  2. It gets a "new plugin instance' message and starts a worker thread to process this request.
  3. Then, almost immediately following the first message, a new message arrives from the DLL: get an instance object (to interact with it from Firefox.
  4. The consumer thread starts a second worker to process the second request.
  5. The second worker finds out that the instance is not yet created (since the first worker hasn't completed its job yet) and enters this awful while loop.
  6. The first worker goes on with instance creation and comes to a point it creates a new thread whose only role is to create an object of the WEmbeddedFrame class and then starts to indefinitely wait on this thread's completion.
  7. However, because the second worker is spinning like hell in a while loop, the new thread does never get a chance to start its execution.
  8. As a result, the instance is never created, the state is never changed to what the second worker expects.

All in all, this problem has a very simple and proper fix: just give up time slices to the system. Did that in r440. It goes much further now but then Java.exe exits before it serves the applet needs. I need to analyze the logs.

Last edited 10 years ago by dmik (previous) (diff)

comment:26 by dmik, 10 years ago

Some thoughts on why it could work before. While just the very same code is IcedTea-Web 1.3 which we released before and which used to work, it seems that it was much less likely that Firefox would issue the second request (to get the object) immediately after the first request (to initialize the instance) so it was less likely that two workers would overlap in time. Now something has changed in Firefox and we get this. Also, CPU load on a particular system could affect it greatly which would explain why we might have seen similar deadlocks from time to time with 1.3 as well.

Also, this dirty code may work fine on Linux due to differences in how thread execution is scheduled by the kernel. Or may be their Firefox behaves differently. There are lots of options, actually.

Last edited 10 years ago by dmik (previous) (diff)

comment:27 by dmik, 10 years ago

Okay, I found where it stops next. The Java Test applet (https://www.java.com/en/download/JavaDetection_applet.jnlp) declares that there are some non-codebase resources used by the applet and, for the sake of security, a GUI dialog window has to be shown to ask for an explicit permission from the user to run this applet (this is new to IcedTea 1.5). It gets stuck while attempting to show this dialog. It is done using some additional thread and this somehow doesn't want to work on OS/2 (nothing happens and the calling code is waiting forever). Needs a closer look.

If we disable these security checks (by putting deployment.manifest.attributes.check=false to %HOME%/.config/icedtea-web/deployment.properties), it goes further, i.e. successfully instantiates the applet, creates a Java Object instance for it and passes that to the browser. However later it causes a hard system hang that requires a reboot. I suppose that the Firefox process crashes or unexpectedly terminates (not enough logging to say for sure). Will concentrate on this problem now.

comment:28 by dmik, 10 years ago

The strange thing is that after adding some more logging at some stage I spotted that the test page worked (it showed what version of Java I have) and then a hard hang again. So the plugin works per se but something screws up the browser.

comment:29 by dmik, 10 years ago

The hard hang has something to do with SMP. Leaving only one CPU active with ACPI.PSD /MAXCPU=1 makes the hang go away. The test plugin page works well, however applets that involve painting (like these ones: www.walter-fendt.de/ph14ru/) show nothing (just a distorted area of the screen). This used to work back then. I will try to fix it in single CPU mode and then we will deal with SMP (note that on this machine I have 4 cores; I guess it would work better on 2 cores meaning that it would hang not that often).

comment:30 by dmik, 10 years ago

I forgot to mention that sometimes I also see hard hangs in /MAXCPU=1 mode. So it must be some race condition which just becomes more obvious with 4 cores.

comment:31 by dmik, 10 years ago

One thing which I noticed when it hangs is that the Java process is spinning like hell in the Odin code on some thread. According to the logs, it basically calls WaitForSingleObject over and over again in a very tight loop and it returns immediately because the state of the object is signaled. Sometimes it also crashes in PMMERGE.DLL (always the same address), here's the info:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x1e93b25a, pid=833, tid=54591505
#
# JRE version: 6.0_27
# Java VM: OpenJDK Client VM (20.0-b12 mixed mode os2-x86 )
# Problematic frame:
# C  [PMMERGE+0x310b25a]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00c83000):  JavaThread "AWT-Windows" daemon [_thread_in_native, id=54591505, stack(0x040f0000,0x04200000)]

siginfo: ExceptionCode=0xc0000005, writing address 0x00000008

Registers:
EAX=0x00000000, EBX=0x00000001, ECX=0x1b260000, EDX=0x13e8af20
ESP=0x041eeb34, EBP=0x041eeb3c, ESI=0x00000000, EDI=0x8000015d
EIP=0x1e93b25a, EFLAGS=0x00010246

Top of Stack: (sp=0x041eeb34)
0x041eeb34:   00000000 8000015d 041eeba4 1e94c090
0x041eeb44:   00000000 00000000 00000000 00000001
0x041eeb54:   00000001 00081001 1e90aa27 00000000
0x041eeb64:   8000015d 00000000 040e0000 8000015d
0x041eeb74:   8000015d 00000000 1e9097b4 8000015d
0x041eeb84:   00000001 8000015d 9c5db620 00000011
0x041eeb94:   00000011 00000341 00000341 00000000
0x041eeba4:   041eebf4 1e939ae6 8000015d 1b1fb36e 

Instructions: (pc=0x1e93b25a)
0x1e93b23a:   ff 75 e4 6a 01 e8 9c fd ff ff 83 c4 08 5e 5f c9
0x1e93b24a:   c3 00 55 8b ec 57 83 7d 08 00 56 75 0d 8b 45 10
0x1e93b25a:   8b 40 08 89 45 08 eb 0e 8b c0 8b 45 08 8b 00 89
0x1e93b26a:   45 08 0b c0 74 1c 8b 75 0c 8b 7d 08 57 e8 64 c9 


Register to memory mapping:

EAX=0x00000000 is an unknown value
EBX=0x00000001 is an unknown value
ECX=0x1b260000 is an unknown value
EDX=0x13e8af20 is an unknown value
ESP=0x041eeb34 is pointing into the stack for thread: 0x00c83000
EBP=0x041eeb3c is pointing into the stack for thread: 0x00c83000
ESI=0x00000000 is an unknown value
EDI=0x8000015d is an unknown value


Stack: [0x040f0000,0x04200000],  sp=0x041eeb34,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [PMMERGE+0x310b25a]
C  [PMMERGE+0x311c090]
C  [PMMERGE+0x3109ae6]
C  [ESTLRL11+0x3e69]
C  [USER32+0xec29]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  0x2115e7b0

[error occurred during error reporting (printing Java stack), id 0xc0000005]

comment:32 by dmik, 10 years ago

The call in USER32.DLL appears to be Win32FrameWindowProc. Really not much. This is a huge procedure that calls back PM many times.

comment:33 by dmik, 10 years ago

Short report to clarify things out a bit after unsuccessfully trying a bunch of things to shed light on that. Currently we are facing 3 issues:

  1. firefox.exe exits early, after calling NPP_GetValue(NPPVpluginScriptableNPObject) in response to which the Java plugin returns a java object wrapped in NPObject. I can't detect why firefox.exe exits yet. The exit seems to be normal, no crash, nothing. I tried to add more logging to track why it exits, but unsuccessful so far. Needs deeper code path investigation.
  2. java.exe hangs dong an endless spin (WaitForSingleObject on one thread and ResetEvent on another). As the hang is very hard, it's impossible to attach a debugger to it or force a proc dump. Dunno how to solve this other than add printf to every WaitForSingleObject usage in Java to see where it is called from. Requires some more time as there are quite a few usages of this call.
  3. What is described https://github.com/bitwiseworks/mozilla-os2/issues/7#issuecomment-27122697. As the first two problems kick in quite often, it's hard to work on that.

And sometimes it works. And when I use the debug build of Java and the debug build of Firefox it works all the time (except that sometimes I get case 2). Sigh...

comment:34 by dmik, 10 years ago

What for case 1, I collected some dumps but that's not much in them because the stack frame structure seems to be corrupt (as usual). I can track it down to as far as PMMERGE.ABNORMALPROCESSEXIT on thread 1. This proves the process decides to unexpectedly bail out but not more. Okay, then we go the slow plan B route by adding printf to the relevant parts of Mozilla code to find where it gives up.

comment:35 by dmik, 10 years ago

Okay, found where things go weird: it's nsObjectLoadingContent::SetupProtoChain. This function sets up the prototype chain of the JS object corresponding to the plugin instance and something goes wrong there sometimes causing the wrong code to be executed. JS machinery is a bit vague for me ATM so I need to debug even further. This code involves thread creation and event processing and my guess is that all these asynchronous things sometimes happen in the wrong order so the code doesn't expect it and blows.

comment:36 by dmik, 10 years ago

In fact, what happens is the following: according to my debugging and further testing, nsObjectLoadingContent::SetupProtoChain works fine (by successfully calling into the plugin DLL to obtain the JSObject instance) and returns to the caller. And the exact place of the crash is this line: https://github.com/bitwiseworks/mozilla-os2/blob/master/js/src/vm/Interpreter.cpp#L3227. According to the debugger, the value of regs.fp() is 0x3 which isn't a valid address. As a result, a call to isYielding() causes a SIGSEGV (I found that out by single stepping in IPMD through many assembler lines — why it doesn't pop up upon normal debugger run is another story, perhaps some of the exception handlers f*cks up, but that takes really long to single step through such complex code).

I need now to step-by-step it again to see what code breaks EBX value.

Last edited 10 years ago by dmik (previous) (diff)

comment:37 by dmik, 10 years ago

Strange, I'm now getting various assertions like this one:

Assertion failure: uint32_t(regs.pc - script->code) < script->length, at D:/Coding/mozilla/master/js/src/vm/Interpreter.cpp:3117

like if something went wrong with the JS stack frames. It shouldn't be here as normally JS works. I must be missing something obvious.

comment:38 by dmik, 10 years ago

Okay, the regs structure seems to be completely wrong by the time when the JS interpreter finishes interpreting the code that creates the plugin instance. In particular the PC value is completely out of range of the script code. At least some hint to the direction of the search.

comment:39 by dmik, 10 years ago

And another finding is that it seems that there is some JS exception pending at the end of the command processing, at least JSContext::isExceptionPending() tells us so. I will try to see what exception is that.

comment:40 by dmik, 10 years ago

Unfortunately, it is not EBX/stack corruption. The debugger was leading me the wrong way, it looked corrupted because I made a jump after INT 3 instead of letting it normally unwind the stack.

The code that is experiencing problems is a nested call into js::RunScript (which eventually invokes the JS interpreter, also in a nested way). While experimenting more in the debugger in manual mode trying to get to the initial RunScript call, I got this assertion:

###!!! ASSERTION: Recursive GetService!: 'Error', file D:/Coding/mozilla/master/xpcom/components/nsComponentManager.cpp, line 1415

I need to check that because it may be related.

comment:41 by dmik, 10 years ago

While working on this I checked the EXCEPTQ problem that Dave reported here https://groups.google.com/forum/#!topic/mozilla.dev.ports.os2/RIi1lj9XEcA and found out that it doesn't work indeed. It is not necessary to fully revert the commit as Dave mentions — according to my tests, it's enough to just disable the FPU exception handler that gets installed to fix FPU control word problems. Once done, EXCEPQ reports are back (but Mozilla breaks due to wrong FPU CW values regarding exception masking, of course). This is strange. The FPU exception handler looks pretty normal. The only explanation I see for now is that the EXCEPTQ exception handler f*cks up when it's not the top one. I need to report this to Steven.

I would really like to get EXCEPTQ reports back because it may save me a lot of time when debugging this JS madness.

comment:42 by dmik, 10 years ago

Well, the simple test case shows that EXCEPTQ works fine being followed by the FPU exception handler. So it must be Mozilla that messes it up some how. I need to find why. It may be related to the plugin problems we see.

comment:43 by dmik, 10 years ago

Okay, I found the reason why EXCEPTQ fails in firefox. We install two exception handlers in a row basically like this (with C++ magic unwrapped):

EXCEPTIONREGISTRATIONRECORD e1;
Install_EXCEPTQ_handler (&e1);
EXCEPTIONREGISTRATIONRECORD e2;
Install_FPU_handler (&e2);

I.e. the FPU handler is installed on top of the chain so that it is called first when an exception happens (to get a chance to handle FPU CW mess and retry execution). The EXCEPTQ handler comes after it. However, this assumes that the variable e2 is allocated below e1 on stack (i.e. it has a smaller address) — this is a strict requirement of the OS/2 kernel regarding structured system exceptions (somehow related to unwinding).

However, in case of Firefox it turns out that the order of e1 and e2 variables is quite the opposite: e2 has a greater address than e1. This screws up the kernel code handling the exception chain so that it throws another, nested exception (XCPT_BAD_STACK) before the second exception handler in chain (i.e. the EXCEPTQ one) gets a chance to execute. Or may be the kernel throws XCPT_BAD_STACK deliberately, when it finds that the order of exception registration records is wrong. Anyway, this nested exception simply aborts the process.

In my test program the order was right so it all worked. It was not difficult then to compare compile options and find out that the one that breaks the order is -O2 (i.e. the optimizer). The order is also broken in -O1 and -O3 modes. The only mode when it is right is -O0. I need to check which optimization option, exactly, controls this and what the standard says about that — to understand if it's designed or our compiler is broken.

Anyway, it seems that it's not a good idea to rely on the compiler when it comes to the order of exception registration records. I will redo the code to make sure the order is always right. And then we'll see how well Firefox will behave.

comment:44 by dmik, 10 years ago

Building Firefox in -O0 mode helped here (I had to fix one IPC source to make it build in this mode https://github.com/bitwiseworks/mozilla-os2/commit/26bcf3155598a7d066bd40f3a1b6e275c9a3bf03) — the EXCEPTQ handler is now properly called in the test case (that simulates a SIGSEGV), perfect.

Regarding the matter of the issue. The C standard doesn't specify how the variables are ordered on the stack (it doesn't even specify that a stack should exist at all, frankly saying). This means that the compiler may choose whatever order it wants. This is not suitable for us. Some ways to ensure the proper stack order include:

  • Use a function nested into another function (this guarantees a new stack frame is created at a lower address). This will add another entity and a bunch of ifdefs.
  • Use alloca() to allocate a single block of stack memory for both registration records and map this block to variables manually. This will kill the elegancy of our C++ wrappers that install exception handlers.

I don't like both of them and will think a bit more.

comment:45 by dmik, 10 years ago

And btw. I'm almost sure this problem is the reason for spontaneous crashes in JS code I've been recently fighting with: most likely our FPU exception handler kicks in (as it should — it's there mostly to fix JS floating point machinery) and then it screws as described above. Too bad these troubles are not seen in the debugger — it takes so many time to find them by hand.

comment:46 by dmik, 10 years ago

EXCEPTQ trap records are back in Firefox now, including optimized builds. See https://github.com/bitwiseworks/mozilla-os2/commit/d8d4c53c0000c66b64630310d229ede6ad77b99e for details.

The Verify Java version page now seems to work well. I will try more complex cases tomorrow.

comment:47 by dmik, 10 years ago

No, this fix doesn't seem actually fix the original problem. I just got it crash again. But this time with the trap report that says not so much though:

 Call Stack
______________________________________________________________________

   EBP     Address    Module     Obj:Offset    Nearest Public Symbol
 --------  ---------  --------  -------------  -----------------------
 Trap  ->  1E98DBD3   PMMERGE   0004:0015DBD3 
 0012F840  1E938F80   PMMERGE   0004:00108F80 
 201F97C4  00000001   Invalid address: 00000001

However, there are some interesting labels on stack to check (like XPConvert.cpp#1189). Code around this line tells us that it is trying to convert the JSObject to XPC exception (and as I mentioned earlier, at a time of the crash the exception flag is set for the crashing main thread).

comment:48 by dmik, 10 years ago

XPConvert.cpp was a false route but another chain of symbols on the stack (from .TRP) has led me to the crash point: it's one of the WinPeekMsg calls (in nsAppShell::ProcessNextNativeEvent). That explains why the crash is in PMMERGE (the crash is access violation when reading from memory inside UnlinkSendListSms called by ReceiveMessage called by SendMsg, as the dump file tells us). This kind of surprises me since I see nothing wrong with these calls.

comment:49 by dmik, 10 years ago

In particular, this call

WinPeekMsg((HAB)0, &qmsg, NULL, WM_CHAR, WM_VIOCHAR, PM_REMOVE)

causes PM to crash with c0000005. qmsg is just a struct on the stack so I really wonder what it is. I have to get the stack trace to see what context ProcessNativeEvent is called.

BTW, the PM gets screwed up after that crash most of the time. It either hangs or becomes unusable in the sense that a new Firefox process can't even start (hangs somewhere in PM init).

comment:50 by dmik, 10 years ago

After disabling the above call, the plugin seems to work most of the time but now it crashes at exit on the Java side like that (again PMMERGE.DLL):

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x1e93b25a, pid=6616, tid=433586193
#
# JRE version: 6.0_27
# Java VM: OpenJDK Client VM (20.0-b12 mixed mode os2-x86 )
# Problematic frame:
# C  [PMMERGE+0x310b25a]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00c8b000):  JavaThread "AWT-Windows" daemon [_thread_in_native, id=433586193, stack(0x040f0000,0x04200000)]

siginfo: ExceptionCode=0xc0000005, writing address 0x00000008

Registers:
EAX=0x00000000, EBX=0x00000001, ECX=0x1b260000, EDX=0x13e8af20
ESP=0x041eebb4, EBP=0x041eebbc, ESI=0x00000000, EDI=0x800001ec
EIP=0x1e93b25a, EFLAGS=0x00010246

Top of Stack: (sp=0x041eebb4)
0x041eebb4:   00000000 800001ec 041eec24 1e94c090
0x041eebc4:   00000000 00000000 00000000 00000001
0x041eebd4:   00000001 00081001 1e90aa27 00000000
0x041eebe4:   800001ec 00000000 040e0000 8000019e
0x041eebf4:   800001ec 00000000 1e9097b4 800001ec
0x041eec04:   00000001 800001ec 9c5d04a0 00000011
0x041eec14:   00000011 000019d8 000019d8 00000000
0x041eec24:   041eec74 1e939ae6 800001ec 1b2100a6 

Instructions: (pc=0x1e93b25a)
0x1e93b23a:   ff 75 e4 6a 01 e8 9c fd ff ff 83 c4 08 5e 5f c9
0x1e93b24a:   c3 00 55 8b ec 57 83 7d 08 00 56 75 0d 8b 45 10
0x1e93b25a:   8b 40 08 89 45 08 eb 0e 8b c0 8b 45 08 8b 00 89
0x1e93b26a:   45 08 0b c0 74 1c 8b 75 0c 8b 7d 08 57 e8 64 c9 


Register to memory mapping:

EAX=0x00000000 is an unknown value
EBX=0x00000001 is an unknown value
ECX=0x1b260000 is an unknown value
EDX=0x13e8af20 is an unknown value
ESP=0x041eebb4 is pointing into the stack for thread: 0x00c8b000
EBP=0x041eebbc is pointing into the stack for thread: 0x00c8b000
ESI=0x00000000 is an unknown value
EDI=0x800001ec is an unknown value


Stack: [0x040f0000,0x04200000],  sp=0x041eebb4,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [PMMERGE+0x310b25a]
C  [PMMERGE+0x311c090]
C  [PMMERGE+0x3109ae6]
C  [ESTLRL11+0x3e69]
C  [USER32+0xec29]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  0x2115e7d0

[error occurred during error reporting (printing Java stack), id 0xc0000005]

comment:51 by dmik, 10 years ago

Just got another trap. It's so interesting that I'm posting it complete: http://pastebin.com/By6jULch. This seems to be the complete stack trace of the PMMERGE crash. XUL.DLL gives away control by calling the previous frame window procedure of the Firefox window and this procedure resides in ESTLRL11.DLL which then calls PMMERGE which then crashes. Very interesting. As you may see the Java trap report also contains a reference to ESTLR11.DLL at the very same address (which in turn means it's the very same window procedure of the frame window — but called from Odin's USER32.DLL, most likely from the OS/2 wrapper for the the emulated Win32 window). I will try to disable eStyler somehow to see if it helps.

The window message being processed is rather harmless WM_FOCUSCHANGE, telling us that the Firefox window is losing focus.

Last edited 10 years ago by dmik (previous) (diff)

comment:52 by dmik, 10 years ago

Important to mention that these recent hangs/traps happen with more complex Java applets. The simple version checker one seems to work quite stable. And these traps have something to do with the Java side of things as I see in the logs. But the existing logging doesn't tell me the reason. It is just hanging for very long and then the appletveiwer process (java-side) just terminates, sometimes with the Java trap file, sometimes — not.

Another unpleasant thing is that when java.exe terminates unexpectedly, this completely freezes Firefox. It's been waiting forever on the main thread completely blocking any user activity. This is due to bad design of both the Firefox plugin subsystem and the IcedTea plugin itself (which is very low quality code). Firefox cured this by moving all plugin interaction to a separate process (OOP). If I switch to OOP I believe Firefox will remain working but this will leave multiple hung plugin-container.exe processes per each plugin instance (not acceptable for a GP release). And besides, it will not solve appletveiwer.exe crashes anyway. We first need to fix these. Then we will sort out Firefox freezes in case of plugin problems.

Last edited 10 years ago by dmik (previous) (diff)

comment:53 by Andreas Buchinger, 10 years ago

Maybe totally unrelated but a few days ago I was hunting for a trap in pmmerge.dll too. Trap was while calling WinDefDlgProc with WM_ADJUSTFRAMEPOS message. WM_ADJUSTFRAMEPOS is not handled by my application and parameters are not changed by me so I see no reason why pmmerge can trap.

There was a reference to estlr11.dll in the trap file too.

For me the trap was very simply to reproduce - clicking on Max/Min buttons of DataSeeker. The trap does not occur anymore since removing optimization option from ICC (vac3.65).

My best guesses were some timing issues or estyler problem. PM SMP issues mentioned by Holger came to mind too.

Again maybe totally unrelated but your pmmerge trap sounds so familiar although I got it with another tool chain.

comment:54 by Steven Levine, 10 years ago

Cc: steve53@… added
Note: See TracTickets for help on using tickets.