Opened 10 years ago
Last modified 10 years ago
#230 new defect
IcedTea-Web: build version 1.5.1
Reported by: | dmik | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Enhanced |
Component: | icedtea-web | Version: | |
Severity: | medium | Keywords: | |
Cc: | steve53@… |
Description
In r429 trunk was updated to version 1.5.1 from vendor. It doesn't currently build due to a number of changes incompatible with 1.3. This ticket is to address it.
Change History (54)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Type: | task → defect |
---|
Made it build in r431 using the KMK way (to make sure OS/2 code builds per se). Will perform a quick test now.
The configure-based build is next.
comment:3 by , 10 years ago
Type: | defect → task |
---|
comment:4 by , 10 years ago
A quick test shows that there are some problems with the Java executable path detection. Need to debug it.
And also RegisterClass from Odin32 fails for some reason. It should not. I need a debug version of Odin. Long time no see.
comment:5 by , 10 years ago
Type: | task → defect |
---|
I fixed the Java executable problem in r433, that was a silly typo. It starts now but something goes wrong with the pipe, the Java side of the plugin says the following and terminates its work:
Unable to write to PIPE. APPLETVIEWER exiting
Not clear why it happens still. This needs some deeper debugging of the applet viewer logic.
The interesting thing is that when dom.ipc.plugins.enabled
is set to false
in Mozilla (i.e. when the plugin is run in the same process as Firefox), the failure of the applet viewer (running inside the java.ex process) causes firefox.exe to terminate (and java.exe keeps running in a semi-hang state requiring a kill). It seems that despite the workaround I added to prevent early process exit on pipe read error (i.e. signal (SIGPIPE, SIG_IGN)
), firefox still exits when the plugin it can't read from its end of the pipe. Anyway, this problem comes from glib (used by the plugin for IPC with java.exe that actually runs the applet). I will need to build the debug version of glib in order to investigate that.
comment:6 by , 10 years ago
The (relatively) new glib version 2.33.12 is built but running across it causes the following error:
(process:27537): GLib-WARNING **: poll(2) failed due to: unknown error (22).
Looks like there is some regression compared to glib 2.25.15 that we used before. Perhaps, the wrong pipe implementation is now on. Will investigate.
comment:7 by , 10 years ago
When I run in with IPC enabled, I get a 100% CPU hang followed by this:
###!!! [Parent][RPCChannel] Error: Channel timeout: cannot send/recv RWSCLI08: RwsExitListProc - termination code= 0 WARNING: pipe error: Broken pipe: file D:/Coding/mozilla/master/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 712
Also some thing to check.
comment:8 by , 10 years ago
Both problems seem to come from a change in glib where they introduced a new primitive, GWakeup. It needs porting to OS/2. See http://trac.netlabs.org/ports/ticket/14#comment:8 for details.
comment:9 by , 10 years ago
GWakeup was ported to OS/2, the problem described in comment:6 has disappeared and now we again have what we had with glib 2.25.15 (described in comment:5). The problem from comment:7 is something different (happens with the old glib as well). I will focus on the comment:5 problem now. IPC is for later (I have some ideas there too though — the problems may be related to our g_spawn implementation).
comment:10 by , 10 years ago
Still not clear what's wrong with the pipe the applet viewer code is trying to write to. The pipe just seems to be broken from the very beginning. Any attempt to write to it from the side of applet viewer (the JAVA process that the plugin starts in order to run Java on behalf of Firefox) causes an immediate failure. From the Java side everything seems to be fine: it just gets a file descriptor and initializes the output stream with it. Something is not right with the pipe descriptor itself. It's either g_spawn
(which relies on file descriptor inheritance from LIBC) or g_io_channel_unix_new
used by the plugin to monitor the descriptor activity. It's not clear though how it worked before.
comment:11 by , 10 years ago
Now I guess that the pipe gets broken not because of Java or glib but because Firefox unexpectedly exits when it sends or receives something from the applet. This immediately makes the pipe invalid and causes any read or write operation on it at the other end (i.e. within the Java.exe process) to fail. This is proven by the fact that sometimes besides getting the write error I see log messages indicating that the applet viewer reads null from the other pipe (used by the plugin to send messages to it). Null would have been returned when the pipe gets closed.
I see the following two messages sent from the plugin to the applet:
PIPE: plugin wrote(?): instance 1 handle 1744830466 width 500 height 150 tag https://www.java.com/en/download/installed.jsp?detect=jre code;JavaDetection.class;height;150;width;500;codebase;https://www.java.com/en/download/;image;/im/download/verify_anim.gif;centerimage;true;boxborder;false;jnlp_href;JavaDetection_applet.jnlp;
and
PIPE: plugin wrote(?): instance 1 reference -2 GetJavaObject
The applet viewer receives both, starts to handle them and then can't write the answer to the pipe (and can't read the new messages from the other pipe as I already described above).
I need to better learn the surrounding code and add more debugging to see where exactly it stops working. The existing logging is not enough.
comment:12 by , 10 years ago
What about why JAVA.EXE keeps hanging around after Firefox terminates, I did some investigation and came to a conclusion that this is caused by a delayed System.exit
processing as I describe in #231. In case of the plugin, the applet viewer calls System.exit
right after it fails to read from or write to the pipes it has been given by the plugin for communication. However, since some other thread in JAVA.EXE is being constantly busy with something else (perhaps, reading from the broken pipe in a tight loop at some other place and hence giving 100% CPU load), this exit is never processed (you have to kill the process).
comment:13 by , 10 years ago
When I run the debug version of Firefox from the debugger, I see the "bad stack" exception on Thread 1. No proper call stack by that time of course. This could also explain why no EXCEPTQ report is generated.
comment:14 by , 10 years ago
Some new details: Thread 1 actually gets XCPT_ASYNC_PROC_TERMINATE exception — just in the middle of the loop that is waiting for an answer from the applet viewer. This means that some other thread causes the process to terminate (either due to an unhanded exception or due to an explicit DosExit call).
comment:15 by , 10 years ago
Okay, the g_child_watch_add
call was the culprit. It sets up some complex processing of the SIGCHLD
signal and apparently on OS/2 this causes the process to terminate whenever its child terminates. The thing is that the plugin first starts java.exe -version
to check that it works and this child process terminates very quickly. Then the applet viewer is started but by that time SIGCHLD
from the first child reaches the firefox process and somehow causes it to terminate. This is a glib problem, I will investigate it tomorrow.
After temporarily disabling the g_child_watch_add
call, I get it run further (no pipe read/write errors). However it seems to hang forever in the java process trying to create the Java plugin's "script object" for Firefox. Needs investigation too.
It's not clear to me how this g_child_watch_add
thingy worked before — it's there since IcedTea Web v 1.1.2 (i.e. the one we started with). Glib also doesn't seem to be the guilty one — I tried it with the old version and it showed the same behavior. Then it must be something else (libc065?), I'm not going to find that out now.
comment:16 by , 10 years ago
The last thing I see in the logs before it hangs now is the following:
plugin_in_pipe_callback Using NetX panel {height=150, centerimage=true, boxborder=false, width=500, jnlp_href=JavaDetection_applet.jnlp, code=JavaDetection.class, image=/im/download/verify_anim.gif, co debase=https://www.java.com/en/download/} X and Y are: 500 150 Framing net.sourceforge.jnlp.NetxPanel[panel0,0,0,0x0,invalid,layout=java.awt.FlowLayout] PIPE: plugin read: instance 1 status starting applet... plugin_in_pipe_callback: setting status starting applet... plugin_in_pipe_callback return
And after a while it ends up with the following:
Error: Timed out when waiting for response to instance 1 reference -2 GetJavaObject Error: Unable to fetch applet instance id from Java s6ide.
I.e. the Java process (applet viewer) notifies the plugin that it is about to start the applet and then does nothing simply eating all CPU. Something goes wrong on the Java side. Needs more logging there.
comment:17 by , 10 years ago
The problem relies in the code that creates a NetxPanel
object. The factory method doing this simply never returns. AFAIR this involves creating the native window (through Odin) so something may not work there as expected with the new plugin code. The debug build of Odin may give us a hint.
comment:18 by , 10 years ago
Odin logs don't show anything unusual. More logging in NetxPanel is needed.
comment:19 by , 10 years ago
As I see, a call to the sun.awt.windows.WEmbeddedFrame
constructor never returns. WEmbeddedFrame
is what is used in our Odin-based OpenJDK to wrap native PM windows (native PM window handle -> odin32 Win32-like window handle emulation -> WEmbeddedFrame). We've had problems with WEmbeddedFrame before though it is known to work with Firefox 10 and the very same OpenJDK / Odin implementation of this wrapping. Needs checking (on the OpenJDK side which means one more piece of software in debugging mode in the pipeline...).
comment:20 by , 10 years ago
I successfully built OpenJDK with more logging however I got a very strange problem with this build: it can't start a new thread that is used by the plugin for some AWT initialization. The Thread.start method works, but Thread.join never returns and the first statement of the new thread's run() method is never executed. As a result, the AWT code with custom debugging is never called. The stock OpenJDK (which should be the very same source) doesn't have this problem but it also doesn't have extra custom debugging... I will try to substitute some files.
comment:21 by , 10 years ago
Okay, manual replacing jawtos2.dll and rt.jar made it use my extra debugging (strange, but will investigate that later). Now I need to analyze the results.
comment:22 by , 10 years ago
It turns out that Java is hanging in an attempt to enter the synchronized (getTreeLock())
in EmbeddedFrame.addNotify()
when creating an embedded frame for the plugin. This is very strange since Java must not have any problems with synchronizing primitives (at least it hasn't ever shown them before). My current guess is that some other thread in Java is doing something so hard that there is no chance for the waiting thread to ever get control... Will look in that direction.
comment:23 by , 10 years ago
I have a strong suspicion now that somehow 100% CPU load blocks thread scheduling in Java. And it's not actually clear where this 100% CPU load comes from. Must be sockets. Need to find that out.
comment:24 by , 10 years ago
I've checked several file read and socket-related places in Java and Odin — it all looks clear. So something else is spinning. Will have to check all threads in the plugin code (both C and Java sides).
comment:25 by , 10 years ago
Okay, I have finally tracked down this CPU load problem. Turns out that this is a very dirty plugin code: they do a nop-like (i.e. a very tight) while
loop in one thread while waiting for a status change from another thread. This while loop gives us 100% CPU load which in turn prevents other, newly created threads from being executed in our JVM implementation. Here is what happens, in more detail:
- Plugin applet viewer's consumer thread in JAVA.EXE (started by the Firefox part of the plugin, i.e. the DLL) reads a message from the Plugin DLL.
- It gets a "new plugin instance' message and starts a worker thread to process this request.
- Then, almost immediately following the first message, a new message arrives from the DLL: get an instance object (to interact with it from Firefox.
- The consumer thread starts a second worker to process the second request.
- The second worker finds out that the instance is not yet created (since the first worker hasn't completed its job yet) and enters this awful while loop.
- The first worker goes on with instance creation and comes to a point it creates a new thread whose only role is to create an object of the
WEmbeddedFrame
class and then starts to indefinitely wait on this thread's completion. - However, because the second worker is spinning like hell in a while loop, the new thread does never get a chance to start its execution.
- As a result, the instance is never created, the state is never changed to what the second worker expects.
All in all, this problem has a very simple and proper fix: just give up time slices to the system. Did that in r440. It goes much further now but then Java.exe exits before it serves the applet needs. I need to analyze the logs.
comment:26 by , 10 years ago
Some thoughts on why it could work before. While just the very same code is IcedTea-Web 1.3 which we released before and which used to work, it seems that it was much less likely that Firefox would issue the second request (to get the object) immediately after the first request (to initialize the instance) so it was less likely that two workers would overlap in time. Now something has changed in Firefox and we get this. Also, CPU load on a particular system could affect it greatly which would explain why we might have seen similar deadlocks from time to time with 1.3 as well.
Also, this dirty code may work fine on Linux due to differences in how thread execution is scheduled by the kernel. Or may be their Firefox behaves differently. There are lots of options, actually.
comment:27 by , 10 years ago
Okay, I found where it stops next. The Java Test applet (https://www.java.com/en/download/JavaDetection_applet.jnlp
) declares that there are some non-codebase resources used by the applet and, for the sake of security, a GUI dialog window has to be shown to ask for an explicit permission from the user to run this applet (this is new to IcedTea 1.5). It gets stuck while attempting to show this dialog. It is done using some additional thread and this somehow doesn't want to work on OS/2 (nothing happens and the calling code is waiting forever). Needs a closer look.
If we disable these security checks (by putting deployment.manifest.attributes.check=false
to %HOME%/.config/icedtea-web/deployment.properties
), it goes further, i.e. successfully instantiates the applet, creates a Java Object instance for it and passes that to the browser. However later it causes a hard system hang that requires a reboot. I suppose that the Firefox process crashes or unexpectedly terminates (not enough logging to say for sure). Will concentrate on this problem now.
comment:28 by , 10 years ago
The strange thing is that after adding some more logging at some stage I spotted that the test page worked (it showed what version of Java I have) and then a hard hang again. So the plugin works per se but something screws up the browser.
comment:29 by , 10 years ago
The hard hang has something to do with SMP. Leaving only one CPU active with ACPI.PSD /MAXCPU=1
makes the hang go away. The test plugin page works well, however applets that involve painting (like these ones: www.walter-fendt.de/ph14ru/) show nothing (just a distorted area of the screen). This used to work back then. I will try to fix it in single CPU mode and then we will deal with SMP (note that on this machine I have 4 cores; I guess it would work better on 2 cores meaning that it would hang not that often).
comment:30 by , 10 years ago
I forgot to mention that sometimes I also see hard hangs in /MAXCPU=1
mode. So it must be some race condition which just becomes more obvious with 4 cores.
comment:31 by , 10 years ago
One thing which I noticed when it hangs is that the Java process is spinning like hell in the Odin code on some thread. According to the logs, it basically calls WaitForSingleObject
over and over again in a very tight loop and it returns immediately because the state of the object is signaled. Sometimes it also crashes in PMMERGE.DLL (always the same address), here's the info:
# # A fatal error has been detected by the Java Runtime Environment: # # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x1e93b25a, pid=833, tid=54591505 # # JRE version: 6.0_27 # Java VM: OpenJDK Client VM (20.0-b12 mixed mode os2-x86 ) # Problematic frame: # C [PMMERGE+0x310b25a] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # --------------- T H R E A D --------------- Current thread (0x00c83000): JavaThread "AWT-Windows" daemon [_thread_in_native, id=54591505, stack(0x040f0000,0x04200000)] siginfo: ExceptionCode=0xc0000005, writing address 0x00000008 Registers: EAX=0x00000000, EBX=0x00000001, ECX=0x1b260000, EDX=0x13e8af20 ESP=0x041eeb34, EBP=0x041eeb3c, ESI=0x00000000, EDI=0x8000015d EIP=0x1e93b25a, EFLAGS=0x00010246 Top of Stack: (sp=0x041eeb34) 0x041eeb34: 00000000 8000015d 041eeba4 1e94c090 0x041eeb44: 00000000 00000000 00000000 00000001 0x041eeb54: 00000001 00081001 1e90aa27 00000000 0x041eeb64: 8000015d 00000000 040e0000 8000015d 0x041eeb74: 8000015d 00000000 1e9097b4 8000015d 0x041eeb84: 00000001 8000015d 9c5db620 00000011 0x041eeb94: 00000011 00000341 00000341 00000000 0x041eeba4: 041eebf4 1e939ae6 8000015d 1b1fb36e Instructions: (pc=0x1e93b25a) 0x1e93b23a: ff 75 e4 6a 01 e8 9c fd ff ff 83 c4 08 5e 5f c9 0x1e93b24a: c3 00 55 8b ec 57 83 7d 08 00 56 75 0d 8b 45 10 0x1e93b25a: 8b 40 08 89 45 08 eb 0e 8b c0 8b 45 08 8b 00 89 0x1e93b26a: 45 08 0b c0 74 1c 8b 75 0c 8b 7d 08 57 e8 64 c9 Register to memory mapping: EAX=0x00000000 is an unknown value EBX=0x00000001 is an unknown value ECX=0x1b260000 is an unknown value EDX=0x13e8af20 is an unknown value ESP=0x041eeb34 is pointing into the stack for thread: 0x00c83000 EBP=0x041eeb3c is pointing into the stack for thread: 0x00c83000 ESI=0x00000000 is an unknown value EDI=0x8000015d is an unknown value Stack: [0x040f0000,0x04200000], sp=0x041eeb34, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [PMMERGE+0x310b25a] C [PMMERGE+0x311c090] C [PMMERGE+0x3109ae6] C [ESTLRL11+0x3e69] C [USER32+0xec29] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j 0x2115e7b0 [error occurred during error reporting (printing Java stack), id 0xc0000005]
comment:32 by , 10 years ago
The call in USER32.DLL appears to be Win32FrameWindowProc
. Really not much. This is a huge procedure that calls back PM many times.
comment:33 by , 10 years ago
Short report to clarify things out a bit after unsuccessfully trying a bunch of things to shed light on that. Currently we are facing 3 issues:
- firefox.exe exits early, after calling
NPP_GetValue(NPPVpluginScriptableNPObject)
in response to which the Java plugin returns a java object wrapped inNPObject
. I can't detect why firefox.exe exits yet. The exit seems to be normal, no crash, nothing. I tried to add more logging to track why it exits, but unsuccessful so far. Needs deeper code path investigation. - java.exe hangs dong an endless spin (
WaitForSingleObject
on one thread andResetEvent
on another). As the hang is very hard, it's impossible to attach a debugger to it or force a proc dump. Dunno how to solve this other than add printf to everyWaitForSingleObject
usage in Java to see where it is called from. Requires some more time as there are quite a few usages of this call. - What is described https://github.com/bitwiseworks/mozilla-os2/issues/7#issuecomment-27122697. As the first two problems kick in quite often, it's hard to work on that.
And sometimes it works. And when I use the debug build of Java and the debug build of Firefox it works all the time (except that sometimes I get case 2). Sigh...
comment:34 by , 10 years ago
What for case 1, I collected some dumps but that's not much in them because the stack frame structure seems to be corrupt (as usual). I can track it down to as far as PMMERGE.ABNORMALPROCESSEXIT on thread 1. This proves the process decides to unexpectedly bail out but not more. Okay, then we go the slow plan B route by adding printf to the relevant parts of Mozilla code to find where it gives up.
comment:35 by , 10 years ago
Okay, found where things go weird: it's nsObjectLoadingContent::SetupProtoChain
. This function sets up the prototype chain of the JS object corresponding to the plugin instance and something goes wrong there sometimes causing the wrong code to be executed. JS machinery is a bit vague for me ATM so I need to debug even further. This code involves thread creation and event processing and my guess is that all these asynchronous things sometimes happen in the wrong order so the code doesn't expect it and blows.
comment:36 by , 10 years ago
In fact, what happens is the following: according to my debugging and further testing, nsObjectLoadingContent::SetupProtoChain
works fine (by successfully calling into the plugin DLL to obtain the JSObject instance) and returns to the caller. And the exact place of the crash is this line: https://github.com/bitwiseworks/mozilla-os2/blob/master/js/src/vm/Interpreter.cpp#L3227. According to the debugger, the value of regs.fp()
is 0x3 which isn't a valid address. As a result, a call to isYielding()
causes a SIGSEGV (I found that out by single stepping in IPMD through many assembler lines — why it doesn't pop up upon normal debugger run is another story, perhaps some of the exception handlers f*cks up, but that takes really long to single step through such complex code).
I need now to step-by-step it again to see what code breaks EBX value.
comment:37 by , 10 years ago
Strange, I'm now getting various assertions like this one:
Assertion failure: uint32_t(regs.pc - script->code) < script->length, at D:/Coding/mozilla/master/js/src/vm/Interpreter.cpp:3117
like if something went wrong with the JS stack frames. It shouldn't be here as normally JS works. I must be missing something obvious.
comment:38 by , 10 years ago
Okay, the regs
structure seems to be completely wrong by the time when the JS interpreter finishes interpreting the code that creates the plugin instance. In particular the PC value is completely out of range of the script code. At least some hint to the direction of the search.
comment:39 by , 10 years ago
And another finding is that it seems that there is some JS exception pending at the end of the command processing, at least JSContext::isExceptionPending()
tells us so. I will try to see what exception is that.
comment:40 by , 10 years ago
Unfortunately, it is not EBX/stack corruption. The debugger was leading me the wrong way, it looked corrupted because I made a jump after INT 3 instead of letting it normally unwind the stack.
The code that is experiencing problems is a nested call into js::RunScript (which eventually invokes the JS interpreter, also in a nested way). While experimenting more in the debugger in manual mode trying to get to the initial RunScript call, I got this assertion:
###!!! ASSERTION: Recursive GetService!: 'Error', file D:/Coding/mozilla/master/xpcom/components/nsComponentManager.cpp, line 1415
I need to check that because it may be related.
comment:41 by , 10 years ago
While working on this I checked the EXCEPTQ problem that Dave reported here https://groups.google.com/forum/#!topic/mozilla.dev.ports.os2/RIi1lj9XEcA and found out that it doesn't work indeed. It is not necessary to fully revert the commit as Dave mentions — according to my tests, it's enough to just disable the FPU exception handler that gets installed to fix FPU control word problems. Once done, EXCEPQ reports are back (but Mozilla breaks due to wrong FPU CW values regarding exception masking, of course). This is strange. The FPU exception handler looks pretty normal. The only explanation I see for now is that the EXCEPTQ exception handler f*cks up when it's not the top one. I need to report this to Steven.
I would really like to get EXCEPTQ reports back because it may save me a lot of time when debugging this JS madness.
comment:42 by , 10 years ago
Well, the simple test case shows that EXCEPTQ works fine being followed by the FPU exception handler. So it must be Mozilla that messes it up some how. I need to find why. It may be related to the plugin problems we see.
comment:43 by , 10 years ago
Okay, I found the reason why EXCEPTQ fails in firefox. We install two exception handlers in a row basically like this (with C++ magic unwrapped):
EXCEPTIONREGISTRATIONRECORD e1; Install_EXCEPTQ_handler (&e1); EXCEPTIONREGISTRATIONRECORD e2; Install_FPU_handler (&e2);
I.e. the FPU handler is installed on top of the chain so that it is called first when an exception happens (to get a chance to handle FPU CW mess and retry execution). The EXCEPTQ handler comes after it. However, this assumes that the variable e2
is allocated below e1
on stack (i.e. it has a smaller address) — this is a strict requirement of the OS/2 kernel regarding structured system exceptions (somehow related to unwinding).
However, in case of Firefox it turns out that the order of e1
and e2
variables is quite the opposite: e2
has a greater address than e1
. This screws up the kernel code handling the exception chain so that it throws another, nested exception (XCPT_BAD_STACK) before the second exception handler in chain (i.e. the EXCEPTQ one) gets a chance to execute. Or may be the kernel throws XCPT_BAD_STACK deliberately, when it finds that the order of exception registration records is wrong. Anyway, this nested exception simply aborts the process.
In my test program the order was right so it all worked. It was not difficult then to compare compile options and find out that the one that breaks the order is -O2
(i.e. the optimizer). The order is also broken in -O1
and -O3
modes. The only mode when it is right is -O0
. I need to check which optimization option, exactly, controls this and what the standard says about that — to understand if it's designed or our compiler is broken.
Anyway, it seems that it's not a good idea to rely on the compiler when it comes to the order of exception registration records. I will redo the code to make sure the order is always right. And then we'll see how well Firefox will behave.
comment:44 by , 10 years ago
Building Firefox in -O0 mode helped here (I had to fix one IPC source to make it build in this mode https://github.com/bitwiseworks/mozilla-os2/commit/26bcf3155598a7d066bd40f3a1b6e275c9a3bf03) — the EXCEPTQ handler is now properly called in the test case (that simulates a SIGSEGV), perfect.
Regarding the matter of the issue. The C standard doesn't specify how the variables are ordered on the stack (it doesn't even specify that a stack should exist at all, frankly saying). This means that the compiler may choose whatever order it wants. This is not suitable for us. Some ways to ensure the proper stack order include:
- Use a function nested into another function (this guarantees a new stack frame is created at a lower address). This will add another entity and a bunch of ifdefs.
- Use
alloca()
to allocate a single block of stack memory for both registration records and map this block to variables manually. This will kill the elegancy of our C++ wrappers that install exception handlers.
I don't like both of them and will think a bit more.
comment:45 by , 10 years ago
And btw. I'm almost sure this problem is the reason for spontaneous crashes in JS code I've been recently fighting with: most likely our FPU exception handler kicks in (as it should — it's there mostly to fix JS floating point machinery) and then it screws as described above. Too bad these troubles are not seen in the debugger — it takes so many time to find them by hand.
comment:46 by , 10 years ago
EXCEPTQ trap records are back in Firefox now, including optimized builds. See https://github.com/bitwiseworks/mozilla-os2/commit/d8d4c53c0000c66b64630310d229ede6ad77b99e for details.
The Verify Java version page now seems to work well. I will try more complex cases tomorrow.
comment:47 by , 10 years ago
No, this fix doesn't seem actually fix the original problem. I just got it crash again. But this time with the trap report that says not so much though:
Call Stack ______________________________________________________________________ EBP Address Module Obj:Offset Nearest Public Symbol -------- --------- -------- ------------- ----------------------- Trap -> 1E98DBD3 PMMERGE 0004:0015DBD3 0012F840 1E938F80 PMMERGE 0004:00108F80 201F97C4 00000001 Invalid address: 00000001
However, there are some interesting labels on stack to check (like XPConvert.cpp#1189). Code around this line tells us that it is trying to convert the JSObject to XPC exception (and as I mentioned earlier, at a time of the crash the exception flag is set for the crashing main thread).
comment:48 by , 10 years ago
XPConvert.cpp was a false route but another chain of symbols on the stack (from .TRP) has led me to the crash point: it's one of the WinPeekMsg
calls (in nsAppShell::ProcessNextNativeEvent
). That explains why the crash is in PMMERGE (the crash is access violation when reading from memory inside UnlinkSendListSms
called by ReceiveMessage
called by SendMsg
, as the dump file tells us). This kind of surprises me since I see nothing wrong with these calls.
comment:49 by , 10 years ago
In particular, this call
WinPeekMsg((HAB)0, &qmsg, NULL, WM_CHAR, WM_VIOCHAR, PM_REMOVE)
causes PM to crash with c0000005. qmsg
is just a struct on the stack so I really wonder what it is. I have to get the stack trace to see what context ProcessNativeEvent
is called.
BTW, the PM gets screwed up after that crash most of the time. It either hangs or becomes unusable in the sense that a new Firefox process can't even start (hangs somewhere in PM init).
comment:50 by , 10 years ago
After disabling the above call, the plugin seems to work most of the time but now it crashes at exit on the Java side like that (again PMMERGE.DLL):
# # A fatal error has been detected by the Java Runtime Environment: # # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x1e93b25a, pid=6616, tid=433586193 # # JRE version: 6.0_27 # Java VM: OpenJDK Client VM (20.0-b12 mixed mode os2-x86 ) # Problematic frame: # C [PMMERGE+0x310b25a] # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # --------------- T H R E A D --------------- Current thread (0x00c8b000): JavaThread "AWT-Windows" daemon [_thread_in_native, id=433586193, stack(0x040f0000,0x04200000)] siginfo: ExceptionCode=0xc0000005, writing address 0x00000008 Registers: EAX=0x00000000, EBX=0x00000001, ECX=0x1b260000, EDX=0x13e8af20 ESP=0x041eebb4, EBP=0x041eebbc, ESI=0x00000000, EDI=0x800001ec EIP=0x1e93b25a, EFLAGS=0x00010246 Top of Stack: (sp=0x041eebb4) 0x041eebb4: 00000000 800001ec 041eec24 1e94c090 0x041eebc4: 00000000 00000000 00000000 00000001 0x041eebd4: 00000001 00081001 1e90aa27 00000000 0x041eebe4: 800001ec 00000000 040e0000 8000019e 0x041eebf4: 800001ec 00000000 1e9097b4 800001ec 0x041eec04: 00000001 800001ec 9c5d04a0 00000011 0x041eec14: 00000011 000019d8 000019d8 00000000 0x041eec24: 041eec74 1e939ae6 800001ec 1b2100a6 Instructions: (pc=0x1e93b25a) 0x1e93b23a: ff 75 e4 6a 01 e8 9c fd ff ff 83 c4 08 5e 5f c9 0x1e93b24a: c3 00 55 8b ec 57 83 7d 08 00 56 75 0d 8b 45 10 0x1e93b25a: 8b 40 08 89 45 08 eb 0e 8b c0 8b 45 08 8b 00 89 0x1e93b26a: 45 08 0b c0 74 1c 8b 75 0c 8b 7d 08 57 e8 64 c9 Register to memory mapping: EAX=0x00000000 is an unknown value EBX=0x00000001 is an unknown value ECX=0x1b260000 is an unknown value EDX=0x13e8af20 is an unknown value ESP=0x041eebb4 is pointing into the stack for thread: 0x00c8b000 EBP=0x041eebbc is pointing into the stack for thread: 0x00c8b000 ESI=0x00000000 is an unknown value EDI=0x800001ec is an unknown value Stack: [0x040f0000,0x04200000], sp=0x041eebb4, free space=1018k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [PMMERGE+0x310b25a] C [PMMERGE+0x311c090] C [PMMERGE+0x3109ae6] C [ESTLRL11+0x3e69] C [USER32+0xec29] Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j 0x2115e7d0 [error occurred during error reporting (printing Java stack), id 0xc0000005]
comment:51 by , 10 years ago
Just got another trap. It's so interesting that I'm posting it complete: http://pastebin.com/By6jULch. This seems to be the complete stack trace of the PMMERGE crash. XUL.DLL gives away control by calling the previous frame window procedure of the Firefox window and this procedure resides in ESTLRL11.DLL which then calls PMMERGE which then crashes. Very interesting. As you may see the Java trap report also contains a reference to ESTLR11.DLL at the very same address (which in turn means it's the very same window procedure of the frame window — but called from Odin's USER32.DLL, most likely from the OS/2 wrapper for the the emulated Win32 window). I will try to disable eStyler somehow to see if it helps.
The window message being processed is rather harmless WM_FOCUSCHANGE, telling us that the Firefox window is losing focus.
comment:52 by , 10 years ago
Important to mention that these recent hangs/traps happen with more complex Java applets. The simple version checker one seems to work quite stable. And these traps have something to do with the Java side of things as I see in the logs. But the existing logging doesn't tell me the reason. It is just hanging for very long and then the appletveiwer process (java-side) just terminates, sometimes with the Java trap file, sometimes — not.
Another unpleasant thing is that when java.exe terminates unexpectedly, this completely freezes Firefox. It's been waiting forever on the main thread completely blocking any user activity. This is due to bad design of both the Firefox plugin subsystem and the IcedTea plugin itself (which is very low quality code). Firefox cured this by moving all plugin interaction to a separate process (OOP). If I switch to OOP I believe Firefox will remain working but this will leave multiple hung plugin-container.exe processes per each plugin instance (not acceptable for a GP release). And besides, it will not solve appletveiwer.exe crashes anyway. We first need to fix these. Then we will sort out Firefox freezes in case of plugin problems.
comment:53 by , 10 years ago
Maybe totally unrelated but a few days ago I was hunting for a trap in pmmerge.dll too. Trap was while calling WinDefDlgProc with WM_ADJUSTFRAMEPOS message. WM_ADJUSTFRAMEPOS is not handled by my application and parameters are not changed by me so I see no reason why pmmerge can trap.
There was a reference to estlr11.dll in the trap file too.
For me the trap was very simply to reproduce - clicking on Max/Min buttons of DataSeeker. The trap does not occur anymore since removing optimization option from ICC (vac3.65).
My best guesses were some timing issues or estyler problem. PM SMP issues mentioned by Holger came to mind too.
Again maybe totally unrelated but your pmmerge trap sounds so familiar although I got it with another tool chain.
comment:54 by , 10 years ago
Cc: | added |
---|
In short, 1.5 went pretty far away from 1.3 configuration-wise. Now the plugin uses various settings files (normally located in the user directory) and it takes things like the JRE path from these files. This made my previous OS/2-specific hacks in this area not working any more. I think we should generally respect the new logic but with some additions. I.e. we will also read the settings files but by default we will use auto-detection for things like JRE instead of always using the hardcoded path (which still may be in use later though, once we have JRE in RPM).
I also think that we should try to build the plugin using its configure instead of our own KMK files (because now we have all the tools like libtool/autoconf/etc). The plugin now comes with a lot of unit tests etc. and it's a pain to translate everything to KMK.