Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#90 closed defect (fixed)

Openfire crashes.

Reported by: Yoda_Java6 Owned by:
Priority: major Milestone: GA
Component: general Version: 1.6.0-b22 WSE
Severity: medium Keywords: openfire
Cc:

Description

Trying to start Openfire, it very often crashes
is same place - but trying enough times, it starts
and works OK.
However, at random times, at the moment a client
logs in it crashes in same place again.

As this is a server app, supposed to run 24/7
this is very unfortunate.

Attachments (2)

Crash010.zip (6.2 KB) - added by Yoda_Java6 13 years ago.
Crash028.zip (1.3 KB) - added by Yoda_Java6 13 years ago.

Download all attachments as: .zip

Change History (24)

Changed 13 years ago by Yoda_Java6

Attachment: Crash010.zip added

comment:1 Changed 13 years ago by dmik

Milestone: RC2GA

Looks like the reason for the crash is same you experience in #87. Will look at it for GA as well. Also, please specify the exact download link to the Openfire version you use.

comment:2 Changed 13 years ago by dmik

Priority: criticalFeedback Pending

Please check it with RC2 and attach new logs if the problem still persists.

comment:3 Changed 13 years ago by Yoda_Java6

I have added RC2 to server. First initial tests seems
that it is more stable - no crashes so far.

Please give me another week or so to test it,
and I'll report back if nothing had happened.

comment:4 Changed 13 years ago by Yoda_Java6

Priority: Feedback Pendingmajor
Version: 1.6.0-b19 RC1.6.0-b22 RC2

It runs much better with version RC2, and I haven't seen anything
equal to the former crashes, but just had a new one.

Crash28 attached.

Changed 13 years ago by Yoda_Java6

Attachment: Crash028.zip added

comment:5 Changed 13 years ago by dmik

Priority: majorFeedback Pending

This crash is exactly the same location as in #87: ava_lang_Throwable::fill_in_stack_trace(Handle throwable, TRAPS). Please also try to collect PROCDUMP data too.

comment:6 Changed 13 years ago by Yoda_Java6

Priority: Feedback Pendingmajor

Just did, but the upload limits are too small.

comment:7 Changed 13 years ago by dmik

Priority: majorFeedback Pending

Please try the latest WSE release together with the -XX:+UseMembar command line option for java.exe to see if it fixes crashes for you.

comment:9 Changed 13 years ago by Yoda_Java6

Priority: Feedback Pendingcritical
Version: 1.6.0-b22 RC21.6.0-b22 WSE

The WSE release crashes Openfire every time I start it.
Crash061.zip sent to your ftp.

comment:10 Changed 13 years ago by dmik

Priority: criticalmajor
Severity: highmedium

Thank you. Please don't play with priorities, it's not the user task.

comment:11 Changed 13 years ago by dmik

Priority: majorFeedback Pending

The links to the new test build are available here: ticket:96#comment:14.

Please try it in both SMP mode (PSD=ACPI.PSD /SMP in config.sys) and UNI mode (just PSD=ACPI.PSD).


comment:12 Changed 13 years ago by Yoda_Java6

Priority: Feedback Pendingmajor

I'm testing the test build from #96 .
It can run again, where WSE build crashed every time at startup.
It still crashes at startup sometimes, like earlier builds did.
This is running in SMP on real SMP server.
The crashes are sent to your ftp.

I tested an unconfigured version of OpenFire? on my SMP laptop.
It showed same crashes in SMP mode.
Here I could try it in UNI mode, and it looks like it didn't
crash in this mode.

Looks like there are still issues, and SMP is (part of) the problem.

comment:13 Changed 13 years ago by dmik

Yoda, your latest dumps you uploaded later yesterday (earlier today :) are still not what I need (it's still the second crash, not the first one). Forgot to use -XX:+UserOSErrorReporting again?

Anyway, please use one more option when running Openfire: -XX:+InterceptOSException. This will disable the JVM error report (and the second crash) completely. As an indication that you did it right, you should now not get hs_err*.log files at all when it crashes -- i.e., only one or two PDUMPs and a POPUPLOG.OS2 entry.

I'm waiting for the new dumps from you.

comment:14 in reply to:  13 Changed 13 years ago by Yoda_Java6

Replying to dmik:

Yoda, your latest dumps you uploaded later yesterday (earlier today :) are still not what I need (it's still the second crash, not the first one). Forgot to use -XX:+UserOSErrorReporting again?

No, but I use -XX:UseOSErrorReporting - I assumed you just misspelled it.
Check jdebug.log in zip - it contains cmd line for start of java...

Anyway, please use one more option when running Openfire: -XX:+InterceptOSException.

When I try this, I get:

java -client -XX:+UseOSErrorReporting -XX:+InterceptOSException -jar H:\OpenFireTest?\lib\startup.jar
Unrecognized VM option '+InterceptOSException'
Could not create the Java virtual machine.

Not sure how then right option should be.

comment:15 Changed 13 years ago by dmik

Indeed, InterceptOSException is disabled in production builds.

WRT UseOSErrorReporting. Yes I misspelled it above, sorry. Anyway, I've just checked that it only works as expected if the JVM error reporter doesn't crash itself...

So, I created a test version of the client JVM.DLL for you where the reporter is disabled: ftp://ftp.dmik.org/tmp/j/jvm_no_report.zip. Using this DLL, you will surely get the PDUMP of the first chrash and this is what I need.

comment:16 Changed 13 years ago by dmik

With the test version of JVM.DLL we could get the right dump. The call stack of the crashed thread looks like this:

odincrt.dll:_ufree
odincrt.dll:free
odincrt.dll:odin_free
... (here may come some more framse)
wsock32.dll:WSAAsyncSelectWorker
wsock32.dll:WSAEventSelect

This seems to happen when the underlying application attemts to open a new connection.

comment:17 Changed 13 years ago by dmik

Judging from the parameters to WSAEventSelect() on stack, it's clear that this one is the call made from java_sun_nio_ch_IOUtil_configureBlocking().

comment:18 Changed 13 years ago by dmik

Probably, it's a classic problem of free() being called twice for some memory block (presumably, ASYNCTHREADPARM or VSemaphore). I can't name the exact object since the DUMP is somehow incomplete (it lacks the disassembly for all but the last procedure, and the stack frame may be not complete). If you happen to get a more detailed PDUMP with the debug version, it should help.

Anyway, I certainly see at least two problems with the current WSOCK32 code:

  1. RemoveFromQueue?() unnecessarily clears the ASYNCTHREADPARM structure after removing it from the list. This makes it impossible to delete the VSemaphore object afterwards (since the pointer to it is cleared inside RemoveFromQueue?()) which in turn creates quite a bit of memory leaks (the luck is that VSemaphore is just 8 bytes long so that it doesn't leak fast enough to be clearly noticed at runtime).
  1. The WSAEventSelect(s, hEventObject, 0) call which is intended to just de-associate the socket from the event object (and should do nothing if there is no association) still creates an auxiliary thread as if it were an association call (the last parameter is not zero). This thread exits just a moment after its creation (because the check for zero is the first thing it does), but still, there is no need for it to be started anyway.

I fixed both problems in the test build of wsock32.dll (on my ftp, as usual) so please test if it fixes the original issue.

Since the crashing code path is definitely the one that involves the cancellation call mentioned in 2. above, one of my guesses regarding the reason of the original crash is that _beginthread() from odincrt (which is actually the renamed VAC runtume) returns a failure on SMP in cases where the thread execution lasts for a very short amount of time (though it is assumed that this method always succeeds in case if the thread was started successfully). This fact would cause the mentioned double free() call: one from the block of code handling the thread creation failure, and the other one from the thread that has actually been started. This is a blind guess though, since the DUMP is not complete and since I don't have the sources of VAC runtime to check how _beginthread() is implemented.

comment:19 Changed 13 years ago by dmik

It actually seems that _beginthread() (which return value is a TID) happens to return zero and this is the reason of the double deletion: zero is interpreted as an error by WSAAsyncSelectWorker() and in response it deletes the ASYNCTHREADPARM structure it created. However, the thread was actually started, competed the task immediately (due to the reason described in 2. above) and also attempted to delete the same ASYNCTHREADPARM structure (which is considered to be owned by the started thread if starting succeeds).

The question is if TID is indeed a valid TID in OS/2 or it's just a failure of _beginthread() starting the thread but returning an incorrect TID? When I get more logs from Yoda, I hope I will be able to tell for sure which one is the case.

So far, I fixed both problems 1. and 2. in revisions 21656 and 21657 in Odin. They were not the ones leading to the crash, but they created conditions for the zero TID issue to appear.

The need for the zero TID fix is to be defined after clarifying the question above.

Last edited 13 years ago by dmik (previous) (diff)

comment:20 Changed 13 years ago by dmik

BTW, from CPREF it indirectly follows that zero is not a valid TID: some APIs (like DosSetPriority?() assume that TID=0 means the current thread). So it must be either a VAC runtime or OS/2 SMP kernel bug.

comment:21 Changed 13 years ago by dmik

Resolution: fixed
Status: newclosed

Since several bugs have been already fixed within this defect, it's enough for it. The name is too general. To prevent this one from growing infinitely, new tickets should be created for new issues.

Yoda, thanks for testing and remember that I'm still awaiting for logs from you from the last test debug build. You may inform me here or in Jabber.

comment:22 in reply to:  21 Changed 13 years ago by Yoda_Java6

Replying to dmik:

Yoda, thanks for testing and remember that I'm still awaiting for logs from you from the last test debug build. You may inform me here or in Jabber.

I think I finally was able to recreate that crash with debug build.
Look at your ftp - huge zip file :-)

Note: See TracTickets for help on using tickets.