Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#90 closed defect (fixed)

Openfire crashes.

Reported by: yoda Owned by:
Priority: major Milestone: GA
Component: general Version: 1.6.0-b22 WSE
Severity: medium Keywords: openfire
Cc:

Description

Trying to start Openfire, it very often crashes
is same place - but trying enough times, it starts
and works OK.
However, at random times, at the moment a client
logs in it crashes in same place again.

As this is a server app, supposed to run 24/7
this is very unfortunate.

Attachments (2)

Crash010.zip (6.2 KB) - added by yoda 9 years ago.
Crash028.zip (1.3 KB) - added by yoda 9 years ago.

Download all attachments as: .zip

Change History (24)

Changed 9 years ago by yoda

comment:1 Changed 9 years ago by dmik

  • Milestone changed from RC2 to GA

Looks like the reason for the crash is same you experience in #87. Will look at it for GA as well. Also, please specify the exact download link to the Openfire version you use.

comment:2 Changed 9 years ago by dmik

  • Priority changed from critical to Feedback Pending

Please check it with RC2 and attach new logs if the problem still persists.

comment:3 Changed 9 years ago by yoda

I have added RC2 to server. First initial tests seems
that it is more stable - no crashes so far.

Please give me another week or so to test it,
and I'll report back if nothing had happened.

comment:4 Changed 9 years ago by yoda

  • Priority changed from Feedback Pending to major
  • Version changed from 1.6.0-b19 RC to 1.6.0-b22 RC2

It runs much better with version RC2, and I haven't seen anything
equal to the former crashes, but just had a new one.

Crash28 attached.

Changed 9 years ago by yoda

comment:5 Changed 8 years ago by dmik

  • Priority changed from major to Feedback Pending

This crash is exactly the same location as in #87: ava_lang_Throwable::fill_in_stack_trace(Handle throwable, TRAPS). Please also try to collect PROCDUMP data too.

comment:6 Changed 8 years ago by yoda

  • Priority changed from Feedback Pending to major

Just did, but the upload limits are too small.

comment:7 Changed 8 years ago by dmik

  • Priority changed from major to Feedback Pending

Please try the latest WSE release together with the -XX:+UseMembar command line option for java.exe to see if it fixes crashes for you.

comment:9 Changed 8 years ago by yoda

  • Priority changed from Feedback Pending to critical
  • Version changed from 1.6.0-b22 RC2 to 1.6.0-b22 WSE

The WSE release crashes Openfire every time I start it.
Crash061.zip sent to your ftp.

comment:10 Changed 8 years ago by dmik

  • Priority changed from critical to major
  • Severity changed from high to medium

Thank you. Please don't play with priorities, it's not the user task.

comment:11 Changed 8 years ago by dmik

  • Priority changed from major to Feedback Pending

The links to the new test build are available here: ticket:96#comment:14.

Please try it in both SMP mode (PSD=ACPI.PSD /SMP in config.sys) and UNI mode (just PSD=ACPI.PSD).


comment:12 Changed 8 years ago by yoda

  • Priority changed from Feedback Pending to major

I'm testing the test build from #96 .
It can run again, where WSE build crashed every time at startup.
It still crashes at startup sometimes, like earlier builds did.
This is running in SMP on real SMP server.
The crashes are sent to your ftp.

I tested an unconfigured version of OpenFire? on my SMP laptop.
It showed same crashes in SMP mode.
Here I could try it in UNI mode, and it looks like it didn't
crash in this mode.

Looks like there are still issues, and SMP is (part of) the problem.

comment:13 follow-up: Changed 8 years ago by dmik

Yoda, your latest dumps you uploaded later yesterday (earlier today :) are still not what I need (it's still the second crash, not the first one). Forgot to use -XX:+UserOSErrorReporting again?

Anyway, please use one more option when running Openfire: -XX:+InterceptOSException. This will disable the JVM error report (and the second crash) completely. As an indication that you did it right, you should now not get hs_err*.log files at all when it crashes -- i.e., only one or two PDUMPs and a POPUPLOG.OS2 entry.

I'm waiting for the new dumps from you.

comment:14 in reply to: ↑ 13 Changed 8 years ago by yoda

Replying to dmik:

Yoda, your latest dumps you uploaded later yesterday (earlier today :) are still not what I need (it's still the second crash, not the first one). Forgot to use -XX:+UserOSErrorReporting again?

No, but I use -XX:UseOSErrorReporting - I assumed you just misspelled it.
Check jdebug.log in zip - it contains cmd line for start of java...

Anyway, please use one more option when running Openfire: -XX:+InterceptOSException.

When I try this, I get:

java -client -XX:+UseOSErrorReporting -XX:+InterceptOSException -jar H:\OpenFireTest?\lib\startup.jar
Unrecognized VM option '+InterceptOSException'
Could not create the Java virtual machine.

Not sure how then right option should be.

comment:15 Changed 8 years ago by dmik

Indeed, InterceptOSException is disabled in production builds.

WRT UseOSErrorReporting. Yes I misspelled it above, sorry. Anyway, I've just checked that it only works as expected if the JVM error reporter doesn't crash itself...

So, I created a test version of the client JVM.DLL for you where the reporter is disabled: ftp://ftp.dmik.org/tmp/j/jvm_no_report.zip. Using this DLL, you will surely get the PDUMP of the first chrash and this is what I need.

comment:16 Changed 8 years ago by dmik

With the test version of JVM.DLL we could get the right dump. The call stack of the crashed thread looks like this:

odincrt.dll:_ufree
odincrt.dll:free
odincrt.dll:odin_free
... (here may come some more framse)
wsock32.dll:WSAAsyncSelectWorker
wsock32.dll:WSAEventSelect

This seems to happen when the underlying application attemts to open a new connection.

comment:17 Changed 8 years ago by dmik

Judging from the parameters to WSAEventSelect() on stack, it's clear that this one is the call made from java_sun_nio_ch_IOUtil_configureBlocking().

comment:18 Changed 8 years ago by dmik

Probably, it's a classic problem of free() being called twice for some memory block (presumably, ASYNCTHREADPARM or VSemaphore). I can't name the exact object since the DUMP is somehow incomplete (it lacks the disassembly for all but the last procedure, and the stack frame may be not complete). If you happen to get a more detailed PDUMP with the debug version, it should help.

Anyway, I certainly see at least two problems with the current WSOCK32 code:

  1. RemoveFromQueue?() unnecessarily clears the ASYNCTHREADPARM structure after removing it from the list. This makes it impossible to delete the VSemaphore object afterwards (since the pointer to it is cleared inside RemoveFromQueue?()) which in turn creates quite a bit of memory leaks (the luck is that VSemaphore is just 8 bytes long so that it doesn't leak fast enough to be clearly noticed at runtime).
  1. The WSAEventSelect(s, hEventObject, 0) call which is intended to just de-associate the socket from the event object (and should do nothing if there is no association) still creates an auxiliary thread as if it were an association call (the last parameter is not zero). This thread exits just a moment after its creation (because the check for zero is the first thing it does), but still, there is no need for it to be started anyway.

I fixed both problems in the test build of wsock32.dll (on my ftp, as usual) so please test if it fixes the original issue.

Since the crashing code path is definitely the one that involves the cancellation call mentioned in 2. above, one of my guesses regarding the reason of the original crash is that _beginthread() from odincrt (which is actually the renamed VAC runtume) returns a failure on SMP in cases where the thread execution lasts for a very short amount of time (though it is assumed that this method always succeeds in case if the thread was started successfully). This fact would cause the mentioned double free() call: one from the block of code handling the thread creation failure, and the other one from the thread that has actually been started. This is a blind guess though, since the DUMP is not complete and since I don't have the sources of VAC runtime to check how _beginthread() is implemented.

comment:19 Changed 8 years ago by dmik

It actually seems that _beginthread() (which return value is a TID) happens to return zero and this is the reason of the double deletion: zero is interpreted as an error by WSAAsyncSelectWorker() and in response it deletes the ASYNCTHREADPARM structure it created. However, the thread was actually started, competed the task immediately (due to the reason described in 2. above) and also attempted to delete the same ASYNCTHREADPARM structure (which is considered to be owned by the started thread if starting succeeds).

The question is if TID is indeed a valid TID in OS/2 or it's just a failure of _beginthread() starting the thread but returning an incorrect TID? When I get more logs from Yoda, I hope I will be able to tell for sure which one is the case.

So far, I fixed both problems 1. and 2. in revisions 21656 and 21657 in Odin. They were not the ones leading to the crash, but they created conditions for the zero TID issue to appear.

The need for the zero TID fix is to be defined after clarifying the question above.

Last edited 8 years ago by dmik (previous) (diff)

comment:20 Changed 8 years ago by dmik

BTW, from CPREF it indirectly follows that zero is not a valid TID: some APIs (like DosSetPriority?() assume that TID=0 means the current thread). So it must be either a VAC runtime or OS/2 SMP kernel bug.

comment:21 follow-up: Changed 8 years ago by dmik

  • Resolution set to fixed
  • Status changed from new to closed

Since several bugs have been already fixed within this defect, it's enough for it. The name is too general. To prevent this one from growing infinitely, new tickets should be created for new issues.

Yoda, thanks for testing and remember that I'm still awaiting for logs from you from the last test debug build. You may inform me here or in Jabber.

comment:22 in reply to: ↑ 21 Changed 8 years ago by yoda

Replying to dmik:

Yoda, thanks for testing and remember that I'm still awaiting for logs from you from the last test debug build. You may inform me here or in Jabber.

I think I finally was able to recreate that crash with debug build.
Look at your ftp - huge zip file :-)

Note: See TracTickets for help on using tickets.