Opened 13 years ago

Closed 11 years ago

#133 closed defect (wontfix)

Openfire crashes when a client connects.

Reported by: Yoda_Java6 Owned by:
Priority: critical Milestone:
Component: general Version: 1.6.0 Build 27 GA4
Severity: high Keywords:
Cc: Yoda

Description

Sometimes Openfire crashes when a client connects.
Logs with dumps sent to your ftp.

Attachments (7)

Crash353.zip (2.1 KB) - added by Yoda_Java6 12 years ago.
Crash354.zip (1.8 KB) - added by Yoda_Java6 12 years ago.
Crash355.zip (2.1 KB) - added by Yoda_Java6 12 years ago.
Crash386.zip (1.8 KB) - added by Yoda_Java6 12 years ago.
Crash387.zip (2.2 KB) - added by Yoda_Java6 12 years ago.
pstat.zip (13.1 KB) - added by Yoda_Java6 12 years ago.
Crash435.zip (2.0 KB) - added by Yoda_Java6 12 years ago.

Download all attachments as: .zip

Change History (68)

comment:1 Changed 13 years ago by Yoda_Java6

test

Changed 12 years ago by Yoda_Java6

Attachment: Crash353.zip added

Changed 12 years ago by Yoda_Java6

Attachment: Crash354.zip added

Changed 12 years ago by Yoda_Java6

Attachment: Crash355.zip added

comment:2 Changed 12 years ago by Yoda_Java6

Priority: majorblocker
Severity: mediumhighest

Tried with new Odin 0.8.1
Now it crashes every time a client connects.
3 crashes attached - Pdumps avail on req

comment:3 Changed 12 years ago by Silvan Scherrer

Priority: blockermajor
Severity: highesthigh

did you try with latest 0.8.3? and i hope you have a clean libc environement. which means only libc064 installed and not a mix of libc064 and libc063

comment:4 Changed 12 years ago by Yoda_Java6

I just updated to Odin 0.8.3 (from 0.6xxx which was suddenly stable for a month).
Although clean env - I updated with latest gcc and libc wpi packs.
After that, impossible to connect to Openfire server - it crashes on every
connect from client.

2 different crashes - basicly I still see same with Jeti/2 client on another PC.
( The Odin ticket you closed even though never fixed ).

Uploading info - Pdumps available on req

Changed 12 years ago by Yoda_Java6

Attachment: Crash386.zip added

Changed 12 years ago by Yoda_Java6

Attachment: Crash387.zip added

comment:5 Changed 12 years ago by Silvan Scherrer

also be sure you update libc064 with the dll from the zip mentioned in http://svn.netlabs.org/libc/ticket/255
and make sure you don't have a libc063.dll other than that from libc064 package around.

Last edited 12 years ago by Silvan Scherrer (previous) (diff)

comment:6 Changed 12 years ago by Yoda_Java6

Installed the new DLL - it makes no difference. Still crashes immediately,
when a client connects.

Libpath checked for dupes, no libc* dupes found.

comment:7 Changed 12 years ago by Silvan Scherrer

i'm not talking about dups here. i'm talking about mixed libc. so if you have a libc063.dll bigger than 153.4kb you have a mixed libc installation.
it's just to make sure we don't search a phantom.

comment:8 Changed 12 years ago by Yoda_Java6

No, all Libc0xx are from the WPI package - except the l8r test version of libc064.dll

comment:9 Changed 12 years ago by Yoda_Java6

Besides, one of crashes existed in the older VAC builds of Odin - they were just more random then. See ticket #146 for another report of same problem.

comment:10 Changed 12 years ago by dmik

I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire? 3.7.1, set it up with the browser, created two users and can now flawlessly have them chatting with each other through my OpenFire? install on OS/2 (both clients are Mac iChat). So it must be a problem in your env (DLL mix-up).

Please try a clean install yoursself and then try to do the following:

  1. Start OpenFire? but don't let it crash (i.e. don't connect to it if it crashes at connect).
  2. Do pstat /a >log.
  3. Send me the resulting log file.

comment:11 Changed 12 years ago by dmik

What I see though is that the Java process running OpenFire? hangs at exitlist if I Ctrl-C it. I will look at this issue.

comment:12 Changed 12 years ago by dmik

I tried Ctrl-C several times with the new OpenFire? version and can't make it hang any more...

What I can see in Odin though since we switched it to GCC is that pressing Ctrl-C in debug builds may lead to the infamous "LIBC panic" message. My quick tests showed that this happens if the thread which is currently writing to the log file is terminated (by the Ctrl-C handler) while holding the fmutex used to serialize file access in LIBC and this is why it complains so loud. In some cases this could leave to the exit thread hang. The next time I get this, I will try to fix it in LIBC. Clearly, it should not panic in such cases.

This problem may be the reason for the OpenFire? hang as well. I will test some more.

comment:13 Changed 12 years ago by dmik

Ok, I can reproduce the hang with the debug version of Odin/Java? (if I try to break it before it fully starts). Got some logs, will investigate further.

comment:14 Changed 12 years ago by dmik

I created a separate defect #159 for the Ctrl-C issue as it has nothing to do with the original topic.

comment:15 in reply to:  11 Changed 12 years ago by Yoda_Java6

Replying to dmik:

What I see though is that the Java process running OpenFire? hangs at exitlist if I Ctrl-C
it. I will look at this issue.

I have seen that many many times previously too - but at that time, same happened for a
lot of other Java apps too.

Lately, I have never stopped it - so don't know current state.

comment:16 in reply to:  10 Changed 12 years ago by Yoda_Java6

Replying to dmik:

I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire?
3.7.1, set it up with the browser, created two users and can now flawlessly have them
chatting with each other through my OpenFire? install on OS/2 (both clients are Mac iChat).

I'm still using 3.6.4 for many reasons - one of them was not to change env during
testing/searching for these bugz.

So it must be a problem in your env (DLL mix-up).

That has already been checked and rechecked. I even loaded Theseus and checked
that all loaded DLL's path. Nothing wrong on that part.

Please try a clean install yoursself and then try to do the following:

I can try to install the latest in another path, and see if I can export/import
my settings to the new version. That will take a while, though.

comment:17 in reply to:  10 Changed 12 years ago by Yoda_Java6

Replying to dmik:

  1. Start OpenFire? but don't let it crash (i.e. don't connect to it if it crashes at connect).
  2. Do pstat /a >log.
  3. Send me the resulting log file.

Pstat.zip attached

Changed 12 years ago by Yoda_Java6

Attachment: pstat.zip added

comment:18 Changed 12 years ago by dmik

I don't see anything suspicious in the logs. OpenFire? still works like a charm here.

comment:19 Changed 12 years ago by Yoda_Java6

Well, I managed to update it to 3.7.1
Now, it does not crash on every connect anymore.
However, it still happens now and then - and still
when a client connects - it is now more like it was
with the VAC builds.

comment:20 Changed 12 years ago by dmik

Priority: majorFeedback Pending
Severity: highmedium

Please check the new builds of Odin 0.8.4 and OpenJDK b24 GA2.

I can repeat again, that the problem doesn't show up here.

comment:21 Changed 12 years ago by Yoda_Java6

Priority: Feedback Pendingcritical
Severity: mediumhigh

Java and Odin updated.
Problem is the same - as soon as a client connects,
OF crashes. The crash seems identical to the crash in Jeti/2
I again had to switch back to VAC based Odin, to make it run.

Crash attached - PDUMP available on req.

Changed 12 years ago by Yoda_Java6

Attachment: Crash435.zip added

comment:22 Changed 12 years ago by Silvan Scherrer

Milestone: NextGA4

comment:23 Changed 12 years ago by Yoda_Java6

With the latest test builds - jd2/hd2/od2/ipihlpapi - OpenFire? 3.71 always
crashes at startup:
ftp://ftp.warpspeed.dk/pdumps/Crash503.zip

comment:24 Changed 12 years ago by dmik

What if you try it with JD3/HD3 (and with updated IPHLPAPI.DLL + OD2)? Still the same crash?

comment:25 Changed 12 years ago by Yoda_Java6

Same crash with jd3/hd3/od2/ipi
ftp://ftp.warpspeed.dk/pdumps/Crash504.zip

comment:26 Changed 12 years ago by dmik

Version: 1.6.0-b22 GA1.6.0-b24 GA2

These problems need to be analyzed further. Moving it to GA4.

Last edited 12 years ago by dmik (previous) (diff)

comment:27 Changed 12 years ago by Yoda_Java6

Version: 1.6.0-b24 GA21.6.0 Build 25 GA3

I tested the new Java and Odin.
After a few bad starts (crash) it managed to run, and then worked for 2 days.
I then had to shut it down - using ctrl-C (remotely)
That took the hole server down :-(

comment:28 Changed 12 years ago by dmik

Please try this DLL ftp://ftp.netlabs.org/pub/odin/test/j4.zip (replace the one in the /bin/client dir) and report if it works now.

comment:29 Changed 12 years ago by Yoda_Java6

A single test was ok. I'll make some further tests l8r.

comment:30 Changed 12 years ago by dmik

Resolution: fixed
Status: newclosed

Closing. Feel free to create a new ticket if needed later.

comment:31 Changed 12 years ago by Yoda_Java6

Resolution: fixed
Status: closedreopened

As I feared, it needed more testing to be sure.
It just again crashed, when a client connects, just as
originally reported.

It does again show a guard page problem !

ftp://ftp.warpspeed.dk/pdumps/Crash522.zip

comment:32 Changed 12 years ago by dmik

Unfortunately, the ZIP is bad. Please attach another one.

From what I see now, the crash is in java_lang_Throwable::fill_in_stack_trace_of_preallocated_backtrace(). Have no idea what it does so far. It indeed looks like a very old crash which I'm unable to reproduce locally.

comment:33 Changed 12 years ago by Yoda_Java6

There was noting wrong with ZIP, but sometimes FTPserver (or IPstack) corrupts access.
Server rebooted.

The crash is not something that just happens at load of Openfire,
but more when server has been idle for some time, and a client connects (again).

Retested ftp download of ZIP - it should be OK now.

comment:34 Changed 12 years ago by dmik

Judging from the code paths that lead to the call to fill_in_stack_trace_of_preallocated_backtrace() this must be the "out of memory" condition. Analyzing the dumps shows that it crashes when attempting to create a Java exception object for "out of memory" (which is a bug of course that needs to be fixed). Given that I have a plenty of RAM it would explain why I can't easily reproduce the problem.

I will try to create a test case that exhausts Java heap somehow to see if it helps reproduce this crash.

comment:35 Changed 12 years ago by Yoda_Java6

I doubt that it is a _real_ out of memory condition.
Server has 1GB RAM, and generally 500MB free.
OF reports 10MB of 247MB is used.

OTOH, I still see these sometimes with the crashes:
OpenJDK Client VM warning: Attempt to protect stack guard pages failed.

comment:36 Changed 12 years ago by dmik

Can't reproduce the condition so far. Will try to overload the PC with other apps.

Regarding your last comment, Java uses a lot of shared memory (which is not the same as the total amount of memory you have), so that may be issue.

comment:37 Changed 12 years ago by Yoda_Java6

FYI:
E:\Utils>qsysinfo.exe

QSV_MAX = 31

QSV_MAX_PATH_LENGTH = 260

QSV_MAX_TEXT_SESSIONS = 16

QSV_MAX_PM_SESSIONS = 16

QSV_MAX_VDM_SESSIONS = 1025

QSV_BOOT_DRIVE = 3

QSV_DYN_PRI_VARIATION = 1

QSV_MAX_WAIT = 1

QSV_MIN_SLICE = 32
QSV_MAX_SLICE = 32
QSV_PAGE_SIZE = 4096

QSV_VERSION_MAJOR = 20
QSV_VERSION_MINOR = 45

QSV_VERSION_REVISION = 0

QSV_MS_COUNT = 1245409108
QSV_TIME_LOW = 1345462467

QSV_TIME_HIGH = 0

QSV_TOTPHYSMEM = 1073704960

QSV_TOTRESMEM = 123203584

QSV_TOTAVAILMEM = 1339817984

QSV_MAXPRMEM = 317390848
QSV_MAXSHMEM = 143261696

QSV_TIMER_INTERVAL = 310

QSV_MAX_COMP_LENGTH = 255

QSV_FOREGROUND_FS_SESSION = 17

QSV_FOREGROUND_PROCESS = 18777

QSV_NUM_PROCESSORS = 2

QSV_MAXHPRMEM = 469762048
QSV_MAXHSHMEM = 102703104

QSV_MAXPROCESSES = 1025

QSV_VIRTUALADDRESSLIMIT = 1024

QSV_INT10ENABLED = 1

comment:38 Changed 12 years ago by Yoda_Java6

The crash happens less often after the latest hotfix - but still happens
randomly at connect time. OF is not exactly overloaded here, as I am
currently the only user. My client do however also use MUC's (local and
remote) and gateways (ICQ and IRC).

comment:39 Changed 12 years ago by dmik

Still, this smells like a memory management issue. However, I'm unable to reproduce the problem locally even after loading lots of apps and exhausting the Java heap / stack.

comment:40 Changed 12 years ago by Yoda_Java6

Sure, the guard page warnings should indicate that.
But it is still the only Java app I run - and connecting the client
makes OF say that memory use only goes from 10 to 15MB in this situation;
so it is hardly any real out of memory situation - but more a memory
handling problem.

Anyway, here are the latest 2 fresh crashes; but they may only show
the same as the others:

ftp://ftp.warpspeed.dk/pdumps/Crash537.zip
ftp://ftp.warpspeed.dk/pdumps/Crash538.zip

comment:41 Changed 12 years ago by dmik

Yes these are the same.

My current findings. The code wants to fill in the stack trace to the Throwable object (e.g. when Java has thrown an exception) but it fails to walk the Java call stack because the last Java frame pointer (SP value) associated with the current thread is zero.

Further, JVM crashes for the second time when trying to find out what Java object the value from the register (pointing to the Java heap) represents. Somehow the Klass reference is also zero.

I'm building the debug version of Java for yoda to test. I guess a few assertion should be triggered in the debug build which may show something interesting.

comment:42 Changed 12 years ago by Yoda_Java6

Sure, I'll test it. But where do I get it ? :-)

comment:45 Changed 12 years ago by Yoda_Java6

BTW, do I need to use Odin Debug too ?

comment:46 Changed 12 years ago by dmik

The only interesting thing I see there is this assertion:

#  Internal Error (d:/Coding/javaos2/openjdk/hotspot/src/share/vm/runtime/handles.cpp:46), pid=31763, tid=2081619996
#  assert(SharedSkipVerify || obj->is_oop()) failed: sanity check

This may be a reason for the crashes. I will check that.

Yes, please also try the debug version of Odin too.

comment:47 Changed 12 years ago by dmik

Also please try it with

set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%

to make sure the assertion you see is suppressed. This may give us other failing assertions.

comment:48 in reply to:  47 Changed 12 years ago by Yoda_Java6

Replying to dmik:

Also please try it with

set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%

to make sure the assertion you see is suppressed. This may give us other failing assertions.

Added that option.
ftp://ftp.warpspeed.dk/pdumps/Crash548.zip

comment:49 Changed 12 years ago by Yoda_Java6

Included Odin debug version. It crashed on startup:
ftp://ftp.warpspeed.dk/pdumps/Crash549.zip (Huge, includes several pdumps)

comment:50 Changed 12 years ago by Yoda_Java6

Hmm, it crashes that way during startup with debug Odin, so that
is not really useful. Back to retail.

comment:51 Changed 11 years ago by dmik

I've tried the fresh build of b27 (it's been running just now) and it works here. Please test once I release it (tomorrow) and report back. I will move thus ticket to the next milestone anyway.

comment:52 Changed 11 years ago by dmik

And btw someone told me somewhere that OpenFire? can't be stopped with Ctrl-Break now. WIth the latest Odin (to be released tomorrow as well) I don't see any problems with that.

comment:53 Changed 11 years ago by Yoda_Java6

Yes, that has been a problem with the last few Odins again.
It especially happens, if you try to ctrl-C, when there are
active connections to clients and other servers.

comment:54 Changed 11 years ago by Yoda_Java6

Trying b27 and Odin 088

At first, it is pretty stable. I did rune it for some time, and connected several times
with no problem - but then it again crashed, when I connected:

ftp://ftp.warpspeed.dk/pdumps/Crash572.zip

comment:55 Changed 11 years ago by Yoda_Java6

Version: 1.6.0 Build 25 GA31.6.0 Build 27 GA4

comment:56 Changed 11 years ago by dmik

Milestone: NextGA5

comment:58 Changed 11 years ago by dmik

I cant download the above zips, it tells me "Connection refused."

comment:59 Changed 11 years ago by Yoda_Java6

There are 3 major reasons for that:

1) My FTPD is currently highly unstable (bad version), that crashes several times a day,

and I currently do not have an autostart script for it.

2) Your BOSS informed my 2 days ago, that there would not be done any further work on

any of my tickets ...

3) Because of 2) I yesterday cleaned out the whole PDUMP dir on my ftp server, as it

used a huge amount of GB, which were now of no use.

comment:60 Changed 11 years ago by herwigb

Thank you for your nice words. I will take them into account when it comes to fixing tickets for you.

As you see, Silvan NEVER said that your tickets were NOT going to be fixed (they would haven been closed in this case), you were jumping to conclusions...

dmik needs the zips in order to fix them, if he had no intention to work on them, he would not have asked. There is NO point telling one person from bitwise works what another bitwise works person said, because we KNOW what the other said.

Without zips the ticket is possibly going to be closed with status "feedback pending"...

comment:61 Changed 11 years ago by Yoda_Java6

Milestone: GA6
Resolution: wontfix
Status: reopenedclosed

Reporter switched to *nix - app works perfectly there - can't test anymore here.

Note: See TracTickets for help on using tickets.