Opened 8 years ago

Closed 6 years ago

#133 closed defect (wontfix)

Openfire crashes when a client connects.

Reported by: yoda Owned by:
Priority: critical Milestone:
Component: general Version: 1.6.0 Build 27 GA4
Severity: high Keywords:
Cc: Yoda

Description

Sometimes Openfire crashes when a client connects.
Logs with dumps sent to your ftp.

Attachments (7)

Crash353.zip (2.1 KB) - added by yoda 8 years ago.
Crash354.zip (1.8 KB) - added by yoda 8 years ago.
Crash355.zip (2.1 KB) - added by yoda 8 years ago.
Crash386.zip (1.8 KB) - added by yoda 8 years ago.
Crash387.zip (2.2 KB) - added by yoda 8 years ago.
pstat.zip (13.1 KB) - added by yoda 7 years ago.
Crash435.zip (2.0 KB) - added by yoda 7 years ago.

Download all attachments as: .zip

Change History (68)

comment:1 Changed 8 years ago by yoda

test

Changed 8 years ago by yoda

Changed 8 years ago by yoda

Changed 8 years ago by yoda

comment:2 Changed 8 years ago by yoda

  • Priority changed from major to blocker
  • Severity changed from medium to highest

Tried with new Odin 0.8.1
Now it crashes every time a client connects.
3 crashes attached - Pdumps avail on req

comment:3 Changed 8 years ago by diver

  • Priority changed from blocker to major
  • Severity changed from highest to high

did you try with latest 0.8.3? and i hope you have a clean libc environement. which means only libc064 installed and not a mix of libc064 and libc063

comment:4 Changed 8 years ago by yoda

I just updated to Odin 0.8.3 (from 0.6xxx which was suddenly stable for a month).
Although clean env - I updated with latest gcc and libc wpi packs.
After that, impossible to connect to Openfire server - it crashes on every
connect from client.

2 different crashes - basicly I still see same with Jeti/2 client on another PC.
( The Odin ticket you closed even though never fixed ).

Uploading info - Pdumps available on req

Changed 8 years ago by yoda

Changed 8 years ago by yoda

comment:5 Changed 8 years ago by diver

also be sure you update libc064 with the dll from the zip mentioned in http://svn.netlabs.org/libc/ticket/255
and make sure you don't have a libc063.dll other than that from libc064 package around.

Last edited 8 years ago by diver (previous) (diff)

comment:6 Changed 8 years ago by yoda

Installed the new DLL - it makes no difference. Still crashes immediately,
when a client connects.

Libpath checked for dupes, no libc* dupes found.

comment:7 Changed 8 years ago by diver

i'm not talking about dups here. i'm talking about mixed libc. so if you have a libc063.dll bigger than 153.4kb you have a mixed libc installation.
it's just to make sure we don't search a phantom.

comment:8 Changed 8 years ago by yoda

No, all Libc0xx are from the WPI package - except the l8r test version of libc064.dll

comment:9 Changed 8 years ago by yoda

Besides, one of crashes existed in the older VAC builds of Odin - they were just more random then. See ticket #146 for another report of same problem.

comment:10 follow-ups: Changed 7 years ago by dmik

I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire? 3.7.1, set it up with the browser, created two users and can now flawlessly have them chatting with each other through my OpenFire? install on OS/2 (both clients are Mac iChat). So it must be a problem in your env (DLL mix-up).

Please try a clean install yoursself and then try to do the following:

  1. Start OpenFire? but don't let it crash (i.e. don't connect to it if it crashes at connect).
  2. Do pstat /a >log.
  3. Send me the resulting log file.

comment:11 follow-up: Changed 7 years ago by dmik

What I see though is that the Java process running OpenFire? hangs at exitlist if I Ctrl-C it. I will look at this issue.

comment:12 Changed 7 years ago by dmik

I tried Ctrl-C several times with the new OpenFire? version and can't make it hang any more...

What I can see in Odin though since we switched it to GCC is that pressing Ctrl-C in debug builds may lead to the infamous "LIBC panic" message. My quick tests showed that this happens if the thread which is currently writing to the log file is terminated (by the Ctrl-C handler) while holding the fmutex used to serialize file access in LIBC and this is why it complains so loud. In some cases this could leave to the exit thread hang. The next time I get this, I will try to fix it in LIBC. Clearly, it should not panic in such cases.

This problem may be the reason for the OpenFire? hang as well. I will test some more.

comment:13 Changed 7 years ago by dmik

Ok, I can reproduce the hang with the debug version of Odin/Java? (if I try to break it before it fully starts). Got some logs, will investigate further.

comment:14 Changed 7 years ago by dmik

I created a separate defect #159 for the Ctrl-C issue as it has nothing to do with the original topic.

comment:15 in reply to: ↑ 11 Changed 7 years ago by yoda

Replying to dmik:

What I see though is that the Java process running OpenFire? hangs at exitlist if I Ctrl-C
it. I will look at this issue.

I have seen that many many times previously too - but at that time, same happened for a
lot of other Java apps too.

Lately, I have never stopped it - so don't know current state.

comment:16 in reply to: ↑ 10 Changed 7 years ago by yoda

Replying to dmik:

I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire?
3.7.1, set it up with the browser, created two users and can now flawlessly have them
chatting with each other through my OpenFire? install on OS/2 (both clients are Mac iChat).

I'm still using 3.6.4 for many reasons - one of them was not to change env during
testing/searching for these bugz.

So it must be a problem in your env (DLL mix-up).

That has already been checked and rechecked. I even loaded Theseus and checked
that all loaded DLL's path. Nothing wrong on that part.

Please try a clean install yoursself and then try to do the following:

I can try to install the latest in another path, and see if I can export/import
my settings to the new version. That will take a while, though.

comment:17 in reply to: ↑ 10 Changed 7 years ago by yoda

Replying to dmik:

  1. Start OpenFire? but don't let it crash (i.e. don't connect to it if it crashes at connect).
  2. Do pstat /a >log.
  3. Send me the resulting log file.

Pstat.zip attached

Changed 7 years ago by yoda

comment:18 Changed 7 years ago by dmik

I don't see anything suspicious in the logs. OpenFire? still works like a charm here.

comment:19 Changed 7 years ago by yoda

Well, I managed to update it to 3.7.1
Now, it does not crash on every connect anymore.
However, it still happens now and then - and still
when a client connects - it is now more like it was
with the VAC builds.

comment:20 Changed 7 years ago by dmik

  • Priority changed from major to Feedback Pending
  • Severity changed from high to medium

Please check the new builds of Odin 0.8.4 and OpenJDK b24 GA2.

I can repeat again, that the problem doesn't show up here.

comment:21 Changed 7 years ago by yoda

  • Priority changed from Feedback Pending to critical
  • Severity changed from medium to high

Java and Odin updated.
Problem is the same - as soon as a client connects,
OF crashes. The crash seems identical to the crash in Jeti/2
I again had to switch back to VAC based Odin, to make it run.

Crash attached - PDUMP available on req.

Changed 7 years ago by yoda

comment:22 Changed 7 years ago by diver

  • Milestone changed from Next to GA4

comment:23 Changed 7 years ago by yoda

With the latest test builds - jd2/hd2/od2/ipihlpapi - OpenFire? 3.71 always
crashes at startup:
ftp://ftp.warpspeed.dk/pdumps/Crash503.zip

comment:24 Changed 7 years ago by dmik

What if you try it with JD3/HD3 (and with updated IPHLPAPI.DLL + OD2)? Still the same crash?

comment:25 Changed 7 years ago by yoda

Same crash with jd3/hd3/od2/ipi
ftp://ftp.warpspeed.dk/pdumps/Crash504.zip

comment:26 Changed 7 years ago by dmik

  • Version changed from 1.6.0-b22 GA to 1.6.0-b24 GA2

These problems need to be analyzed further. Moving it to GA4.

Last edited 7 years ago by dmik (previous) (diff)

comment:27 Changed 7 years ago by yoda

  • Version changed from 1.6.0-b24 GA2 to 1.6.0 Build 25 GA3

I tested the new Java and Odin.
After a few bad starts (crash) it managed to run, and then worked for 2 days.
I then had to shut it down - using ctrl-C (remotely)
That took the hole server down :-(

comment:28 Changed 7 years ago by dmik

Please try this DLL ftp://ftp.netlabs.org/pub/odin/test/j4.zip (replace the one in the /bin/client dir) and report if it works now.

comment:29 Changed 7 years ago by yoda

A single test was ok. I'll make some further tests l8r.

comment:30 Changed 7 years ago by dmik

  • Resolution set to fixed
  • Status changed from new to closed

Closing. Feel free to create a new ticket if needed later.

comment:31 Changed 7 years ago by yoda

  • Resolution fixed deleted
  • Status changed from closed to reopened

As I feared, it needed more testing to be sure.
It just again crashed, when a client connects, just as
originally reported.

It does again show a guard page problem !

ftp://ftp.warpspeed.dk/pdumps/Crash522.zip

comment:32 Changed 7 years ago by dmik

Unfortunately, the ZIP is bad. Please attach another one.

From what I see now, the crash is in java_lang_Throwable::fill_in_stack_trace_of_preallocated_backtrace(). Have no idea what it does so far. It indeed looks like a very old crash which I'm unable to reproduce locally.

comment:33 Changed 7 years ago by yoda

There was noting wrong with ZIP, but sometimes FTPserver (or IPstack) corrupts access.
Server rebooted.

The crash is not something that just happens at load of Openfire,
but more when server has been idle for some time, and a client connects (again).

Retested ftp download of ZIP - it should be OK now.

comment:34 Changed 7 years ago by dmik

Judging from the code paths that lead to the call to fill_in_stack_trace_of_preallocated_backtrace() this must be the "out of memory" condition. Analyzing the dumps shows that it crashes when attempting to create a Java exception object for "out of memory" (which is a bug of course that needs to be fixed). Given that I have a plenty of RAM it would explain why I can't easily reproduce the problem.

I will try to create a test case that exhausts Java heap somehow to see if it helps reproduce this crash.

comment:35 Changed 7 years ago by yoda

I doubt that it is a _real_ out of memory condition.
Server has 1GB RAM, and generally 500MB free.
OF reports 10MB of 247MB is used.

OTOH, I still see these sometimes with the crashes:
OpenJDK Client VM warning: Attempt to protect stack guard pages failed.

comment:36 Changed 7 years ago by dmik

Can't reproduce the condition so far. Will try to overload the PC with other apps.

Regarding your last comment, Java uses a lot of shared memory (which is not the same as the total amount of memory you have), so that may be issue.

comment:37 Changed 7 years ago by yoda

FYI:
E:\Utils>qsysinfo.exe

QSV_MAX = 31

QSV_MAX_PATH_LENGTH = 260

QSV_MAX_TEXT_SESSIONS = 16

QSV_MAX_PM_SESSIONS = 16

QSV_MAX_VDM_SESSIONS = 1025

QSV_BOOT_DRIVE = 3

QSV_DYN_PRI_VARIATION = 1

QSV_MAX_WAIT = 1

QSV_MIN_SLICE = 32
QSV_MAX_SLICE = 32
QSV_PAGE_SIZE = 4096

QSV_VERSION_MAJOR = 20
QSV_VERSION_MINOR = 45

QSV_VERSION_REVISION = 0

QSV_MS_COUNT = 1245409108
QSV_TIME_LOW = 1345462467

QSV_TIME_HIGH = 0

QSV_TOTPHYSMEM = 1073704960

QSV_TOTRESMEM = 123203584

QSV_TOTAVAILMEM = 1339817984

QSV_MAXPRMEM = 317390848
QSV_MAXSHMEM = 143261696

QSV_TIMER_INTERVAL = 310

QSV_MAX_COMP_LENGTH = 255

QSV_FOREGROUND_FS_SESSION = 17

QSV_FOREGROUND_PROCESS = 18777

QSV_NUM_PROCESSORS = 2

QSV_MAXHPRMEM = 469762048
QSV_MAXHSHMEM = 102703104

QSV_MAXPROCESSES = 1025

QSV_VIRTUALADDRESSLIMIT = 1024

QSV_INT10ENABLED = 1

comment:38 Changed 7 years ago by yoda

The crash happens less often after the latest hotfix - but still happens
randomly at connect time. OF is not exactly overloaded here, as I am
currently the only user. My client do however also use MUC's (local and
remote) and gateways (ICQ and IRC).

comment:39 Changed 7 years ago by dmik

Still, this smells like a memory management issue. However, I'm unable to reproduce the problem locally even after loading lots of apps and exhausting the Java heap / stack.

comment:40 Changed 7 years ago by yoda

Sure, the guard page warnings should indicate that.
But it is still the only Java app I run - and connecting the client
makes OF say that memory use only goes from 10 to 15MB in this situation;
so it is hardly any real out of memory situation - but more a memory
handling problem.

Anyway, here are the latest 2 fresh crashes; but they may only show
the same as the others:

ftp://ftp.warpspeed.dk/pdumps/Crash537.zip
ftp://ftp.warpspeed.dk/pdumps/Crash538.zip

comment:41 Changed 7 years ago by dmik

Yes these are the same.

My current findings. The code wants to fill in the stack trace to the Throwable object (e.g. when Java has thrown an exception) but it fails to walk the Java call stack because the last Java frame pointer (SP value) associated with the current thread is zero.

Further, JVM crashes for the second time when trying to find out what Java object the value from the register (pointing to the Java heap) represents. Somehow the Klass reference is also zero.

I'm building the debug version of Java for yoda to test. I guess a few assertion should be triggered in the debug build which may show something interesting.

comment:42 Changed 7 years ago by yoda

Sure, I'll test it. But where do I get it ? :-)

comment:45 Changed 7 years ago by yoda

BTW, do I need to use Odin Debug too ?

comment:46 Changed 7 years ago by dmik

The only interesting thing I see there is this assertion:

#  Internal Error (d:/Coding/javaos2/openjdk/hotspot/src/share/vm/runtime/handles.cpp:46), pid=31763, tid=2081619996
#  assert(SharedSkipVerify || obj->is_oop()) failed: sanity check

This may be a reason for the crashes. I will check that.

Yes, please also try the debug version of Odin too.

comment:47 follow-up: Changed 7 years ago by dmik

Also please try it with

set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%

to make sure the assertion you see is suppressed. This may give us other failing assertions.

comment:48 in reply to: ↑ 47 Changed 7 years ago by yoda

Replying to dmik:

Also please try it with

set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%

to make sure the assertion you see is suppressed. This may give us other failing assertions.

Added that option.
ftp://ftp.warpspeed.dk/pdumps/Crash548.zip

comment:49 Changed 7 years ago by yoda

Included Odin debug version. It crashed on startup:
ftp://ftp.warpspeed.dk/pdumps/Crash549.zip (Huge, includes several pdumps)

comment:50 Changed 7 years ago by yoda

Hmm, it crashes that way during startup with debug Odin, so that
is not really useful. Back to retail.

comment:51 Changed 7 years ago by dmik

I've tried the fresh build of b27 (it's been running just now) and it works here. Please test once I release it (tomorrow) and report back. I will move thus ticket to the next milestone anyway.

comment:52 Changed 7 years ago by dmik

And btw someone told me somewhere that OpenFire? can't be stopped with Ctrl-Break now. WIth the latest Odin (to be released tomorrow as well) I don't see any problems with that.

comment:53 Changed 7 years ago by yoda

Yes, that has been a problem with the last few Odins again.
It especially happens, if you try to ctrl-C, when there are
active connections to clients and other servers.

comment:54 Changed 7 years ago by yoda

Trying b27 and Odin 088

At first, it is pretty stable. I did rune it for some time, and connected several times
with no problem - but then it again crashed, when I connected:

ftp://ftp.warpspeed.dk/pdumps/Crash572.zip

comment:55 Changed 7 years ago by yoda

  • Version changed from 1.6.0 Build 25 GA3 to 1.6.0 Build 27 GA4

comment:56 Changed 7 years ago by dmik

  • Milestone changed from Next to GA5

comment:58 Changed 7 years ago by dmik

I cant download the above zips, it tells me "Connection refused."

comment:59 Changed 7 years ago by yoda

There are 3 major reasons for that:

1) My FTPD is currently highly unstable (bad version), that crashes several times a day,

and I currently do not have an autostart script for it.

2) Your BOSS informed my 2 days ago, that there would not be done any further work on

any of my tickets ...

3) Because of 2) I yesterday cleaned out the whole PDUMP dir on my ftp server, as it

used a huge amount of GB, which were now of no use.

comment:60 Changed 7 years ago by herwigb

Thank you for your nice words. I will take them into account when it comes to fixing tickets for you.

As you see, Silvan NEVER said that your tickets were NOT going to be fixed (they would haven been closed in this case), you were jumping to conclusions...

dmik needs the zips in order to fix them, if he had no intention to work on them, he would not have asked. There is NO point telling one person from bitwise works what another bitwise works person said, because we KNOW what the other said.

Without zips the ticket is possibly going to be closed with status "feedback pending"...

comment:61 Changed 6 years ago by yoda

  • Milestone GA6 deleted
  • Resolution set to wontfix
  • Status changed from reopened to closed

Reporter switched to *nix - app works perfectly there - can't test anymore here.

Note: See TracTickets for help on using tickets.