Opened 13 years ago
Closed 12 years ago
#133 closed defect (wontfix)
Openfire crashes when a client connects.
Reported by: | Yoda_Java6 | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | general | Version: | 1.6.0 Build 27 GA4 |
Severity: | high | Keywords: | |
Cc: | Yoda |
Description
Sometimes Openfire crashes when a client connects.
Logs with dumps sent to your ftp.
Attachments (7)
Change History (68)
comment:1 by , 13 years ago
by , 13 years ago
Attachment: | Crash353.zip added |
---|
by , 13 years ago
Attachment: | Crash354.zip added |
---|
by , 13 years ago
Attachment: | Crash355.zip added |
---|
comment:2 by , 13 years ago
Priority: | major → blocker |
---|---|
Severity: | medium → highest |
Tried with new Odin 0.8.1
Now it crashes every time a client connects.
3 crashes attached - Pdumps avail on req
comment:3 by , 13 years ago
Priority: | blocker → major |
---|---|
Severity: | highest → high |
did you try with latest 0.8.3? and i hope you have a clean libc environement. which means only libc064 installed and not a mix of libc064 and libc063
comment:4 by , 13 years ago
I just updated to Odin 0.8.3 (from 0.6xxx which was suddenly stable for a month).
Although clean env - I updated with latest gcc and libc wpi packs.
After that, impossible to connect to Openfire server - it crashes on every
connect from client.
2 different crashes - basicly I still see same with Jeti/2 client on another PC.
( The Odin ticket you closed even though never fixed ).
Uploading info - Pdumps available on req
by , 13 years ago
Attachment: | Crash386.zip added |
---|
by , 13 years ago
Attachment: | Crash387.zip added |
---|
comment:5 by , 13 years ago
also be sure you update libc064 with the dll from the zip mentioned in http://svn.netlabs.org/libc/ticket/255
and make sure you don't have a libc063.dll other than that from libc064 package around.
comment:6 by , 13 years ago
Installed the new DLL - it makes no difference. Still crashes immediately,
when a client connects.
Libpath checked for dupes, no libc* dupes found.
comment:7 by , 13 years ago
i'm not talking about dups here. i'm talking about mixed libc. so if you have a libc063.dll bigger than 153.4kb you have a mixed libc installation.
it's just to make sure we don't search a phantom.
comment:8 by , 13 years ago
No, all Libc0xx are from the WPI package - except the l8r test version of libc064.dll
comment:9 by , 13 years ago
Besides, one of crashes existed in the older VAC builds of Odin - they were just more random then. See ticket #146 for another report of same problem.
follow-ups: 16 17 comment:10 by , 13 years ago
I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire 3.7.1, set it up with the browser, created two users and can now flawlessly have them chatting with each other through my OpenFire install on OS/2 (both clients are Mac iChat). So it must be a problem in your env (DLL mix-up).
Please try a clean install yoursself and then try to do the following:
- Start OpenFire but don't let it crash (i.e. don't connect to it if it crashes at connect).
- Do pstat /a >log.
- Send me the resulting log file.
follow-up: 15 comment:11 by , 13 years ago
What I see though is that the Java process running OpenFire hangs at exitlist if I Ctrl-C it. I will look at this issue.
comment:12 by , 13 years ago
I tried Ctrl-C several times with the new OpenFire version and can't make it hang any more...
What I can see in Odin though since we switched it to GCC is that pressing Ctrl-C in debug builds may lead to the infamous "LIBC panic" message. My quick tests showed that this happens if the thread which is currently writing to the log file is terminated (by the Ctrl-C handler) while holding the fmutex used to serialize file access in LIBC and this is why it complains so loud. In some cases this could leave to the exit thread hang. The next time I get this, I will try to fix it in LIBC. Clearly, it should not panic in such cases.
This problem may be the reason for the OpenFire hang as well. I will test some more.
comment:13 by , 13 years ago
Ok, I can reproduce the hang with the debug version of Odin/Java (if I try to break it before it fully starts). Got some logs, will investigate further.
comment:14 by , 13 years ago
I created a separate defect #159 for the Ctrl-C issue as it has nothing to do with the original topic.
comment:15 by , 13 years ago
Replying to dmik:
What I see though is that the Java process running OpenFire hangs at exitlist if I Ctrl-C
it. I will look at this issue.
I have seen that many many times previously too - but at that time, same happened for a
lot of other Java apps too.
Lately, I have never stopped it - so don't know current state.
comment:16 by , 13 years ago
Replying to dmik:
I have no idea what is wrong on your side. I've just installed a fresh copy of OpenFire
3.7.1, set it up with the browser, created two users and can now flawlessly have them
chatting with each other through my OpenFire install on OS/2 (both clients are Mac iChat).
I'm still using 3.6.4 for many reasons - one of them was not to change env during
testing/searching for these bugz.
So it must be a problem in your env (DLL mix-up).
That has already been checked and rechecked. I even loaded Theseus and checked
that all loaded DLL's path. Nothing wrong on that part.
Please try a clean install yoursself and then try to do the following:
I can try to install the latest in another path, and see if I can export/import
my settings to the new version. That will take a while, though.
comment:17 by , 13 years ago
by , 13 years ago
comment:18 by , 13 years ago
I don't see anything suspicious in the logs. OpenFire still works like a charm here.
comment:19 by , 13 years ago
Well, I managed to update it to 3.7.1
Now, it does not crash on every connect anymore.
However, it still happens now and then - and still
when a client connects - it is now more like it was
with the VAC builds.
comment:20 by , 13 years ago
Priority: | major → Feedback Pending |
---|---|
Severity: | high → medium |
Please check the new builds of Odin 0.8.4 and OpenJDK b24 GA2.
I can repeat again, that the problem doesn't show up here.
comment:21 by , 13 years ago
Priority: | Feedback Pending → critical |
---|---|
Severity: | medium → high |
Java and Odin updated.
Problem is the same - as soon as a client connects,
OF crashes. The crash seems identical to the crash in Jeti/2
I again had to switch back to VAC based Odin, to make it run.
Crash attached - PDUMP available on req.
by , 13 years ago
Attachment: | Crash435.zip added |
---|
comment:22 by , 12 years ago
Milestone: | Next → GA4 |
---|
comment:23 by , 12 years ago
With the latest test builds - jd2/hd2/od2/ipihlpapi - OpenFire 3.71 always
crashes at startup:
ftp://ftp.warpspeed.dk/pdumps/Crash503.zip
comment:24 by , 12 years ago
What if you try it with JD3/HD3 (and with updated IPHLPAPI.DLL + OD2)? Still the same crash?
comment:25 by , 12 years ago
Same crash with jd3/hd3/od2/ipi
ftp://ftp.warpspeed.dk/pdumps/Crash504.zip
comment:26 by , 12 years ago
Version: | 1.6.0-b22 GA → 1.6.0-b24 GA2 |
---|
These problems need to be analyzed further. Moving it to GA4.
comment:27 by , 12 years ago
Version: | 1.6.0-b24 GA2 → 1.6.0 Build 25 GA3 |
---|
I tested the new Java and Odin.
After a few bad starts (crash) it managed to run, and then worked for 2 days.
I then had to shut it down - using ctrl-C (remotely)
That took the hole server down :-(
comment:28 by , 12 years ago
Please try this DLL ftp://ftp.netlabs.org/pub/odin/test/j4.zip (replace the one in the /bin/client dir) and report if it works now.
comment:30 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Closing. Feel free to create a new ticket if needed later.
comment:31 by , 12 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
As I feared, it needed more testing to be sure.
It just again crashed, when a client connects, just as
originally reported.
It does again show a guard page problem !
comment:32 by , 12 years ago
Unfortunately, the ZIP is bad. Please attach another one.
From what I see now, the crash is in java_lang_Throwable::fill_in_stack_trace_of_preallocated_backtrace(). Have no idea what it does so far. It indeed looks like a very old crash which I'm unable to reproduce locally.
comment:33 by , 12 years ago
There was noting wrong with ZIP, but sometimes FTPserver (or IPstack) corrupts access.
Server rebooted.
The crash is not something that just happens at load of Openfire,
but more when server has been idle for some time, and a client connects (again).
Retested ftp download of ZIP - it should be OK now.
comment:34 by , 12 years ago
Judging from the code paths that lead to the call to fill_in_stack_trace_of_preallocated_backtrace() this must be the "out of memory" condition. Analyzing the dumps shows that it crashes when attempting to create a Java exception object for "out of memory" (which is a bug of course that needs to be fixed). Given that I have a plenty of RAM it would explain why I can't easily reproduce the problem.
I will try to create a test case that exhausts Java heap somehow to see if it helps reproduce this crash.
comment:35 by , 12 years ago
I doubt that it is a _real_ out of memory condition.
Server has 1GB RAM, and generally 500MB free.
OF reports 10MB of 247MB is used.
OTOH, I still see these sometimes with the crashes:
OpenJDK Client VM warning: Attempt to protect stack guard pages failed.
comment:36 by , 12 years ago
Can't reproduce the condition so far. Will try to overload the PC with other apps.
Regarding your last comment, Java uses a lot of shared memory (which is not the same as the total amount of memory you have), so that may be issue.
comment:37 by , 12 years ago
FYI:
E:\Utils>qsysinfo.exe
QSV_MAX = 31
QSV_MAX_PATH_LENGTH = 260
QSV_MAX_TEXT_SESSIONS = 16
QSV_MAX_PM_SESSIONS = 16
QSV_MAX_VDM_SESSIONS = 1025
QSV_BOOT_DRIVE = 3
QSV_DYN_PRI_VARIATION = 1
QSV_MAX_WAIT = 1
QSV_MIN_SLICE = 32
QSV_MAX_SLICE = 32
QSV_PAGE_SIZE = 4096
QSV_VERSION_MAJOR = 20
QSV_VERSION_MINOR = 45
QSV_VERSION_REVISION = 0
QSV_MS_COUNT = 1245409108
QSV_TIME_LOW = 1345462467
QSV_TIME_HIGH = 0
QSV_TOTPHYSMEM = 1073704960
QSV_TOTRESMEM = 123203584
QSV_TOTAVAILMEM = 1339817984
QSV_MAXPRMEM = 317390848
QSV_MAXSHMEM = 143261696
QSV_TIMER_INTERVAL = 310
QSV_MAX_COMP_LENGTH = 255
QSV_FOREGROUND_FS_SESSION = 17
QSV_FOREGROUND_PROCESS = 18777
QSV_NUM_PROCESSORS = 2
QSV_MAXHPRMEM = 469762048
QSV_MAXHSHMEM = 102703104
QSV_MAXPROCESSES = 1025
QSV_VIRTUALADDRESSLIMIT = 1024
QSV_INT10ENABLED = 1
comment:38 by , 12 years ago
The crash happens less often after the latest hotfix - but still happens
randomly at connect time. OF is not exactly overloaded here, as I am
currently the only user. My client do however also use MUC's (local and
remote) and gateways (ICQ and IRC).
comment:39 by , 12 years ago
Still, this smells like a memory management issue. However, I'm unable to reproduce the problem locally even after loading lots of apps and exhausting the Java heap / stack.
comment:40 by , 12 years ago
Sure, the guard page warnings should indicate that.
But it is still the only Java app I run - and connecting the client
makes OF say that memory use only goes from 10 to 15MB in this situation;
so it is hardly any real out of memory situation - but more a memory
handling problem.
Anyway, here are the latest 2 fresh crashes; but they may only show
the same as the others:
ftp://ftp.warpspeed.dk/pdumps/Crash537.zip
ftp://ftp.warpspeed.dk/pdumps/Crash538.zip
comment:41 by , 12 years ago
Yes these are the same.
My current findings. The code wants to fill in the stack trace to the Throwable object (e.g. when Java has thrown an exception) but it fails to walk the Java call stack because the last Java frame pointer (SP value) associated with the current thread is zero.
Further, JVM crashes for the second time when trying to find out what Java object the value from the register (pointing to the Java heap) represents. Somehow the Klass reference is also zero.
I'm building the debug version of Java for yoda to test. I guess a few assertion should be triggered in the debug build which may show something interesting.
comment:44 by , 12 years ago
comment:46 by , 12 years ago
The only interesting thing I see there is this assertion:
# Internal Error (d:/Coding/javaos2/openjdk/hotspot/src/share/vm/runtime/handles.cpp:46), pid=31763, tid=2081619996 # assert(SharedSkipVerify || obj->is_oop()) failed: sanity check
This may be a reason for the crashes. I will check that.
Yes, please also try the debug version of Odin too.
follow-up: 48 comment:47 by , 12 years ago
Also please try it with
set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%
to make sure the assertion you see is suppressed. This may give us other failing assertions.
comment:48 by , 12 years ago
Replying to dmik:
Also please try it with
set JAVA_TOOL_OPTIONS=-XX:+SharedSkipVerify %JAVA_TOOL_OPTIONS%to make sure the assertion you see is suppressed. This may give us other failing assertions.
Added that option.
ftp://ftp.warpspeed.dk/pdumps/Crash548.zip
comment:49 by , 12 years ago
Included Odin debug version. It crashed on startup:
ftp://ftp.warpspeed.dk/pdumps/Crash549.zip (Huge, includes several pdumps)
comment:50 by , 12 years ago
Hmm, it crashes that way during startup with debug Odin, so that
is not really useful. Back to retail.
comment:51 by , 12 years ago
I've tried the fresh build of b27 (it's been running just now) and it works here. Please test once I release it (tomorrow) and report back. I will move thus ticket to the next milestone anyway.
comment:52 by , 12 years ago
And btw someone told me somewhere that OpenFire can't be stopped with Ctrl-Break now. WIth the latest Odin (to be released tomorrow as well) I don't see any problems with that.
comment:53 by , 12 years ago
Yes, that has been a problem with the last few Odins again.
It especially happens, if you try to ctrl-C, when there are
active connections to clients and other servers.
comment:54 by , 12 years ago
Trying b27 and Odin 088
At first, it is pretty stable. I did rune it for some time, and connected several times
with no problem - but then it again crashed, when I connected:
comment:56 by , 12 years ago
Milestone: | Next → GA5 |
---|
comment:57 by , 12 years ago
comment:59 by , 12 years ago
There are 3 major reasons for that:
1) My FTPD is currently highly unstable (bad version), that crashes several times a day,
and I currently do not have an autostart script for it.
2) Your BOSS informed my 2 days ago, that there would not be done any further work on
any of my tickets ...
3) Because of 2) I yesterday cleaned out the whole PDUMP dir on my ftp server, as it
used a huge amount of GB, which were now of no use.
comment:60 by , 12 years ago
Thank you for your nice words. I will take them into account when it comes to fixing tickets for you.
As you see, Silvan NEVER said that your tickets were NOT going to be fixed (they would haven been closed in this case), you were jumping to conclusions...
dmik needs the zips in order to fix them, if he had no intention to work on them, he would not have asked. There is NO point telling one person from bitwise works what another bitwise works person said, because we KNOW what the other said.
Without zips the ticket is possibly going to be closed with status "feedback pending"...
comment:61 by , 12 years ago
Milestone: | GA6 |
---|---|
Resolution: | → wontfix |
Status: | reopened → closed |
Reporter switched to *nix - app works perfectly there - can't test anymore here.
test