Opened 9 years ago
Closed 8 years ago
#113 closed defect (fixed)
dash: nasm configure crashes
Reported by: | dmik | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | dash | Version: | |
Severity: | medium | Keywords: | |
Cc: |
Description
If you run nasm configure from under dash, you will get the following at the end:
configure: creating ./config.status config.status: creating Makefile config.status: creating rdoff/Makefile config.status: creating doc/Makefile config.status: creating config.h config.status: executing default-1 commands D:/Coding/ports/nasm/test-dash/configure: 45: D:/Coding/ports/nasm/test-dash/configure: Cannot fork LIBC PANIC!! fmutex deadlock: Owner died! 0x1af4c5b0: Owner=0xec370001 Self=0xf3980001 fs=0x3 flags=0x0 hev=0x00010006 Desc="LIBC Thread DB Mutex" pid=0xf398 ppid=0xec27 tid=0x0001 slot=0x00b2 pri=0x0200 mc=0x0001 ps=0x0010 C:\USR\BIN\SH.EXE Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it. _fmutex operation failed: LIBC Heap request Killed by SIGABRT pid=0xf398 ppid=0xec27 tid=0x0001 slot=0x00b2 pri=0x0200 mc=0x0002 ps=0x0017 C:\USR\BIN\SH.EXE Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.
This has something to do with forking (dash has some additional code there compared to ash).
Attachments (2)
Change History (17)
comment:1 by , 9 years ago
comment:2 by , 9 years ago
The fun thing that only the release build of dash fails, the debug build works...
comment:3 by , 9 years ago
Note that it also crashes when running Makefiles (while calling the install
program for instance). It's some generic problem. However, a quick look doesn't reveal any problems. A deeper investigation is needed.
comment:4 by , 9 years ago
This issue is really floating. I tried to enable trace debugging in dash, also tried manual printf, also tried libc with logging — in all cases the problem just vanishes and all works as designed. Sometimes it re-appears when I remove all this additional debugging, sometimes — not. The problem seems to be related to the parent/child fork callback execution during fork()
. In particular, the "Thread DB Mutex" gets requested in threadForkParent1
and released in threadForkCompletion
. However, somehow the child process already ends by that time w/o the mutex release and the next attempt to lock it in the parent process fails with the above assertion (it's just my guess, since I'm unable to reproduce it with logging enabled as I already mentioned). The child process seems to have run all code it was intended to, so this is a process termination problem as I see it now (which is of course not nice as it breaks outer scripts, e.g. rpmbuild).
I made the "Cannot fork" message little bit more verbose in r1659 (it prints errno text now) and also added turning on interrupts (snooped it in ash sources which dash is based upon). Let's see how it behaves for others (will make an RPM for it after also updating the code base to version 0.5.9 released in June).
comment:5 by , 8 years ago
I got very good libc debug logs from Elbert and they indicate that the error comes from one of the functions that the parent initiates in the child process. This function returns ENOMEM and this causes the assertion to be triggered in the loop that handles packets from the parent. I'm trying to find out which function returns ENOMEM. It's not clear from the logs which function it is. According to the logs, it's certainly called from a pfnAtFork
callback run byforkBthProcessModules
and its is address missing from libc.map. It is also pretty clear that pfnAtFork
in this case is not the default _atfork_callback
used by EXEs and DLLs generated by gcc. So most likely we have some 3rd party DLL that provides its own pfnAtFork
implementation and this implementation requests to call a function within the child process via pfnInvoke
that fails with ENOMEM and causes an assertion in the child (which then causes Cannot fork
).
I have to extend logging in LIBC to make it more clear.
by , 8 years ago
Attachment: | libc_0c73.log added |
---|
by , 8 years ago
Attachment: | libc_0c74.log added |
---|
comment:6 by , 8 years ago
This turns out to be a double LIBC issue (one is the normal LIBC and the other one is LIBC with logging loaded in LIBPATHSTRICT=T mode). I can easily reproduce this crash locally when using two different LIBC DLLs because both get registered as modules and both try to initialize the same things (and the second attempt fails with ENOMEM). What I don't fully understand is how another LIBC version appears among the registered DLLs - most likely, it is dragged in by other (already loaded) DLLs that are not affected by LIBPATHSTRICT=T). Sigh. We have to go back to the non logging LIBC version.
comment:7 by , 8 years ago
Elbert sent me the new logs and now I see that dash does a series of fork
calls that work then it does exec
and that exec
fails in the child because the hidden kLIBC argument that libc passes on to its children started with exec
doesn't have a flags byte preceding it. This may or may not be a reason for a dash failure but it apparently shadows the failure I'm currently investigating - because it triggers an assertion that is inactive in the normal builds.
comment:8 by , 8 years ago
The above problem is fixed in http://trac.netlabs.org/libc/ticket/362. Elbert produced a different set of logs for me (with this exec/spawn
assertion disabled) and these revealed another assertion, this time in fork code:
0017516f 01 06 0019 Asrt: Assertion Failed!!! 0017516f 01 06 0019 Asrt: Function: forkBthBufferWait 0017516f 01 06 0019 Asrt: File: D:/Coding/libc/libc-0.6/src/emx/src/lib/sys/libcfork.c 0017516f 01 06 0019 Asrt: Line: 2406 0017516f 01 06 0019 Asrt: Expr: 0 0017516f 01 06 0019 Asrt: DosRequestMutexSem(0x800100d0, 30000) -> 640
This I believe is what weI see in normal release builds of dash, i.e. Cannot fork (Invalid argument)
. There is a lot of logs and I need to sort them out.
comment:9 by , 8 years ago
According to the logs, the parent process is waiting form the child to complete handling of the fork buffer (execute necessary callbacks etc) in forkBthBufferWait
but the timeout occurs and that aborts the parent. The child in its turn is hanging forever in fhForkChild1(64)
waiting for LIBC Thread DB Mutex
(which is never released for some reason, not within the parent's timeout of 30s at least).
comment:10 by , 8 years ago
There is this code in __fmutex_release_internal_must_complete
(called from fhForkCHild1
to release the global file handle data mutex):
/* This is a hack to catch lost poke signals. */ if (!ulNesting) { __LIBC_PTHREAD pThrd = __libc_threadCurrentNoAuto(); if (pThrd && pThrd->fSigBeingPoked) __libc_Back_signalLostPoke(); }
And it gets broken right in this "hack" section, while calling __libc_Back_signalLostPoke
(it's a lot of stuff going on there).
comment:11 by , 8 years ago
Well, __libc_Back_signalLostPoke
eventually requests ownership of LIBC Thread DB Mutex
but it is already owned by the thread so it panics with in the deadlock detection due to a recursive mutex but this child's panic isn't seen in the console for some reason and we instead see the panic when it tries to lock this mutex for some other need again (perhaps during termination) but the thread is market as gone already.
comment:12 by , 8 years ago
BTW, a totally non-informative EINVAL (errno 22) returned by fork
seems to be the result of __libc_native2errno
not knowing about ERROR_TIMEOUT (640). Shame.
comment:13 by , 8 years ago
I made LIBC return ETIMEDOUT on ERROR_TIMEOUT and latest testing shows that this is exactly the case. However the deadlock detector barfs not on the recursion as I guessed above but on the dead owner. And now I know exactly what's going on and why it doesn't happen all the time.
The fhForkChild1
callback does its part of the job for setting up inherited file handles in the child process and then releases the LIBC SYS Filehandle Mutex
. This mutex is a "must complete" one and when the must complete section is left at mutex release, there is a check for handling lost poke signals (the one marked as a "hack" above). If there are no lost signals, then nothing is done and all works smoothly. However if there is some signal being poked at that time, then __libc_Back_signalLostPoke
gets called in the child context. This call, among other things, results into enumerating all child's threads to perform some signal work. The enumeration function locks the LIBC Thread DB Mutex
at its beginning. However, the mutex data at this stage is still a raw copy of the parent's LIBC Thread DB Mutex
in the locked state (i.e. it's state its set to owned and the owner is the parent process). The attempt to lock it in the child process results into a wait cycle since the code thinks the mutex is owned. But in the child process this mutex is not actually owned so there is no-one to release it and this wait cycle lasts forever.
The parent, in turn, waits for the child to complete its part of the job (including a call to fhForkChild1
with a timeout (30 seconds) and if this timeout expires, it returns ETIMEOUT (used to be EINVAL) to the caller and this error eventually pops up as the fork
result in the application code. Hence the Cannot fork
message in dash.
The LIBC PANIC message also has its explanation. When the parent terminates with the ETIMEOUT error, the child is still waiting for its LIBC Thread DB Mutex
(in short intervals of 3s). And upon the next timeout it detects that the process marked as the owner (i.e. the parent) is unexpectedly dead and aborts with the panic message.
There is a special fork callback, threadForkCompletion
, that is supposed to clean up this mutex in the child process after the fork operation is complete by marking it as unowned and resetting the owner to 0. However, this callback is not yet called by the time fhForkChild1
does its job (its execution is scheduled to be run at the very end of the fork call) and hence the garbage in the mutex data.
The reason why it doesn't always fail is because there isn't always a lost signal and in such case no Thread DB Mutex
is accessed and therefore no problems. May be it's the way how dash sets up signals or something else which explains why it only happens with dash - this needs to be checked. But anyway, it's a clear bug in the LIBC code that needs to be fixed. We should detect such a situation and don't attempt to catch lost poke signals. I will come up with a solution shortly.
comment:14 by , 8 years ago
Further investigation revealed the following. The thing is that not only the mutex gets a copy of parent data but all global static objects as well, of course. This includes the LIBC thread data, and in particular the gfPreAllocThrd
static struct that stores data for the main thread. As it's copied over from the parent to the child as well, its fSigBeingPoked
member contains what was set in the parent. And if the parent has some lost pending signals when a new fork happens, this flag is set to 1. So 1 will be set in the child process too even though there are no any lost signals in the child since it hasn't reached it's execution point yet.
In case of dash, the parent sometimes receives SIGCHLD from a previous fork when doing a new one and this SIGCHLD sets fSigBeingPoked
to 1 and causes __libc_Back_signalLostPoke
to be called in the child that leads to the above failure.
And it's not only about fSigBeingPoked
, there are other members in the LIBC thread struct and they will also contain values from the parent which may not be accurate. This actually relates to all thread database If I read the code correctly. We need a way to reset it in the child (via a fork child callback, similar to fhForkChild1
).
comment:15 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
This was fixed in LIBC, the patch is here: http://trac.netlabs.org/libc/ticket/363. Now dash seems to work flawlessly.
Just got a similar crash when configuring ffmpeg as a test for #104 (that part worked well):