Opened 8 years ago

Closed 8 years ago

#113 closed defect (fixed)

dash: nasm configure crashes

Reported by: dmik Owned by:
Priority: major Milestone:
Component: dash Version:
Severity: medium Keywords:
Cc:

Description

If you run nasm configure from under dash, you will get the following at the end:

configure: creating ./config.status
config.status: creating Makefile
config.status: creating rdoff/Makefile
config.status: creating doc/Makefile
config.status: creating config.h
config.status: executing default-1 commands
D:/Coding/ports/nasm/test-dash/configure: 45: D:/Coding/ports/nasm/test-dash/configure: Cannot fork

LIBC PANIC!!
fmutex deadlock: Owner died!
0x1af4c5b0: Owner=0xec370001 Self=0xf3980001 fs=0x3 flags=0x0 hev=0x00010006
            Desc="LIBC Thread DB Mutex"
pid=0xf398 ppid=0xec27 tid=0x0001 slot=0x00b2 pri=0x0200 mc=0x0001 ps=0x0010
C:\USR\BIN\SH.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

_fmutex operation failed:  LIBC Heap request

Killed by SIGABRT
pid=0xf398 ppid=0xec27 tid=0x0001 slot=0x00b2 pri=0x0200 mc=0x0002 ps=0x0017
C:\USR\BIN\SH.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

This has something to do with forking (dash has some additional code there compared to ash).

Attachments (2)

libc_0c73.log (168.5 KB ) - added by dmik 8 years ago.
libc_0c74.log (10.4 KB ) - added by dmik 8 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 by dmik, 8 years ago

Just got a similar crash when configuring ffmpeg as a test for #104 (that part worked well):

License: GPL version 2 or later
Creating config.mak, config.h, and doc/config.texi...
D:/Coding/ports/ffmpeg/master/configure: 1: eval: Cannot fork

LIBC PANIC!!
fmutex deadlock: Owner died!
0x1af4c5b0: Owner=0x71320001 Self=0x71340001 fs=0x3 flags=0x0 hev=0x00010006
            Desc="LIBC Thread DB Mutex"
pid=0x7134 ppid=0x5be4 tid=0x0001 slot=0x00a1 pri=0x0200 mc=0x0001 ps=0x0010
C:\USR\BIN\SH.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

_fmutex operation failed:  LIBC Heap request
 
Killed by SIGABRT
pid=0x7134 ppid=0x5be4 tid=0x0001 slot=0x00a1 pri=0x0200 mc=0x0002 ps=0x0017
C:\USR\BIN\SH.EXE
Process dumping was disabled, use DUMPPROC / PROCDUMP to enable it.

comment:2 by dmik, 8 years ago

The fun thing that only the release build of dash fails, the debug build works...

comment:3 by dmik, 8 years ago

Note that it also crashes when running Makefiles (while calling the install program for instance). It's some generic problem. However, a quick look doesn't reveal any problems. A deeper investigation is needed.

comment:4 by dmik, 8 years ago

This issue is really floating. I tried to enable trace debugging in dash, also tried manual printf, also tried libc with logging — in all cases the problem just vanishes and all works as designed. Sometimes it re-appears when I remove all this additional debugging, sometimes — not. The problem seems to be related to the parent/child fork callback execution during fork(). In particular, the "Thread DB Mutex" gets requested in threadForkParent1 and released in threadForkCompletion. However, somehow the child process already ends by that time w/o the mutex release and the next attempt to lock it in the parent process fails with the above assertion (it's just my guess, since I'm unable to reproduce it with logging enabled as I already mentioned). The child process seems to have run all code it was intended to, so this is a process termination problem as I see it now (which is of course not nice as it breaks outer scripts, e.g. rpmbuild).

I made the "Cannot fork" message little bit more verbose in r1659 (it prints errno text now) and also added turning on interrupts (snooped it in ash sources which dash is based upon). Let's see how it behaves for others (will make an RPM for it after also updating the code base to version 0.5.9 released in June).

comment:5 by dmik, 8 years ago

I got very good libc debug logs from Elbert and they indicate that the error comes from one of the functions that the parent initiates in the child process. This function returns ENOMEM and this causes the assertion to be triggered in the loop that handles packets from the parent. I'm trying to find out which function returns ENOMEM. It's not clear from the logs which function it is. According to the logs, it's certainly called from a pfnAtFork callback run byforkBthProcessModules and its is address missing from libc.map. It is also pretty clear that pfnAtFork in this case is not the default _atfork_callback used by EXEs and DLLs generated by gcc. So most likely we have some 3rd party DLL that provides its own pfnAtFork implementation and this implementation requests to call a function within the child process via pfnInvoke that fails with ENOMEM and causes an assertion in the child (which then causes Cannot fork).

I have to extend logging in LIBC to make it more clear.

by dmik, 8 years ago

Attachment: libc_0c73.log added

by dmik, 8 years ago

Attachment: libc_0c74.log added

comment:6 by dmik, 8 years ago

This turns out to be a double LIBC issue (one is the normal LIBC and the other one is LIBC with logging loaded in LIBPATHSTRICT=T mode). I can easily reproduce this crash locally when using two different LIBC DLLs because both get registered as modules and both try to initialize the same things (and the second attempt fails with ENOMEM). What I don't fully understand is how another LIBC version appears among the registered DLLs - most likely, it is dragged in by other (already loaded) DLLs that are not affected by LIBPATHSTRICT=T). Sigh. We have to go back to the non logging LIBC version.

comment:7 by dmik, 8 years ago

Elbert sent me the new logs and now I see that dash does a series of fork calls that work then it does exec and that exec fails in the child because the hidden kLIBC argument that libc passes on to its children started with exec doesn't have a flags byte preceding it. This may or may not be a reason for a dash failure but it apparently shadows the failure I'm currently investigating - because it triggers an assertion that is inactive in the normal builds.

comment:8 by dmik, 8 years ago

The above problem is fixed in http://trac.netlabs.org/libc/ticket/362. Elbert produced a different set of logs for me (with this exec/spawn assertion disabled) and these revealed another assertion, this time in fork code:

0017516f 01 06 0019 Asrt: Assertion Failed!!!
0017516f 01 06 0019 Asrt: Function: forkBthBufferWait
0017516f 01 06 0019 Asrt: File:     D:/Coding/libc/libc-0.6/src/emx/src/lib/sys/libcfork.c
0017516f 01 06 0019 Asrt: Line:     2406
0017516f 01 06 0019 Asrt: Expr:     0
0017516f 01 06 0019 Asrt: DosRequestMutexSem(0x800100d0, 30000) -> 640

This I believe is what weI see in normal release builds of dash, i.e. Cannot fork (Invalid argument). There is a lot of logs and I need to sort them out.

comment:9 by dmik, 8 years ago

According to the logs, the parent process is waiting form the child to complete handling of the fork buffer (execute necessary callbacks etc) in forkBthBufferWait but the timeout occurs and that aborts the parent. The child in its turn is hanging forever in fhForkChild1(64) waiting for LIBC Thread DB Mutex (which is never released for some reason, not within the parent's timeout of 30s at least).

comment:10 by dmik, 8 years ago

There is this code in __fmutex_release_internal_must_complete (called from fhForkCHild1 to release the global file handle data mutex):

    /* This is a hack to catch lost poke signals. */
    if (!ulNesting)
    {
        __LIBC_PTHREAD pThrd = __libc_threadCurrentNoAuto();
        if (pThrd && pThrd->fSigBeingPoked)
            __libc_Back_signalLostPoke();
    }

And it gets broken right in this "hack" section, while calling __libc_Back_signalLostPoke (it's a lot of stuff going on there).

comment:11 by dmik, 8 years ago

Well, __libc_Back_signalLostPoke eventually requests ownership of LIBC Thread DB Mutex but it is already owned by the thread so it panics with in the deadlock detection due to a recursive mutex but this child's panic isn't seen in the console for some reason and we instead see the panic when it tries to lock this mutex for some other need again (perhaps during termination) but the thread is market as gone already.

comment:12 by dmik, 8 years ago

BTW, a totally non-informative EINVAL (errno 22) returned by fork seems to be the result of __libc_native2errno not knowing about ERROR_TIMEOUT (640). Shame.

Last edited 8 years ago by dmik (previous) (diff)

comment:13 by dmik, 8 years ago

I made LIBC return ETIMEDOUT on ERROR_TIMEOUT and latest testing shows that this is exactly the case. However the deadlock detector barfs not on the recursion as I guessed above but on the dead owner. And now I know exactly what's going on and why it doesn't happen all the time.

The fhForkChild1 callback does its part of the job for setting up inherited file handles in the child process and then releases the LIBC SYS Filehandle Mutex. This mutex is a "must complete" one and when the must complete section is left at mutex release, there is a check for handling lost poke signals (the one marked as a "hack" above). If there are no lost signals, then nothing is done and all works smoothly. However if there is some signal being poked at that time, then __libc_Back_signalLostPoke gets called in the child context. This call, among other things, results into enumerating all child's threads to perform some signal work. The enumeration function locks the LIBC Thread DB Mutex at its beginning. However, the mutex data at this stage is still a raw copy of the parent's LIBC Thread DB Mutex in the locked state (i.e. it's state its set to owned and the owner is the parent process). The attempt to lock it in the child process results into a wait cycle since the code thinks the mutex is owned. But in the child process this mutex is not actually owned so there is no-one to release it and this wait cycle lasts forever.

The parent, in turn, waits for the child to complete its part of the job (including a call to fhForkChild1 with a timeout (30 seconds) and if this timeout expires, it returns ETIMEOUT (used to be EINVAL) to the caller and this error eventually pops up as the fork result in the application code. Hence the Cannot fork message in dash.

The LIBC PANIC message also has its explanation. When the parent terminates with the ETIMEOUT error, the child is still waiting for its LIBC Thread DB Mutex (in short intervals of 3s). And upon the next timeout it detects that the process marked as the owner (i.e. the parent) is unexpectedly dead and aborts with the panic message.

There is a special fork callback, threadForkCompletion, that is supposed to clean up this mutex in the child process after the fork operation is complete by marking it as unowned and resetting the owner to 0. However, this callback is not yet called by the time fhForkChild1 does its job (its execution is scheduled to be run at the very end of the fork call) and hence the garbage in the mutex data.

The reason why it doesn't always fail is because there isn't always a lost signal and in such case no Thread DB Mutex is accessed and therefore no problems. May be it's the way how dash sets up signals or something else which explains why it only happens with dash - this needs to be checked. But anyway, it's a clear bug in the LIBC code that needs to be fixed. We should detect such a situation and don't attempt to catch lost poke signals. I will come up with a solution shortly.

Last edited 8 years ago by dmik (previous) (diff)

comment:14 by dmik, 8 years ago

Further investigation revealed the following. The thing is that not only the mutex gets a copy of parent data but all global static objects as well, of course. This includes the LIBC thread data, and in particular the gfPreAllocThrd static struct that stores data for the main thread. As it's copied over from the parent to the child as well, its fSigBeingPoked member contains what was set in the parent. And if the parent has some lost pending signals when a new fork happens, this flag is set to 1. So 1 will be set in the child process too even though there are no any lost signals in the child since it hasn't reached it's execution point yet.

In case of dash, the parent sometimes receives SIGCHLD from a previous fork when doing a new one and this SIGCHLD sets fSigBeingPoked to 1 and causes __libc_Back_signalLostPoke to be called in the child that leads to the above failure.

And it's not only about fSigBeingPoked, there are other members in the LIBC thread struct and they will also contain values from the parent which may not be accurate. This actually relates to all thread database If I read the code correctly. We need a way to reset it in the child (via a fork child callback, similar to fhForkChild1).

comment:15 by dmik, 8 years ago

Resolution: fixed
Status: newclosed

This was fixed in LIBC, the patch is here: http://trac.netlabs.org/libc/ticket/363. Now dash seems to work flawlessly.

Note: See TracTickets for help on using tickets.