Opened 10 years ago

Closed 10 years ago

#72 closed defect (fixed)

QProcess cannot start applications of other types directly

Reported by: dmik Owned by: dmik
Priority: minor Milestone: Qt Beta 3
Component: QtCore Version: 4.5.1 Beta 1
Severity: Keywords:
Cc:

Description

An attempt to start a PM application from a console Qt application will just fail because DosExecPgm? (used by spawnvpe() we use in QProcess) cannot start applications of different type.

One of the possible solutions is to start such applications indirectly through cmd.exe ("cmd.exe /c app.exe") which knows how to create new sessions of different type.

Change History (31)

comment:1 Changed 10 years ago by dmik

Not a blocker, since we haven't decided about its importance for real apps yet.

comment:2 Changed 10 years ago by diver

as the app type is easy to find out i would go the cmd /c app.exe way.
if we fix this, then we have to fix the redirect problem as well.

comment:3 Changed 10 years ago by dmik

  • Owner set to dmik
  • Status changed from new to accepted

comment:4 Changed 10 years ago by dmik

  • Priority changed from major to blocker

Making this a blocker; there are many obvious cases when it may be necessary to do at least "cmd /c".

Besides, I'll track other issues in this defect. For example, I discovered that there is no "process terminated" notification after a successful startup when using non-blocking asynchronous approach (i.e. through signals rather than through QProcess::waitFor*() functions). The reason is that nobody checks for process termination in this case. On Win32, they wait on the handle of the process; on Unix, they use a dedicated pipe + SIGCHLD + QSocketNotifier on that pipe to detect the death or termination.

In Qt3 for OS/2 we used a 100ms timer for that. We will probably have to do the same in Qt4 as well since I'm not sure kLIBC delivers SIGCHLD. Note that using DosWaitChild?() would require a dedicated thread which I'd also want to avoid.

comment:5 Changed 10 years ago by dmik

I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:

Socket operation on non-socket
]}}
And the parent Qt process doesn't get anything in this case.

This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.

comment:6 Changed 10 years ago by dmik

I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:

Socket operation on non-socket

And the parent Qt process doesn't get anything in this case.

This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.

comment:7 Changed 10 years ago by dmik

SIGCHLD seems to be supported by kLIBC BTW. So we may use it regardless of whether we will end up using native pipes or not.

comment:8 Changed 10 years ago by diver

i'm not sure if this is related, but on samba we had to use ioctl() to set blocking/non-blocking sockets. as libc seems to have a bug in there.

see: http://svn.netlabs.org/samba/changeset/170 in the util.c

comment:9 Changed 10 years ago by dmik

Thanks for the hint, but according to my tests (see #70), the blocked state of the file descriptor doesn't influence the results.

comment:10 Changed 10 years ago by dmik

Implemented the detection of the process termination in r190. I had to use a dedicated thread for that since it's not safe to call complex functions (e.g. QMetaObject::invokeMethod()) from a signal handler because of the reentrancy. This thread will also become useful for native pipes once we decide to use them.

comment:11 Changed 10 years ago by dmik

Finally, I managed to build a custom libc064x build so that I can now properly test the internals.

comment:12 Changed 10 years ago by dmik

In TCP/IP 4.21 (TCPIP32.DLL, referred to as TCP/IP V4.4 in kLIBC headers), there is no such definition as socket inheritance. To make passing sockets from the parent to the child possible, they had to make all socket IDs global. By default, all sockets opened by the process are automatically closed in the exit handler. To avoid this, ownership must be explicitly passed to the child using removesocketfromlist() and accepted by the child using addsockettolist().

I've got a suspicion that the race is between these two calls.

comment:13 Changed 10 years ago by dmik

Ok, I found it. A race in kLIBC indeed. Here is what we get when everything goes fine (indented is the child process):

// parent: created a new socket (fd=5), refcnt=1/1 (process/global)
0229ed5f 01 03 0018 Entr 0000 __libc_spmSocketNew: iSocket=2751
0229ed5f 01 03 0018 Leav 0000 __libc_spmSocketNew (0 ms): ret 0 (0x0)

// created a duplicate with fd=1, refcnt=2/2
0229edaa 01 03 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751
0229edaa 01 03 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 131074 (0x20002)

    // child: inherited the socket through fd=1, refcnt=1/3, called addsockettolist()
    0229ee36 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751
    0229ee36 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65539 (0x10003)

// parent: closed fd=1, refcnt=1/2
0229eed0 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
0229eed0 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 65538 (0x10002)

// closed fd=5, refcnt=0/1 (global isn't 0, so removesocketfromlist() is called to prevent real closure
0229eed2 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
0229eed2 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 1 (0x1)

    // child: closed fd=1, refcnt=0/0, perform real closure
    0229efb3 01 05 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
    0229efb3 01 05 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

And here is a log of the failed case:

// parent: created a new socket (fd=10), refcnt=1/1 (process/global)
01c99f11 01 03 0018 Entr 0002 __libc_spmSocketNew: iSocket=2581
01c99f11 01 03 0018 Leav 0002 __libc_spmSocketNew (0 ms): ret 0 (0x0)

// created a duplicate with fd=1, refcnt=2/2
01c99f1d 01 03 0018 Entr 0002 __libc_spmSocketRef: iSocket=2581
01c99f1d 01 03 0018 Leav 0002 __libc_spmSocketRef (0 ms): ret 131074 (0x20002)

// *** here the fun part starts:
// closed fd=1, refcnt=1/1
01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581
01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 65537 (0x10001)

// closed fd=10, refcnt=0/0, the socket is really closed before child gets a chance to reference it!
01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581
01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

    // and here is the late child trying to inherit a non-existent socket
    01c9a023 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2581
    01c9a023 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65537 (0x10001)

    // then it tries to close its fd=1 and asserts:
    01c9ace8 01 04 001b Asrt: Assertion Failed!!!
    01c9ace8 01 04 001b Asrt: Function: __libc_tcpipCleanup44
    01c9ace8 01 04 001b Asrt: File:     D:/Coding/libc-0.6/src/emx/src/lib/sys/tcpipver.c
    01c9ace8 01 04 001b Asrt: Line:     1376
    01c9ace8 01 04 001b Asrt: Expr:     !rc
    01c9ace8 01 04 001b Asrt: soclose(2581) -> rc=-1 sock_errno()->10038

    01c9ace8 01 05 0018 Entr 0026 __libc_spmSocketDeref: iSocket=2581
    01c9ace8 01 05 0018 Leav 0026 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

So, apparently it sometimes happens that the child is too slow to fire up and increase the global reference count on sockets it inherits before the parent decides the baby has been started and closes them on its side.

I believe I saw some code that waits for child initialization of the kLIBC part (including inheritance and stuff) in case of a successful DosExecPgm? call but either it is not working well or I'm mistaken. I'll look at it again.

comment:14 Changed 10 years ago by dmik

BTW, there is a different strategy regarding socket reference counting: the spawn* code can increase the global counter by one for all inherited sockets (those that don't have FD_CLOEXEC set) *before* issuing DosExecPgm?, decrease it back on failure or leave this task for the child in case if DosExecPgm? succeeds.

comment:15 Changed 10 years ago by dmik

libc_spmWaitForChildToBecomeAlive() waits for maximum 130 ms until the LIBC child has done termination; if it hasn't and 130 ms is expired, spawn* simply returns. According to the logs, both in the successful and in the failed case spawn* timed out, but luckily, once the child was fast enough to increase socket reference counters before the parent decreases them. In the second case, the wait is aborted even before the LIBC init routine gets control in the child.

I suppose Knut had some reasons for setting such a small timeout but they at least don't count in the case of a virtualized hardware.

comment:16 Changed 10 years ago by dmik

By termination I mean initialization of course. And no, the formatting is not meant too.

comment:17 Changed 10 years ago by dmik

Okay, increasing the LIBC child timeout from 130 to 1300 ms (and any child timeout from 8 to 800 ms) solves the race condition in my virtualized environment so that I don't get "Socket operation on non-socket" any more.

I also found a defect in the socket select thread in Qt (QEventDispatcherPM): it would keep posting "socket ready" messages to the main thread non-stop, w/o actually giving it a chance to execute and process them. As a result, weird things could happen (EPIPE followed by unexpected child process termination and so on). This defect is fixed now (by r196).

In my environment, i.e. with patched LIBC and r196, all tests now work including the smplayer/mplayer pair (I had to patch core.cpp a bit more to get rid of invalid options and also selected "Run mplayer in its own window" since Qt integration didn't work (didn't investigate this).

Silvan, please try the new code. It should work with the normal LIBC DLL most of the time. If you experience some oddities, tell me and I'll give you the patched one for testing.

Regarding the timeouts, I'll try to ask Knut.

comment:18 Changed 10 years ago by diver

i just built qt new and tryed smplayer/mplayer again. for me it still doesn't work.
it gets now further than it used to. i get all messages about the file and also the start playing message. and usually i see 1-2 seconds of the movie. but then nothing works anymore. i have the feeling that the fast processing of stdout lines still has some timing issues. as mplayer sends a lot of those when playing a movie.

the qt integration for smplayer needs a patch, which i did not send you afaik.

comment:19 Changed 10 years ago by dmik

Could you be more verbose plz? Do you use the "Run mplayer in its own window" mode? What do you mean by "nothing works anymore"?

comment:20 Changed 10 years ago by diver

as you already know all works, when changed to PM app instead of console app.

comment:21 Changed 10 years ago by diver

  • Milestone changed from Qt GA to Qt Beta 3

comment:22 Changed 10 years ago by dmik

Now I think that going the 'cmd /c' way is not really good since it will return a PID of CMD.exe instead of the PID of the actual application. This is definitely against the programmer's expectations and makes things such as custom IPC between the parent and the child difficult or impossible (because there will be no trusted child identification).

It seems that we will have to implement P_SESSION in kLIBC as well to overcome this limitation. This, as well as #71, relates to http://svn.netlabs.org/libc/ticket/121.

comment:23 Changed 10 years ago by diver

as it's not possible to get the rigfht PID with the cmd /c way i guess a fixed/changed libc is really needed. could you do the patch? and apply to the libc ticket 121? and we probably need to ship a custom libc as long as the patch is not included.

comment:24 Changed 10 years ago by dmik

On the second thought, It actually may be possible to find a child PID of the started PID. (DosQuerySysInfo? or how it is called).

What frightens me in case of patching LIBC is that I don't know how tightly spawn() mechanics relies on the parent-child relationship (inheritance, signal handling, waitpid() functionality): the thing is that DosStartSession?() doesn't provide this kind of relationship between the main and the started executable. I'll read the EMX docs and LIBC sources closer.

comment:25 Changed 10 years ago by dmik

Some results of hacking kLIBC with DosStartSession?. Handle inheritance works in principle (why shouldn't it), but I found the following difficulties:

  • DosStartSession? returns control to the caller much later than DosExecPgm?, which means that the child has already passed the LIBC/SPM init routines by that time and it didn't recognize that it was our child (because we didn't update the shared structure with its PID used by it as a key). A particular result of that is that doesn't call the inheritance routines. As a temporary workaround, I added some sleep to the SPM init code which is of course not good and needs a better solution.
  • SIGCHLD doesn't get generated. We may need a helper thread that waits for the system queue and emulates this signal when it sees the child's death.

comment:26 Changed 10 years ago by dmik

All of the above looks like doable but I'll probably have to delay Beta 3 for 1-2 days. We'll see tomorrow.

comment:27 Changed 10 years ago by diver

delay of beta3 is not a big problem. better have a more feature rich Qt :) and this fix also help others using libc.

comment:28 Changed 10 years ago by diver

  • Milestone changed from Qt Beta 3 to Qt Enhanced
  • Priority changed from blocker to minor

as it's not a wide used scenario to start a PM app from a console app we move it to Qt enhanced.

comment:29 Changed 10 years ago by dmik

  • Milestone changed from Qt Enhanced to Qt Beta 3

comment:30 Changed 10 years ago by dmik

Indeed, the situation is such that we only have problems when starting PM applications from non-PM applications (DosExecPgm?() returns NO_ERROR but the application isn't actually started). On the contrary, starting non-PM (at least WINDOWCOMPAT) apps from pm apps works well with DosExecPgm?() (although totally against the documentation).

Since the case of starting PM apps from non-PM apps is qiute a rare in practice, I think it isn't worth a complete kLIBC patch at the moment.

Besides, in r217 I implemented the workarounds both for starting PM apps from non-PM apps (in the form of "cmd /c") and for detaching programs (#71) using "cmd /c detach". This still returns an incorrect PID (pid of cmd.exe) but that's ok for the present time.

I created #85 for tracking the kLIBC patch and further improvemets of QProcess.

comment:31 Changed 10 years ago by dmik

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.