Opened 15 years ago

Closed 15 years ago

#72 closed defect (fixed)

QProcess cannot start applications of other types directly

Reported by: Dmitry A. Kuminov Owned by: Dmitry A. Kuminov
Priority: minor Milestone: Qt Beta 3
Component: QtCore Version: 4.5.1 Beta 1
Severity: Keywords:
Cc:

Description

An attempt to start a PM application from a console Qt application will just fail because DosExecPgm? (used by spawnvpe() we use in QProcess) cannot start applications of different type.

One of the possible solutions is to start such applications indirectly through cmd.exe ("cmd.exe /c app.exe") which knows how to create new sessions of different type.

Change History (31)

comment:1 Changed 15 years ago by Dmitry A. Kuminov

Not a blocker, since we haven't decided about its importance for real apps yet.

comment:2 Changed 15 years ago by Silvan Scherrer

as the app type is easy to find out i would go the cmd /c app.exe way.
if we fix this, then we have to fix the redirect problem as well.

comment:3 Changed 15 years ago by Dmitry A. Kuminov

Owner: set to Dmitry A. Kuminov
Status: newaccepted

comment:4 Changed 15 years ago by Dmitry A. Kuminov

Priority: majorblocker

Making this a blocker; there are many obvious cases when it may be necessary to do at least "cmd /c".

Besides, I'll track other issues in this defect. For example, I discovered that there is no "process terminated" notification after a successful startup when using non-blocking asynchronous approach (i.e. through signals rather than through QProcess::waitFor*() functions). The reason is that nobody checks for process termination in this case. On Win32, they wait on the handle of the process; on Unix, they use a dedicated pipe + SIGCHLD + QSocketNotifier on that pipe to detect the death or termination.

In Qt3 for OS/2 we used a 100ms timer for that. We will probably have to do the same in Qt4 as well since I'm not sure kLIBC delivers SIGCHLD. Note that using DosWaitChild?() would require a dedicated thread which I'd also want to avoid.

comment:5 Changed 15 years ago by Dmitry A. Kuminov

I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:

Socket operation on non-socket
]}}
And the parent Qt process doesn't get anything in this case.

This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.

comment:6 Changed 15 years ago by Dmitry A. Kuminov

I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:

Socket operation on non-socket

And the parent Qt process doesn't get anything in this case.

This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.

comment:7 Changed 15 years ago by Dmitry A. Kuminov

SIGCHLD seems to be supported by kLIBC BTW. So we may use it regardless of whether we will end up using native pipes or not.

comment:8 Changed 15 years ago by Silvan Scherrer

i'm not sure if this is related, but on samba we had to use ioctl() to set blocking/non-blocking sockets. as libc seems to have a bug in there.

see: http://svn.netlabs.org/samba/changeset/170 in the util.c

comment:9 Changed 15 years ago by Dmitry A. Kuminov

Thanks for the hint, but according to my tests (see #70), the blocked state of the file descriptor doesn't influence the results.

comment:10 Changed 15 years ago by Dmitry A. Kuminov

Implemented the detection of the process termination in r190. I had to use a dedicated thread for that since it's not safe to call complex functions (e.g. QMetaObject::invokeMethod()) from a signal handler because of the reentrancy. This thread will also become useful for native pipes once we decide to use them.

comment:11 Changed 15 years ago by Dmitry A. Kuminov

Finally, I managed to build a custom libc064x build so that I can now properly test the internals.

comment:12 Changed 15 years ago by Dmitry A. Kuminov

In TCP/IP 4.21 (TCPIP32.DLL, referred to as TCP/IP V4.4 in kLIBC headers), there is no such definition as socket inheritance. To make passing sockets from the parent to the child possible, they had to make all socket IDs global. By default, all sockets opened by the process are automatically closed in the exit handler. To avoid this, ownership must be explicitly passed to the child using removesocketfromlist() and accepted by the child using addsockettolist().

I've got a suspicion that the race is between these two calls.

comment:13 Changed 15 years ago by Dmitry A. Kuminov

Ok, I found it. A race in kLIBC indeed. Here is what we get when everything goes fine (indented is the child process):

// parent: created a new socket (fd=5), refcnt=1/1 (process/global)
0229ed5f 01 03 0018 Entr 0000 __libc_spmSocketNew: iSocket=2751
0229ed5f 01 03 0018 Leav 0000 __libc_spmSocketNew (0 ms): ret 0 (0x0)

// created a duplicate with fd=1, refcnt=2/2
0229edaa 01 03 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751
0229edaa 01 03 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 131074 (0x20002)

    // child: inherited the socket through fd=1, refcnt=1/3, called addsockettolist()
    0229ee36 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751
    0229ee36 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65539 (0x10003)

// parent: closed fd=1, refcnt=1/2
0229eed0 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
0229eed0 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 65538 (0x10002)

// closed fd=5, refcnt=0/1 (global isn't 0, so removesocketfromlist() is called to prevent real closure
0229eed2 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
0229eed2 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 1 (0x1)

    // child: closed fd=1, refcnt=0/0, perform real closure
    0229efb3 01 05 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751
    0229efb3 01 05 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

And here is a log of the failed case:

// parent: created a new socket (fd=10), refcnt=1/1 (process/global)
01c99f11 01 03 0018 Entr 0002 __libc_spmSocketNew: iSocket=2581
01c99f11 01 03 0018 Leav 0002 __libc_spmSocketNew (0 ms): ret 0 (0x0)

// created a duplicate with fd=1, refcnt=2/2
01c99f1d 01 03 0018 Entr 0002 __libc_spmSocketRef: iSocket=2581
01c99f1d 01 03 0018 Leav 0002 __libc_spmSocketRef (0 ms): ret 131074 (0x20002)

// *** here the fun part starts:
// closed fd=1, refcnt=1/1
01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581
01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 65537 (0x10001)

// closed fd=10, refcnt=0/0, the socket is really closed before child gets a chance to reference it!
01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581
01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

    // and here is the late child trying to inherit a non-existent socket
    01c9a023 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2581
    01c9a023 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65537 (0x10001)

    // then it tries to close its fd=1 and asserts:
    01c9ace8 01 04 001b Asrt: Assertion Failed!!!
    01c9ace8 01 04 001b Asrt: Function: __libc_tcpipCleanup44
    01c9ace8 01 04 001b Asrt: File:     D:/Coding/libc-0.6/src/emx/src/lib/sys/tcpipver.c
    01c9ace8 01 04 001b Asrt: Line:     1376
    01c9ace8 01 04 001b Asrt: Expr:     !rc
    01c9ace8 01 04 001b Asrt: soclose(2581) -> rc=-1 sock_errno()->10038

    01c9ace8 01 05 0018 Entr 0026 __libc_spmSocketDeref: iSocket=2581
    01c9ace8 01 05 0018 Leav 0026 __libc_spmSocketDeref (0 ms): ret 0 (0x0)

So, apparently it sometimes happens that the child is too slow to fire up and increase the global reference count on sockets it inherits before the parent decides the baby has been started and closes them on its side.

I believe I saw some code that waits for child initialization of the kLIBC part (including inheritance and stuff) in case of a successful DosExecPgm? call but either it is not working well or I'm mistaken. I'll look at it again.

comment:14 Changed 15 years ago by Dmitry A. Kuminov

BTW, there is a different strategy regarding socket reference counting: the spawn* code can increase the global counter by one for all inherited sockets (those that don't have FD_CLOEXEC set) *before* issuing DosExecPgm?, decrease it back on failure or leave this task for the child in case if DosExecPgm? succeeds.

comment:15 Changed 15 years ago by Dmitry A. Kuminov

libc_spmWaitForChildToBecomeAlive() waits for maximum 130 ms until the LIBC child has done termination; if it hasn't and 130 ms is expired, spawn* simply returns. According to the logs, both in the successful and in the failed case spawn* timed out, but luckily, once the child was fast enough to increase socket reference counters before the parent decreases them. In the second case, the wait is aborted even before the LIBC init routine gets control in the child.

I suppose Knut had some reasons for setting such a small timeout but they at least don't count in the case of a virtualized hardware.

comment:16 Changed 15 years ago by Dmitry A. Kuminov

By termination I mean initialization of course. And no, the formatting is not meant too.

comment:17 Changed 15 years ago by Dmitry A. Kuminov

Okay, increasing the LIBC child timeout from 130 to 1300 ms (and any child timeout from 8 to 800 ms) solves the race condition in my virtualized environment so that I don't get "Socket operation on non-socket" any more.

I also found a defect in the socket select thread in Qt (QEventDispatcherPM): it would keep posting "socket ready" messages to the main thread non-stop, w/o actually giving it a chance to execute and process them. As a result, weird things could happen (EPIPE followed by unexpected child process termination and so on). This defect is fixed now (by r196).

In my environment, i.e. with patched LIBC and r196, all tests now work including the smplayer/mplayer pair (I had to patch core.cpp a bit more to get rid of invalid options and also selected "Run mplayer in its own window" since Qt integration didn't work (didn't investigate this).

Silvan, please try the new code. It should work with the normal LIBC DLL most of the time. If you experience some oddities, tell me and I'll give you the patched one for testing.

Regarding the timeouts, I'll try to ask Knut.

comment:18 Changed 15 years ago by Silvan Scherrer

i just built qt new and tryed smplayer/mplayer again. for me it still doesn't work.
it gets now further than it used to. i get all messages about the file and also the start playing message. and usually i see 1-2 seconds of the movie. but then nothing works anymore. i have the feeling that the fast processing of stdout lines still has some timing issues. as mplayer sends a lot of those when playing a movie.

the qt integration for smplayer needs a patch, which i did not send you afaik.

comment:19 Changed 15 years ago by Dmitry A. Kuminov

Could you be more verbose plz? Do you use the "Run mplayer in its own window" mode? What do you mean by "nothing works anymore"?

comment:20 Changed 15 years ago by Silvan Scherrer

as you already know all works, when changed to PM app instead of console app.

comment:21 Changed 15 years ago by Silvan Scherrer

Milestone: Qt GAQt Beta 3

comment:22 Changed 15 years ago by Dmitry A. Kuminov

Now I think that going the 'cmd /c' way is not really good since it will return a PID of CMD.exe instead of the PID of the actual application. This is definitely against the programmer's expectations and makes things such as custom IPC between the parent and the child difficult or impossible (because there will be no trusted child identification).

It seems that we will have to implement P_SESSION in kLIBC as well to overcome this limitation. This, as well as #71, relates to http://svn.netlabs.org/libc/ticket/121.

comment:23 Changed 15 years ago by Silvan Scherrer

as it's not possible to get the rigfht PID with the cmd /c way i guess a fixed/changed libc is really needed. could you do the patch? and apply to the libc ticket 121? and we probably need to ship a custom libc as long as the patch is not included.

comment:24 Changed 15 years ago by Dmitry A. Kuminov

On the second thought, It actually may be possible to find a child PID of the started PID. (DosQuerySysInfo? or how it is called).

What frightens me in case of patching LIBC is that I don't know how tightly spawn() mechanics relies on the parent-child relationship (inheritance, signal handling, waitpid() functionality): the thing is that DosStartSession?() doesn't provide this kind of relationship between the main and the started executable. I'll read the EMX docs and LIBC sources closer.

comment:25 Changed 15 years ago by Dmitry A. Kuminov

Some results of hacking kLIBC with DosStartSession?. Handle inheritance works in principle (why shouldn't it), but I found the following difficulties:

  • DosStartSession? returns control to the caller much later than DosExecPgm?, which means that the child has already passed the LIBC/SPM init routines by that time and it didn't recognize that it was our child (because we didn't update the shared structure with its PID used by it as a key). A particular result of that is that doesn't call the inheritance routines. As a temporary workaround, I added some sleep to the SPM init code which is of course not good and needs a better solution.
  • SIGCHLD doesn't get generated. We may need a helper thread that waits for the system queue and emulates this signal when it sees the child's death.

comment:26 Changed 15 years ago by Dmitry A. Kuminov

All of the above looks like doable but I'll probably have to delay Beta 3 for 1-2 days. We'll see tomorrow.

comment:27 Changed 15 years ago by Silvan Scherrer

delay of beta3 is not a big problem. better have a more feature rich Qt :) and this fix also help others using libc.

comment:28 Changed 15 years ago by Silvan Scherrer

Milestone: Qt Beta 3Qt Enhanced
Priority: blockerminor

as it's not a wide used scenario to start a PM app from a console app we move it to Qt enhanced.

comment:29 Changed 15 years ago by Dmitry A. Kuminov

Milestone: Qt EnhancedQt Beta 3

comment:30 Changed 15 years ago by Dmitry A. Kuminov

Indeed, the situation is such that we only have problems when starting PM applications from non-PM applications (DosExecPgm?() returns NO_ERROR but the application isn't actually started). On the contrary, starting non-PM (at least WINDOWCOMPAT) apps from pm apps works well with DosExecPgm?() (although totally against the documentation).

Since the case of starting PM apps from non-PM apps is qiute a rare in practice, I think it isn't worth a complete kLIBC patch at the moment.

Besides, in r217 I implemented the workarounds both for starting PM apps from non-PM apps (in the form of "cmd /c") and for detaching programs (#71) using "cmd /c detach". This still returns an incorrect PID (pid of cmd.exe) but that's ok for the present time.

I created #85 for tracking the kLIBC patch and further improvemets of QProcess.

comment:31 Changed 15 years ago by Dmitry A. Kuminov

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.