Opened 15 years ago
Closed 15 years ago
#72 closed defect (fixed)
QProcess cannot start applications of other types directly
Reported by: | Dmitry A. Kuminov | Owned by: | Dmitry A. Kuminov |
---|---|---|---|
Priority: | minor | Milestone: | Qt Beta 3 |
Component: | QtCore | Version: | 4.5.1 Beta 1 |
Severity: | Keywords: | ||
Cc: |
Description
An attempt to start a PM application from a console Qt application will just fail because DosExecPgm (used by spawnvpe() we use in QProcess) cannot start applications of different type.
One of the possible solutions is to start such applications indirectly through cmd.exe ("cmd.exe /c app.exe") which knows how to create new sessions of different type.
Change History (31)
comment:1 by , 15 years ago
comment:2 by , 15 years ago
as the app type is easy to find out i would go the cmd /c app.exe way.
if we fix this, then we have to fix the redirect problem as well.
comment:3 by , 15 years ago
Owner: | set to |
---|---|
Status: | new → accepted |
comment:4 by , 15 years ago
Priority: | major → blocker |
---|
Making this a blocker; there are many obvious cases when it may be necessary to do at least "cmd /c".
Besides, I'll track other issues in this defect. For example, I discovered that there is no "process terminated" notification after a successful startup when using non-blocking asynchronous approach (i.e. through signals rather than through QProcess::waitFor*() functions). The reason is that nobody checks for process termination in this case. On Win32, they wait on the handle of the process; on Unix, they use a dedicated pipe + SIGCHLD + QSocketNotifier on that pipe to detect the death or termination.
In Qt3 for OS/2 we used a 100ms timer for that. We will probably have to do the same in Qt4 as well since I'm not sure kLIBC delivers SIGCHLD. Note that using DosWaitChild() would require a dedicated thread which I'd also want to avoid.
comment:5 by , 15 years ago
I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:
Socket operation on non-socket ]}} And the parent Qt process doesn't get anything in this case. This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.
comment:6 by , 15 years ago
I've wrote a sample program that just writes "test" to stdout and nothing more. When I run this program through QProcess, it sometimes fails to write to stdout (which is redirected to a local socket by QProcess) with the following error:
Socket operation on non-socket
And the parent Qt process doesn't get anything in this case.
This doesn't always happen, sometimes the text is successfully written by the test program and successfully read by the Qt parent process. This suggests that it can be a timing issue but it makes it even worse. No ideas so far.
comment:7 by , 15 years ago
SIGCHLD seems to be supported by kLIBC BTW. So we may use it regardless of whether we will end up using native pipes or not.
comment:8 by , 15 years ago
i'm not sure if this is related, but on samba we had to use ioctl() to set blocking/non-blocking sockets. as libc seems to have a bug in there.
see: http://svn.netlabs.org/samba/changeset/170 in the util.c
comment:9 by , 15 years ago
Thanks for the hint, but according to my tests (see #70), the blocked state of the file descriptor doesn't influence the results.
comment:10 by , 15 years ago
Implemented the detection of the process termination in r190. I had to use a dedicated thread for that since it's not safe to call complex functions (e.g. QMetaObject::invokeMethod()) from a signal handler because of the reentrancy. This thread will also become useful for native pipes once we decide to use them.
comment:11 by , 15 years ago
Finally, I managed to build a custom libc064x build so that I can now properly test the internals.
comment:12 by , 15 years ago
In TCP/IP 4.21 (TCPIP32.DLL, referred to as TCP/IP V4.4 in kLIBC headers), there is no such definition as socket inheritance. To make passing sockets from the parent to the child possible, they had to make all socket IDs global. By default, all sockets opened by the process are automatically closed in the exit handler. To avoid this, ownership must be explicitly passed to the child using removesocketfromlist() and accepted by the child using addsockettolist().
I've got a suspicion that the race is between these two calls.
comment:13 by , 15 years ago
Ok, I found it. A race in kLIBC indeed. Here is what we get when everything goes fine (indented is the child process):
// parent: created a new socket (fd=5), refcnt=1/1 (process/global) 0229ed5f 01 03 0018 Entr 0000 __libc_spmSocketNew: iSocket=2751 0229ed5f 01 03 0018 Leav 0000 __libc_spmSocketNew (0 ms): ret 0 (0x0) // created a duplicate with fd=1, refcnt=2/2 0229edaa 01 03 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751 0229edaa 01 03 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 131074 (0x20002) // child: inherited the socket through fd=1, refcnt=1/3, called addsockettolist() 0229ee36 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2751 0229ee36 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65539 (0x10003) // parent: closed fd=1, refcnt=1/2 0229eed0 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751 0229eed0 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 65538 (0x10002) // closed fd=5, refcnt=0/1 (global isn't 0, so removesocketfromlist() is called to prevent real closure 0229eed2 01 03 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751 0229eed2 01 03 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 1 (0x1) // child: closed fd=1, refcnt=0/0, perform real closure 0229efb3 01 05 0018 Entr 0000 __libc_spmSocketDeref: iSocket=2751 0229efb3 01 05 0018 Leav 0000 __libc_spmSocketDeref (0 ms): ret 0 (0x0)
And here is a log of the failed case:
// parent: created a new socket (fd=10), refcnt=1/1 (process/global) 01c99f11 01 03 0018 Entr 0002 __libc_spmSocketNew: iSocket=2581 01c99f11 01 03 0018 Leav 0002 __libc_spmSocketNew (0 ms): ret 0 (0x0) // created a duplicate with fd=1, refcnt=2/2 01c99f1d 01 03 0018 Entr 0002 __libc_spmSocketRef: iSocket=2581 01c99f1d 01 03 0018 Leav 0002 __libc_spmSocketRef (0 ms): ret 131074 (0x20002) // *** here the fun part starts: // closed fd=1, refcnt=1/1 01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581 01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 65537 (0x10001) // closed fd=10, refcnt=0/0, the socket is really closed before child gets a chance to reference it! 01c99f6d 01 03 0018 Entr 0002 __libc_spmSocketDeref: iSocket=2581 01c99f6d 01 03 0018 Leav 0002 __libc_spmSocketDeref (0 ms): ret 0 (0x0) // and here is the late child trying to inherit a non-existent socket 01c9a023 01 02 0018 Entr 0000 __libc_spmSocketRef: iSocket=2581 01c9a023 01 02 0018 Leav 0000 __libc_spmSocketRef (0 ms): ret 65537 (0x10001) // then it tries to close its fd=1 and asserts: 01c9ace8 01 04 001b Asrt: Assertion Failed!!! 01c9ace8 01 04 001b Asrt: Function: __libc_tcpipCleanup44 01c9ace8 01 04 001b Asrt: File: D:/Coding/libc-0.6/src/emx/src/lib/sys/tcpipver.c 01c9ace8 01 04 001b Asrt: Line: 1376 01c9ace8 01 04 001b Asrt: Expr: !rc 01c9ace8 01 04 001b Asrt: soclose(2581) -> rc=-1 sock_errno()->10038 01c9ace8 01 05 0018 Entr 0026 __libc_spmSocketDeref: iSocket=2581 01c9ace8 01 05 0018 Leav 0026 __libc_spmSocketDeref (0 ms): ret 0 (0x0)
So, apparently it sometimes happens that the child is too slow to fire up and increase the global reference count on sockets it inherits before the parent decides the baby has been started and closes them on its side.
I believe I saw some code that waits for child initialization of the kLIBC part (including inheritance and stuff) in case of a successful DosExecPgm call but either it is not working well or I'm mistaken. I'll look at it again.
comment:14 by , 15 years ago
BTW, there is a different strategy regarding socket reference counting: the spawn* code can increase the global counter by one for all inherited sockets (those that don't have FD_CLOEXEC set) *before* issuing DosExecPgm, decrease it back on failure or leave this task for the child in case if DosExecPgm succeeds.
comment:15 by , 15 years ago
libc_spmWaitForChildToBecomeAlive() waits for maximum 130 ms until the LIBC child has done termination; if it hasn't and 130 ms is expired, spawn* simply returns. According to the logs, both in the successful and in the failed case spawn* timed out, but luckily, once the child was fast enough to increase socket reference counters before the parent decreases them. In the second case, the wait is aborted even before the LIBC init routine gets control in the child.
I suppose Knut had some reasons for setting such a small timeout but they at least don't count in the case of a virtualized hardware.
comment:16 by , 15 years ago
By termination I mean initialization of course. And no, the formatting is not meant too.
comment:17 by , 15 years ago
Okay, increasing the LIBC child timeout from 130 to 1300 ms (and any child timeout from 8 to 800 ms) solves the race condition in my virtualized environment so that I don't get "Socket operation on non-socket" any more.
I also found a defect in the socket select thread in Qt (QEventDispatcherPM): it would keep posting "socket ready" messages to the main thread non-stop, w/o actually giving it a chance to execute and process them. As a result, weird things could happen (EPIPE followed by unexpected child process termination and so on). This defect is fixed now (by r196).
In my environment, i.e. with patched LIBC and r196, all tests now work including the smplayer/mplayer pair (I had to patch core.cpp a bit more to get rid of invalid options and also selected "Run mplayer in its own window" since Qt integration didn't work (didn't investigate this).
Silvan, please try the new code. It should work with the normal LIBC DLL most of the time. If you experience some oddities, tell me and I'll give you the patched one for testing.
Regarding the timeouts, I'll try to ask Knut.
comment:18 by , 15 years ago
i just built qt new and tryed smplayer/mplayer again. for me it still doesn't work.
it gets now further than it used to. i get all messages about the file and also the start playing message. and usually i see 1-2 seconds of the movie. but then nothing works anymore. i have the feeling that the fast processing of stdout lines still has some timing issues. as mplayer sends a lot of those when playing a movie.
the qt integration for smplayer needs a patch, which i did not send you afaik.
comment:19 by , 15 years ago
Could you be more verbose plz? Do you use the "Run mplayer in its own window" mode? What do you mean by "nothing works anymore"?
comment:20 by , 15 years ago
as you already know all works, when changed to PM app instead of console app.
comment:21 by , 15 years ago
Milestone: | Qt GA → Qt Beta 3 |
---|
comment:22 by , 15 years ago
Now I think that going the 'cmd /c' way is not really good since it will return a PID of CMD.exe instead of the PID of the actual application. This is definitely against the programmer's expectations and makes things such as custom IPC between the parent and the child difficult or impossible (because there will be no trusted child identification).
It seems that we will have to implement P_SESSION in kLIBC as well to overcome this limitation. This, as well as #71, relates to http://svn.netlabs.org/libc/ticket/121.
comment:23 by , 15 years ago
as it's not possible to get the rigfht PID with the cmd /c way i guess a fixed/changed libc is really needed. could you do the patch? and apply to the libc ticket 121? and we probably need to ship a custom libc as long as the patch is not included.
comment:24 by , 15 years ago
On the second thought, It actually may be possible to find a child PID of the started PID. (DosQuerySysInfo or how it is called).
What frightens me in case of patching LIBC is that I don't know how tightly spawn() mechanics relies on the parent-child relationship (inheritance, signal handling, waitpid() functionality): the thing is that DosStartSession() doesn't provide this kind of relationship between the main and the started executable. I'll read the EMX docs and LIBC sources closer.
comment:25 by , 15 years ago
Some results of hacking kLIBC with DosStartSession. Handle inheritance works in principle (why shouldn't it), but I found the following difficulties:
- DosStartSession returns control to the caller much later than DosExecPgm, which means that the child has already passed the LIBC/SPM init routines by that time and it didn't recognize that it was our child (because we didn't update the shared structure with its PID used by it as a key). A particular result of that is that doesn't call the inheritance routines. As a temporary workaround, I added some sleep to the SPM init code which is of course not good and needs a better solution.
- We need to add a new wait()/waitpid() implementation for children started with DosStartSession() which uses the system queue (DosCreateQueue).
- SIGCHLD doesn't get generated. We may need a helper thread that waits for the system queue and emulates this signal when it sees the child's death.
comment:26 by , 15 years ago
All of the above looks like doable but I'll probably have to delay Beta 3 for 1-2 days. We'll see tomorrow.
comment:27 by , 15 years ago
delay of beta3 is not a big problem. better have a more feature rich Qt :) and this fix also help others using libc.
comment:28 by , 15 years ago
Milestone: | Qt Beta 3 → Qt Enhanced |
---|---|
Priority: | blocker → minor |
as it's not a wide used scenario to start a PM app from a console app we move it to Qt enhanced.
comment:29 by , 15 years ago
Milestone: | Qt Enhanced → Qt Beta 3 |
---|
comment:30 by , 15 years ago
Indeed, the situation is such that we only have problems when starting PM applications from non-PM applications (DosExecPgm() returns NO_ERROR but the application isn't actually started). On the contrary, starting non-PM (at least WINDOWCOMPAT) apps from pm apps works well with DosExecPgm() (although totally against the documentation).
Since the case of starting PM apps from non-PM apps is qiute a rare in practice, I think it isn't worth a complete kLIBC patch at the moment.
Besides, in r217 I implemented the workarounds both for starting PM apps from non-PM apps (in the form of "cmd /c") and for detaching programs (#71) using "cmd /c detach". This still returns an incorrect PID (pid of cmd.exe) but that's ok for the present time.
I created #85 for tracking the kLIBC patch and further improvemets of QProcess.
comment:31 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Not a blocker, since we haven't decided about its importance for real apps yet.