Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#280 closed defect (fixed)

python: Make subprocess.Popen thread-safe

Reported by: dmik Owned by:
Priority: major Milestone:
Component: rpm Version:
Severity: low Keywords:
Cc: steve53@…

Description

In #275 subprocess.Popen was switched from fork to spawn. However, the new implementation is not thread safe. In order to implement piping of stdin/out/err in the started child the dup() technique is used where the parent stdin/out/err are temporarily associated with pipe handles and then reverted back after the child inherits them.

This is not thread safe as another parent thread might want to use stdin/out/err while they are redirected and the results may be quite unpredictable. One example is https://github.com/bitwiseworks/mozilla-os2/issues/247#issuecomment-346183854.

This needs to be fixed.

Change History (13)

comment:1 Changed 6 years ago by dmik

The issue is not that easy. In order to avoid racing here, the app should somehow guarantee that there is no access to handles 0, 1, 2 from other threads when dup() for them is called. There are many places in Python where such an I/O is done (just search for e.g. fwrite). While we may provide a wrapper for LIBC fwrite/write which provides locking , this still looks fragile and an overkill (that will have negative performance impact).

While it may look like providing a special implementation of spawn in C that receives handles to be assigned to std I/O of the child and calls PyGILState_Ensure() before manipulating it with dup() is an option, it's not. Upon return, PyGILState_Ensure() guarantees that the calling thread holds the Python's Global Interpreter Lock which means that no other Python thread may execute before this lock is released. Yes, in Python only one thread may do something at a time but there are two exceptions: this lock is released before doing any I/O and also some custom libraries may release it for performance reasons. Given that in our case it's all about I/O, calling PyGILState_Ensure() will not protect us from other threads doing I/O.

So the only option here besides finer granularity locks in write/fwrite is the infamous DosEnterCritSec for the duration of dup(); spawn(P_NOWAIT); dup(). For that, I need to export this function to Python.

comment:2 Changed 6 years ago by dmik

BTW, although Windows has the same spawn logic as OS/2, such a problem doesn't exist there because it was solved inside their CreateProcess that may take handles to be assigned as std I/O in the child. And while we can do something similar by providing a special implementation of spawn (that would effectively do dup(); spawn(P_NOWAIT); dup()) in e.g. LIBCx and export this function to Python instead of exposing DosEnterCritSec, I'm not convinced so far that it's worth doing as there might be other program logic besides dup; spawn; dup that needs protection.

Last edited 6 years ago by dmik (previous) (diff)

comment:3 Changed 6 years ago by dmik

BTW2, LIBC popen has exactly the same vulnerability. And more over, making popen or subprocess.Popen thread-safe will not solve the general problem. Because there may be other *processes* which are connected to the parent process std I/O via pipes (for whose this parent process is a child) so that they can write to these handles at arbitrary points in time and fiddling with them will give the same strange effects. It looks like http://trac.netlabs.org/ports/ticket/143 (and http://trac.netlabs.org/ports/ticket/157) may be consequences of such a inter-process race.

There is no universal protection from these inter-process scenarios because DosEnterCritSec has process scope, not global scope. Such a process-safety may only be guaranteed on case by case basis by providing some sort of IPC synchronization between processes. All in all, this looks like an OS/2 kernel flaw. It should have provided an atomic way of redirecting std I/O the way Win does using CreateProcess.

This question of process safety is beyond the scope of this ticket though, so I will leave it out.

comment:4 Changed 6 years ago by KO Myung-Hun

Just out of curiosity, it's possible that a parent process is affected due to dup() in a child process ? If correct programs, it'll back old std ios up first. Because of this, dup()ing newly created pipes to std ios cannot affect on pipes of a parent process. Any idea ?

comment:5 Changed 6 years ago by dmik

@komh good question but perhaps you miss my point. I will illustrate it with a scheme. Let's consider this connection where Proc2 is a child of Proc1 and OUT1 is some arbitrary file handle:

Proc1 OUT <-------------| PIPE1 |-------------> Proc2 STDIN

Now, let's assume that Proc2 does this:

1. Create a new pipe PIPE2
2. DUP to save STDIN connection to PIPE1
3. DUP to connect STDIN to PIPE2
4. SPAWN Proc3
5. DUP to restore STDIN connection to PIPE1

Now, if Proc1 tries to write to its OUT handle somewhere in between steps 3 and 5, it's likely that we will screw up DosWrite the way it happens in multi-threaded Python. Yes, this problem is different from that other case where output loss happens because the process writes to STDOUT which is temporarily redirected to a child's pipe since in this case data is still written to the right pipe (PIPE1 in the example above) even if the other end of the pipe is assigned a different handle in the child. However, I won't be surprised that there are bugs in DosWrite that may cause not only crashes but also data loss when it happens while the other end of the pipe is manipulated with DosDupHandle. I don't have any proof of that though, only a theoretical thought.

In other words, the inter-process case should work in theory but I assume it might not due to bugs in DosWrite (which I observe in Python). As opposed to that, the Python case should not work w/o synchronization even in theory because this happens within the same process where file handles are shared between all threads.

I hope I answered your question.

Last edited 6 years ago by dmik (previous) (diff)

comment:6 Changed 6 years ago by KO Myung-Hun

I know, DosDupHandle?() generates another handle pointing to the same device without modifying the device itself. For example, after step 2 it will be like:

Proc1 OUT <-----| PIPE1 |-----> Proc2 STDIN(handle 0)
                        +-----> Proc2 saved STDIN

And after step 3:

Proc1 OUT <-----| PIPE1 |-----> Proc2 saved STDIN
Proc2 OUT <-----| PIPE2 |-----> Proc3 STDIN(handle 0)

As a result, writing to OUT handle in Proc1 has no problems at all. Because the other end of a pipe has never broken.

If you say that there may be a bug in OS/2 kernel while processing this, I have no more words. But in theory, I think, there are no problems in inter-process work.

However, threads in the same process may encounter problems unless lock is used. For example,

1. thread1 writes to STDIN
2. thread2 dup() a pipe to STDIN
3. thread1 write to STDIN
4. thread2 restore STDIN

In this case, thread1 may write to other handles in step 1 and step 3 although it always write to the same handle, STDIN(handle 0).

comment:7 Changed 6 years ago by dmik

Yes, there are bugs in DOSCALLS otherwise it wouldn't crash during unserialized concurrent access. As for the rest, you just repeated my words so we are on the same page which is a good sign.

Anyway, DosEnterCritSec? seems to solve the problem in case of concurrent thread access within one process. For now, I did that by exposing DosEnterCritSec? to Python but I dislike this idea (it's too dangerous and one may easily create deadlocks with it). Instead, I will add a spawn enhancement to LIBCX that takes an additional array of 3 file descriptors that will become new child's STDIN/STDOUT/STDERR and hide all the implementation details (including DosEnterCritSec) on the C level to guarantee that the duration of the critical section is as minimal as possible. Stay tuned.

comment:8 Changed 6 years ago by KO Myung-Hun

Right. I agree with you about thread-unsafe problems but not for inter-process problems.

And spawn with 3 FDs is a good approach. And if possible, I would like to say to set FD_CLOEXEC to other FDs than new child's stdin/out/err and to restore them after spawning. Because inherited FDs is the cause of a hang when pipe redirection.

Thanks for your hard work as always.

comment:9 Changed 6 years ago by dmik

FD_CLOEXEC is a must when redirecting, true. The subprocess.Popen code does that on its own (right in Python) but moving this code to the spawn extension is a good hint, thanks. I will also add a flag that will instruct it to prevent inheritance of *any* handles from parent (they have something similar in CreateProcess on Win32 too).

comment:10 Changed 6 years ago by Steven Levine

Cc: steve53@… added

comment:11 Changed 6 years ago by dmik

I implemented spawn2 in LIBCx that does stdio redirection as discussed above (all the dup work and something else). However, I also came to a final conclusion that using DosEnterCritSec is plain wrong in this scenario, see https://github.com/bitwiseworks/libcx/issues/49#issuecomment-352119076 for details.

However, I was able to successfully use the fact that Python executes only one native C function at a time (unless it enables other thread execution with Py_BEGIN_ALLOW_THREADS). Given that all dup work is now hidden in spawn2, just making sure that spawn2 doesn't allow other threads to be executed, is enough to make Popen fully thread-safe. And I was finally able to run all 30k of JIT tests.

A commit will follow.

comment:12 Changed 6 years ago by dmik

As it turned out in https://github.com/bitwiseworks/libcx/issues/49#issuecomment-352119076, DosEnterCritSec is not enough and even wrong. I implemented spawn2 using the wrapper approach and it's rock solid per se, also when calling from Python. However, the heavy JIT test suite test case still fails in multi-Core mode with an assertion in LIBC:

LIBC fatal error - streams: unlock, 0 count

Killed by SIGABRT
pid=0xecda ppid=0x0161 tid=0x000f slot=0x00a5 pri=0x0200 mc=0x0001 ps=0x0010
C:\USR\BIN\PYTHON.EXE

This doesn't look directly related to my latest spawn2 implementation (it doesn't touch kLIBC streams in any way and I couldn't find any other proof of any relation). My current thought is that there are more things which are currently not thread safe, either in kLIBC itself or in Python.

Anyway, as I said above, this is irrelevant to spawn2 according to the current data I have. So I will just commit the related Python changes.

comment:13 Changed 6 years ago by dmik

Resolution: fixed
Status: newclosed

Committed in r1281. BTW, when I added Py_BEGIN_ALLOW_THREADS / Py_END_ALLOW_THREADS around spawn2, the LIBC fatal error had gone. Was able to run all 30k tests — only one hung indefinitely, will report in https://github.com/bitwiseworks/mozilla-os2/issues/247 in more detail.

Last edited 6 years ago by dmik (previous) (diff)
Note: See TracTickets for help on using tickets.