Opened 6 years ago

Closed 6 years ago

#276 closed defect (fixed)

python: thread start hangs

Reported by: dmik Owned by:
Priority: major Milestone:
Component: python Version:
Severity: low Keywords:
Cc:

Description

Sometimes python hangs when attempting to start helper threads. See http://trac.netlabs.org/rpm/ticket/275#comment:6 for details.

Change History (4)

comment:1 Changed 6 years ago by dmik

The test case is this application

import subprocess
print subprocess.Popen (['cmd', '/c', 'ver'], stdout = subprocess.PIPE, stderr = subprocess.PIPE).communicate()[0]

Popen.communicate() basically creates two threads to read from the child process output and then waits until these threads end. I can connect to this application when it hangs and this is what I see:

Thread 1 (which calls communicate()):

 Function                         | Part           
----------------------------------+----------------
 __fmutex_request_internal        | FMUTEX.C       
 _fmutex_request                  | UCALLOC.C      
 _ucalloc                         | UCALLOC.C      
 _std_calloc                      | CALLOC.OBJ     
 _pthread_cond_broadcast          | pthr01.dll:1   
 _pthread_mutex_lock              | pthr01.dll:1   
 _pthread_detach                  | pthr01.dll:1   
 PyThread_start_new_thread        | THREAD.C       
 thread_PyThread_start_new_thread | THREADMODULE.C 
 PyEval_EvalFrameEx               | CEVAL.C        
 PyEval_EvalFrameEx               | CEVAL.C        
 PyEval_EvalFrameEx               | CEVAL.C        
 PyEval_EvalCodeEx                | CEVAL.C        
 PyEval_EvalFrameEx               | CEVAL.C        
 PyEval_EvalCodeEx                | CEVAL.C        
 PyEval_EvalCode                  | CEVAL.C        
 run_mod                          | PYTHONRUN.C    
 PyRun_FileExFlags                | PYTHONRUN.C    
 PyRun_SimpleFileExFlags          | PYTHONRUN.C    
 Py_Main                          | MAIN.C         
 0x0001008B                       | python.exe:1   
 0x00010042                       | python.exe:1   
 ___init_app                      | MAIN.OBJ       
 ___init_app                      | APPINIT.OBJ    

Thread 2 (one of worker threads created by python, it reads the child stdout):

 Function             | Part           
----------------------+----------------
 __std_bzero          | CCLR6LXJ.S     
 _um_alloc_no_lock    | IALLOC.C       
 _ucalloc             | UCALLOC.C      
 _std_calloc          | CALLOC.OBJ     
 _TlsSetValue         | pthr01.dll:1   
 _pthread_setspecific | pthr01.dll:1   
 _pthread_exit        | pthr01.dll:1   
 threadWrapper        | BEGINTHR.OBJ   
 0x1FFECE33           | DOSCALL1.DLL:4 

It looks like the worker thread gets stuck in __std_bzero while allocating zero-filled memory from the heap. Since this allocation holds the heap mutex lock, Thread 1 is unable to acquire it when doing its own allocation and gets stuck too. It's unclear so far why __std_bzero does not return.

comment:2 Changed 6 years ago by dmik

I overlooked the thread 2 state in the debugger which was Critical. This means that the thread was blocked by some other thread that entered the critical section with DosEnterCritSec. Looking at pthread code closer made it obvious: _pthread_detach on Thread 1 calls DosEnterCritSec and then requests a LIBC heap's fast mutex in calloc(). However, if this mutex happens to be owned by some other thread at that time (Thread 2 in this case, through a call to the same heap mutex via calloc() too) it's a guaranteed deadlock as the other thread won't get any chance to run and release the held lock.

Note that deadlocks invlolving DosEnterCritSec are known for causing system freezes and producing unkillable zombies. Given that pthread is used in many ports we maintain, this dangerous code might be called in many cases so I wonder how many freezes/halts are fixed when we fix this... Anyway, this deserves a separate ticket in Ports.

comment:3 Changed 6 years ago by dmik

I created http://trac.netlabs.org/ports/ticket/175 for the pthread problem.

comment:4 Changed 6 years ago by dmik

Resolution: fixed
Status: newclosed

The problem has completely gone away with http://trac.netlabs.org/ports/ticket/175, closing.

Note: See TracTickets for help on using tickets.