Opened 18 years ago

Last modified 18 years ago

#30 assigned defect

QProcess's readStdout()/readStderr() is slow

Reported by: froloff Owned by: dmik
Priority: normal Milestone: qt-os2-3.3.1-rc07
Component: kernel Version: 3.3.1-rc06
Severity: normal Keywords: QProcess redirected output pipe
Cc:

Description

QT-based photoproc http://sourceforge.net/project/showfiles.php?group_id=98019 application uses QProcess class to launch 3-rd party text-mode tool. Everything looks-like ok and tool started but nothing happend, like data from tool's stdout doesn't pass to main application.

Change History (20)

comment:1 by dmik, 18 years ago

Can you please try the source:trunk/examples/process example and play with it (by specifying that tool's executable instead of uic.exe) in order to narrow the problem down?

comment:2 by froloff, 18 years ago

I made some investigations. It seems that data passed throuth stdout redirected to pipe but with dramatically low speed ~1Kb/sec or even less.

comment:3 by froloff, 18 years ago

I made a testcase http://froloff.homeip.net/utils/process-testcase.zip dcraw, producing 48Mb pnm file, with redirected output (see do-direct.cmd) takes only few seconds on my home PC, while launching it with process.exe gives only few Kb of destination file withing first minutes.

comment:4 by dmik, 18 years ago

I played with your test case and confirm the slowness. Please try to increase the pipe size at http://svn.netlabs.org/qt3/browser/trunk/src/kernel/qprocess_pm.cpp#L61 to 65520 (the maximum value supported on my system) to see how it improves the performance. Here the improvement is quite noticeable (though isn't that close to running do-direct.cmd).

Anyway, it's not a good idea to read big files using readStdout()/readStderr() of QProcess -- it's not the way how it was designed to be used (at least on Win32 and OS/2). The main purpose of redirecting QProcess output is to communicate to command line tools (to create front-ends and such).

When redirecting output, QProcess uses the 100 ms timer to monitor pipes and can be used only on the main (GUI) thread. We can theoretically use dedicated threads instead of the timer to read from pipes (that will result into reading the whole output stream into the memory if the GUI is stuck or not that fast to process readyStdout()/readyStderr() signals), but I see no benefits of it provided that you can easily redirect bug output to a temporary file and process it afterwards. This will be definitely the fastest method. So please also consider patching photoproc sources as a good alternative.

P.S. I just wonder how could do-direct.cmd produce a 48 Mb file on your system for few seconds. What PC/HDD you have? On my PD 920 (2.8Mhz) with an ATA133 HDD, it takes ~1 minute to produce this file. I can't believe your system can be 60 times faster than mine...

comment:5 by froloff, 18 years ago

Вообще говоря, в unix-е сваливать результат в stdout или читать из stdin - нормальная практика. Это позволяет избегать промежуточных временных файлов и ускоряет обработку данных не требующих прямого доступа. Кроме того надо постоянно следить за этими временными файлами, чтобы они не накапливались удалять и т.д. что не слишком удобно, да и места требуется немало. Поэтому было бы неплохо, если бы хотя бы удалось достичь скорости записи на диск (теоретически, скорость обмена данными через пайп должна быть больше в разы).

По поводу быстродействия do-direct.cmd, специально только что запустил - заняло 12 сек. Процессор AMD-64-4000+, диск не слишком быстрый: 20-30Мб на чтение, правда памяти 4Гб ;)

comment:6 by dmik, 18 years ago

Да понятно, что из юникса ноги растут :) И плюсы сего подхода тоже понятны. Однако, ты попробуй увеличить PIPE_SIZE для начала, а там посмотрим. Мне крайне неохота создавать по 2 нити на каждый QProcess, а по-другому дальнейшее повышение скорости невозможно. Надо будет, кстати, глянуть, как в Qt/X11 сделано.

При данной схеме (100ms poll x 65k buf) мы получим 650K в секунду максимум, при условии, что GUI будет выгребать все вовремя. poll timer сильно уменьшить не получится, т.к. тогда будет излишне грузиться проц в моменты простоя (когда процесс на самом деле ничего не выводит).

По быстродействию do-direct.cmd: видимо, в памяти дело. Ну и AMD твой пошустрее будет.

comment:7 by dmik, 18 years ago

Keywords: QProcess redirected output pipe added
Milestone: qt-os2-3.3.1-rc07
Summary: QProcessQProcess's readStdout()/readStderr() is slow
Version: 3.3.1-rc06

Finally, I have recalled why I used named pipes there -- just in case if I come to a conclusion that using the polling technique is unacceptably slow in some applications. Named pipes allow to associate a semaphore with them (using DosSetNPipeSem()) which gives us an opportunity to wait on all QProcess pipes on a single worker thread using DosWaitMuxWaitSem().

We'll see how the polling technique will behave with the 32ms polling and 65K pipe buffers and then decide is it worth to rewrite this part of QProcess to use semaphores and a dedicated thread.

comment:8 by dmik, 18 years ago

Status: newassigned

I've made some more testing by copying a 28 MB file:

  • using the copy command: ~1ms
  • using QProcess::readStdout() + fwrite() + 1ms polling + 65K buffers: ~12ms
  • using QProcess::readStdout() + fwrite() + 0ms polling + 65K buffers: ~5ms

0ms (special zero pseudo-timer) polling means that the GUI thread is constantly reading from the pipe suspending only if there are some messages on the event queue (which is checked after reading the whole piece of data available, that usually equals to the buffer size or a bit more). This is the top speed possibe, but we get quite a noticeable GUI response time decrease: for example, it won't not process mouse press events immediately, but with some delay (upto 1 second or so).

Thought more on using a dedicated thread for monitoring/reading from pipies. Although it seems attractive, it will not help much. The matter is that QProcess is *not* thread safe by design, so all its methods must be called on the GUI thread (so it's impossible to read data on a dedicated thread and process it there as well, to completely free the GUI up of this task). Below are both scenarios:

A. Single-treaded

  1. The GUI thread checks the pipe for new data every NNN ms.
  2. When data is available, the GUI thread reads it to an internal buffer (pipe->mem1) and calls readyStdout().
  3. readyStdout() slots called on the GUI thread make a private copy of read data using readStdout() (mem1->mem2).

B. Multi-treaded

  1. A worker thread waits for new data in the pipe, GUI does nothing.
  2. When new data is available, the worker thread reads data to its internal buffer (pipe->mem1) and posts a message to the GUI thread.
  3. GUI thread gets the message and calls readyStdout().
  4. readyStdout() slots called on the GUI thread make a private copy of read data using readStdout() that internally uses mutex to protect multithreaded access (mem1->mem2).

As you can see, in method B, GUI will be informed by posting messages to its message queue. In fact, in its best, this will be equivalent to using a minimal (1ms) polling interval in terms of speed (which is far from being comparable to the reguar 'copy' command). So, method B loses in speed to method A when the latter uses 0ms polling, or they are about to be equal when it uses 1ms polling.

Where method B can sligtly win is the GUI response time -- in method B, GUI doesn't have to constantly poll the pipe. However, method B actually benefits only when there is no data available (when it is, both methods will give the same response time because GUI will be constantly busy with either 1ms timer messages or messages from the worker thread).

On the other hand, using a 100ms timer in method A actially will not slow down GUI significantly (the end user won't notice that at all). So, if we implement the progressive polling timer interval scheme (start with 100ms and decrease down to 1ms as long as data is immediately available, return back to 100ms when it is not), it will make method A pretty much equal to method B with the only difference that it will react to arriving new data with some (100ms) delay. Though, I believe this delay is not that essential for applications of QProcess in real life.

The above makes me think that we should not rewrite the code and introduce a worker thread but use progressive polling intervals instead.

comment:9 by dmik, 18 years ago

Actually, method B has one more advantage over metdod A. In A, both mem copy ops (pipe->mem1 and mem1->mem2) will happen one after another on the main GUI thread (usually at a time when the poll event signals there is data). In B, pipe->mem1 will be done on the worker thread, while mem1->mem2 -- on the GUI thread some time later. This is unlikely to improve performance, but may improve GUI response time a bit. Still not sure is it worth.

comment:10 by dmik, 18 years ago

Just for the record. On Win32 (Qt 3.3.5), the same test on a 28 MB file takes:

  • using the copy command: ~1 sec
  • using QProcess::readStdout() + fwrite() + 1ms polling + 65K buffers: ~11 min

That's because Win32 uses 1K buffers as far as I remember.

So, whichever solution we take, it won't be portable which is too bad.

froloff, your comments?

comment:11 by dmik, 18 years ago

Opps, in the comment regarding Qt/2 speed comparisons, numbers are seconds of course, not milliseconds.

comment:12 by dmik, 18 years ago

And in the comment regarding Qt/win32 comparisons, its 100ms polling + 1K buffers.

comment:13 by froloff, 18 years ago

Well, I still don't understand the problem with thread implementation. If it is not a good way to poll pipe from GUI, let do full pipe processing from additional thread (including reading data) and GUI will just check flag if data from pipe ready or not. In unix all events, including GUI, multiplexed through single select() call. We can't do this, but we can emulate it very easy, so that we may have a number of event waiters, which then will post appropriate semaphore, and a single event processor, controled with multiplex event semaphore.

About pipes themselves. The following example demonstrate that pipe IPC method doesn't add any delay comparing with direct disk writing, as comand

popen.exe "dcraw -4 -c -w IMG_0664.CR2"

gives absolutely the same timing result as do_direct.cmd


#include <stdio.h>

int main( int argc, char *argv[]) {

FILE *fp, *g;

int l;

int status;

char buf[2048];

fp = popen(argv[1], "rb");

if (fp == NULL)

/* Handle error */;

g= fopen("dest.bin", "wb");

if (g == NULL)

/* Handle error */;

while ( (l= fread(buf, 1, 2048, fp)) > 0 )

{

fwrite(buf, 1, l, g);

}

status = pclose(fp);

if (status == -1) {

/* Error reported by pclose() */

} else {

/* Use macros described under wait() to inspect `status' in order

to determine success/failure of command executed by popen() */

}

fclose(g);

}


comment:14 by dmik, 18 years ago

The problem with the thread implementation is that I'm afraid it will not improve speed comparing to 0/1ms polling + 65K buffers, because all communication will still be possible on a GUI thread only. The speed demonstrated in your snippet above is not possible in principle because GUI cannot read (copy) data constantly -- it have to interrupt to process other messages. Is that clear? :)

Please try the current implementation with 0ms polling + 65K buffers (and tell me the results) -- this is the fastest possible way. 1ms polling + 65K buffers results are also interesting.

And you forgot about the portability again. Do you want this app to be portable (in particular, to Win32)? If yes, you have to rely on the current QProcess implementation (=> use temporary files). If no, you don't need QProcess at all and can do piping yourself, completely on a separate thread.

comment:15 by dmik, 18 years ago

Here is some performance summary of starting QProcess("cmd /c type 28_MB_file.txt") and reading its output using the readyStdout() sighal. Tests are run several times to make things cached.

method size time speed signal everynotes
65520 B buf + 0 ms timer 27,4 MB120 ms228.3 MB/s0,27 ms 100% CPU load
65520 B buf + 0 ms timer + fwrite27,4 MB3,3 s 8,3 MB/s 7,5 ms 100% CPU load
65520 B buf + 1 ms timer 27,4 MB8,6 s 3,19 MB/s 19,6 ms ~5% CPU load
65520 B buf + 1 ms timer + fwrite27,4 MB11 s 2,49 MB/s 25,1 ms ~5% CPU load
32768 B buf + 1 ms timer 27,4 MB18,4 s1,49 MB/s 21 ms ~5% CPU load
32768 B buf + 1 ms timer + fwrite27,4 MB20,8 s1,32 MB/s 23,7 ms ~5% CPU load

comment:16 by dmik, 18 years ago

Here are the results of the test with a dedicated threa using popen from the snippet above and posting events to GUI when new data available:

method size time speed signal evey
65520 B buf 27,4 MB 285 ms 96,1 MB/s 0,65 ms
65520 B buf + fwrite27,4 MB 3.5 s 7,8 MB /s 7,98 ms

As you see, values are pretty comparable to the single-threaded 0 ms timer method shown above. The former is a bit slower because some time is spent to thread synchronization. But, the problem is that it also gives the same 100% CPU load (because GUI is constantly busy with either saving a file when using fwrite or just processing "data ready" notifications when not). Which in turn makes GUI almost unusable, as with 0 ms timers.

comment:17 by dmik, 18 years ago

Well, GUI thread is not actually that busy when fwrite is not done, because "data ready" event processing is very fast in this case, so nothing stops it from serving other (system and input) events just in time.

So the real problem when reading is too fast is that GUI is constantly busy processing data. A correct application would just immediately pass read data to some other non-GUI thread for processing but I'm afraid that it's not the case in real applications using QProcess for that purpose (Dmitry, can you tell how it is done in photoproc?).

So, the method with 65K buffers and 1 ms timers is of course 3-4 times slower but it's safer, because:

  1. the GUI thread gets notifications not faster than every ~20ms (the system timer limitation);
  2. according to the QProcess implementation, the maximum amount of data readStdout() can return is as twise as the pipe buffer size, i.e. 128K, that usually can be quickly processed.

This means that GUI will remain responsive even if data is processed on the GUI thread in some quick manner (like writing to a file). Other methods don't give us such a guarantee.

comment:18 by dmik, 18 years ago

I guess I've found the best solution. I improved the 'dedicated thread' method by:

  1. Using PM messages (WM_SEM3) instead of QT events.
  2. Limiting the maximum pending (read but not fetched) data size to some particular value (i.e. to the doubled pipe buffer size).

Item 1 ensures that normal PM messages are regularly processed by the GUI thread (processing Qt events is done so that PM messages are not fetched until all Qt events are delivered, this is why it gives 100% CPU load). Item 2 prevents the pending data buffer from unlimited growing (in case if GUI doesn't regularly read out pending data).

Here are the results:

method size time speed signal every
65520 B buf 27,4 MB 240 ms 114,4 MB/s0,55 ms
65520 B buf + fwrite27,4 MB 3,6 s 7.6 MB/s 8,2 ms
65520 B buf 109,6 MB2,7 s 40.58 MB/s1,5 ms
65520 B buf + fwrite109,6 MB26,5 s 4.1 MB/s 15,1 ms

Note that CPU load is just 15-25% and GUI remains fully operable in all four cases. If, for example, you will output pending data block size on every read operation to the scroll window, you can even smoothly scroll that window (using the mouse wheel) w/o any noticeable slow down! Even more, for some (yet unknown) reason, testcase 4 (the biggest and slowest one) will even work faster, i.e. ~16 (!) seconds instead of 26,5. I did binary source and the resulting file comparison (to make sure no read blocks are skipped) -- they are just the same.

Also note that removing the maximum pending data size limit has very little effect on performance: testcase 4 completes in 25,5 secons (i.e. 1 second difference). On the other hand, the unlimited buffer can grow up to several megabytes if GUI is not processing data fast enough, which sometimes leads to short-term GUI freezes (stopped scrolling, etc). This proves that having a limit is a very useful thing.

So, I'm going to reimplement the current QProcess implementation to use semaphores and a dedicated thread. I just hope it will not take too long :)

comment:19 by froloff, 18 years ago

Well, I applied the latest changes (revision 135) and compared the speeds.

I can confirm that new implementation improve situation dramatically.

Loading RAW file in photoproc application from dcraw output take now less then 2 seconds, including dcraw processing time. This is several times faster then is was when dcraw write data directly to disk (~12sec) and when we took data from dcraw input with old QProcess implementation (worse then 8 seconds, when POLL_TIME was set to 1).

As I mentioned before, in our case HDD would be always the bottleneck and sending data via pipe always many times faster (if you have enough memory of cause :) ), this is also explain, why 3-rd tescase so differ in speed from 1-st one (40Mb/sec is usuall raw read speed from the disk).

comment:20 by dmik, 18 years ago

Hmmm, it's strange but my last comment on Oct, 10 has somehow got lost :( I explained some details of implementation I've applied. Anyway, I'll just mention that the results are pretty the same as in the last table (for a dedicated thread) which is not that bad taking into account that QProcess as a whole is a bit more complex than the dedicated thread. One more significant difference from the testcase is that I don't read out pipe data on the monitor thread, but leave it in the system pipe buffer and only post a message to the GUI thread that some data is ready (which acts like a "natural" buffer size limit I talked about above).

Another interesting observation I made when testing the latest QProcess implementation is related to the fourth (heaviest) testcase from the last table. I found that if I put a line to the text edit box every time there is new data to read (which produces the constant text output), *and* scroll this test edit box (using the mouse wheel) back and forth all the time during the process execution, the execution time is almost two times faster (not slower as one might expect because of extra actons such as GUI redraw etc.): the testcase completes in 15 seconds instead of 27! I think it may be explained by the fact that GUI thread doesn't have to go to the sleep state in WinGetMsg() because of constantly arriving mouse and WM_PAINT messages it needs to process. If it is the reason, then reading some data to a temporary buffer on the monitor thread (to let the process write some more data instead of waiting until the GUI thread wakes up to pick up the "DATA READY" message and free the pipe buffer) may give the same effect. I'll try when I have some more time.

Anyway, thank you for the feedback.

Note: See TracTickets for help on using tickets.