Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#101 closed enhancement (fixed)

Use DIVE to speed up blitting in QRasterWindowSurface

Reported by: Dmitry A. Kuminov Owned by:
Priority: major Milestone: Qt 4.6.2
Component: QtGui Version: 4.5.1 Beta 3
Severity: highest Keywords:
Cc: pmw

Description

We may want to try to use DIVE to speed up flushing the widget's raster buffer to screen in QRasterWindowSurface::flush().

Currently we use GpiDrawBits() for this purpose which is not very optimal since the image bits aren't likely to go directly to the video memory in this case.

We also may provide a custom QWindowSurface that allocates widget buffer directly in video memory (I recall DIVE should allow this).

Attachments (3)

blit.cpp (3.0 KB ) - added by Dmitry A. Kuminov 15 years ago.
diveblit.zip (9.0 KB ) - added by rudi 15 years ago.
2nd upload, fixed a small clipping error
diveblit2.zip (12.9 KB ) - added by rudi 15 years ago.

Download all attachments as: .zip

Change History (71)

comment:1 by Dmitry A. Kuminov, 15 years ago

We currently use the world transformation matrix to flip the image buffer along the y axis from the top-oriented Qt coordinate space to the bottom-oriented PM coordinate space.

In r308, I tried to mirror the image buffer manually before calling GpiDrawBits() and it seems to provide slightly faster performance.

Here are the the results with the world matrix:

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 62
*** qt_total_blit_pixels 2921346
*** speed 47118 pixels/ms

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 69
*** qt_total_blit_pixels 2921346
*** speed 42338 pixels/ms

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 67
*** qt_total_blit_pixels 2921346
*** speed 43602 pixels/ms

And here are the four results with using manual image mirroring:

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 43
*** qt_total_blit_pixels 2921688
*** speed 67946 pixels/ms

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 58
*** qt_total_blit_pixels 2921688
*** speed 50373 pixels/ms

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 65
*** qt_total_blit_pixels 2921688
*** speed 44949 pixels/ms

D:>qt4 debug\textedit.exe
*** qt_total_blit_ms 60
*** qt_total_blit_pixels 2921346
*** speed 48689 pixels/ms

To do the test, I sequentially run the textedit demo application and pressed Alt, X (maximize), Alt, R (restore), Alt, C (Close) with a ~half second delay between commands.

For timing, I used the millisecond counter from the per-process Local Info Segment and I executed this in a VM so the results may not be perfect (especially the second case has very big differences, though subsequent calls show that the speed is around 50000 pixels/ms).

Anyway, I committed in the manual image mirroring version (second case). Silvan, please provide some feedback.

It would be great if you could test this as well, on a real hardware. To get counters, you need to do the following:

  1. Place #define QT_LOG_BLITSPEED to the top of source:/trunk/src/gui/painting/qwindowsurface_raster.cpp and source:/trunk/src/gui/kernel/qapplication_pm.cpp.
  2. Change #if 1 to #if 0 for matrix based flipping and vice versa for manual flipping at line 227 here: source:/trunk/src/gui/painting/qwindowsurface_raster.cpp#L227.

Later, I'll write a more proper test that draws a lot of graphics in a loop and doesn't require operator's action which should give less errors in the results.

comment:2 by Silvan Scherrer, 15 years ago

it's strange here.
i see exactly the oposite. with L277 1 i get around 22000 pixel/ms.
with L227 0 i get around 49000 pixel/ms

comment:3 by Dmitry A. Kuminov, 15 years ago

Okay, I will have to write a proper test case in order to collect more error-free results and then decide what option we should choose.

comment:4 by pmw, 15 years ago

Cc: pmw added

Not sure if I understand correctly what you are trying to do. But do you know about GpiEnableYInversion()? I think that can be used instead of manually flipping buffers around. Like this:

   LONG before = GpiQueryYInversion(hps);
   GpiEnableYInversion(hps, bi.cy-1); /* where bi is of type BITMAPINFO2 */
   ...
   GpiDrawBits();
   ...
   GpiEnableYInversion(hps, before);

There are a few more details at http://wiki.netlabs.org/index.php/GpiEnableYInversion().

comment:5 by Dmitry A. Kuminov, 15 years ago

No, I hear about this function for the first time. Thank you, I always suspected there should be such a trick. I will do detailed measurement to see what method is faster.

I regret that I didn't know about this function a few years ago -- I had to do computations to manually flip the Y axis in dozens places in the Qt source code (both Qt3 and Qt4).

BTW, the link you gave contains nothing ATM. I found the information in the Google cache though.

comment:6 by Dmitry A. Kuminov, 15 years ago

Note for myself: replace QImage::mirrored() calls in source:/trunk/src/gui/image/qpixmap_pm.cpp with GpiEnableYInversion() too once we prove that it's faster.

comment:7 by Silvan Scherrer, 15 years ago

Severity: highest

eventually it wouldn't hurt to see if this gives some more speed.

comment:8 by Silvan Scherrer, 15 years ago

Milestone: Qt EnhancedQt GA

moved to Qt4 GA as it's worth the effort

comment:9 by Dmitry A. Kuminov, 15 years ago

In r435, I added the testcase that constantly draws random rectangles in a window causing the blit operation (copying the rectangle of pixels from the offscreen buffer to the presentation space of the window) to happen when processing WM_PAINT events.

The test calculates the roundtrip speed of copy offscreen buffer to window
perations which means that besides the time necessary for copying pixels, this also includes the time necessary to fill 2D rectangles in the offscreen buffer, deliver paint and timer messages and so on.

To get the speed of the copy operation only, use the QT_LOG_BLITSPEED define as described above.

Note that to get the most realistic numbers suitable for comparison, let the test run until the average speed printed in the window title stabilizes (i.e. stops to grow).

comment:10 by Dmitry A. Kuminov, 15 years ago

Implemented the GpiEnableYInversion method in r437. Note that in order to switch between three different methods (QImage mirror, GpiEnableYInversion and GPI Matrix) you need only to assign a correct value to QT_BITMAP_MIRROR in qwindowsurface_raster.cpp, line 75.

I performed local testing and though I cannot give you the numbers (in the virtualized environment they are highly unstable and make no sense for others but me) I got a strong feeling that GpiEnableYInversion (method 2) does just the same GPI Matrix transformation internally as I do in method 3. So both 2 and 3 show comparable results and both are slower than the old good QImage mirror (method 1).

Silvan and others, I'm really interested in seeing your statistics on real hardware in all three modes (both roundtrip and copy only speeds; they are printed to stderr if you do everything right).

comment:11 by Dmitry A. Kuminov, 15 years ago

JFYI. This annoying GPI mirroring problem (the PC desktop is not a drawing board, after all!) steals 25-30% of performance :( If you remove mirroring in method 1 by replacing

QImage img = d->image->image.mirrored();

with

const QImage &img = d->image->image;

you will immediately get the mentioned performance boost...

comment:12 by Silvan Scherrer, 15 years ago

does this mean the effort did not bring anything?
i try to test the 3 scenario after x-mas

comment:13 by Dmitry A. Kuminov, 15 years ago

Yes, it basically means this. Sure, when you have time.

comment:14 by Dmitry A. Kuminov, 15 years ago

Milestone: Qt GAQt Enhanced

DIVE is certainly not for the GA.

comment:15 by Silvan Scherrer, 15 years ago

here are some testresults
Qt4 4.5 GA

Stats:

total pixels blitted : 1373095041
total time, ms : 68763
average roundtrip speed, px/ms : 19968

Qt4 4.6.2 default

Stats:

total pixels blitted : 1145961698
total time, ms : 55113
average roundtrip speed, px/ms : 20792

Qt4 4.6.2 #define QT_BITMAP_MIRROR 2

Stats:

total pixels blitted : 1889664774
total time, ms : 64887
average roundtrip speed, px/ms : 29122

Qt4 4.6.2 #define QT_BITMAP_MIRROR 3

Stats:

total pixels blitted : 1599437680
total time, ms : 52079
average roundtrip speed, px/ms : 30711

comment:16 by Dmitry A. Kuminov, 15 years ago

Thank you Silvan. Attaching a patched blit.cpp capable of building on Qt3.

by Dmitry A. Kuminov, 15 years ago

Attachment: blit.cpp added

comment:17 by Silvan Scherrer, 15 years ago

with Qt3

Stats:

total pixels blitted : 718862409
total time, ms : 20323
average roundtrip speed, px/ms : 35371

comment:18 by Dmitry A. Kuminov, 15 years ago

Type: defectenhancement

comment:19 by rudi, 15 years ago

Silvan asked me to do some tests as well, but I get completely weird results:

This are the stats using the recent 4.6.2 trunk (default mode, source code downloaded this morning from the SVN):

Stats:

total pixels blitted : 126260005
total time, ms : 38524
average roundtrip speed, px/ms : 3277

And this is what I get when running the Qt3 version (source two comment above):

Stats:

total pixels blitted : 1916609352
total time, ms : 24584
average roundtrip speed, px/ms : 77961

Not just by the numbers, but also visually the Qt3 version is much faster. This is a P4 2.4GHz, Intel 845 integrated graphics, SNAP 3.1.8.

comment:20 by Dmitry A. Kuminov, 15 years ago

Yes, your results are kind of strange since they are close to what I have under VirtualBox (~2000 px/ms in Qt4 and ~70000 in Qt3). Qt3 paints rectangles using GPI primitives directly in the window PS (which usually means directly in video memory) while Qt4 draws everything to an off-screen image buffer which is then copied to the window PS. Taking into account that regular memory operations in the VM are slower than on the real hardware due to extra page handling while drawing a rectangle into the guest video memory means drawing it directly to the host window's PC, this explains why the difference is so big under VirtualBox.

While your Qt3 value is still comparable to Silvan's by the order of magnitude, your Qt4 value is definitely not. Can you please repeat the Qt4 tests several times and also try the 4.6.2 trunk switched to QT_BITMAP_MIRROR 2 and 3 instead of the default please?

comment:21 by rudi, 15 years ago

O.K, the other build has completed now, so here are the full results (P4 2,4Ghz, Intel I845 integrated graphics):

4.6.2 + QT_BITMAP_MIRROR 1

Stats:

total pixels blitted : 132725433
total time, ms : 40606
average roundtrip speed, px/ms : 3268

4.6.2 + QT_BITMAP_MIRROR 2

Stats:

total pixels blitted : 587069252
total time, ms : 51660
average roundtrip speed, px/ms : 11364

4.6.2 + QT_BITMAP_MIRROR 3

Stats:

total pixels blitted : 532857305
total time, ms : 45050
average roundtrip speed, px/ms : 11828

3.3.1

Stats:

total pixels blitted : 3902955420
total time, ms : 50596
average roundtrip speed, px/ms : 77139

I ran the same tests on two other machines (Athlon 2000 XP, ATI Radeon 9600, Athlon 2100 XP, NVIDIA integrated Geforce2 compatible graphics). Both exposed very similar results. In all cases QT_BITMAP_MIRROR 3 was the fastest for Qt4, but was outperformed dramatically by Qt3.

BTW, all the machines are non-virtual and equipped with graphics adapters that are fully supported by SNAP 3.1.8 (which was used in all cases).

comment:22 by rudi, 15 years ago

Just to add: I forgot to run the tests on my notebook, which is a little bit more contemporary that these desktops. Result: the numbers are bigger, but the trend is the same. The manual flip gains a bit more in this case, but is still half the speed of the other methods. Qt3 is still far ahead with about 106000 px/ms.

comment:23 by rudi, 15 years ago

A second note: I compiled blit.cpp on Windows using Qt4.5.2/mingw. The result for the very same machine as above (just booted to W2K instead of Warp4):

Stats:

total pixels blitted : 3736592378
total time, ms : 35451
average roundtrip speed, px/ms : 105401

comment:24 by Dmitry A. Kuminov, 15 years ago

Okay, thank you! This is a good evidence that DIVE is highly wanted.

Regarding the difference between Qt3 and Qt4 -- as I previously said, it's due to using GPI primitives rather than bitmap blitting. As you say, your cards are fully supported by Snap which means they have hardware acceleration for lines, rectangles, fills and so on. In Silvan's case (who uses Panorama), most likely, this level of acceleration is missing for his card, hence the lower numbers. On the other side, panorama seems to have got really good acceleration for blitting operations on his hardware which may explain this 10 times difference...

comment:25 by rudi, 15 years ago

I have to admit, that the computer I did the initial tests on was performing extremly good on W2K, but the others scored at 65000..85000px/ms as well.

BTW, it looks like GpiDrawBits() is slow no matter if it has to flip the image or not. To check that I omitted the flipping (which caused Qt apps look a bit funny ;-). But the speed gain (vs. YInversion or Matrix) was only a few percent...

Some thoughts about DIVE:

Using a DIVE buffer in video memory is certainly something we don't want to do. That is, because VRAM is optimized for output and usually has very poor read performance. This is, why Panorama employs a shadow buffer in main memory to do the actual drawing on and blits that to VRAM when necessary.

Now, DIVE allows to associate a buffer physically located in main memory to a DIVE buffer. That would avoid the problem above, but AFAIK, DIVE does not support source buffers in a 32 bits per pixel format. At least it didn't when I tried that several years ago... That would mean, we have to convert our 32bpp Qt drawing surface to a 24bpp format. Given the results from the tests with image.mirrored(), I would think that this will be a huge performance killer as well. Also it has to be noted, that for some strange reason DIVE only accepts a maximum of 50 visible rects for clipping. This can cause strange results if someone moves a shaped window (maybe you remember that funny program that draw a "window" in form of japanese cartoon character) over a DIVE app. Also there seem to be a few buggy DIVE/PM/Displaydriver version(s) around, which will not correctly redraw some areas of the screen when a DIVE app is active.

Of course we could also try to avoid DIVE blitters completely and do everything ourself. This would probably be the fastest way (given that our code is well optimized), but certainly the most complex one. In this case we would obtain a pointer to the frame buffer and blit the data directly. However, then we are also responsible for clipping and color space conversion. Maybe Qt already contains code for this - as they support embedded devices.

comment:26 by Dmitry A. Kuminov, 15 years ago

Thank you for the information about DIVE.

However, what we need is exactly to write out the off-screen buffer from the main memory to the window context; we don't need to read from the video memory and don't store any data there. Regarding the 24 vs 32 bit, Qt in fact uses 24 bit but data but padded with 0 to 32. If dive can't handle this we're in trouble indeed.

What about DIVE bugs, we'll see. If the situation is really that bad nowadays, I will provide an env. var to select the rendering method dynamically.

BTW, commenting out the mirror stage here gives some significant benefit, but mine is the virtualized hardware so doesn't really matter.

comment:27 by rudi, 15 years ago

However, what we need is exactly to write out the off-screen buffer from the main memory to the window context; we don't need to read from the video memory and don't store any data there.

Hmm, but isn't Qt drawing in that buffer, which would cause reads to take place depending on the method of combing background and foreground ? In this situation I would not recommend to use a buffer in off-screen video memory (i.e. on the graphics card).

BTW, commenting out the mirror stage here gives some significant benefit, but mine is the virtualized hardware so doesn't really matter.

It's interesting that your environment (what is that, btw ?) behaves opposite to what Silvan and I experience on real hardware on which the "Matrix" approach is the fastest.

in reply to:  27 comment:28 by Dmitry A. Kuminov, 15 years ago

Replying to rudi:

Hmm, but isn't Qt drawing in that buffer, which would cause reads to take place depending on the method of combing background and foreground ? In this situation I would not recommend to use a buffer in off-screen video memory (i.e. on the graphics card).

Qt draws to the off-screen buffer in the *main* memory (not on the graphics card) and just blits it to whatever the video driver thinks the best. Since with DIVE, we should garanteedly blit directly into the video memory (which is fast for writes) it should be faster than passing through internal GPI buffers PM maintains for windows. This is how I understand the theory of DIVE at least.

comment:29 by rudi, 15 years ago

DIVE blitters aways copy a source DIVE buffer to a destination DIVE buffer. The destination is usually (but not necessarily) the visible screen. The source buffer can be allocated by DIVE (in this case it may end up in an invisible area of the VRAM on the card) or it can be associated with an area in main memory. So in theory we could either let DIVE allocate the buffer and pass the pointer to Qt as drawing surface or we can let Qt allocate a drawing surface and create a DIVE buffer instance that maps to the same memory location. Only in the latter case we are sure that the source buffer for the DIVE blit (which is Qt's drawing surface) is not located in invisible VRAM on the card.

BTW, I dug out some ancient test program and it shows that DIVE blitters does not support 32bpp source buffers. Trying to let DIVE allocate one results in DIVE_ERR_SOURCE_FORMAT. Trying to associate one will succeed in the first place (?), but will crash when actually trying to blit. SO my memory was right about that.

comment:30 by rudi, 15 years ago

I did some initial tests with direct frame buffer writes. Even though the first results don't look very encouraging, I'll try to implement a "QT_BITMAP_MIRROR 4" option based on that...

comment:31 by rudi, 15 years ago

Hi Dmik,

as promised here is my (now somewhat better) working implementation of direct blitting. For initial testing, I put most of the new stuff into diveblit.cpp, which is included by qwindowsurface_raster.cpp. QtGui now needs to be linked against MMPM2. Especially on 16bit target screens the performance gain is noticeable (30..50%). For 32bit not really. Of course, the code leaves a room for optimization (MMX/SSE), but I guess that will speed up things by 10..15% in maximum. Also note, that 24bit target screens are not yet supported.

Since we are still slower than Windows by a magnitude, I guess they are using hardware acceleration withing the display driver...

comment:32 by rudi, 15 years ago

Her the result for teh very same machine I used before
(speed was 11828 with QT_BITMAP_MIRROR = 3)

DiveCaps.fScreenDirect: 1
DiveCaps.fBankSwitched: 0
DiveCaps.ulWidth: 1920
DiveCaps.ulHeight: 1080
DiveCaps.ulDepth: 16
DiveCaps.ulScanLineBytes: 3840
DiveCaps.fccColorEncoding: R565
Stats:

total pixels blitted : 1356648576
total time, ms : 55578
average roundtrip speed, px/ms : 24409

by rudi, 15 years ago

Attachment: diveblit.zip added

2nd upload, fixed a small clipping error

comment:33 by rudi, 15 years ago

I've done some more work on the code. The new version should support the screen formats BGR4, RGB4, BRG3, RGB3, R565, R555 and R664. It loads DIVE.DLL at runtime, so linking against MMPM2.LIB is no longer required. I've also included a copy of DIVE.H.

Note that the code doesn't run anymore on an i386 CPU. At least a 486 is required. Together with the options described in ticket #152, I now get an average speed of 25057 px/ms.

by rudi, 15 years ago

Attachment: diveblit2.zip added

comment:34 by Silvan Scherrer, 15 years ago

i did the blit test with BITMAP_MIRROR 4 as well. note i did not change the options like suggested in ticket:152

Stats:

total pixels blitted : 1597239277
total time, ms : 36831
average roundtrip speed, px/ms : 43366

comment:35 by Silvan Scherrer, 15 years ago

Milestone: Qt EnhancedQt 4.6.2

comment:36 by Dmitry A. Kuminov, 15 years ago

Rudi, thank you for your work, the performance boost looks really significant, according to your and Silvan's reports. I'm reviewing your changes right now. The first remark: please use the coding conventions of the surrounding code which in particular means *no* tabs, and it would also be nice to attach .diffs here (rather than the complete source files) since it's always easier to review and apply.

comment:37 by Dmitry A. Kuminov, 15 years ago

Okay, you selected the direct frame buffer way which causes you to perform both the clipping and color conversion manually while copying scan lines. You also don't synchronize with PM (using WM_VRNDISABLED/WM_VRNENABLED) so no surprise that the current implementation has various negative side effects. For example, here (in VBox), the window contents gets distorted under the mouse pointer if you move it while the window is drawing itself. Silvan gets more dramatic effects like the window contents staying at the previous location when moving the window and so on.

You also create a new Dive instance each time a window rectangle needs to be drawn, this looks like not very optimal to me.

So, what I will try to do is:

  1. Use DiveBlitImage for blitting. This will give us automatic color conversion from the 32 bit QImage buffer to whatever color format the screen currently is (which will also cover the exotic banked cases and so on). I think this conversion won't be any slower than the manual route because the conversion operations are trivial. It may even be faster because it's theoretically possible that it is supported in the hardware. Also, using DiveBlitImage will save us from blitting rectangle by rectangle (which is another potential place for optimization inside the driver). Anyway, the test results will show us what is faster.
  1. Properly handle WM_VRNDISABLED/WM_VRNENABLED to make sure blitting is done in sync with PM. This should fix all the problems that cause screen distortion.

comment:38 by rudi, 15 years ago

it would also be nice to attach .diffs here (rather than the complete source files) since it's always easier to review and apply.

Since I'm not a Unix guy I need some advice on what diff tool to use and how...

You also create a new Dive instance each time a window rectangle needs to be drawn, this looks like not very optimal to me.

That is not the case. It's created on the first blit and then re-used. This is because the instance cannot be created in the (static) constructor and I have not found a better, non-intrusive place.

You also don't synchronize with PM (using WM_VRNDISABLED/WM_VRNENABLED)

Aren't we always acting in response to a WM_PAINT (i.e. synchronously) ? I was under the assumption that Dive(De-)AcquireFrameBuffer() would be sufficient in this case.
But reacting on WM_VRNDISABLED/WM_VRNENABLED is certainly good because it would avoid the (most likely expensive) query for visible rects during each blit.

For example, here (in VBox), the window contents gets distorted under the mouse pointer if you move it while the window is drawing itself.

This happens on all systems with software mouse pointers. It can be avoided by turning off the pointer before the blit and turning it on again right after. Silvan already a quickly hacked fix that does this. However, it might introduce flicker (even on systems with hardware mouse pointer) and will slow down things a bit.

Silvan gets more dramatic effects like the window contents staying at the previous location when moving the window and so on.

This is most likely a result of the fact that the direct frame buffer access is bypassing Panorama's shadow buffer. Here it works perfectly (with SNAP and GENGRADD).

comment:39 by Dmitry A. Kuminov, 15 years ago

Ah right, hDive is created once. Overlooked it, sorry (was too late yesterday).

And no, not all draws go through WM_PAINT in Qt4 (and according to my understanding of Dive, even in this case handling of WM_VRNDISABLED/WM_VRNENABLED would be still necessary for proper synchronization).

What about distortions, I still think that they are caused by the fact we don't do WM_VRNDISABLED/WM_VRNENABLED. Note that there are certain cases where you must not access the frame buffer (e.g. between receipt of a WM_VRNDISABLED message and a WM_VRNENABLED, according to MMAPG). I also have a couple of Dive apps that don't demonstrate distortion under the mouse pointer and they certainly don't turn the mouse off and on (i.e. there is no flicker). I don't have their source code though so cant tell exactly how they achieve that.

comment:40 by rudi, 15 years ago

There are not so terribly many DIVE applications around that do direct frame buffer writes. Most of them use use DiveBlitImage in a dedicated blitter thread...

A software mouse pointer (which IMHO a relict from the stone age of computing) will always flicker if you blit at a sufficiently high rate. You can observe that with the "SHOW" example that comes with the VAC compiler. Interestingly this happens not only when the pointer is directly over the window, but close enough that - considering the current pointer size and hotspot - an overlap may happen.

I have not looked deep enough at Qt get an idea how to direct WM_VRNDISABLED/WM_VRNENABLED to our blitting object. It might even get more complicated in case of multiple windows (toplevel widgets). Do you have any ideas ?

@Silvan:

I heard that Panorama has an option in "System setup/System" that allows to disable the shadow buffer. Could you do that and report the results.

comment:41 by Dmitry A. Kuminov, 15 years ago

I know how to do WM_VRNDISABLED/WM_VRNENABLED integration and currently working on that. I will also implement the DiveBlitImage() way and then we will compare (and we will also be able to see if handling these messages helps us to get rid of distortions when doing direct frame buffer access).

comment:42 by Silvan Scherrer, 15 years ago

@Rudi, i changed it to shadow buffer disabled. then i'm able to tae screenshots with the SHOW example. with shadow buffer enabled i only get a black screenshot.

comment:43 by Dmitry A. Kuminov, 15 years ago

DiveBlitImage(buf,screen) crashes with SIGSEGV here; still can't find why.

comment:44 by Dmitry A. Kuminov, 15 years ago

Okay, I found the reason. DiveBlitImage() crashes if you set either fccSrcColorFormat or fccDstColorFormat in SETUP_BLITTER to something unsupported. Now I see what Rudi meant in comment 25 about unsupported 32-bit source buffer -- it seems like this call only supports LUT8/R565/BGR3 formats (corresponding to 8/16/24 bit screen color depths).

However, it looks like both these parameters are actually irrelevant for DiveBlitImage() -- and this is not a big surprise, since we already define the color format of the source buffer when we create it and we always use the screen as the destination buffer whose color format is also known. So, simply providing FOURCC_SCRN to both fields makes DiveBlitImage() happy here.

Note that my source buffer is actually BGR4 (which corresponds to the 32-bit QImage used as a paint device in Qt) while the screen is BGR3, and blitting seems to work. At least I see the correct window contents and colors (I need to do some further tests to be completely sure).

comment:45 by Dmitry A. Kuminov, 15 years ago

I also must say that there's no noticeable flicker with this way. One may see a single mouse flicker once in a while (place the mouse pointer to the border of the blit window to see it better) but the very same flicker occurs when using the normal GpiDrawBits() way.

The speed of the DiveBlitImage() is the same as with the direct frame buffer access here but testing on the real hardware is required of course. I will commit the code tomorrow (there are still some small issues to solve).

comment:46 by Dmitry A. Kuminov, 15 years ago

Implemented the DiveBlitImage() approach in r705:707. Note that this approach is the default now (and corresponds to the "-graphicssystem native" command line option). If Dive is not available (or an error happens when allocating a Dive instance), the old GpiDrawBits() approach is used. If you want to manually switch to the old approach, you should use the "-graphicssystem raster" command line option with the given Qt4 GUI application.

Note that I create a new Dive instance per every top-level window. This is done so to minimize the number of DiveSetupBlitter() calls (which is a good practice according to the MMAPG recommendation). According to dive.h, the maximum number of Dive instances is 64 (but I don't know if that's per process or per system). I will try to write a test to find this out but in any case the application should work even if the maximum is reached because the old GpiDrawBits() approach will be automatically used for exceeding windows.

Note also that for the old approach I changed the mirroring mode to the fastest one (matrix transformation) in r708.

Please test and post the new results here.

Note that there is still a small problem: if you try to move a window of another process over the Qt window you will see distortions. For some reason, PM doesn't send notify the Qt application about the visible region changes with WM_VRNENABLED in such cases. I will figure out tomorrow.

comment:47 by Dmitry A. Kuminov, 15 years ago

Okay, you should never trust the ffVisRgnChanged parameter of the WM_VRNENABLED message. This means that, according to my tests, when the top level window is obscured by top level windows of other processes, PM still sends WM_VRNENABLED to it when the visible region changes (e.g. when the overlapping windows are moved over it back and forth) but with ffVisRgnChanged erroneously set to FALSE. I changed the code to ignore ffVisRgnChanged in r710 and the distortions seem to have gone.

comment:48 by Silvan Scherrer, 15 years ago

i get the following stats now:

Stats:

total pixels blitted : 2427103994
total time, ms : 60031
average roundtrip speed, px/ms : 40430

comment:49 by Dmitry A. Kuminov, 15 years ago

Silvan, I installed Panorama and luckily I get the same behavior as you: when moving the Qt window it's contents gets distorted. The same is when I move another window over it. Investigating.

comment:50 by Dmitry A. Kuminov, 15 years ago

My guess is that Panorama assumes that all windows always draw to the associated micro presentation space (as returned by WinGetPS() or WinBeginPaint() for the given window) and use the image drawn into this PS for painting the window contents when moving it and in some other cases. Since we never paint to this PS in DIVE mode, it contains garbage (usually just the image of what is behind the window).

comment:51 by Dmitry A. Kuminov, 15 years ago

We can't do anything here except paint to the window's PS but doing so means the old non-Dive approach. We should try to contact Panorama developers and explain the situation to them. To my understanding, they somehow need to sync the area where Dive paints go and the micro PS in the driver. It should probably be simply the same memory block.

Until then, I will have to disable DIVE by default and introduce a QT_PM_DIVE environment variable that will turn it on.

comment:52 by rudi, 15 years ago

I tested it as well. As expected, it fails on all systems where the screen is not in BGR4 mode. This is due to the wrong blitter setup. I guess, your system is using BGR4 as well (not BGR3 as you wrote earlier and which had really surprised me). In any case fccSrcColorFormat on DiveSetupBlitter *MUST* match the format specified on DiveAllocImageBuffer. As you found out yourself, DIVE won't accept BRG4 as source format. This is why I did the direct frame buffer write which also avoids some other ugly DIVE side effects (artifacts when moving the window, moving a "shape window" crashing the app, only one dive instance per app).

About the mouse pointer in direct mode: I still highly doubt that this is caused by not honoring WM_VRN(EN/DIS)ABLED. Since you now have implemented handlers for those message you can check if any of them get called while moving the mouse pointer over a Qt menu. I guess not. What happens is, that in "soft-pointer" mode the rectangle covered by the pointer is cached somewhere. When the pointer is moved, the content of this cached area is restored before the pointer is drawn at the new position. Unfortunately, this pointer cache is not invalidated by Dive(De-)aquireImageBuffer, which is causing the distortion. Most likely IBM implemented some hack that avoids this when using DiveBlitImage.

About Panorama: This is the same issue as with my code. What happens is, that screen access via DIVE (no matter if direct or via DiveBlitImage) bypasses Panorama's shadow buffer. Note, that Silvan couldn't make a screen shot of the SHOW.EXE DIVE example app. This is not so much a problem for a video-player-type-of-application. But we are misusing DIVE for GPI ! So in case of a window being moved or scrolled the display driver might get instructed to perform a screen-to-screen copy. Panorama will then use the shadow buffer as source. However, the data we blitted via DIVE is not there. In theory Panorama should have re-synced it's shadow buffer on DiveDeaquireImageBuffer (which is done by DiveBlitImage internally). But this doesn't seem to be implemented. Probably because of the performance loss introduced by the video ram readback.

I'm quite confident that the distortion when moving Qt windows will go away, when you disable the shadow buffer. However this will kill Panorama's performance completely.

comment:53 by Dmitry A. Kuminov, 15 years ago

Rudi, thank you for your help.

I have now tested the DiveBlitImage() way in different screen modes too. I confirm that it doesn't work (it doesn't fail here though, it simply interprets the source buffer wrongly). And my screen mode was indeed BGR4 when I tested it first time so this is where my confusion regarding fccSrcColorFormat comes from (I assumed that it's BGR3/24bit for some reason). Another confusion is that fccSrcColorFormat is ambiguous and it doesn't actually need to match the format of the source buffer -- if different, it simply causes DiveBlitImage() to misinterpret the buffer (by overriding the format specified during source buffer creation). BTW, the SHOW example doesn't work in 32 bit mode here which indicates that even at IBM they lacked clear understanding of the API.

Anyway, I think I found an acceptable solution for working in DiveBlitImage() mode. The thing is that Qt can actually use any pixel format as a window surface, not necessarily BGR4. So, in r712 I implemented the following logic:

  • If we can create a QImage that matches the screen FOURCC directly (currently, it's R555, R565, RGB3 and BGR4) we use this way. It basically means that no color conversion is done when blitting (should be *almost* as fast as the direct framebuffer access).
  • Otherwise, we use a 24-bit QImage and FURCC_RGB3 for the buffer. This seems to work well even in indexed 256- and 16- color modes (which are not covered by your direct framebuffer implementation). I estimate it also to be quite comparable to the direct frame buffer access.

Please check if r712 works for you in various (non-32 bit) screen modes.

The next step is integrating your direct framebuffer implementation to the QPMDiveWindowSurface class.

And yes, the mouse issue has nothing to do with WM_VRNENABLED/WM_VRNDISABLED but in fact it's DiveBlitImage() that does the trick somehow.

comment:54 by rudi, 15 years ago

Unfortunately I'm out of office for the next few weeks and won't be able to do any testing. I'll do so when I'm back. BTW, I asked on the Panorama list about the issue. Maybe they come up with some ideas...

comment:55 by Dmitry A. Kuminov, 15 years ago

I integrated the direct framebuffer access mode in r713. I also introduced the following environment variables:

  • QT_PM_NO_DIVE -- Disables DIVE completely, which is equivalent to "-graphicssystem raster".
  • QT_PM_DIVE=BLIT|FB -- Enables DIVE. If the vaule is BLIT, the DiveBlitImage() approach is used (slower but no mouse pointer problems). If the value is FB, the framebuffer approach is used (7-10% faster, but creates artifacts on the screen under the mouse pointer). Any other value turns DIVE off (same as QT_PM_NO_DIVE).

comment:56 by Dmitry A. Kuminov, 15 years ago

Everyone is encouraged to test. I need at least two tests with SNAP and Panorama. Note that it's important to test in different screen color modes (at least, the ones specified below). Also note that you should run the BLIT example several times in a row and take the average number. The numbers are also necessary to announce the percentage of the performance boost in the 4.6.2 readme.

Here is a template table for the results:

driver/mode QT_PM_NO_DIVEQT_PM_DIVE=BLITQT_PM_DIVE=FB
SNAP / 15bit (32K)
SNAP / 16bit (64K)
SNAP / 24bit (16M)
SNAP / 32bit (16M)

comment:57 by Dmitry A. Kuminov, 15 years ago

In r714, I added a workaround that prevents screen corruption under the software mouse pointer. This mode is activated by setting QT_PM_DIVE=FBSWM. I simply hide the mouse pointer when it's close to the repainted area and it really looks acceptable. At least, under SNAP/VBox, I can't see a big difference comparing to QT_PM_DIVE=BLIT in terms of the mouse pointer flicker.

Silvan and others, please test the QT_PM_DIVE=FBSWM mode too: there should be a slight performance penalty when the mouse pointer is directly over the area being repainted, and I wonder how big it is.

comment:58 by Silvan Scherrer, 15 years ago

driver/mode NO_DIVEDIVE=BLITDIVE=FBDIVE=FBSWM
Panorama/32bit (16M) 30258 41744 47460 44459

comment:59 by herwigb, 15 years ago

driver/modeQT_PM_NO_DIVEQT_PM_DIVE=BLITQT_PM_DIVE=FBQT_PM_DIVE=FBSWM
SNAP/16bit (64K)8983116911220512023

comment:60 by Dmitry A. Kuminov, 15 years ago

Herwig, thank you for the information, could you please measure the speed of the 4.5.1 GA release, for comparison?

comment:61 by herwigb, 15 years ago

driver/modeQT_PM_NO_DIVE
SNAP/16bit (64K)3203

comment:62 by Dmitry A. Kuminov, 15 years ago

Wow, the performance boost is just amazing in your case. Are you sure in your numbers?

comment:63 by Dmitry A. Kuminov, 15 years ago

VirtualBox is constantly crashing here. I have to wait half an our each time after that while it checks the HPFS drive... Maybe my PC is overheating, but this really slows me down. And of course, this happens when I'm busy with cooking the release :(

comment:64 by Dmitry A. Kuminov, 15 years ago

Note that in r735 I added detection of the Panorama video driver: if neither QT_PM_DIVE nor QT_PM_NO_DIVE is set, then, if Panorama is detected, DIVE is disabled and otherwise the FBSWM mode is set by default.

comment:65 by Dmitry A. Kuminov, 15 years ago

Resolution: fixed
Status: newclosed

Enough for this defect.

comment:66 by rudi, 15 years ago

A bit late, but...

This is the machine, I used for the initial tests earlier in this thread:

driver/modeNO_DIVEDIVE=BLITDIVE=FBDIVE=FBSWM
SNAP/16bit (64K)12284278583069930229
SNAP/32bit (16M)13233243682481125185

And this is a much slower machine, which supports 24bit modes. See ticket #159 for problems and necessary changes to get this (partailly) working:

driver/modeNO_DIVEDIVE=BLITDIVE=FBDIVE=FBSWM
SNAP/16bit (64K)5337131631742417519
SNAP/24bit (16M)5958crash1420714011
SNAP/32bit (16M)7277120941347313332

I could not test the 15bit mode because even though SNAP claims to support it, I was not able to get it activated on any of my machines.

comment:67 by rudi, 15 years ago

I was not looking at the right place: the above machine does support 15bpp. So I can confirm that this mode does work as well.

driver/modeNO_DIVEDIVE=BLITDIVE=FBDIVE=FBSWM
SNAP/15bit (32K)5428132001747017432

.

comment:68 by Dmitry A. Kuminov, 15 years ago

Great! Thank you for testing.

Note: See TracTickets for help on using tickets.