Opened 13 years ago

Closed 13 years ago

#8 closed defect (fixed)

Native Command Queueing Slows Access

Reported by: rlwalsh Owned by: cjm
Priority: major Milestone:
Component: driver Version:
Keywords: Cc:

Description

When I build Firefox, I usually have gmake run two processes in parallel ('-j2') and it takes 38 minutes. With NCQ enabled, it takes 45 minutes. I just tried a build running four processes ('-j4'). Without NCQ, it took 33 minutes; with NCQ, it was 39.5 minutes.

I frequently use the "deltree" util to delete my entire build tree which has 100,000(?) files in several thousand directories. With NCQ, it is noticeably slower (almost painfully slow).

The system:

  • Intel DG43GT board with a "G43 Express" chipset including an 82801JIB I/O controller ("ICH10")
  • Core2 Quad @ 2.5gnz with VT (Q8300 SLGUR)
  • 2gb memory
  • Seagate ST3500418AS 500gb drive using JFS with a 64mb cache
  • eCS 2.0 using os2ahci.add copied from my 2.1 installation

Attachments (1)

PAPERINO-20111005-acpi-3.19.8.log (60.7 KB ) - added by ydario 13 years ago.
PCI dump and other HW info

Download all attachments as: .zip

Change History (13)

comment:1 by cjm, 13 years ago

Sorry for the delay -- there was a communication problem on my end ;)

In theory, NCQ could also slow down operations due to the protocol overhead between adapter and device, especially in single-thread workloads, but the differences you describe seem out of proportion and you are using multiple processes/threads so I would actually expect an improvement.

Could you try and turn on level-2 debugging for the driver (options "/c:0 /d /d"), then run two tests, one with NCQ turned on, the other one with NCQ turned off.

The tests should be somewhat comparable, thus copying a single large file to NUL right after booting would be the best as the same sectors and I/O commands should be involved. While running this test, please log the debug output into a file. This requires two command prompt windows:

Window 1:
copy os2ahci$ debug.log

Window 2:
copy <large_file> NUL

Leave the copy running for about a minute, then press Ctrl+C in window 1 to stop the debug logging and save the file. Change the NCQ option in config.sys, reboot and repeat the test.

Thanks,
--Christian

comment:2 by cjm, 13 years ago

Owner: changed from somebody to cjm
Status: newaccepted

comment:3 by cjm, 13 years ago

Forgot to mention: You will need version 1.18 to allow userland tracing (i.e. "copy os2ahci$ debug.log"). Previous versions can only write the debug log to a COM port (default being COM1) and that would be too slow to show useful timing information. Besides, there was no timing information before 1.18...

Last edited 13 years ago by cjm (previous) (diff)

comment:4 by ydario, 13 years ago

On my notebook it is even worse, without NCQ I can copy a file from JFS to JFS (different partition, same disk) at around 15MB/s, with NCQ the same operation is done at 2MB/s.

I followed instructions to produce logging, but the file is too big for attachments (max 256kb), please get it here:

http://web.os2power.com/download/ahcilog.zip

I'm using build 1.18 from mensys.

comment:5 by cjm, 13 years ago

The NCQ log shows that there's almost always a timer tick between the time a command is issued and the completion interrupt. Timer interrupts are around 18/s so we're talking about a long time between command issue and completion.

There's a feature in AHCI which allows collecting completion events in order to have a single interrupt for servicing multiple completions which comes with an artificial interrupt delay but this feature is not used by OS2AHCI. I could imagine the following reasons for this behavior:

  • The adapter switches to delayed interrupt signaling when using NCQ. I would need to check this; so far, I haben't heard about such a "feature"
  • The device itself introduces an artificial delay processing NCQ requests. Same thing: I would need to investigate.

Would it be possible to specify exactly which AHCI controller and which hard disk is involved?

Thanks,
--Christian

Version 0, edited 13 years ago by cjm (next)

comment:6 by cjm, 13 years ago

Minor update: The interrupt delay feature, called Command Completion Coalescing, would signal interrupts on a non-existing port number and would definitely have to be both enabled and supported by the driver. Since the interrupts are still coming in on the correct port, we can discard that part of the theory.

This leaves the device itself. Or a controller problem; there were indications that, for example, certain Intel chipsets (645 if I recall correctly) might incur substantial performance drops with NCQ enabled. Or, of course, some screwup inside the driver.

I just noticed the time stamps are actually not always 55ms apart, thus the timer tick is not 18 times per second on this system; looking at the millisecond counter printed in the log, the situation looks a bit different as there's around 8ms between each command completion. Here's an excerpt from the log with some clutter removed (the first 7 commands, all queued up, then interrupts coming in to mark completion of each command):

# first the seven write requests, all queued up in the same timer
# tick (i.e. not more than a few milliseconds, probably less)
[81763] ahci_write(0.0.0, 150191932, 8)
[81763] issuing command on slot 23
[81763] ahci_write(0.0.0, 150217668, 8)
[81763] issuing command on slot 24
[81763] ahci_write(0.0.0, 150217708, 8)
[81763] issuing command on slot 25
[81763] ahci_write(0.0.0, 150447924, 8)
[81763] issuing command on slot 26
[81763] ahci_write(0.0.0, 250871412, 8)
[81763] issuing command on slot 27
[81763] ahci_write(0.0.0, 53201374, 8)
[81763] issuing command on slot 28
[81763] ahci_write(0.0.0, 54174350, 8)
[81763] issuing command on slot 29

# first interrupt, around 24ms later
[81787] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81787] [ncq_cmds]: active_cmds = 0x3f000000, done_mask = 0x00800000

# second interrupt, around 7ms later
[81794] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81794] [ncq_cmds]: active_cmds = 0x3e000000, done_mask = 0x01000000

# third interrupt, another 8ms later
[81802] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81802] [ncq_cmds]: active_cmds = 0x3c000000, done_mask = 0x02000000

# and so on...
[81806] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81806] [ncq_cmds]: active_cmds = 0x38000000, done_mask = 0x04000000

[81818] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81818] [ncq_cmds]: active_cmds = 0x30000000, done_mask = 0x08000000

[81832] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81832] [ncq_cmds]: active_cmds = 0x20000000, done_mask = 0x10000000

[81843] port interrupt for adapter #0, port #0, stack frame 15e8:ffaa
[81843] [ncq_cmds]: active_cmds = 0x00000000, done_mask = 0x20000000

This gives us 24ms until the first command completes, then around 8ms between each completion. The I/Os in question are not really sequential but not that far apart, either, thus I would expect an average seek time less than, say, 3ms; 24ms + 7*8ms for 7 I/O requests means around 12ms per command. Since we have 8 sectors per write request, this would roughly give us 300KB/s. Ok, the math is quite a bit off the 2MB/s you got but that's only a small subsection of the log and it already shows a substantial issue with NCQ processing.

In the non-NCQ log, completion interrupts usually arrive only a few milliseconds after issuing the command, often even in the same millisecond. This looks more like the average seeks times I would expect, plus there's probably a write cache on the HD as well.

Now it would be interesting to see whether this pattern is consistent across different HW combinations. I'll verify NCQ performance on and AMD790 desktop box with a regular 3.5" hard drive and get back.

Thanks,
--Christian

by ydario, 13 years ago

PCI dump and other HW info

comment:7 by ydario, 13 years ago

I have also ubuntu installed on this notebook, maybe it could help investigating HW configuration. Let me know what do you eventually need.

comment:8 by cjm, 13 years ago

Thanks for the information. So you're using an ICH9 chipset which should be recent enough to rule out any hardware-related quirks.

In the meantime, I have received benchmark reports from Mensys with NCQ turned on and off and it seems as if the HW write cache plays a role in all this. OS2AHCI currently enables/disables the HW write cache as requested by the upstream I/O command only when NCQ is turned on simply because there's a flag in the corresponding NCQ ATA command which allows us to do so. Non-NCQ ATA commands don't seem to have this flag (although I must admit I didn't look very hard -- I'll do so shortly).

This adds a new theory to this defect: If only OS2AHCI with NCQ enabled respects the "HW cache disable" flag, the performance loss may be a side effect of otherwise correct behavior. The most simple test for this theory will be to add a command line flag to always enable the HW write cache regardless of the upstream I/O command flags...

I've added the flag '/F' to OS2AHCI version 1.21 which will force the write cache to be enabled regardless of the upstream I/O flags. I will upload the source ASAP and initiate a build of version 1.21. Once the binary build of version 1.21 is available fow download, please test NCQ performance with the command line flag "/F".

comment:9 by ydario, 13 years ago

Now /n /f is a lot faster!

Running rpmbuild -ba git.spec:

no opt: 3m 26.850000s
/f: 3m 26.570000s
/n: 8m 56.150000s
/n /f: 3m 27.290000s

so build test is as fast as no opt version.

Also the overall build speed is better: with 1.18 the same build was around 4m 10s, but I changed some other thinks (binutils, newer acpi).
Maybe I'll get some time to retest with 1.18 driver and current software setup.

But the biggest suprise has been in file copy: from 16MB/s of .18 build, to 26MB/s with /n /f

So it seems that /n alone is disabling caching at some level.

comment:10 by cjm, 13 years ago

Hmmm... that's both good and bad news. Good news because we know what's causing the problem and bad news because the slow behavior is actually the correct behavior -- if the OS tells a driver not to buffer a write request in the cache but instead write it directly to the disk, the driver should do so. That existing drivers haven't done so doesn't make this a lot better although one could argue that this has worked for many years without complaints about corrupt databases...

I guess one test that we should run at some point would be to compare performance on HPFS. JFS is a journaling file system, thus it will attempt to flush the journal cache quite frequently and this would be a non-buffered write operation. Maybe we should implement some mechanism that detects non-buffered requests and still buffers them but sets up a timer which will explicitly flush the cache a few milliseconds after the last non-buffered I/O. I'll have a look at the JFS code to get a better idea how this is intended to work (I seem to remember it's open source).

comment:11 by cjm, 13 years ago

I just went through the JFS and OS2DASD code and the situation looks like this:

  • OS2DASD sets the XIO_DISABLE_HW_WRITE_CACHE flag in the IORB if the upstream does not explicitly set the 'RW_Flags' to 2
  • JFS doesn't set 'RW_Flags' at all and ignores the HW write cache across the board. It can't sync, either, because there's no support for flushing HW caches in the DASD interface.

This means if we follow the specification, JFS writes with NCQ enabled (and without /F) will always go directly to the disk, bypassing the the HW write cache and causing a substantial performance loss. Based on the information I got so far, I'm coming to the conclusion that OS2AHCI is the first driver that actually supports the XIO_DISABLE_HW_WRITE_CACHE flag...

I'll leave the option /F in the code -- it was initially supposed to be a test option -- and leave it to the users to decide whether they want to use write caches with NCQ. BTW, OS2AHCI explicitly flushes the HW caches whenever suspending or shutting down so there should be no data loss as long as there's no power failure. But I wouldn't want to run a database like that....

I'll close the defect as resolved.

Thanks,
--Christian

comment:12 by cjm, 13 years ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.