Opened 10 years ago

Last modified 10 years ago

#44 accepted defect

TCP/IP stops working after a few minutes.

Reported by: BlondeGuy Owned by: David Azarewicz
Priority: Feedback Pending Component: r8169
Version: 1.0.0 Keywords:
Cc: steve53@…

Description

The computer is a new Lenovo ThinkCentre? M32 thin client. Networking, including the Samba client included in eCS 2.2 Beta II (Dec 30, 2013) works for a few minutes. I was able to run testlog and copy the files to the server. After a few minutes, the network stops working. Even ping does not work.

Attachments (9)

THUNDERBIRD-20140901-R8169-1.00.0-ThinkCentreM32.log (25.2 KB) - added by BlondeGuy 10 years ago.
Testlog taken while networking was working
R8169_trace_trap.jpg (90.7 KB) - added by BlondeGuy 10 years ago.
digital photo of trap screen resulting from trace lines in config.sys
R8169_network_stopped_formatted.ftf (13.7 KB) - added by BlondeGuy 10 years ago.
Formatted trace dump after network failure
ThinkCentre M32 after kvm switch.zip (88.6 KB) - added by BlondeGuy 10 years ago.
kvm attached to usb port 5
ThinkCentre M32 after kvm switch 2.zip (80.8 KB) - added by BlondeGuy 10 years ago.
After network stops, using new usb driver distribution with trace files.
ThinkCentre M32 after network stopped with ACPI 3_22_5.zip (87.6 KB) - added by BlondeGuy 10 years ago.
Formatted trace of USB after network stopped
THUNDERBIRD-20140923-acpi-3.22.05.zip (95.1 KB) - added by BlondeGuy 10 years ago.
testlog acpi after kvm switch
ThinkCentre M32 after network stopped with nw.ftf (148.3 KB) - added by BlondeGuy 10 years ago.
Trace right after switching KVM
trace2.zip (147.4 KB) - added by BlondeGuy 10 years ago.
Trace with both USB and network

Download all attachments as: .zip

Change History (41)

Changed 10 years ago by BlondeGuy

Testlog taken while networking was working

comment:1 Changed 10 years ago by David Azarewicz

Owner: set to David Azarewicz
Priority: majorFeedback Pending
Status: newaccepted

Interesting. I have never seen anything like this. However, I do see a CONFIG.SYS ordering violation. I'm surprised you don't get a trap, but memory corruption can take many forms. The rule is that all MAC drivers must come after all protocol drivers and after all IFS drivers. You have an IFS driver loaded after the MAC driver. Fix this and see if your problem goes away.

comment:2 Changed 10 years ago by BlondeGuy

I never knew that about IFS vs. MAC drivers. I changed the ordering of the drivers, but the problem remains. In addition, I have found that I can now cause networking to stop working just by switching the KVM from the ThinkCentre?, to some other machine, then back.

This causes the keyboard and mouse to disconnnect, then reconnect. After than ping no longer gives any response, and SAMBA no longer works.

comment:3 in reply to:  2 Changed 10 years ago by David Azarewicz

Replying to BlondeGuy:

I never knew that about IFS vs. MAC drivers.

Yes, this has always been a requirement. You may or may not notice a problem with some configurations. The IBM MPTS program and Alex's new replacement both enforce this requirement. Unfortunately, some other installers do not.

I changed the ordering of the drivers, but the problem remains. In addition, I have found that I can now cause networking to stop working just by switching the KVM from the ThinkCentre?, to some other machine, then back.

So, this is a new problem that did not exist before? This is very suspicious.

I'm assuming USB keyboard and mouse, correct?

Which USB drivers are you using?

You keep mentioning only TCP/IP traffic stops. Does your NETBIOS traffic also stop? I was assuming that ALL network traffic stops. Please verify that my assumption is correct.

I would like to see what the R8169 driver is doing after your network traffic stops. Please install the trace version of the driver. Then enable tracing *with wrapping*:

TRACEBUF=512 /M=WRAP,QUEUED,NODTI /D=ALL
TRACE=ON 248

Wait for the failure, then capture a formatted trace and attach it to this ticket.

comment:4 Changed 10 years ago by BlondeGuy

First, I had not thought about NetBIOS yet. When the driver first loads, I can do Net Start Req, and it works as expected. After TCP/IP stops working, I cannot start the requestor. If the requestor is already started, I can still do Net View. I will have a second NetBIOS computer set up in a few days.

Second, I was using USB keyboard and mouse with a KVM to share with the other computers. I switched this out for a PS/2 mouse and keyboard. Only the monitor is left connected to the KVM. Even with this setup, I can switch away, then switch back and TCP/IP (and apparently NetBIOS) stop working. Only the monitor is shared at this point, through a VGA cable that is proprietary to the IOGear KVM.

This is not a new problem caused by reordering Config.Sys. I had overlooked the connection between switching away and back using the KVM and the network stopping. I do not understand how they could be connected, but it is repeatable. In a way it is handy because I can cause the failure in an easily repeatable way.

Third, I did install the trace driver, and reboot, and there is no change in behavior. But when I add the two lines to config.sys and reboot, I get a trap at boot time. A digital photo of the trap is attached.

Fourth, it may not matter what USB drivers I am using, but they are the drivers from the eCS 2.2 beta II disk.

Build Level Display Facility Version 6.12.675 Sep 25 2001
(C) Copyright IBM Corporation 1993-2001
Signature: @#D Azarewicz:11.05#@##1## 2 Oct 2013 13:45:08 DAZAR1 :

:::::@@USB EHCI compliant Driver for eCS (c) 2013 D Azarewicz
Vendor: D Azarewicz
Revision: 11.05
Date/Time?: 2 Oct 2013 13:45:08
Build Machine: DAZAR1
File Version: 11.5
Description: USB EHCI compliant Driver for eCS (c) 2013 D Azarewicz

Thanks again for your help. I hope there is a simple way to get the trace for the network driver.

Changed 10 years ago by BlondeGuy

Attachment: R8169_trace_trap.jpg added

digital photo of trap screen resulting from trace lines in config.sys

comment:5 Changed 10 years ago by David Azarewicz

I assume you moved the DEVICE=C:\IBMCOM\MACS\R8169.OS2 statement to be the *last* DEVICE= statement in your CONFIG.SYS, correct?

A trap dump would be the next step. Is that possible?

comment:6 Changed 10 years ago by BlondeGuy

I corrupted the installation, and had to begin again. This install easily duplicates the problems I saw before. To be clearer, when the machine is connected to my KVM, then the act of switching away from the computer, then back, causes the network to stop working. Both TCP/IP and NetBIOS seem to stop working when this happens. I'll know more about NetBIOS when a second computer arrives.

To make sure that networking is fine apart from the KVM, I got a separate monitor, keyboard and mouse, and ran a network test for 24 hours. I saw no problem in this case.

To verify, the DEVICE=C:\IBMCOM\MACS\R8169.OS2 statement is the last DEVICE= statement in Config.Sys.

And with these instructions in Config.Sys, I still get a trap.

TRACEBUF=512 /M=WRAP,QUEUED,NODTI /D=ALL
TRACE=ON 248

I will try to get a trap dump.

comment:7 Changed 10 years ago by BlondeGuy

I set the system up to collect a trapdump, but when I press Ctrl-Alt-NumLock?-NumLock?, the system reboots without collecting a dump. This is a system that has 2 GB of memory, but eCS recognizes only 512 MB. I cannot collect a trapdump at this time.

comment:8 Changed 10 years ago by David Azarewicz

Ok, apparently there is a known problem with the trace facility when you set it up in CONFIG.SYS. So, remove the trace lines from your CONFIG.SYS and boot the system. Then after the network has stopped working:

  1. Type: TRACE ON /D:ALL /B:512 /M:W,Q,NDTI
  2. Type: TRACE ON 248
  3. Type: TRACEFMT and save a formatted trace to a file.

Then attach that formatted trace to this ticket.

Make sure you still are using the trace version of the R8169 driver.

Thanks.

Changed 10 years ago by BlondeGuy

Formatted trace dump after network failure

comment:9 Changed 10 years ago by David Azarewicz

I tried sending you a new driver to test, however the e-mail keeps being returned "Recipient's mailbox is full, message returned to sender."

comment:10 Changed 10 years ago by BlondeGuy

My mailbox filled up at 9:22 this morning. It's empty now. Please try again.

comment:11 Changed 10 years ago by BlondeGuy

Thanks, got it. I installed the test version of the USBEHCD.SYS driver, which identifies itself as 11.08, but it acts exactly the same as the 11.05 driver. So, when I switch away and back using the KVM, the network stops functioning.

comment:12 Changed 10 years ago by David Azarewicz

Then I will need a trace of that driver. You can use the same commands as for tracing the R8169 driver except use 226 for the EHC driver.

Type: TRACE ON /D:ALL /B:512 /M:W,Q,NDTI
Type: TRACE ON 226
Type: TRACEFMT and save a formatted trace to a file.

Save the formatted trace and attach it to this ticket.
Thanks.

Last edited 10 years ago by David Azarewicz (previous) (diff)

comment:13 Changed 10 years ago by David Azarewicz

Also, If possible, please try plugging the KVM into the other USB controller and see if that makes any difference.

The problem is that the NIC driver stops getting interrupts. I assumed that this was because the EHCI_0 and the R8169 shared interrupt 16 and the EHCI driver was doing something wrong. That still is likely the case. The other EHCI controller (EHCI_1) is on a different interrupt so switching to that one would add some information to the puzzle.

comment:14 Changed 10 years ago by BlondeGuy

Good idea. The USB ports are numbered from 1 to 6. I had the KVM plugged port 5, and it behaves as described above. But when it is plugged in to port 1, the behavior is different.

Now when I switch the KVM away and back, the network stops and the USB no longer functions.

Changed 10 years ago by BlondeGuy

kvm attached to usb port 5

comment:15 Changed 10 years ago by BlondeGuy

The ThinkCentre? M32 after kvm switch trace file is attached. This is with the KVM on port 5, so the keyboard and mouse are working after the kvm switch.

comment:16 Changed 10 years ago by David Azarewicz

The trace points I was looking for are not in the file you attached. Which version of the USB stack are you using? If it is some older version, that would explain why I can't find the data I need. Please use only the last build I sent you for all testing. My current theory is that this is a USB issue. I don't want to waste time debugging old code.

comment:17 Changed 10 years ago by David Azarewicz

Also the TFF files seem to be missing. It seems that I didn't put them in the warpin package. I'll send you a new build.

Changed 10 years ago by BlondeGuy

After network stops, using new usb driver distribution with trace files.

comment:18 Changed 10 years ago by BlondeGuy

Installed new USB driver and collected trace files. See ThinkCentre? M32 after kvm switch 2.ftf.

comment:19 Changed 10 years ago by David Azarewicz

This is a puzzling problem and I need more information. I still cannot figure out what is happening on that system.

I need to know which port is connected to which controller. Probably the easiest way to determine this, and it will provide additional information is to plug the KVM into port 5, switch away and back, then run acpistat a few of times. Look at interrupts 16 and 23. Which one is counting up?

Also, can you recover from the stuck network? Does unplugging and replugging the KVM do anything? How about if it is plugged into port 1? Does the USB start working again if you replug it? Does any other USB device work, like another mouse or keyboard?

Also, please install the debug PSD from the new package I sent you. Add the /WRAP switch to the PSD= line in the config.sys. Reboot and do a "testlog acpi" after the network stops working and attach it to this ticket.

Thanks.

Changed 10 years ago by BlondeGuy

Formatted trace of USB after network stopped

comment:20 Changed 10 years ago by BlondeGuy

With the KVM (mouse and keyboard) plugged into port 5, I ran acpistat, and interrupt 16 goes up very slowly (by 5 or 6), and interrupt 23 goes up very quickly (by hundreds).

After switching the KVM once, interrupt 16 never changes at all, while interrupt 23 changes about the same as before. Unplugging and plugging the KVM still beeps, but the network does not recover.

With the KVM in port 1, USB and network are stopped. Unplugging and plugging the KVM or extra USB keyboard on port 1 does nothing, but both work if plugged into port 5. The network does not restart in any case.

Look for new testlog acpi attachment. (also new USB trace, that I took before reading your instructions fully.)

Changed 10 years ago by BlondeGuy

testlog acpi after kvm switch

comment:21 Changed 10 years ago by David Azarewicz

Thank you, that was very useful information. I now know what is happening, now to just figure out why.

For the R8169 trace, do you remember if you enabled tracing before or after you switched the KVM away? I need to see a trace that was tracing during the switch. If you didn't do it this way, please do it and attach the formatted trace file:

  1. Plug the KVM into port 5.
  2. Reboot the system so everything is working
  3. Type: TRACE ON /D:ALL /B:512 /M:NW,Q,NDTI
  4. Type: TRACE ON 248
  5. Switch the KVM away and back
  6. Type: TRACEFMT and save a formatted trace to a file.

Note that I changed the paramaters for #3 to NW (no wrap)

Changed 10 years ago by BlondeGuy

Trace right after switching KVM

comment:22 Changed 10 years ago by BlondeGuy

I collected The trace file, but it is smaller than the others.

A new behavior showed up while I was tracing. The first time I switched the KVM, it did not kill the network. I have never seen it do that before. A second switch of the KVM did kill the network, and then I took the trace.

I tried it again (without tracing), and the KVM killed the network the first time.

comment:23 Changed 10 years ago by David Azarewicz

I am having trouble figuring out what is causing the problem and how the USB is interfering with the NIC. Perhaps a mixed trace of both the USB and NIC at the same time may show some interaction. When you get a chance, please try to capture the following trace (I used wrapping because the USB generates so much trace information and I want to see the end):

  1. Plug the KVM into port 5.
  2. Reboot the system so everything is working
  3. Type: TRACE ON /D:ALL /B:512 /M:W,Q,NDTI
  4. Type: TRACE ON 226,248
  5. Switch the KVM away and back
  6. Verify that the network is stopped.
  7. Type: TRACEFMT and save a formatted trace to a file. Then attach it to this ticket.

Thanks.

Changed 10 years ago by BlondeGuy

Attachment: trace2.zip added

Trace with both USB and network

comment:24 Changed 10 years ago by BlondeGuy

The mixed trace is called trace2.zip, attached.

comment:25 Changed 10 years ago by David Azarewicz

Thank you. That information was very helpful.

Please do this test to see if I have correctly identified the problem:

Go to the Software Downloads section of my website and get the pcicfgwr driver. Here is a direct download link: http://88watts.net/download/pcicfgwr-0.01.zip

Unzip the package and put the pcicfgwr.sys driver in the root directory of your boot disk.

Add this line to your CONFIG.SYS:

BASEDEV=pcicfgwr.sys 0:2:0 W4=0x0407

Reboot and let me know if the problem goes away. If not, please attach the output of the PCI command after running this test.

comment:26 Changed 10 years ago by BlondeGuy

With the USB plugged into port 5, I added the BASEDEV as you wrote it above. I can now switch the KVM over and over without stopping the network.

With the USB plugged into port 1, I rebooted and if I switch away and back, the network is fine, but the USB mouse and keyboard does not work. If I switch away and back again, the eCenter dissappears, and the mouse and keyboard are working again.

I tried removing the USB widget from the eCenter and rebooting. Now the KVM is working fine and the network still works after switching.

comment:27 Changed 10 years ago by David Azarewicz

Priority: Feedback Pendingmajor

Thank you. This confirms what the problem is.

The problem is that the video hardware is generating spurious interrupts when you switch away and back. Technically, this is a hardware defect (or a BIOS defect). Hardware is supposed to power up with interrupts disabled, which is obviously not the case here. There is no driver for this device and since there is no driver it obviously has not enabled the interrupt, so this interrupt should not be occurring. The pcicfgwr.sys command that I gave you disables the interrupt generating capability of the video hardware at the PCI interface level. This is an OK workaround for now. At this point I don't know how else to handle this situation. There is no driver or other software that is broken that can be fixed. I also wonder if this is a one-off defect in the hardware, or a design flaw in that series of systems.

In case it wasn't clear already, this issue has nothing to do with the NIC hardware or driver, or the USB hardware or drivers.

The USB and eCenter issue when plugged into port 1 is probably not related.

comment:28 Changed 10 years ago by BlondeGuy

I'm going to revert to the latest released drivers, and test the PCICFGWR.SYS fix. If that works, I'd guess I can put the computer into service.

I'll have the computer until around Warpstock if you think of anything else worth looking at.

Thanks for your help.

comment:29 in reply to:  28 Changed 10 years ago by David Azarewicz

Replying to BlondeGuy:

I'm going to revert to the latest released drivers, and test the PCICFGWR.SYS fix. If that works, I'd guess I can put the computer into service.

Yes, that is what I expected you would do.

The pcicfgwr.sys driver simply does its business and exits. In this case it simply writes one word to one PCI register and exits. So it doesn't hang around using resources or anything. I wrote that driver a few years ago to handle oddball systems where the BIOS didn't setup things correctly. I never expected to need it to do something like this. I have never seen any system where hardware has a live interrupt enabled with no software to handle it.

The earlier you put it in the config.sys the earlier that interrupt gets disabled. It would be OK to make it the first BASEDEV.

I'll have the computer until around Warpstock if you think of anything else worth looking at.

Yes, I'm still thinking about it.

comment:30 Changed 10 years ago by David Azarewicz

Priority: majorFeedback Pending

Please check your BIOS settings for something similar to "Allocate IRQ for PCI VGA" and make sure it is off. It might be in Advanced PCI PNP or somewhere similar. Also make sure the BIOS is set to a non-PNP OS. If your BIOS has this IRQ allocation setting and you can turn it off, that would be a better fix than the pcicfgwr.sys thing (if it works).

comment:31 Changed 10 years ago by Steven Levine

Cc: steve53@… added

comment:32 Changed 10 years ago by BlondeGuy

I've been through the BIOS settings, and I don't see anything that helps. Thanks for the suggestion.

Note: See TracTickets for help on using tickets.