Opened 15 years ago

Closed 12 years ago

#476 closed defect (fixed)

Acpi 3.18 crashed with acpideamon installed

Reported by: ecsnl Owned by: eco
Priority: blocker Milestone: Release version 3.19
Component: ACPI PSD Version: 3.17
Keywords: Cc:

Description

Pasha on the Friday the 5th of March we worked on a crash of my Thinkpad T60 with ACPI events. Pressing the power button does not crash the system. But pressing the blue ThinkVantage button, opening or closing the screen, press the volume buttons all crashed the system (kernel trap). I also showed you this via VNC session with the kernel debugger.

You pointed out on IRC this might be caused by Lars updated Panorama. I even went back to standard GRADD from IBM and checked that the GRADD.SYS from IBM was also loaded. Creating any of the above listed ACPI events still caused the system to TRAP.

ACPI is buildlevel: Signature: @#netlabs dot org:3.18#@##1## 1 Mar 2010 06:19:57 pasha:::: 0::@@ ACPI core PSD Driver. (c) netlabs.org 2005-2009 Vendor: netlabs dot org Revision: 3.18 Date/Time: 1 Mar 2010 06:19:57 Build Machine: pasha File Version: 3.18 Description: ACPI core PSD Driver. (c) netlabs.org 2005-2009

its loaded with the switches /APIC /SMP /VBE Using SMP kernel 104a

After our IRC chat I at least have been able to locate the a way around the trap. If I *don't* load the acpideamon.exe it does not trap the system, i can then press the volume button. The trap is also gone when the system is loaded with Panorama and the ACPIDEAMON.EXE is not loaded.

I don't have logs right now but you also have a T60 so you should be able to test this yourself.

Change History (11)

comment:1 by ecsnl, 15 years ago

As an addtional note. My system does trap is choose to power it off via ACPI. I get the same trap I get when I press the volume button.

You typed Pasha you have a T60 so you should be able to reproduce this problem.

comment:2 by ecsnl, 15 years ago

After more testing it was found that when a full screen VIO session was open and with debug build of ACPI.PSD it would show additional information. When the volume button was pressed. An ACPI event normaly looks like this: EC queue data 1D EC arm "_Q1D" N:4B Cur:4B

Beg Exec EC 75

End Exec EC 76 queue

With the hang occured (after pressing a button) it looked something like this (line orders not in the right order maybe)

EC queue data 1D EC arm "_Q1D" N:4B Cur:4B ErrInt ESR:40 CPU0 <-! ErrInt ReceiveIllegal:3 <-!

Beg Exec EC 75

End Exec EC 76 queue

comment:3 by ecsnl, 15 years ago

I commented back to Pasha that the last build of ACPI 3.18 from December (which I could find on the FTP server worked). The builds from the 1st and second of March trapped.

The last PSD from yester with builddate 9th of March also had the same problem. The new build he send me today 11th of March fixed the problem. From IRC Pasha writes today:

Roderick Now at boot no problem Roderick Its when acpideamon is loaded Roderick You modified something in the build between March 2 Roderick and this new build that fixes it Roderick because the trap is gone! [Pasha] it is very bad :( Roderick ? [Pasha] I try explain it in acpi-dev [Pasha] it is very CPU specifics Roderick So is there now chance other system has problem again ? Roderick Now that my system works ? [Pasha] Now I need recheck all my test machines ;-) I afraid, that I set this for some of AMD system.... Roderick But please send an email to acpi-dev so we can document this!!! Roderick Its good to have some notes in the bugtracker [Pasha] first I update svn with comment in this problem [Pasha] AcpiIRQ.c line 730 [Pasha] ACPI318APIC1STCALL.ZIP  29192803.02.10 3:32 [Pasha] in your ftp

comment:4 by ecsnl, 15 years ago

Pasha you said you needed to test all your systems because of this modification, let me know if you made any changes so I can check again!

comment:5 by ecsnl, 15 years ago

Just to document this critical ESR stuff I will document the emails here.

comment:6 by ecsnl, 15 years ago

Steve posted to acpi-dev

Hi guys,

Pasha made a comment in one of the tickets that the Intel docs do not describe how to manage the ESR.

Section 10.6.3 of volume 3a of the IA64-32 Software Developer's Manuals states

<snip> The ESR is a write/read register. A write (of any value) to the ESR must be done just prior to reading the ESR to update the register. This initial write causes the ESR contents to be updated with the latest error status. Back-to-back writes clear the ESR register. After an error bit is set in the register, it remains set until the register is cleared. </snip>

These are available from

http://www.intel.com/products/processor/manuals/

The above does not match how the PSD is currently managing the ESR. It appears that 2 writes are required to clear the ESR. I recommend that we review the PSD code and ensure it is correct for both AMD and Intel CPUs.

Also, the code at

http://lxr.linux.no/#linux+v2.6.33/arch/x86/kernel/apic/apic.c#L1107

might be useful. It implies we need minor, but significant changes to the ESR management logic.

Regards,

Steven

comment:7 by ecsnl, 15 years ago

Pasha his reply:

On Thu, 11 Mar 2010 09:05:56 -0800, Steven Levine wrote:

Hi,

Pasha made a comment in one of the tickets that the Intel docs do not describe how to manage the ESR. Section 10.6.3 of volume 3a of the IA64-32 Software Developer's Manuals states

<snip> The ESR is a write/read register. A write (of any value) to the ESR must be done just prior to reading the ESR to update the register. This initial write causes the ESR contents to be updated with the latest error status. Back-to-back writes clear the ESR register. After an error bit is set in the register, it remains set until the register is cleared. </snip>

AMD:

7.8.14 Error Status Register This register must be written to trigger an update before it can be read. Each write causes the internal error state to be loaded into this register, clearing the internal error state. A second write before another error occurs causes this register to be cleared.

I read it as:

  • HW write to register
  • Each HW write clear status, mean previous status
  • What is a mean "second write" is undefine

Intel:

The ESR is a write/read register. A write (of any value) to the ESR must be done just prior to reading the ESR to update the register. This initial write causes the ESR contents to be updated with the latest error status. Back-to-back writes clear the ESR register.

From here I don't understand who and where write to register. Idioms "Back-to-back" I try find in

"NTS's dictionary of american slang" - finding, as I understand "it is series of write". Questions, how many? 1, 2 , 65535, 4Gb e.t.c.

Current handler of ESR work w/o questions ~3 year. I can see this only in AMD VIA chipset. Current

problem was in init ESR.

Your, Pavel

comment:8 by ecsnl, 15 years ago

Pasha his reply:

Hi

Both source say as QU, about ReceiveIllegal, SendIllegal. So I can send IPI to vector 0x0-0xf and

give this error. So I can test handler at both, Intel and AMD.

Your, Pavel

comment:9 by ecsnl, 15 years ago

Last reply:

In <auto-000000429144@…>, on 03/11/10

at 11:06 PM, "Pavel Shtemenko" <pasha@…> said:

Hi Pasha,

  • HW write to register
  • Each HW write clear status, mean previous status
  • What is a mean "second write" is undefine

The specs could be worded a bit better IMO. The second write will force another status update, but since the assumption is that no new errors have occurred since the last status update, the result will be that the register is cleared.

From here I don't understand who and where write to register. Idioms

"Back-to-back" I try find in "NTS's dictionary of american slang" - finding, as I understand "it is series of write". Questions, how many? 1, 2 , 65535, 4Gb e.t.c.

Back-to-back generally means two writes without an intervening read. It's bad language for a spec. I also suspect it is not the only way to clear the register. A write-read-write cycle is going to clear the register too, unless a new error has be recorded.

Notice that the linux code does 4 writes for some older Intel 32-bit chipsets and references the errata.

Current handler of ESR work w/o questions ~3 year. I can see this only

in AMD VIA chipset. Current problem was in init ESR.

Chipsets are going to vary becasue the specs are complex. All we can do is handle the special cases as we discover them. One thing you could to to make it easier to spot these kinds of failures in the future is to do the initial clear in a loop like

do forever

write esr read esr if 0 quitloop if tried 4 times

complain and quit

end

You could even do something similar in the ESR interrupt handler.

Steven

comment:10 by ecsnl, 15 years ago

Pasha what is reply to Steve his email please comment in TRACK so we have this crucial part of ACPI documented.

comment:11 by David Azarewicz, 12 years ago

Resolution: fixed
Status: newclosed

This ticket is extremely old and for an unsupported version. The section of code this ticket relates to has been completely rewritten in the current version and works much differently. The problem reported cannot occur in the current version so I am closing this ticket. If you have problems with the current version please open a new ticket.

Note: See TracTickets for help on using tickets.