Tuesday, February 14, 2012

Netra 240 Faulty Processor Replacement


One of my friends has this Sun Netra 240 Server. Couple of days back he called me for an amber LED that showed up on it's front panel LEDs. So I connected my laptop to it's serial management port to take some logs from it's sc and os. Below are some outputs.

sc> showenvironment
=============== Environmental Status ===============
--------------------------------------------------------------------------------
System Temperatures (Temperatures in Celsius):
--------------------------------------------------------------------------------
Sensor         Status    Temp LowHard LowSoft LowWarn HighWarn HighSoft HighHard
--------------------------------------------------------------------------------
MB.P0.T_CORE    WARNING   118     --      --      --     115      125      127
MB.P1.T_CORE    OK         82     --      --      --     115      125      127
MB.T_ENC        OK         25    -11      -9      -7      57       60       63

The showenvironment status at service console is showing warning temperature for processor0.

#prtdiag -v
System Configuration: Sun Microsystems  sun4u Netra 240
System clock frequency: 167 MHZ
Memory size: 2GB        

==================================== CPUs ====================================
               E$          CPU                    CPU
CPU  Freq      Size        Implementation         Mask    Status      Location
---  --------  ----------  ---------------------  -----   ------      --------
0    1503 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    faulted     MB/P0
1    1503 MHz  1MB         SUNW,UltraSPARC-IIIi    3.4    on-line     MB/P1


Temperature sensors:
-----------------------------------------
Location       Sensor              Status
-----------------------------------------
MB/P0          T_CORE              warning (118C)
MB/P1          T_CORE              okay 
MB             T_ENC               okay 
PS0            FF_OT               okay
PS1            FF_OT               okay

The prtdiag output at os level is also showing faulted and warning temperature for P0 in temperature sensors section.

# dmesg | grep -i err
Jan ** **:**:** fmd: [ID 441519 daemon.error] SUNW-MSG-ID: SUN4U-8000-9R, TYPE: Fault, VER: 1, SEVERITY: Major

Diagnostics messages too has something related to processor and is showing the severity as major.

Now that we are confirmed that there is some problem in the processor 0. It may be because of accumulation of dust in the heat-sink or may be there is some problem with the processor0 heat-sink fan. But after cross checking it again, we found that the fan0 that was associated with the processor0 was in okay condition.

Fan Status:
-------------------------------------------
Location             Sensor          Status
-------------------------------------------
F2                   RS              okay
F3                   RS              okay
MB/P0/F0             RS              okay
MB/P0/F1             RS              okay
PS0                  FF_FAN          okay         
PS1                  FF_FAN          okay         

Fan status in prtdiag output.

So we rule out the fans. Now it may be due to the accumulation of dust in heat-sink, that we will come to know after we open the server. Hence, we decided to order a new processor for the server. After checking the documents and the sun solve website, we found that the processor alone cannot be ordered. So we decided to find out the system board part no from the showfru output at sc. Yes we could have also done it from the OS using the prtfru output. Below is the output of showfru.

sc> showfru
FRU_PROM at MB.SEEPROM
SEGMENT: SD
/ManR
/ManR/UNIX_Timestamp32:      MON SEP 08 13:54:22 2008

/ManR/Description:           FRUID,M'BD,2X1.5GHZ,ROHS,R06
/ManR/Manufacture Location:  Shunde,China
/ManR/Sun Part No:           3753484
/ManR/Sun Serial No:         1X0F5J
/ManR/Vendor:                Mitac International
/ManR/Initial HW Dash Level: 03
/ManR/Initial HW Rev Level:  50
/ManR/Shortname:             MOTHERBOARD
/SpecPartNo:                 885-0963-03

We ordered the sun part no – 3753484 which represents the system board.
After we received the system board we shut down the server and opened it to have a look at it's internals. Well opening the server is not that easy. I strongly recommend you to follow the sun documentation. We found that the heat-sink paste on the processor0 was totally vanished which had led the processor0 to attain high temperature for a long time and in result made it faulty.
So we took out the processor0 from the good board and replaced it with the faulty one on the server system board.
After closing the lid and powering it on again we found all the fault LEDs had cleared. Error from showenvironment at sc and prtdiag outputs also disappeared simultaneously.

1 comment: