Skip to main content

XenServer 6 Hint #1

I want to share this hint with you about my experience with XenServer 6 Technical Preview…

My goal was to install XenServer 6  on some HP WS460C workstations and BL460 servers in a HP C7000 enclosure, the workstations and servers where good to go but after the XenServer 6 TP installer hits the first issue. The server keeps rebooting when it tries to initialize hardware and then start all over again and again and again… “the bunny methoed”. Would i ever get my XenServer 6  installed or was i getting my expertations to high.. towards the cloud i looked and i did verify that the current released XenServer 5.6 Sp2 had no problem installing on the hardware. With a little help of my friends the solution was simple but weird. I hope Citrix will fix this in the final release of XenServer 6 when its shipped.
* update 13/10-2011 this is not fixed in XenServer 6. 

The solution to get XenServer 6 installer up and running is by doing following:
Go to “Power Management Options” >> “Advance power management options” >> “Minimum Proc Idle power state” = Set to “No C-State”

The issue is a known issue on HP, Dell, IBM maybe others..

 

 

CTX127395 explains about How – Hosts Become Unresponsive with XenServer 5.6 / (6)

 

 

 

Symptoms

Random server lockups

XenServer hosts intermittently and without any apparent reason become completely unresponsive and lose network connectivity, serial console access and local console access. To recover, a hard reboot is required in most circumstances. Examination of logs and traces typically fails to provide an explanation on why the issue occurred.

Hosts that are experiencing this issue mighgt display the following log entries that indicate long idle periods leading up to the system freeze:

Oct 11 15:02:58 HOSTNAME — MARK —
Oct 11 15:22:58 HOSTNAME — MARK —
Oct 11 17:54:06 HOSTNAME syslogd 1.4.1: restart.
Oct 11 17:54:06 HOSTNAME kernel: klogd 1.4.1, log source = /proc/kmsg started.The freeze can be identified in the time gap between 15:22 to when the server is rebooted at 17:54. Had the server been “alive” after 17:54, it would have continued to log – MARK – every 20 minutes where there was no other system activity.

Frequency of occurrences have varied from multiple times per day to once every 10 days or so. Servers where all CPU cores are kept constantly busy might never encounter the issue, because no core ever goes into a deep C-state.

Other issues

In addition to the total system freezes, C-states have been linked to a number of less severe misbehavior, such as erratic network performance, bus resets on storage adapters, random crashed processes, and unexplained low system performance.

Cause

The Nehalem CPU architecture introduced new power-saving features for CPUs in the form of deep C-states which the CPU cores can enter during idle periods, essentially allowing partial power-down of the CPU. Unfortunately, processors, the same as software, have bugs. Rather than call them bugs, Intel prefer to call them “errata”, and there are a number of known CPU errata in both the Nehalem and Westmere CPU cores with regards to the new C-state features. In particular, under certain conditions C3/C6 state transitions can lead to system freezes and other erratic behavior. As software platforms have been updated to include support for these new power saving features, the real life results of these errata have gradually come to light, and problems have been observed on every software platform that has implemented support for the deep C-states (see “External links” at the bottom of this document). Older operating systems, including versions of XenServer prior to 5.6, were not affected because they did not yet make use of the new C-state features, and thereby never encountered these CPU problems.

Citrix engineering is still investigating the exact mechanisms by which the errata lead to the observed symptoms. Currently, the errata most strongly believe to be the cause of the full system hangs are as follows:

For Nehalem:

75xx – BA80
55xx – AAK120
35xx – AAM108
34xx – AAO67

For Westmere:

56xx – BD59
36xx – AAY54

Detailed descriptions of the CPU errata can be found on Intel’s tech docs site, at the following site. The relevant documents are under the “Specification Updates” heading.

http://www.intel.com/p/en_US/products/server/processor/xeon3000/technical-documents/

http://www.intel.com/p/en_US/products/server/processor/xeon5000/technical-documents/

http://www.intel.com/p/en_US/products/server/processor/xeon7000/technical-documents/

Resolution

Go into the BIOS menu and make the following changes:
Set C-states to Disabled.
Set Turbo Mode to Disabled.

If your server BIOS has power management options that leave power management to the BIOS rather than the operating system, such as Dell’s Active Power Controller mode, also disable this and set the power management options to OS Control.

Examples of Dell and HP BIOS screen shots with C States support are listed below:

Dell Servers:

HP Servers:

Note: Some manufacturers are now shipping their servers with C-states turned off in BIOS, by default. IBM appears to have done this from the start, and Fujitsu reportedly has C-states turned off in their latest BIOS revisions.

Status

The core problem is faults in the hardware implementation of the new C-state features in this generation of CPUs. Citrix is investigating what software workarounds can be put in place to avoid these issues, but recommends that on current affected hardware, C-states should remain disabled in the BIOS until Intel can provide a CPU microcode update that makes C-states behave as designed.

More Information

Affected hardware:

Potentially any server based on the Nehalem and Westmere architectures. This includes

Nehalem:

Xeon 75xx, 55xx, 35xx, 34xx

Westmere:

Xeon 56xx, 36xx

A comprehensive table of CPUs based on the Nehalem and Westmere cores can be found here:

http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

Special note on Broadcom 5709 / 5716 NIC cards

Servers with the Broadcom 5709 and 5716 adapters have proven to be particularly susceptible to the system freeze issue. One suspected link is that the Broadcom driver included on the XenServer 5.6 installation disk contains a known defect (RHBZ 511368) which leads to a lost interrupt vector and subsequent loss of network connectivity when the server is under load. This, in turn, can lead to the server and virtual machines becoming idle because of a lack of network requests, which causes the CPUs to enter the problematic C-states, with resulting hang. Servers with these Broadcom NICs thereby become susceptible to freezes both under low and high system load.

Disabling C-states prevents the Broadcom driver bug from causing hangs, but Citrix nonetheless strongly recommends that customers with Broadcom cards contact Citrix Technical Support to receive an updated driver disk.

External links for further reading:

 

Dell – http://lists.us.dell.com/pipermail/linux-poweredge/2010-May/042280.html

 

Microsoft – http://support.microsoft.com/kb/2000977

IBM – http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5083648

IBM – http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5085841&brandind=5000008

VMware – http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1028656

Oracle – http://forums.oracle.com/forums/thread.jspa?threadID=1924462&start=15&tstart=0

RedHat – https://access.redhat.com/kb/docs/DOC-26837 (available for Red Hat subscribers)

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Turn on pictures to see the captcha *