So recently I encountered a very weird phenomenon at one of my customers and we had a very hard time to determine the root cause of the issue.
My customer buys his servers each time in a set of 12 (one rack). All servers are equipped with a dual-port fiberchannel Host Bus Adapter (HBA). Each port is connected to a different fabric (TOP & BOT fabric).
One of the racks freshly installed a few weeks before the maintenance weekend when we performed a storage-, SAN switch & server upgrade was causing a whole bunch of issues. The fact is that all switches are in a healthy state, no errors are visible in the errorlog and all ports have succesfully performed a Fabric Login Process (FLOGI).
Our customers uses HP Proliant DLxxx G7 servers with a combination of QLogic and Emulex fiberchannel cards. The fiberchannel switches are HP-branded: HP Brocade 8/40 SAN switch.
In a first case, we verified the port configuration:
- Fix speed? Yes, 8G.
- Fillword? Configured with mode 3 (aa-then-ia: attempts hardware arbff-arbff (mode 1) first. If the attempt fails to go into active state, this command executes software idle-arb (mode 2). Mode 3 is the preferable to modes 1 and 2 as it captures more cases.)
and ofcourse the port statistics (and in more detail, the port errors). Here I came to the conclusion the numbers where very static. Wich means the port is online in the fabric, and as no errors are filling up I came to the conclusion the port was not being used in the fabric even it was zoned out with a storage array and a disk has been presented.
A verification of the OS versions together with firmware revisions, but we came to the conclusion all of these were identical! So no-go neither.
We noticed the disk comes temporary online on the VMware ESX HyperVisor (1 HBA is connected), but instanteneously disappeared resulting in death or unusable paths.
See screenshot below to illustrate the situation.
I redefined all configurations on the switches (delete zone, delete alias and redefine everything) without result.
I redefined the portconfig (portdisable – portcfgdefault – portcfgspeed – portcfgfillword) without result.
When connecting the server to a different switch within the same fabric, the links comes up and the disk becomes visible! When connecting it back to the old port, the server sees nothing. So eventually I started documenting all connections on the switch to investigate the issue is limited to one rack and yes it was! (eventually it was the law of Murphy, but we’ll come to this).
Pretty weird stuff isn’t it?
We gave HP a situation description and provided our testplan/scenario together with the affected switches supportsaves. Eventually the root cause was found. (& yes, it took several weeks.. but when I look at the issue, it seems acceptable).
The issue manifest in a fabric when a switch got upgraded and the units name server get inactive during the HAreboot. If on that specific moment -during the outage- a device polls the name server during a logon- or logoff process, some routes can fail and remains offline even after a succesful device logon (FLOGI).
If you want to determine if your switches are having the same issue, you can follow these steps:
- Open an SSH session to the switch and perform a logon with the root account. Default passwords can be found here.
- Once logged on, execute the following command: rtedbug staticnodes show.
- In normal situations, these numbers should be equal to zero (0).
In order to resolve this issue, we analyzed two resolution processes.
Resolution process 1:
- Disconnect ISLs between sw002 and sw001 and reboot sw002.
- After sw002A is fully up and all servers logged in , then connect one ISL (port0 TOP002A to port 15) , check IO and access , then connect second ISL.”
Resolution process 2:
- Disable the ISL ports into the switch
- Run the command “rtedebug staticnodes delete”
- Enable the ISL ports again which were disabled in step 1
HP informed me the issue should be resolved in the latest FOS firmware (>7.2.x). The upgrade from 7.0.x to 7.1.1a is vulnerable for this issue.
Update: in the mean-time I executed resolution process 1 and it resolved the issue and I was able to connect to the storage again by using all paths. However after a thorough discussion with HP support, we concluded to avoid the issue a SAN switch firmware update is required. I’ts better to cure in advance than to fix afterwards..