When Disks Fail, Sort Of!

January 17, 2012
By

Another interesting one for folks.  This is the first time I’ve seen this, as most of my storage comes off of my SAN.  We started getting alerts for virtual machines, and after pulling up the vm console, noticed it was black.  Tried to reset the vm, but the reset task just sat ‘In Progress’ for without progressing any further.

Logged onto the host and ran ‘vim-cmd vmsvc/getallvms’ to get a listing of all the current vm’s, and I got an error ‘Failed to login: SSL Exception: The SSL handshake timed out ‘.  That sounds like a hostd issue, so attempted to restart that on my ESXi 4.1 box, /sbin/services.sh restart.  All looked ok, until the script stopped at the ‘USB Arbitrator’ section.  Uh-oh.  Stopping there typically points to a disk/storage issue.

Started going through the logs, and sure enough, there are errors trying to hit a certain path, as well as some new entries:

vmkernel: 0:00:04:30.418 cpu5:4101)<6>megasas_service_aen[8]: aen received

vmkernel: 0:00:04:30.418 cpu1:4409)<6>megasas_hotplug_work[8]: event code 0×0071

vmkernel: 0:00:04:30.461 cpu1:4409)<6>megasas_hotplug_work[8]: aen registered

vmkernel: 0:00:05:28.470 cpu10:9067)VSCSI: 2245: handle 8196(vscsi0:2):Reset request on FSS handle 3474254 (2 outstanding commands)

vmkernel: 0:00:05:28.470 cpu3:4215)VSCSI: 2519: handle 8196(vscsi0:2):Reset [Retries: 0/0]

vmkernel: 0:00:05:28.470 cpu3:4215)megasas: ABORT sn 73067 cmd=0×28 retries=0 tmo=0

vmkernel: 0:00:05:28.470 cpu3:4215)<5>0 :: megasas: RESET sn 73067 cmd=28 retries=0

Looking more and more like a disk error.  Since these are all local disks, it’s time to run through some diags.  Ok, trying to shut things down and reboot doesn’t work.  Have to hard power-down the host.  Boot into my controller, and all looks ok.  Array is optimal, and no events fired.  That’s strange.  Continuing on to the diags, and everything proceeds normal.  Strange again.  Time to test the disks individually, and boom, there’s an error with disk 1 of the array.  So, a disk failed, but the controller didn’t recognize it, so the hot spare didn’t kick in.  Isn’t it wonderful that we have so many tools for monitoring just about everything, but when things don’t report correctly, how do you monitor that?

Oh well, suppose I’ll have to pull this disk out manually and let the controller detect that to pull in the hot spare.

Guess that’s why I’m still employed, and will be for some time.

-KjB

 

 

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

VMwise RSS