Started troubleshooting a few issues with our View 4.6 environment and generally cleaning things up, and ran into an issue that I found rather annoying to fix.
We started receiving provisioning issues from the connection brokers. Provisioning desktops would fail, desktops would fail to power on. And the ones that were on were not always shown as running. So, I used my handy client to connect to vCenter, and noticed issues issues with tasks failing claiming the host was in a disconnected state. Checking the host, showed the host perfectly happy.
Tried to connect to a few vm consoles on the host, would fail stating an MKS failure citing a login failure. Hmm, that was strange, logging into the host showed no errors with regards to login problems in any of the local system logs.
Since this was not going well, I figured I’d try the usual suspects, and restart the management agents. So, I logged into the troubled host and ran ‘/sbin/services.sh restart’. That produced another volley of errors, showing problems with .lock file issues and unable to lock /etc/vmware/esx.conf. That was a bit disconcerting. Ran stop on the service instead to check things out, /sbin/services.sh stop’. Ran it a few times before things looked like they stopped properly.
Once stopped, ran ‘/sbin/services.sh start’, and saw things starting up properly, and eventually failing starting the ha agent aam, and timing out. So, hostd at this point was running, but neither ha nor the vCenter agent, vpx, were running. So, the host stayed disconnected from vCenter.
Well, sounds like the vCenter agent was hosed. No problem, I guess the vpx agent will need to be reinstalled. So, I forced a connect operation from vCenter, saw that it was reinstalling the agent, and the host was connected again. Yay, all was good. Not really. Quite a few hosts were now coming up as invalid or orphaned. Clicking on the host and some vm’s produced more errors with references to a problem with the host’s ‘/etc/vmware/hostd/vmInventory.xml’ file. Great, the fun just doesn’t stop.
Having seen a similar issue before, I put the host in maintenance mode, and then took it out of maintenance mode, and saw a couple of those vm’s return to normal, which was good. But, still others were now in invalid mode. At this time, keep in mind that the host still has the vm processes running, and the vm’s themselves are working over RDP, so the environment, however unstable, is still working, for the most part. So, I don’t want to do anything drastic, as to actually cause more problems. I couldn’t manage or see real state of the vm’s, but they were at least available, to some degree.
So, now the host is available again, but the vm’s are invalid, and distributed to other hosts, but processes are still running on the original host.
So, having experienced another similar issue on the server side, I knew I could remove the vm’s from vCenter’s inventory, and add them back in. Great, that worked for a couple of them, but they are now showing up as powered off, even though they are running. Next thing was to browse the datastore they actually lived on, right-click the vmx file, and add them back to inventory. Again, for those that allowed this option, all worked, and they were back, but they were still showing that they were powered off.
So, I went to console and cli. And ran through the process manually, and running the commands on the hosts that were actually running the processes still.
So, I logged on to each host in the cluster, and ran
ps |grep <vmname>
If that command did not produce a result, move on to the next host. When you do receive something similar to below, you know you found the host where the vm is actually running, which should be the host that you disconnected most of the time.
33173485 vmm0:desktop189
40375832 43697470 mks:desktop189 /bin/vmx
40396313 43697470 vcpu-0:desktop189 /bin/vmx
Now, let’s figure out what datastore this vm is living on. If you haven’t removed the vm from inventory yet, then you can just click on the vm name, and find it. Otherwise, you’ll have to hunt around and check. Once you found it, run the below command to register the vm on the host manually. This was very handy as the gui client did not give me the option to add all of the vm’s back into inventory.
vim-cmd solo/registervm /vmfs/volumes/<datastore_name>/<vmname>/<vmname>.vmx <vmname>.
From my output above, I ran
vim-cmd solo/registervm /vmfs/volumes/datastore1/desktop189/desktop189.vmx desktop
Those vm’s that I managed to remove earlier, were now added back into the host’s inventory, and in the proper state (running). I simply had to move them back into the proper resource pool. Thankfully, even better, those vm’s which I had not removed from inventory earlier, and left in the invalid state, were now returned to proper status. Awesome!
So, a little long-winded, and I’m not sure that all of those steps are necessary, but that’s how I got my environment running.
Your mileage will most certainly vary.
-KjB




