This week I had an VM that couldn’t perform an vMotion or a snapshot task. Performing one of these tasks results in an error message as shown in the picture below:
One of the first steps I did was checking the VM log of that particular VM for any errors. This is very simple to do with an SSH session to one of the ESXI hosts. Go inside the VM folder and use this command to check the log for any error events.
less vmware.log | grep error
As you can see there are no useful error messages that could explain the failure of the tasks. What I did as well was checking the log for any snapshot activities:
The funny part is that I can only find snapshot activities that completed successfully. The failed snapshot events are not even being logged in the VMware log.
Let’s have a look in the hostd log with an filter on the VM name to see if we can find any useful log entries. We can do that with the next command:
less /var/log/hostd.log | grep VMName
The resolution for this is to restart the management services on the host where the VM resides in. There is an knowledge base article available from VMware: url.
According to the knowledge base article from VMware, we need to restart the management services. The hostd and vpxa services need to be restarted in an SSH or console session. Restarting the management services will not cause any downtime for the running VMs according to the knowledge base article. In our environment, there are sufficient resources, we will migrate all the VMs (except of the problem VM) off the host by using affinity rules. This will lower the risk off any issues that may arise.
Create one VM affinity group and select ALL VMs and exclude the problem VM. Create an Host affinity group and select only the problem host (the one that host the problem VM). Now we can create an anti-affinity rule with the two created affinity groups to make sure that only the problem VM will run in the host. Select the Should not run on hosts in group in the specification of the rule.
We are now ready to restart the services on the problem host. The issue was happening in a vCD enabled environment and I haven’t ever restarted the management services in this type of environment. I can confirm to you guys that it is possible, vCD is just an orchestration tool on top of the vSphere environment. Run the next command to restart the management services:
Restarting the management services fixed the failed vMotion and snapshot tasks.
Remove the newly created affinity rule and groups to put the host back in production.