Detaching a resource pool from PvDC failed in vCloud Director 9.7
Detaching a resource pool failed in vCloud Director with the following error message: [ID] ValidationException Delete_Hub_Containing_VMS . This week i needed to troubleshoot a task that failed in vCloud Director that occurred during the detaching of an resource pool in a provider vDC. The following error was reported in the vCloud Director GUI as shown in the picture. In this article, i will explain all the steps that were taken to eventually fix the issue.
Investigating in the vCloud layer
The error message is quite clear, the vCloud Director task message is reporting a VM that is blocking the detaching of the resource pool. By opening the failed task in the Recent Tasks window, we can open a popup window that contains the Job ID. One of the things i did first, was to see if we can get some additional information in an SSH session by checking the logs file on the vCloud Director Cells.
Open a SSH session to all of the vCloud Cells and browse to the log folder with the following command:
Open the vcloud-container-debug.log file with VIM then press the / key and enter the JobID that we want to search and hit enter.
vim vcloud-container-debug.log /JobID
In the log file below we have found the error message with some additional information: Cannot remove compute hub moref://2abe16cb-1030-4ca1-a94c-aec9dba8e386/ResourcePool#resgroup-256749 containing VMs. Evacuate all VMs on this hub and try again.
2020-02-24 12:53:04,218 | ERROR | task-service-activity-pool-4360 | ComputeHubSetImpl | Cannot remove compute hub moref://2abe16cb-1030-4ca1-a94c-aec9dba8e386/ResourcePool#resgroup-256749 containing VMs. Evacuate all VMs on this hub and try again. | requestId=f1dde410-e3c6-4ce7-94a5-964b4fcbc08b,request=POST https://vcloud-url.com/cloud/amfsecure,requestTime=1582542542758,remoteAddress=10.0.10.241:38704,userAgent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...,accept=*/* method=rclService.updateResourcePoolSet vcd=3080855e-47b0-4c40-9df3-3d744e66e3c2,task=7d5ceb92-024c-4ea3-bc98-c22c8a643fa6 activity=(com.vmware.vcloud.backendbase.management.system.TaskActivity,urn:uuid:5a2dd78c-69bf-4620-9773-2ba5726df4ff) com.vmware.vcloud.fabric.compute.ValidationException: ValidationException DELETE_HUB_CONTAINING_VMS at com.vmware.vcloud.fabric.compute.chs.ComputeHubSetImpl.validateRemoveHubs(ComputeHubSetImpl.java:223) at com.vmware.vcloud.fabric.compute.chs.ComputeHubSetImpl.removeHubs(ComputeHubSetImpl.java:251) at com.vmware.vcloud.rcl.impl.RclServiceImpl.updateResourcePoolSetTaskLRRelated(RclServiceImpl.java:2370) at com.vmware.vcloud.rcl.impl.RclServiceImpl.updateResourcePoolSetTask(RclServiceImpl.java:2281) at com.vmware.vcloud.rcl.impl.RclServiceImpl.executeTask(RclServiceImpl.java:1937) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase$1.doInSecurityContext(TaskActivity.java:652) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase$1.doInSecurityContext(TaskActivity.java:647) at com.vmware.vcloud.backendbase.management.system.SecurityContextTemplate.executeForOrgAndUser(SecurityContextTemplate.java:43) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase.execute(TaskActivity.java:654) at com.vmware.vcloud.backendbase.management.system.TaskActivity$ExecutePhase.invokeInner(TaskActivity.java:550) at com.vmware.vcloud.backendbase.management.system.TaskActivity$TaskActivityBasePhase.invoke(TaskActivity.java:301) at com.vmware.vcloud.activity.executors.ActivityRunner.runPhase(ActivityRunner.java:175) at com.vmware.vcloud.activity.executors.ActivityRunner.run(ActivityRunner.java:112) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Before we initiated the detaching task in vCloud Director, we already migrated all the virtual machines, Edges and vApp networks from the old resource pool to the new resource pool in the same provider vDC. The only VM that was still located on the resource pool was the DrsShellVM. This VM is automatically created by vCloud Director when creating affinity rules.
Investigating in the database (SQL) layer
Please be advised that on this part with the queries mentioned below, we will not change anything in the database. We are just going to query the database, to see if there are any stale objects that are blocking us from detaching the resource pool from the provider vDC. Open the Microsoft SQL management server and use query below to find the computerhub ID of the resource pool that is causing the error.
select * from computehub_set_computehub select * from prov_vdc_logical_resource where lr_type='COMPUTE_HUB_SET' select * from prov_vdc
From the output below we can match the ID from the prov_vdc table with the prov_vdc_id from the prov_vdc_logical_resource table. We can also match the fo_handle_id from the prov_vdc_logical_resource table with 2 computerhub_set_id records from the computerhub_set_computehub table.
Note: as you can see we have two records with the same computehub_set_id, The one that we are going to detach is not set as primary in the provider vDC. So in my case we need to select the one that has a 0 in the is_primary table.
With the computerhub ID we have found we can now check if there are any stale objects within that resource pool. To do so, we should query all computeVM in the database that matches the computehub ID. Use the command as shown below to do so:
select * from computevm where computehub_id=0x4CA840418D984098BD2017A7E76EB3EC
The next query will tell us what vApp the vm is part of. Use the id from the previous output as shown below:
select * from vapp_vm where cvm_id=0xCF69F208A32341749AD964E1DF55B2D2
In my case the output was empty. This means that this is a stale computevm in the database.
The resolution is removing the stale computevm object from the vCloud database. Create a valid SQL database backup of the vCloud database before making any changes to the database.
Note: If you are not comfortable with making any changes in the SQL database, please contact the VMware support team so that they can do this for you. You can easily create a support ticket at VMware by visiting the following url.
The stale object can be delete with a simple query shown below. In the query I define the computehub_id and the id from previous output.
delete from computevm where computehub_id=0x4CA840418D984098BD2017A7E76EB3EC and id=0xCF69F208A32341749AD964E1DF55B2D2
Detaching a resource pool in vCloud Director 9.7
After deleting the stale computevm object from the vCloud database, i tried to detach the resource pool again and this time it succeeded.
To detach a resource pool, login to https://vcloud-url.com/provider and click on Cloud resources in the hamburger menu. Select the provider vDC you want to edit and click on Resource Pools. Select the resource pool you want to detach and click on Detach.
By removing the stale object from the vCloud database, i was able to detach the resource pool from the provider vDC. The reason why i wanted to detach the resource pool is because i wanted to decommission this resource pool (cluster). The last thing i had to do was manually removing the DrsShellVM.