Recently I’ve stumbled on issue where the clustered shared volumes on Hyper-V cluster were put in redirected access. But this doesn’t happens by itself. Here is how the issue appeared and how it was resolved.
I support 5 node Hyper-V cluster which have 5 clustered shared volumes and quorum disk. It all began when there was some minor interruption in SAN storage service. Even though Storage team didn’t detect issues this interception was detected by all servers that had LUNs connected to that storage so not only the Hyper-v cluster was the affected one. After the storage issue was fixed we noticed that there were several alerts related to the cluster’s nodes in SCOM. The description of the alerts was the following:
Cluster Shared Volume ‘Volume1′ (Volume1’) is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node’s connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.
There was separate alert for every CSV on the cluster. At first I tried to return one of the CSV’s to normal state by going in the Failover Cluster Manager console->Cluster Shared Volumes->Right click on one of the volumes->More Actions->Turn on redirected access for this Cluster shared volume.
This didn’t work out. The command stared execution but later it timed out and I cancelled it. So I searched in Bing to find more information about the problem. And I found the following article:
The article was clear statement that this issue was caused by storage connectivity issue. After some more granular investigation I noticed that on one of the nodes in the cluster the LUNs were not present in the Disk Management console. And because one of the nodes didn’t had this configuration the cluster was not fully healthy and in order to preserve it’s integrity forced itself to work in redirected access mode. Because of that all the virtual machines on the cluster were still up and running.
In such situation I had two choices to resolve the issue:
- Restart the server and see if disk configuration will return
- Add the LUNs to the server again
I decided to go with the first option because it was more easy for execution and I could always rely on option 2 if 1 was unsuccessful. I’ve put the faulty node in maintenance mode in Virtual Machine Manager and in Operations Manager. All virtual machines were migrated and I restarted the server. After the server was up and running again the configuration in Disk Manager was back and all CSV’s were no longer I redirected access mode. I’ve stopped maintenance mode in VMM and the node was back on the cluster.
I suspect why exactly this node lost its disk configuration during the Storage service interruption: of all 5 nodes in the cluster only this one had different HBA cards than the other four. But of course this would never happened if Storage service didn’t had issues that day.
P.S. The screenshot was copied from the mentioned article.