SCVMM VM Network dependency for Hyper-V replica

Encountered a weird issue today with SCVMM. I can’t provide screenshots since I’d have to white out object names, making it without purpose.

I wanted to delete a VM Network, however when I looked at the dependencies on it, I saw a Virtual Machine.

Looking at the Virtual Machine itself, all of its network configuration was tied to a different VM Network. For a while I really couldn’t figure out where this reference was coming from.

Then I realized that this VM has a Hyper-V Replica. SCVMM will show that replica, but you cannot open the properties of it in the GUI. However, you can query it with PowerShell. I was able to do so with this:

 

$VmName = "vm1"
$VM = Get-SCVirtualMachine -Name $VMName -VMHost "destination hyper-v host"
$NIC = Get-SCVirtualNetworkAdapter -VM $VM | Out-Gridview -PassThru -Title 'Select VMs vNIC'

This allowed me to view the properties of the NIC that was attached to my replica. SCVMM stores network information (not Hyper-V network info) in it’s database, and it appears that the original config was changed on my source VM, but that never made its way into the replica.

I was able to update this by adding the following two PowerShell lines:

# Pop up an OutGrid to choose the VM network you want to attach the NIC to
$VMNetwork = Get-SCVMNetwork | Out-Gridview -PassThru -Title 'Select VM Network'
Set-SCVirtualNetworkAdapter -VirtualNetworkAdapter $VM.VirtualNetworkAdapters[($NIC.SlotID)] -VMNetwork $VMNetwork

Now, I did receive an error from this last Set command, that “VMM failed to allocate a MAC address to network adapter”. However, after re-runnining the “Get-SCVirtualNetworkAdapter” I could confirm the Logical Network and VM Network were updated appropriately, and my dependency was no longer there.

I validated that this did not appear to affect the health of Hyper-V replication either.

Update – after performing the actions above, the dependency on the VMNetwork was still there when I tried to delete (but didn’t appear in the GUI). At the risk of breaking replication, I decided to try updating the VMM database directly, rather than try to figure out the best method.

I figured my Hyper-V replica VM nic was still tied to the VMNetwork in the database, despite the command I used above. So I used SQL Server Management Studio to query the tables like this:

-- Find VM Network ID
DECLARE @var VARCHAR(MAX)
SELECT @var = id FROM [VirtualManagerDB].[dbo].[tbl_NetMan_VMNetwork] WHERE Name = 'vm network name'
-- Use output from previous command to find matching VM nics
SELECT * FROM tbl_WLC_VNetworkAdapter WHERE (vmnetworkid = @var)

This produced two rows, one for my source VM and the other for my replica. I updated the VMSubnetID and VMNetworkID columns on my replica to match the source VM.

This allowed me to successfully delete the VM Network.

 

Hyper-V replica health notifications

I set up Hyper-V replica between two clusters in two offices, and needed a way of keeping track of replica health. I used a PowerShell script running from a utility server on a scheduled task to accomplish this.

This script runs a “get-vmreplication” command on each cluster node, and sends an email if any is found in warning or critical state.

One issue I needed to solve was the permissions required to run this from a utility server. I’m sure there are many other (and better) ways to accomplish this such as group managed service accounts, but there are certain limitations in my environment.

First I created an account in AD to act as a service account, as a standard user. I used the “ConvertFrom-SecureString” cmdlet as demonstrated in this article, as the SYSTEM account on my utility server, to produce a file for building a credential object in my PowerShell script. To run this as SYSTEM I used “psexec -s -i powershell.exe”.

Then I created a scheduled task, set to run as SYSTEM when not logged on, at 12 hour intervals, with the action of running the script below:

 

$password = Get-Content C:\scripts\MaintenanceChecks\SecureString.txt | ConvertTo-SecureString
$Credential = New-Object System.Management.Automation.PSCredential("svc.clusterhealth@domain.com",$password)	
 
$Servers = @("ClusterHost1","ClusterHost2","ClusterHost3","ClusterHost4")
 
Foreach ($server in $Servers)
{
	Invoke-Command -credential $credential -computername $server -scriptblock {
 
		$resultsWarn = get-vmreplication | where-object {$_.Health -like "Warning"} 
		if ($resultsWarn)
		{
			$smtpServer = "relay.domain.com"
			$port = "25"
			$message = New-Object System.Net.Mail.MailMessage
			$message.From = "fromaddress@domain.com"
			$message.Sender = "fromaddress@domain.com"
			$message.To.Add( "toaddress@domain.com" )
			$message.Subject = "Replication Health Warning - $($Using:server)"
			$message.IsBodyHtml = $true
			$message.Body = "Replication health was found to be in warning state for the following VMs:  $($resultsWarn.Name)"
			$Client = New-Object System.Net.Mail.SmtpClient( $smtpServer , $port )
			$Client.Send( $message )
		}
 
		$resultsCrit = get-vmreplication | where-object {$_.Health -like "Critical"} 
		if ($resultsCrit)
		{
			$smtpServer = "relay.domain.com"
			$port = "25"
			$message = New-Object System.Net.Mail.MailMessage
			$message.From = "fromaddress@domain.com"
			$message.Sender = "fromaddress@domain.com"
			$message.To.Add( "toaddress@domain.com" )
			$message.Subject = "Replication Health Critical - $($Using:server)"
			$message.IsBodyHtml = $true
			$message.Body = "Replication health was found to be in CRITICAL state for the following VMs: $($resultsCrit.Name)"
			$Client = New-Object System.Net.Mail.SmtpClient( $smtpServer , $port )
			$Client.Send( $message )
		}
	}
}

 

 

 

Server 2016 VM freeze up

I recently deployed a couple Server 2016 virtual machines within my environment, and have been having an issue with them freezing up after periods of inactivity. Symptoms would be inaccessible on the network, locked up from the Hyper-V console (i.e. unresponsive to the Ctrl+Alt+Del command), and not responding to any shutdown commands (from Hyper-V or command line).

I initially thought this might be an incompatibility with the hypervisor, as it is still Server 2012 R2, or perhaps a missing hotfix/update but after some research this doesn’t seem to be the case.

Yesterday I finally hit on a lead sourced from this discussion thread, which indicates the problem originates from the Pagefile being sourced on a separate VHDX.

Turns out this is exactly how I configure my VMs, so that if I ever decide to protect or Hyper-V replica the VM I can exclude it.

The resolution for this particular issue is to:

  • Right click Start Menu
  • Choose “System”
  • Click “Advanced System Settings”
  • Under Startup and Recovery, click “Settings”
  • Change the Write debugging information dropdown to “None”

The implications of this setting are that if Windows crashes due to unexpected failure, it will not create a memory dump file. More detail can be found here.

Commvault and Hyper-V – my experience

While it is quite simple to find many people talking about Backup Exec and Veeam online, it is much more difficult to find anecdotal experience with Commvault Simpana. Having recently been part of an implementation in my company, I thought I would share my own opinions, particularly as it relates to a Hyper-V environment.  These are my personal views and do not represent my employer in any way.

Summary

In short, if you are a Hyper-V environment, I cannot recommend consideration for Commvault Simpana in any capacity. One would be much better served investigating Veeam or Altaro instead, especially since those products are dedicated to virtualization support and have a track record of excellent word-of-mouth.

Simpana V10

Initially I deployed on Simpana V10 SP12. Having completed the required (and outrageously expensive) Commvault training, I had a good understanding of how the system worked and how to implement.

Overall, getting started went smoothly, however it wasn’t very long before I began encountering issues. Here are some of the things I’ve found with V10:

Lack of Change Block Tracking

Somehow during the investigation phase, no one at my company (including myself!) thought to investigate whether Commvault will do Change Block Tracking (CBT) for Hyper-V. And it turns out in V10, it does not. This came as a very large surprise when I went to back up my 6TB virtual machine it the incremental required ~30 hours to complete.

Following an investigation with Commvault support, it was determined that a CRC process is done for every single bit in the VM, to assess whether it has changed or not. There were certain optimizations I made to ensure that this process was as fast as possible, such as changing my EqualLogic MPIO to Round Robin and ensuring the EQL was using 4xNIC with no dedicated management NIC.

By working through some of the other issues below, I was able to mostly mitigate the slow CRC process in my environment but it was a major challenge.

Cluster Shared Volume Owner node

Windows Server 2012 did away with the concept of “Redirected Mode” for backup in a Hyper-V environment, but I don’t think Commvault got the message. While my CSV didn’t go into an actual Redirected Mode, it turns out that only the first Node specified within the Commvault Virtual Client for Hyper-V will stream the backup data, regardless of the owner of the CSV being worked on.

What this means is that in my 2-node cluster, the CRC read process occurred between my cluster hosts on the 1x1Gbe network for cluster communication, rather than happening directly on the node that owned the CSV. This was a huge bottleneck and absolutely killed performance in the cluster.

The solution was to create a Pre-Job powershell script that moved the CSV ownership to one node in the Cluster, which was set as the proxy for the Commvault backup. Not ideal, especially as Windows Server will automatically re-balance the CSV owner as of Server 2012.

Multiple Subclients

To fully saturate my iSCSI connections, I had to split my large VMs into smaller ones. The recommendation I received from Commvault was to not have a VM larger than 2TB. I found that regardless of how many data readers and network streams I configured on the subclient, only a single iSCSI connection was utilized. Once I changed MPIO to Round Robin, all iSCSI connections were utilized but not fully saturated by one subclient.

Now I have 4 subclients, at ~2TB each, running concurrently. This caused some major effort in re-configuring our File Server(s) but thankfully we’re using DFS namespaces to obfuscate the actual server names and it was fairly invisible to our users.

VSS issues

Previously using Microsoft DPM or Backup Exec, I never experienced VSS issues from the Hyper-V hosts or the guests. With Simpana, out of 7 jobs running nightly, at least one is failing with some kind of VSS error. Whether this is “writer is in a transient state” or just errors getting the snap in the first place it is a regular occurrence. I have mitigated some of these issues by ensuring that all guest drives have more than 15% free space on them, including the PageFile.VHDX volumes I’ve created per VM.

Still, for a top-tier product I would not expect as many errors to occur especially when the environment was fully stable prior to Simpana.

Lack of VSS Hardware Support

I would LOVE to use my EqualLogic hardware VSS provider, but it is not supported, and I have found zero indications that progress is being made in supporting additional VSS hardware providers.  I actually tried it out and the backup was successfully completed, however there were numerous errors on the Hyper-V node and since it is an unsupported platform I cannot use it in production.

Simpana V11

Now Version 11 SP2 has been released, and there are two crucial improvements that it is supposed to provide for Hyper-V:

Change Block Tracking

A 3rd party file system driver has been implemented for CBT in Hyper-V environments. After initial implementation (which requires a new Full backup) it seemed to work quite effectively; my 4 large VMs each taking 7-9 hours now required 40-80 minutes for an incremental.

However, the second weekend something happened where a subclient job crashed, put a CSV into Redirected Mode, and hard-locked a cluster node when I tried to return the CSV to normal. Since then, CBT has been failing on at least 50% of my subclients, even after having Commvault support perform a reset on it.

At this point I am not very trustworthy of such a new feature.

CSV Owner recognition

V11 was supposed to introduce new algorithms for CSV owner identification, allowing all Cluster nodes to act as coordinators for backup of subclients. While this mostly works, there are still odd quirks (that I haven’t dug into deeply yet) such as a weekend job last night that was again saturating my cluster communication network between nodes and effectively locking up every VM running on the cluster. I think I’m still safer moving the CSV owner before every job right now.

 

Hyper-V NIC Team Networking issue

I encountered an issue with a Hyper-V virtual machine recently that had me very confused. I still don’t have a great resolution to it but at least a functional workaround.

At a site I have a Server 2012 R2 Hyper-V host (Host), a Server 2012 R2 file server guest (FileServer) and a Server 2012 R2 domain controller (DC).

The Host has two NICs in a single switch-independent address hash team. This Team is set as the source of the vSwitch which has management capabilities enabled. We wanted to use a NIC team to provide network redundancy in case a cable was disconnected as this network is being provided by the site owner rather than my company.

Host and FileServer could reach DC, but nothing else could. DC could reach Host and FileServer, but not even it’s own default gateway.

This immediately sounded like a mis-configured virtual switch on Host, as it appeared DC could only access internal traffic. But I confirmed the vswitch was set to “External” and if this were the case the FileServer would have presumably been affected by this issue as well, but it was not.

I tried disabling VMQ on the Host NICs, as well as Large Send Offload, since both those features have been known to cause problems, but that did not resolve the issue either.

I tried changing the teaming algorithm to Hyper-V Port and Dynamic, but that didn’t resolve the issue either.

Then I decided to put one of the NICs into a Standby state in the team. This caused the accessibility to switch between the VMs; all of a sudden nothing external could reach my file server, but the DC came online to external traffic.

I tried changing which NIC was in standby but that still left me with one VM that had no connectivity.

My assumption at this point is that this issue is being caused by Port Security on the network switches; something that we are aware the site owner is doing. I suppose that the Team presents a single MAC address across multiple ports, which the port security doesn’t like and so it blocks traffic from one side of that team. Because of how traffic is balanced across the team it leaves one of the VMs in an inaccessible state.

Unfortunately we do not have control over this network or the ability to implement LACP, and so I’ve had to remove the NIC teaming and go back to segregated NICs for management and VM access.