Server 2016 VM freeze up

I recently deployed a couple Server 2016 virtual machines within my environment, and have been having an issue with them freezing up after periods of inactivity. Symptoms would be inaccessible on the network, locked up from the Hyper-V console (i.e. unresponsive to the Ctrl+Alt+Del command), and not responding to any shutdown commands (from Hyper-V or command line).

I initially thought this might be an incompatibility with the hypervisor, as it is still Server 2012 R2, or perhaps a missing hotfix/update but after some research this doesn’t seem to be the case.

Yesterday I finally hit on a lead sourced from this discussion thread, which indicates the problem originates from the Pagefile being sourced on a separate VHDX.

Turns out this is exactly how I configure my VMs, so that if I ever decide to protect or Hyper-V replica the VM I can exclude it.

The resolution for this particular issue is to:

  • Right click Start Menu
  • Choose “System”
  • Click “Advanced System Settings”
  • Under Startup and Recovery, click “Settings”
  • Change the Write debugging information dropdown to “None”

The implications of this setting are that if Windows crashes due to unexpected failure, it will not create a memory dump file. More detail can be found here.

Commvault and Hyper-V – my experience

While it is quite simple to find many people talking about Backup Exec and Veeam online, it is much more difficult to find anecdotal experience with Commvault Simpana. Having recently been part of an implementation in my company, I thought I would share my own opinions, particularly as it relates to a Hyper-V environment.  These are my personal views and do not represent my employer in any way.

Summary

In short, if you are a Hyper-V environment, I cannot recommend consideration for Commvault Simpana in any capacity. One would be much better served investigating Veeam or Altaro instead, especially since those products are dedicated to virtualization support and have a track record of excellent word-of-mouth.

Simpana V10

Initially I deployed on Simpana V10 SP12. Having completed the required (and outrageously expensive) Commvault training, I had a good understanding of how the system worked and how to implement.

Overall, getting started went smoothly, however it wasn’t very long before I began encountering issues. Here are some of the things I’ve found with V10:

Lack of Change Block Tracking

Somehow during the investigation phase, no one at my company (including myself!) thought to investigate whether Commvault will do Change Block Tracking (CBT) for Hyper-V. And it turns out in V10, it does not. This came as a very large surprise when I went to back up my 6TB virtual machine it the incremental required ~30 hours to complete.

Following an investigation with Commvault support, it was determined that a CRC process is done for every single bit in the VM, to assess whether it has changed or not. There were certain optimizations I made to ensure that this process was as fast as possible, such as changing my EqualLogic MPIO to Round Robin and ensuring the EQL was using 4xNIC with no dedicated management NIC.

By working through some of the other issues below, I was able to mostly mitigate the slow CRC process in my environment but it was a major challenge.

Cluster Shared Volume Owner node

Windows Server 2012 did away with the concept of “Redirected Mode” for backup in a Hyper-V environment, but I don’t think Commvault got the message. While my CSV didn’t go into an actual Redirected Mode, it turns out that only the first Node specified within the Commvault Virtual Client for Hyper-V will stream the backup data, regardless of the owner of the CSV being worked on.

What this means is that in my 2-node cluster, the CRC read process occurred between my cluster hosts on the 1x1Gbe network for cluster communication, rather than happening directly on the node that owned the CSV. This was a huge bottleneck and absolutely killed performance in the cluster.

The solution was to create a Pre-Job powershell script that moved the CSV ownership to one node in the Cluster, which was set as the proxy for the Commvault backup. Not ideal, especially as Windows Server will automatically re-balance the CSV owner as of Server 2012.

Multiple Subclients

To fully saturate my iSCSI connections, I had to split my large VMs into smaller ones. The recommendation I received from Commvault was to not have a VM larger than 2TB. I found that regardless of how many data readers and network streams I configured on the subclient, only a single iSCSI connection was utilized. Once I changed MPIO to Round Robin, all iSCSI connections were utilized but not fully saturated by one subclient.

Now I have 4 subclients, at ~2TB each, running concurrently. This caused some major effort in re-configuring our File Server(s) but thankfully we’re using DFS namespaces to obfuscate the actual server names and it was fairly invisible to our users.

VSS issues

Previously using Microsoft DPM or Backup Exec, I never experienced VSS issues from the Hyper-V hosts or the guests. With Simpana, out of 7 jobs running nightly, at least one is failing with some kind of VSS error. Whether this is “writer is in a transient state” or just errors getting the snap in the first place it is a regular occurrence. I have mitigated some of these issues by ensuring that all guest drives have more than 15% free space on them, including the PageFile.VHDX volumes I’ve created per VM.

Still, for a top-tier product I would not expect as many errors to occur especially when the environment was fully stable prior to Simpana.

Lack of VSS Hardware Support

I would LOVE to use my EqualLogic hardware VSS provider, but it is not supported, and I have found zero indications that progress is being made in supporting additional VSS hardware providers.  I actually tried it out and the backup was successfully completed, however there were numerous errors on the Hyper-V node and since it is an unsupported platform I cannot use it in production.

Simpana V11

Now Version 11 SP2 has been released, and there are two crucial improvements that it is supposed to provide for Hyper-V:

Change Block Tracking

A 3rd party file system driver has been implemented for CBT in Hyper-V environments. After initial implementation (which requires a new Full backup) it seemed to work quite effectively; my 4 large VMs each taking 7-9 hours now required 40-80 minutes for an incremental.

However, the second weekend something happened where a subclient job crashed, put a CSV into Redirected Mode, and hard-locked a cluster node when I tried to return the CSV to normal. Since then, CBT has been failing on at least 50% of my subclients, even after having Commvault support perform a reset on it.

At this point I am not very trustworthy of such a new feature.

CSV Owner recognition

V11 was supposed to introduce new algorithms for CSV owner identification, allowing all Cluster nodes to act as coordinators for backup of subclients. While this mostly works, there are still odd quirks (that I haven’t dug into deeply yet) such as a weekend job last night that was again saturating my cluster communication network between nodes and effectively locking up every VM running on the cluster. I think I’m still safer moving the CSV owner before every job right now.

 

Hyper-V NIC Team Networking issue

I encountered an issue with a Hyper-V virtual machine recently that had me very confused. I still don’t have a great resolution to it but at least a functional workaround.

At a site I have a Server 2012 R2 Hyper-V host (Host), a Server 2012 R2 file server guest (FileServer) and a Server 2012 R2 domain controller (DC).

The Host has two NICs in a single switch-independent address hash team. This Team is set as the source of the vSwitch which has management capabilities enabled. We wanted to use a NIC team to provide network redundancy in case a cable was disconnected as this network is being provided by the site owner rather than my company.

Host and FileServer could reach DC, but nothing else could. DC could reach Host and FileServer, but not even it’s own default gateway.

This immediately sounded like a mis-configured virtual switch on Host, as it appeared DC could only access internal traffic. But I confirmed the vswitch was set to “External” and if this were the case the FileServer would have presumably been affected by this issue as well, but it was not.

I tried disabling VMQ on the Host NICs, as well as Large Send Offload, since both those features have been known to cause problems, but that did not resolve the issue either.

I tried changing the teaming algorithm to Hyper-V Port and Dynamic, but that didn’t resolve the issue either.

Then I decided to put one of the NICs into a Standby state in the team. This caused the accessibility to switch between the VMs; all of a sudden nothing external could reach my file server, but the DC came online to external traffic.

I tried changing which NIC was in standby but that still left me with one VM that had no connectivity.

My assumption at this point is that this issue is being caused by Port Security on the network switches; something that we are aware the site owner is doing. I suppose that the Team presents a single MAC address across multiple ports, which the port security doesn’t like and so it blocks traffic from one side of that team. Because of how traffic is balanced across the team it leaves one of the VMs in an inaccessible state.

Unfortunately we do not have control over this network or the ability to implement LACP, and so I’ve had to remove the NIC teaming and go back to segregated NICs for management and VM access.

 

Migrate Mindtouch to Hyper-V

My Mindtouch Core wiki VM was originally running on VMWare server a long time ago. I needed to migrate this to Hyper-V so that I could decommission my use of VMware.

I originally wrote this post more than 2 years ago, but am publishing it now in case someone finds it useful.

 

Used vmdk2vhd to convert the disk to VHD file.

After transferring and booting, it failed.

Used these instructions to assist in fixing: http://itproctology.blogspot.ca/2009/04/migrating-debian-from-vmware-esx-to.html

mount -t ext3 /dev/hda1 /root

vi /root/etc /fstab

change sda1 to hda1

vi /root/boot/grub/menu.lst

change sda1 to hda1

Then added a Legacy Network adapter

Then followed these instructions to install Hyper-V integration services

http://www.r2x2.com/install-hyper-v-integration-services-on-debian-5-x/

DPM 2012 R2 and the downsides

I’ve been using DPM 2012 R2 for a few months now, having replaced Symantec Backup Exec 2010 due to growing data sizes and increased struggles with tape rotations.

However I’ve found a number of deficiencies with DPM that make me wish we were able to implement something like Veeam instead.

Here’s a short summary of what I need DPM to do better:

  • No deduplication support!
  • Disk volume based system leaves ‘islands of storage’ unusable and inefficient
    • Prevents disk from being shared for other backup purposes such as Hyper-V replication
  • Lack of long-term disk backups
    • Our TechNet reading has shown that since DPM uses VSS it can only take a maximum of 64 snapshots for a protected resource. We’re currently unsure if this applies to VMs as a protected resource
  • Poor visibility into DPM running operations
    • No clarity on what the data transfer represents
    • No information on compression ratios
    • No transfer speed indicators
  • No easy way to see status of data across all protected sources
    • No dashboards or easy summaries.
    • Many clicks to drill down into each protection group
  • Poor configurability on logging
    • Email notifications are very chatty, or non-existent without much middle ground.
    • No escalation methods or schedules
  • No automated test restore capabilities or scheduling
  • Limited Reporting
    • Only 6 reports out of the box, and must use SQL Reporting Services to build anything new (which I am adept with, but that’s besides the point)
  • Tape Library support seems cumbersome, compression isn’t work despite it reporting as running
  • No built in VM replication technology for Disaster Recovery scenarios
  • Very low community knowledge or support
    • For example, trying to find information on tape compression is impossible; no one online is talking about DPM and how it’s used.
  • No central console for viewing multiple backup source/destination pairs