Commvault and Hyper-V – my experience

While it is quite simple to find many people talking about Backup Exec and Veeam online, it is much more difficult to find anecdotal experience with Commvault Simpana. Having recently been part of an implementation in my company, I thought I would share my own opinions, particularly as it relates to a Hyper-V environment.  These are my personal views and do not represent my employer in any way.

Summary

In short, if you are a Hyper-V environment, I cannot recommend consideration for Commvault Simpana in any capacity. One would be much better served investigating Veeam or Altaro instead, especially since those products are dedicated to virtualization support and have a track record of excellent word-of-mouth.

Simpana V10

Initially I deployed on Simpana V10 SP12. Having completed the required (and outrageously expensive) Commvault training, I had a good understanding of how the system worked and how to implement.

Overall, getting started went smoothly, however it wasn’t very long before I began encountering issues. Here are some of the things I’ve found with V10:

Lack of Change Block Tracking

Somehow during the investigation phase, no one at my company (including myself!) thought to investigate whether Commvault will do Change Block Tracking (CBT) for Hyper-V. And it turns out in V10, it does not. This came as a very large surprise when I went to back up my 6TB virtual machine it the incremental required ~30 hours to complete.

Following an investigation with Commvault support, it was determined that a CRC process is done for every single bit in the VM, to assess whether it has changed or not. There were certain optimizations I made to ensure that this process was as fast as possible, such as changing my EqualLogic MPIO to Round Robin and ensuring the EQL was using 4xNIC with no dedicated management NIC.

By working through some of the other issues below, I was able to mostly mitigate the slow CRC process in my environment but it was a major challenge.

Cluster Shared Volume Owner node

Windows Server 2012 did away with the concept of “Redirected Mode” for backup in a Hyper-V environment, but I don’t think Commvault got the message. While my CSV didn’t go into an actual Redirected Mode, it turns out that only the first Node specified within the Commvault Virtual Client for Hyper-V will stream the backup data, regardless of the owner of the CSV being worked on.

What this means is that in my 2-node cluster, the CRC read process occurred between my cluster hosts on the 1x1Gbe network for cluster communication, rather than happening directly on the node that owned the CSV. This was a huge bottleneck and absolutely killed performance in the cluster.

The solution was to create a Pre-Job powershell script that moved the CSV ownership to one node in the Cluster, which was set as the proxy for the Commvault backup. Not ideal, especially as Windows Server will automatically re-balance the CSV owner as of Server 2012.

Multiple Subclients

To fully saturate my iSCSI connections, I had to split my large VMs into smaller ones. The recommendation I received from Commvault was to not have a VM larger than 2TB. I found that regardless of how many data readers and network streams I configured on the subclient, only a single iSCSI connection was utilized. Once I changed MPIO to Round Robin, all iSCSI connections were utilized but not fully saturated by one subclient.

Now I have 4 subclients, at ~2TB each, running concurrently. This caused some major effort in re-configuring our File Server(s) but thankfully we’re using DFS namespaces to obfuscate the actual server names and it was fairly invisible to our users.

VSS issues

Previously using Microsoft DPM or Backup Exec, I never experienced VSS issues from the Hyper-V hosts or the guests. With Simpana, out of 7 jobs running nightly, at least one is failing with some kind of VSS error. Whether this is “writer is in a transient state” or just errors getting the snap in the first place it is a regular occurrence. I have mitigated some of these issues by ensuring that all guest drives have more than 15% free space on them, including the PageFile.VHDX volumes I’ve created per VM.

Still, for a top-tier product I would not expect as many errors to occur especially when the environment was fully stable prior to Simpana.

Lack of VSS Hardware Support

I would LOVE to use my EqualLogic hardware VSS provider, but it is not supported, and I have found zero indications that progress is being made in supporting additional VSS hardware providers.  I actually tried it out and the backup was successfully completed, however there were numerous errors on the Hyper-V node and since it is an unsupported platform I cannot use it in production.

Simpana V11

Now Version 11 SP2 has been released, and there are two crucial improvements that it is supposed to provide for Hyper-V:

Change Block Tracking

A 3rd party file system driver has been implemented for CBT in Hyper-V environments. After initial implementation (which requires a new Full backup) it seemed to work quite effectively; my 4 large VMs each taking 7-9 hours now required 40-80 minutes for an incremental.

However, the second weekend something happened where a subclient job crashed, put a CSV into Redirected Mode, and hard-locked a cluster node when I tried to return the CSV to normal. Since then, CBT has been failing on at least 50% of my subclients, even after having Commvault support perform a reset on it.

At this point I am not very trustworthy of such a new feature.

CSV Owner recognition

V11 was supposed to introduce new algorithms for CSV owner identification, allowing all Cluster nodes to act as coordinators for backup of subclients. While this mostly works, there are still odd quirks (that I haven’t dug into deeply yet) such as a weekend job last night that was again saturating my cluster communication network between nodes and effectively locking up every VM running on the cluster. I think I’m still safer moving the CSV owner before every job right now.

 

Locked files on SMB share

I’ve been experiencing an issue with files becoming read-only locked on a Server 2012 R2 file share, typically across the WAN.

Usually once per day at least, we would have a user report that a file had been marked read-only on the file server in an unexpected way.

Here’s some of the instances that have occurred:

  • A person is working on a drawing over a period of an hour or two, attempts to save the drawing they’ve had open for a while and receive “file is read only”
  • A person goes to open a drawing, gets warning it’s read-only, but the user who previously had it open closed it minutes or hours ago.
  • A person goes to open a drawing, gets warning it’s read-only, but the user mentioned in the warning has not touched the drawing since the last restart (perhaps “recent files” holding it open?)

95% of these issues were related to AutoCAD .dwg files, but it occasionally happened to Excel files too.

I used handle.exe from sysinternals to verify that the file was actually opened by a process (acad.exe) and it consistently was; there was just no explanation for why or how this process was opening or holding open the file handle without user interaction or knowledge.

 

I finally traced this to a series of registry changes that were being pushed out as ‘optimizations’ for SMB, which had been recommended here: https://msdn.microsoft.com/en-us/library/dn567661%28v=vs.85%29.aspx#clients

Primarily, we had defined:

HKLM\System\CurrentControlSet\Services\LanmanWorkstation\Parameters
Key New Value Original Value
FileInfoCacheEntriesMax 32768 64
DirectoryCacheEntriesMax 4096 16
FileNotFoundCacheEntriesMax 32768 128

When I reverted these values back to original, the reported issues universally stopped according to my users.

This just speaks to the increased need of change tracking in my organization; it would have been relatively simple to correlate the first reported instance to a set of changes in the same time frame. Implementing that system is easier said than done however.

EqualLogic SAN HQ and VC++ issue

When installing Dell EqualLogic SAN HQ 3.10 on a new server, I ran into an issue where the installer looked like it needed to deliver a message about the VC++, but nothing appeared.

I tried installing various versions of the VC++ Redistributable, until I finally hit the right one:

Microsoft Visual C++ 2008 Redistributable – x64 9.0.30729.17

Download here.

Once this was installed, my installation of SAN HQ proceeded normally.

 

Dell 2162ds KVM Network Connect Error

I have a Dell 2162ds KVM switch in my server room for out-of-band management instead of individual DRAC cards.

I recently went to use this, but the Java connection produced an error of “Network Connect Error”.

Luckily someone already found a workaround for this issue here.

Here’s how to get it to work:

  1. Run Notepad as Administrator
  2. Navigate to C:\Program Files (x86)\Java\jre1.8.0_65\lib\security
  3. Open the file “Java.Security”
  4. Find the line that looks like this: jdk.tls.disabledAlgorithms=SSLv3, RC4, DH keySize < 768
  5. Remove this text from that line: “, DH keySize < 768”
  6. Save the text file

Now your KVM session should start properly.

Network up but DNS mysteriously broken

I was recently troubleshooting a computer for a family member, where they reported “I can’t access the Internet” and the resolution was something I’ve never seen before.

This was a laptop with both an Ethernet and Wifi connection. They were both set to DHCP with dynamic DNS, and IPCONFIG displayed the correct information.

I could ping to 8.8.8.8 confirming network connectivity, and an NSLookup found my gateway acting as a DNS server which could properly resolve external names.

However, as soon as any browser attempted to access a DNS name, it failed. Chrome gave a “DNS_Probe_Finished_Nxdomain” error, and IE simply stated “Page could not be found”.

I checked the Hosts file for malicious entries, ensured no proxy was enabled within IE, and verified the routing table was all normal.

I ran ComboFix and GMER to look for rootkits, and started the computer in Safe Mode with Networking but none of these resolved the issue.

Finally I decided to install WireShark and run ProcessMon while the browser connection was made, in an attempt to see where these requests were going.

When trying to run WireShark after the install though, it gave an error about a missing “dnsapi.dll” file. I verified the file was in the proper location (c:\windows\system32), but on a hunch decided to refresh it from SFC with this command:

sfc /scanfile=c:\windows\system32\dnsapi.dll

The output confirmed a corrupted file was replaced, and then I rebooted Windows. Once it came back up, all external browsing worked!

I suspect that some malware had gotten onto this machine and modifed the dnsapi.dll file, but at some point had been partly removed.

This one left me confused for a while, so hopefully this helps anyone else coming across the issue.