EqualLogic DPM Hyper-V network tuning

I’m in the process of configuring DPM to back up my Hyper-V environment, which resides on an EqualLogic PS6500ES SAN.

It was during this that I encountered an issue with the DPM consistency check for a 3TB VM locking up every other VM on my cluster, due to high write latencies. During this period I couldn’t even get useful stats out of the EqualLogic because SANHQ wouldn’t communicate with it and the Group Manager live sessions would fail to initialize.

 

After some investigation, I did the following on all my Hyper-V hosts and my DPM host:

– Disabled “Large Send Offload” for every NIC

– Set “Receive Side Scaling” queues to 8

– Disabled the Nagle algorithm for iSCSI NICs (http://social.technet.microsoft.com/wiki/contents/articles/7636.iscsi-and-the-nagle-algorithm.aspx)

– Update Broadcom firmware and drivers

 

Following these changes, I still see very high write latency on my backup datastore volume, but the other volumes operate perfectly.

 

 

 

Server 2012 R2 Upgrade and BSOD

I’m currently in the process of upgrading a standalone Server 2012 machine running Hyper-V to Server 2012 R2.

Due to resource constraints, I’m performing an in-place upgrade, despite this server residing 800km away from me. Thank goodness for iDRAC Enterprise.

 

However, during this process, during the “Getting Devices Ready” section I received a Blue Screen Of Death, with the error message:

whea_uncorrectable_error

 

After it hit this BSOD twice, the upgrade process failed out and reverted back to Server 2012. I was unable to find a log file of what occurred in any more detail, and was worried that I would be stuck on Server 2012.

Thankfully, I discovered a log file on the iDRAC with the following message:

A bus fatal error was detected on a component at slot 1.

This triggered my memory, and I recalled that we have a USB3 PCI-E card installed for pre-seeding an external drive with backup info.

I used the BIOS setup (Integrated Devices > Slot Disablement) to disable Slot 1, and then retried the upgrade with fingers crossed.

Success!

 

Hyper-V 2012 migration to R2

Myself and a co-worker just completed an upgrade of our 2-node Server 2012 Hyper-V cluster to a 3-node Server 2012 R2 cluster, and it went very smoothly.

I’ve been looking forward to some of the improvements in Hyper-V 2012 R2, in addition to a 3rd node which is going to be the basis for our Citrix XenApp implementation (with an nVIDIA GRID K1 GPU).

I’ve posted before about my Hyper-V implementation which was done using iSCSI as the protocol but direct connections rather than through switching, since I only had 2 hosts.

For this most recent upgrade I needed to add a 3rd host, which meant a real iSCSI SAN. Here’s the network design I moved forward with:

Server 2012 R2 Network Design
click for big

 

This time I actually checked compatibility of my hardware before proceeding, and found no issues to be concerned about.

The process for the upgrade is described below, which includes the various steps required when 1) renaming hosts in use with MD3220i, and 2) converting to iSCSI SAN instead of direct connect:

Before maintenance window

  • Install redundant switches in the rack (I used PowerConnect 5548’s)
  • Live Migrate VMs from Server1 to Server2
  • Remove Server1 from Cluster membership (Evict Node)
  • Wipe and reinstall Windows Server 2012 R2 on Server1
  • Configure Server1 with new iSCSI configuration as documented
  • Re-cable iSCSI NIC ports to redundant switches
  • Create new Failover Cluster on Server1
  • From Server1 run “Copy Cluster Roles” wizard (previously known as “Cluster Migration Wizard”)
    • This will copy VM configuration, CSV info and cluster networks to the new cluster

Within maintenance window

  • When ready to cut over:
    • Power down VM’s on Server2.
    • Make CSVs on original cluster Offline
    • Power down Server2
  • Remap host mappings for each server in Modular Disk Storage Manager (MDSM) to “unused iSCSI initiator” after rename of host, otherwise you won’t find any available iSCSI disks
  • Reconfigure iSCSI port IP addresses for MD3220i controllers
  • Add host to MDSM (for new 3rd node)
  • Configure iSCSI Storage on Server1 (followed this helpful guide)
  • On Server1, make CSV’s online
  • Start VMs on Server1, ensure they’re online and working properly

 

At this point I had a fully functioning, single-node cluster within Server 2012 R2. With the right planning you can do this with 5-15 minutes of downtime for your VMs.

Next I added the second node:

  • Evict Server2 from Old Cluster, effectively killing it.
  • Wipe and reinstall Windows Server 2012 on Server2
  • Configure Server2 with new iSCSI configuration as documented
  • Recable iSCSI NICs to redundant switches
  • Join Server2 to cluster membership
  • Re-allocate VMs to Server2 to share the load

I still had to reset the preferred node and failover options on each VM.

Adding the 3rd node followed the exact same process. The Cluster Validation Wizard gave a few errors about the processors not being the exact same model, however I had no concerns there as it is simply a newer generation Intel Xeon.

 

The tasks remaining for me are to upgrade the Integration Services for each of my VMs, which will require a reboot so I’m holding off for now.

Technology is awesome (and how I can’t afford it all)

Lately I have been doing some planning for budget season, and thinking about the medium-term future and where I’d like to take my infrastructure

A big part of this is storage, and my company is in a bit of an odd place in that we’re growing so fast we need to add to our MD3220i SAN, but the MD3220i itself has an expiring warranty in December 2015. I feel like it would be a waste of money to add a disk shelf in 2014 to just have it go unused by 2015.

To address this I began with my Dell team, and had a product specialist in the office today to go over their mid size and enterprise storage products: Equallogic and Compellent. He did an excellent job in making it clear the advantages of a ‘frameless’ storage infrastructure over a ‘framed’ one like we’re in now.

Since then (only a few hours ago really) my mind has just been buzzing at all the possibilities and Projects that this meeting has kickstarted.

In the form of one long run-on sentence:

If we upgrade our storage next year to an Equallogic we can utilize the storage tiering to reduce rack space and power use while maintaining performance and increasing capacity, while at the same time decommissioning old hardware (our MD3000) and re-using our slightly old hardware (MD3220i) for purposes such as backup and disaster recovery, which we’re looking at something like AppAssure or Veeam of Unitrends to handle as long as we have the appropriate disk space, which needs to be shared with Hyper-V Replica for DR purposes, because I’m severlely lacking in that area right now which is dangerous but can be solved with a multi-tier backup and DR plan of having storage on the LAN AND offsite with replication of the backup database and Hyper-V Replica but this requires a cluster upgrade to Server 2012 R2, which would be nice anyways because then I can do live VHDX expansion to avoid having to disrupt my file server because the less off-hours maintenance I have to do the better so that I can use my time doing things like analyzing performance benefits and presenting to the Executive why we need to do all this stuff RIGHT NOW.

 

Server 2012 Storage Spaces and Hot Spares

I had previously blogged about my SC847 backup storage array, and how I’m contemplating using Windows Server 2012 Storage Spaces to manage the storage in a redundant way.

Yesterday I began setting that up, and it was very easy to configure. My only complaint through the process is that there isn’t much information on what the implications of your choices are (such as Simple, Mirror or Parity virtual disks) when you’re actually making that choice.

Since this storage was just being set up, I decided to familiarize myself with the failover functions of the storage spaces.

Throughout this process I set up a storage space and virtual disk, and then removed a hard drive to see what would happen. What I observed was that both the storage space and virtual disk went into an alert state, and the physical disk list showed the removed disk as “Lost Communication”.

I wiped away all of the configurations and recreated, this time with a hot spare. When I performed the same test of pulling a hard drive, I expected the hot spare to immediately recalculate parity on my virtual disk but this didn’t happen.

Right clicking the virtual disk and choosing “repair” did force the virtual disk to utilize the hot spare though.

 

 

While attempting to figure out the intended behavior, I came across this blog post by Baris Eris detailing the hot spare operation in depth. I won’t repeat everything here; instead I highly recommend you read what Baris has written as it is excellent.

One thing I will noted is that I also had to use the powershell command to switch the re-connected disk to a hot spare, but after doing that the red LED light on my SC847 was lit until I power cycled the unit.

The end result for me is that the behavior of the hot spare in storage spaces will work, as long as documentation is in place for staff to understand how it works, and when manual intervention is necessary.