Status in 2015

I feel like adding to my blog. I haven’t felt this way in a long time, and when I look at my not-so-recent posts it is evident; I only made 7 posts in all of 2014.

Part of this is because 2014 was incredibly busy at work, and exhausting at home. Continued responsibilities with my company being acquired, many new projects to juggle and a large amount of overtime all contributed to a lack of desire to write.

The other part is that I have been spending a large amount of time ‘managing’. Project managing, department managing, and managing IT within a global company. Since I don’t have as much technical to write about, I’m left with the learnings and challenges of management. However I’m not particularly good at writing, not like my wife, and have found it difficult to share experiences or thoughts on this matter.

 

Things are finally slowing down at work though; perhaps not slowing but I’m planning better, setting deliverables that are much more achievable. And because of this I again feel the desire to share and to document.

 

I’m hopeful that in 2015 I can be more consistent in this, and continue to organize my work in such a way as to not be overwhelmed.

DPM syncronization failure on Secondary Server

I now have an environment of Microsoft Data Protection Manager 2012 R2 set up as a replacement for Backup Exec 2010.

Despite the lack of some features, it has been performing quite well. However I recently started receiving email notification of errors and it relates to my secondary DPM server.

The primary DPM server exists in the head office, and provides back up of Hyper-V VMs from my main cluster. The secondary DPM server exists in a branch office 300KM away and provides back up of the primary DPM server.

 

I started receiving errors like the following from the Secondary server to individual resources on the primary:

Synchronization for replica of \Online\servername(servername.clustername) on PrimaryDPM failed because the replica is not in a valid state or is in an inactive state. (ID 30300 Details: VssError:The writer experienced a non-transient error.  If the backup process is retried,
the error is likely to reoccur.
 (0x800423F4))

 

Every time I tried to perform a consistency check on these resources, it would begin and then end within 30 seconds.

To be honest I didn’t have a lot of time to troubleshoot this one. I tried restarting both DPM servers as well as the Hyper-V host and VM itself, and none of that seemed to have an impact.

At some point I noticed that the resources giving the errors on the Secondary server hadn’t had a recovery point on the Primary server in quite some time.

I forced an Express Full Backup of the VMs on the Primary server and allowed it to complete (successfully). I then initiated a consistency check on the Secondary server protected resources, and it too completed successfully!
Where I’m still confused is why didn’t I receive alerts from my Primary DPM server that recovery points were being missed?

HP MSM AP Static Provisioning

I’m currently setting up a controlled WiFi network to adhere to my parent company’s standards. We’re using the HP MSM760 controller with MSM460 access points.

I had everything set up and tested within my head office environment, however I ran into an issue when I moved the AP’s to a branch office on a different subnet.

Every AP that I moved registered with the MSM controller in Australia rather than Canada where I am. After some reading of the manual I determined it did this because the discovery of the controller works in this order:

  • UPD broadcast
  • DHCP options
  • DNS lookup (to cnsrv1)

Because the Australian controller predates mine, they had already set up and used the DNS name “cnsrv1”. Since my APs no longer detected a controller through the UDP broadcast because of the new subnet, it resolved the DNS name and re-registered.

 

To move my APs back to my controller I had to do the following:

From the Australian controller, change the AP to Autonomous mode:

switch_to_autonomous

 

Then I checked my DHCP server for the current IP of the AP, because it changed after switching to autonomous mode

Following that, I logged onto the web interface of the AP.

Then I used Maintenance > System > Provision to enter the static provisioning settings:

provision

I enabled discovery, enabled discovery by IP, and entered my Canada controller IP and clicked save:

provision2

Then from the left side of the screen, clicked Restart to confirm the static provision:

provision3

When the AP came back up, it registered on my Canada controller and all is good!

EqualLogic DPM Hyper-V network tuning

I’m in the process of configuring DPM to back up my Hyper-V environment, which resides on an EqualLogic PS6500ES SAN.

It was during this that I encountered an issue with the DPM consistency check for a 3TB VM locking up every other VM on my cluster, due to high write latencies. During this period I couldn’t even get useful stats out of the EqualLogic because SANHQ wouldn’t communicate with it and the Group Manager live sessions would fail to initialize.

 

After some investigation, I did the following on all my Hyper-V hosts and my DPM host:

– Disabled “Large Send Offload” for every NIC

– Set “Receive Side Scaling” queues to 8

– Disabled the Nagle algorithm for iSCSI NICs (http://social.technet.microsoft.com/wiki/contents/articles/7636.iscsi-and-the-nagle-algorithm.aspx)

– Update Broadcom firmware and drivers

 

Following these changes, I still see very high write latency on my backup datastore volume, but the other volumes operate perfectly.

 

 

 

Server 2012 R2 Upgrade and BSOD

I’m currently in the process of upgrading a standalone Server 2012 machine running Hyper-V to Server 2012 R2.

Due to resource constraints, I’m performing an in-place upgrade, despite this server residing 800km away from me. Thank goodness for iDRAC Enterprise.

 

However, during this process, during the “Getting Devices Ready” section I received a Blue Screen Of Death, with the error message:

whea_uncorrectable_error

 

After it hit this BSOD twice, the upgrade process failed out and reverted back to Server 2012. I was unable to find a log file of what occurred in any more detail, and was worried that I would be stuck on Server 2012.

Thankfully, I discovered a log file on the iDRAC with the following message:

A bus fatal error was detected on a component at slot 1.

This triggered my memory, and I recalled that we have a USB3 PCI-E card installed for pre-seeding an external drive with backup info.

I used the BIOS setup (Integrated Devices > Slot Disablement) to disable Slot 1, and then retried the upgrade with fingers crossed.

Success!