Server 2012 Storage Spaces and Hot Spares

I had previously blogged about my SC847 backup storage array, and how I’m contemplating using Windows Server 2012 Storage Spaces to manage the storage in a redundant way.

Yesterday I began setting that up, and it was very easy to configure. My only complaint through the process is that there isn’t much information on what the implications of your choices are (such as Simple, Mirror or Parity virtual disks) when you’re actually making that choice.

Since this storage was just being set up, I decided to familiarize myself with the failover functions of the storage spaces.

Throughout this process I set up a storage space and virtual disk, and then removed a hard drive to see what would happen. What I observed was that both the storage space and virtual disk went into an alert state, and the physical disk list showed the removed disk as “Lost Communication”.

I wiped away all of the configurations and recreated, this time with a hot spare. When I performed the same test of pulling a hard drive, I expected the hot spare to immediately recalculate parity on my virtual disk but this didn’t happen.

Right clicking the virtual disk and choosing “repair” did force the virtual disk to utilize the hot spare though.

 

 

While attempting to figure out the intended behavior, I came across this blog post by Baris Eris detailing the hot spare operation in depth. I won’t repeat everything here; instead I highly recommend you read what Baris has written as it is excellent.

One thing I will noted is that I also had to use the powershell command to switch the re-connected disk to a hot spare, but after doing that the red LED light on my SC847 was lit until I power cycled the unit.

The end result for me is that the behavior of the hot spare in storage spaces will work, as long as documentation is in place for staff to understand how it works, and when manual intervention is necessary.

 

Intranet site not accessible externally by domain member

A strange issue popped up recently with one of my internal sites. To be honest I’m not quite sure what changed as this site has not experienced the problem mentioned in the post title until just recently.

The problem is as follows:

  • A domain-joined computer is within the company LAN, and accesses intranet.company.com without issue.
  • A non-domain joined computer (such as my personal computer) is able to access intranet.company.com externally.
  • The the domain-joined computer travels outside the LAN and is now unable to access intranet.company.com.

 

At first I thought this was a problem with my reverse proxy, but after extensive troubleshooting I had ruled it out. Once I realized domain membership was a factor in connectivity, I knew the network firewall wasn’t the issue either. I suspected it had something to do with Internet Explorer’s categorization and rules around Internet/Intranet/Trusted Sites.

 

Eventually I stumbled upon this serverfault article which lead me to the solution. I needed to use the AdsUtil.vbs script to set the authentication on the affected directory to “NTLM” instead of the default “Negotiate,NTLM”. As the page mentions, I am using IE8 and IIS 6.

 

To use that adsutil.vbs, I did the following:

Opened a command prompt, and navigated to:

C:\Inetpub\AdminScripts

Then I opened IIS and took note of the site ID for the affected site:

 

 

Then I checked on the authentication value with my affected site ID inserted into the command:

cscript adsutil.vbs GET W3SVC/14548430/Root/NTAuthenticationProviders

And after verifying it was the default, I changed it:

cscript adsutil.vbs SET W3SVC/14548430/Root/NTAuthenticationProviders "NTLM"

After this, my domain-joined computers were accessing it properly once again.

Windows Server 2012 Windows Update Error 0x80240440

I have begun setting up a new server for a branch office, and have decided to use Windows Server 2012 on it; thanks Software Assurance! This way I can utilize the new Hyper-V features when I’m ready, as well as virtualize a domain controller properly.

 

However, I ran into a problem with Windows Update on both the Host and Guest running Server 2012. Windows Update reported an error:

 

 

The windows update log located at %windir%/windowsupdate.log reported this:

+++++++++++  PT: Synchronizing server updates  +++++++++++
  + ServiceId = {9482F4B4-E343-43B6-B170-9A65BC822C77}, Server URL = https://fe1.update.microsoft.com/v6/ClientWebService/client.asmx
WARNING: Nws Failure: errorCode=0x803d0014
WARNING: Original error code: 0x80072efe
WARNING: There was an error communicating with the endpoint at 'https://fe1.update.microsoft.com/v6/ClientWebService/client.asmx'.
WARNING: There was an error sending the HTTP request.
WARNING: The connection with the remote endpoint was terminated.
WARNING: The connection with the server was terminated abnormally
WARNING: Web service call failed with hr = 80240440.
WARNING: Current service auth scheme='None'.
WARNING: Proxy List used: '(null)', Bypass List used: '(null)', Last Proxy used: '(null)', Last auth Schemes used: 'None'.
FATAL: OnCallFailure(hrCall, m_error) failed with hr=0x80240440
WARNING: PTError: 0x80240440
WARNING: SyncUpdates_WithRecovery failed.: 0x80240440
WARNING: Sync of Updates: 0x80240440
WARNING: SyncServerUpdatesInternal failed: 0x80240440
WARNING: Failed to synchronize, error = 0x80240440
WARNING: Exit code = 0x80240440

 

At first I thought this may be related to the “Trusted Sites” within Internet Explorer. I have mine set through GPO, so I added “https://*.update.microsoft.com” to that GPO and then did a “gpupdate /force”, but the error remained.

 

Then I thought to look at my Sonicwall NSA 2400; we have the Application Control enabled, and this has been known to cause strange network connectivity issues even when not expected so I’ve just by default started checking here.

Unsurprisingly this turned out to be the problem. The strange thing is, the AppControl rule that was blocking the traffic isn’t visible in the list of applications; only through the logging did I find it.
If you navigate to the AppControl settings page, use the “Lookup Signature”, for signature # 6:

 

Click on the pencil icon, and you’ll see this screen:

 

Turns out the rule “Non-SSL Traffic over SSL port” is blocking this Windows Update traffic.

Setting the Block option to Disabled for this rule allows Windows Update to work properly.

 

 

Fixing a DFSR connection problem

What a week; actually started over the weekend, with a DFSR database crash on a spoke server within my topology. This happened for a replication group that is 1.6TB in size, so the volume check takes quite a long time.
During this time, replication hung so I decided to restart the DFSR service on our hub server. Unfortunately, the restart failed, and the service was hung at “stopping”. So I killed the dfsrs.exe process, and then started the service.

At this point, it tried to repair the DFSR database, but failed so it went into “initial replication”. Initial replication on a 1.6 TB replication group is a thing straight from my nightmares. Compounding the problem is the fact that the hub server then crashed the next night (which I haven’t had time to look into yet) and basically had to restart the process.

That was 3 days ago, and after all this time, I’ve got initial replication finished but a backlog of 10,000 files going to 2 of the spoke servers. That backlog didn’t appear to be moving, and investigating the DFS Replication section of the Event Log revealed:

The DFS Replication service encountered an error communicating with partner SW3020 for replication group swg.ca\files\jobs. 
The service will retry the connection periodically. 
Additional Information: 
Error: 9032 (The connection is shutting down)

 

The steps I took to fix this error:

  • On the hub server, I deleted the individual connections from the hub to the spoke servers for this specific replication group
  • From a domain controller in the hub site, I ran this to ensure those changes reached the branch sites sooner:
    repadmin /syncall /e /A /P
  • Then I re-created the connections for each spoke and re-ran the repadmin command.

Following that, both servers showed this in the DFSR log:

The DFS Replication service failed to communicate with partner SW3020 for replication group swg.ca\files\jobs. The partner did not recognize the connection or the replication group configuration. 
The service will retry the connection periodically. 
Additional Information: 
Error: 9026 (The connection is invalid)

 

So from each spoke server, I ran the following:

dfsrdiag pollad /v /member:hubserver.domain.com         (Replication partner)
dfsrdiag pollad /v /member:hub_site_dc.domain.com      (Domain Controller in hub site)

 

Shortly thereafter I saw this in the logs:

The DFS Replication service successfully established an inbound connection with partner SW3020 for replication group swg.ca\files\jobs.

 

And now replication traffic is flowing properly. Now all I have to do is deal with the more than 500 conflict files this whole ordeal has generated.

 

Using Quick Storage Migration for VHDs with DFSR data

As shown in my last post, I recently added some storage to our SAN, and will be moving existing VHD files from our Hyper-V cluster to this new storage.

The unique thing about this is that these VHD’s contain data that is being served with Microsoft DFS and replicated with DFSR. Hopefully word is spreading about DFSR data stores requirements for backup, which include some specific requirements for backup, especially when it comes to snapshots due to the multi-master database DFSR uses. Because of this information, I was a little concerned about the Quick Storage Migration (QSM) and so I started digging.

I eventually came across this blog post that went into detail about how the QSM works; It mentions that it does a snapshot of the VM and creates differencing disks, and eventually the snapshot is merged and VM restarted from that saved state. At this point I was concerned about my DFS data and so I sent an email to the wonderful and always helpful AskDS blog seeking clarification.

 

Here’s the response from Ned Pyle:

My presumption is that this is safe because – from what I can glean – this feature never appears to roll back time to an earlier state as part of its differencing process.

He then went the extra mile and contacted internal Microsoft peers closely related to the QSM feature, who responded:

That’s correct, we don’t revert the machine state to an earlier time. A differencing disk is created to keep track of the changes while the parent vhd is being copied. Once the VHD is copied, the differencing disk is merged into the parent.

 

Based on that I performed a QSM of my 1.6 TB VHD yesterday. It took 12.5 hours to complete, but in the end it was fully successful, with no negative repercussions.

Something interesting to note, is that I had to manually move a different 350GB vhd file to my new storage first, instead of a QSM since I was out of space on the original storage to create the avhd differencing disk. I shut down the VM, transferred the VHD (took about an hour), and then re-pathed it within the VM settings and turned the VM back on.

Following this I received a DFSR error # 2212 that “The DFS Replication service has detected an unexpected shutdown on volume F:”. I’m not sure why this occurred, and I only did the one transfer so I can’t verify that it wasn’t related to some other operation or bad shutdown.