Azure Application Gateway through NSG

I’m testing some things with Azure Application Gateway this week, and ran into a problem after trying to isolate down a network security group (NSG) to restrict virtual network traffic between subnets and peered VNETs.

Here’s the test layout:

click for big

The NSG applied to the “sub-clt-test” has a default incoming rule to allow VirtualNetwork traffic of any port. This is a service tag and according to Microsoft the VirtualNetwork tag includes all subnets within a VNET as well as peered VNETs. This means in my diagram, there’s an “allow all” rule between my “vnet-edge” and “vnet-client”. Not ideal.

I created a new rule on the NSG for “sub-clt-test” to deny all HTTP traffic into the subnet, and then the following intending to allow the Application Gateway to communicate to it’s backend-pool targets:

Source Destination Port
10.8.48.6 Any 80
AzureLoadBalancer (Service Tag) Any 80

The IP address listed there is what is listed as the the frontend private IP of my Application Gateway, within the subnet 10.8.48.0/24.  I added the second rule during testing in case that the Microsoft service tag included the Application Gateway within its dynamic range.

What I discovered is that this configuration broke the Application Gateway’s ability to communicate with the backend targets. The health probes went unresponsive and suggested that the NSG is reviewed.

I knew that this wasn’t due to outbound restrictions on any of my NSG’s because as soon as I removed the incoming port 80 deny on this subnet, it began functioning again.

I removed the Deny rule, and then installed WireShark on the backend web server, to collect information about what IP was actually making the connection.

I discovered that while the frontend private IP was listed as 10.8.48.6, it is actually the IP addresses of 10.8.48.4 and 10.8.48.5 making the connection to the backend pool. The frustrating part is that I couldn’t find any explanation for this behavior in Microsoft documentation – I know that the Application Gateway requires its own subnet not to be shared with other resources, but there’s no references to the reserved IP’s that traffic would be coming from.

Since the subnet is effectively reserved for this resource its easy enough to modify my NSG for the range, but I felt like I was missing something obvious as to why it is these IP addresses being used for the connections.

Halfway through writing this post though, I came across this blog post by RoudyBob, with a bit of insight. Each instance of the Application Gateway uses an IP from the subnet assigned, and it is these IPs that will communicate with your backend targets.

ASR Deployment Planner throughput test failure

I’m preparing an environment for the Azure Site Recovery Deployment Planner tool, and ran into a problem with one of the tests. With this tool, you can run a few different tests independently:

  • GetVMList – generate a list of VMs from specified hosts
  • StartProfiling – run a profile job on the generated list of VMs, over a specified period of time
  • GenerateReport – generate output report of results based on dataset collected in the “StartProfiling” job
  • GetThroughput – run an upload test to an Azure Storage Account to get measurements of viable throughput for your environment (optionally done in the “StartProfiling” job too)

When I ran the StartProfiling job, I encountered a bunch of errors at the end which indicated it was related to the throughput test. So then I independently ran that, and encountered the following errors:

Output not in a definite format
click for bigger

UploadTest.exe has stopped working

 

There isn’t any other mentions online of this error that I could find, but I went back to the tool requirements and realized that I had missed a prerequisite.

Once I installed the Microsoft Visual C++ Redistributable for Visual Studio 2012 as identified from the ASR doc, the throughput test succeeded without further errors.

Azure Site Recovery and Backups

I’m working on a specific test case of Azure Site Recovery and came across an error, which identified a gap in my knowledge of ASR and Hyper-V Replica.

I have ASR configured to replicate a group of VMs at 5 minute intervals. My initial replication policy for this proof-of-concept was configured to hold recovery points for 2 hours, with app-consistent snapshots every 1 hour.

In practice, what I have seen for potential select-able recovery points is one every 5 minutes going back to the latest application-consistent recovery point, and then any additional app-consistent recovery points within the retention time set (2 hours):

Because the underlying mechanism is Hyper-V Replica, this corresponds to the options for Recovery Points visible in Hyper-V Manager:

Hyper-V will perform the .HRL file replication to Azure every 5 minutes as configured, but it will also utilize the Hyper-V integration components to trigger in-guest VSS for the application-aware snapshot at 1 hour intervals. This means the RPO in general is up to 5 minutes, but for application-aware RPO it is 1 hour.

 

In addition to replication, I am backing up a VM with Quest Rapid Recovery. The test was to ensure that both protection methods (Disaster Recovery and Backup) do not conflict with each other. Rapid Recovery is running an incremental snapshot every 20 minutes, and on about 40% of them the following events are received in the Application Log for the VSS service:

Volume Shadow Copy Service error: The I/O writes cannot be flushed during the shadow copy creation period

Volume Shadow Copy Service error: Unexpected error DeviceIoControl(\\?\Volume{a9dca4cb). hr = 0x80070016, The device does not recognize the command.

 

Quest has a KB article about this issue, which says to disable the Hyper-V integration component for backup in order to avoid a timing conflict when the host uses the Volume Shadow Copy requestor service. The problem is, disabling this prevents ASR from getting an application-aware snapshot of the virtual machine, which it will begin to throw warnings about after a few missed intervals:

These problems make sense though – for every hourly attempt of Hyper-V to take an application-aware snapshot using VSS, Rapid Recovery finds that writer in use and times-out waiting for it. There isn’t a way to configure when in an hour Hyper-V takes the snapshot, but I’ve begun tweaking my Rapid Recovery schedule to not occur on rounded intervals like :00 or :10, but rather :03 or :23 in an attempt to avoid conflicts with the VSS timing. So far this hasn’t been as effective as I’d hoped.

The other alternative is to disable application-aware snapshots if they’re not needed. If it is just flat files or an application that doesn’t natively tie into VSS, the best you can expect is a crash-consistent snapshot and you should configure your ASR replication policy accordingly, by setting that value to 0. In this manner you can still retain multiple hours of recovery points, they’ll just ALL be crash-consistent.

 

Azure Site Recovery setup errors

While setting up an Azure Site Recovery proof of concept, errors were encountered; at first with associating the replication policy and then afterwards with updating the authentication service.

The background is connecting SCVMM with a Server 2012 R2 Hyper-V Cluster to replicate to Azure. During the final steps of the “Prepare Infrastructure” phase, you need to associate a replication policy. This failed at the following step:

The text of the error was:

Error ID
10003
Error Message
Protection couldn't be configured for cloud/site POC-ASR.
Provider error
Provider error code: 31408

Provider error message:

	Failed to fetch the version of Microsoft Azure Recovery Services Agent installed on the Hyper-V host server . Error: An internal error has occurred trying to contact the  server: : .

WinRM: URL: [http://:5985], Verb: [INVOKE], Method: [GetStringValue], Resource: [http://schemas.microsoft.com/wbem/wsman/1/wmi/root/cimv2/StdRegProv]

Check that WS-Management service is installed and running on server .

Provider error possible causes:
	It is possible that Registry provider of WMI is corrupted.

Provider error recommended action:
	Build the repository using MOF compiler and retry the operation.

This occurred right before I was distracted by other items so I didn’t directly troubleshoot. When I came back to the Azure Portal (in a fresh session) I had a surprising new message greeting me at the Recovery Services Vault blade:

This was very odd, since I had just installed the latest version of the Site Recovery provider on my VMM host, as well as the MARS agent on my Hyper-V hosts. But when I clicked “Update Now” it listed my VMM host and displayed a new button to “Update Authentication Service”.

This almost immediately error-ed out:

Error ID
635
Error Message
Updating authentication service information for server -  failed.
Provider error
Provider error code: 31437

Provider error message:

	Failed to fetch the version of Microsoft Azure Site Recovery Agent installed on the Hyper-V host(s) '' as the host is not reachable.

Provider error possible causes:
	
      1. Windows Management Instrumentation service crashed.
      2. Windows Remote Management (WinRM) service is not running.
      3. Required services may not be running on the Hyper-V host(s)''.
  
Provider error recommended action:
	
      Ensure that
      1. A firewall is not blocking HTTPS/HTTPS traffic on the Hyper-V host.
      2. If the server is running windows Server 2008 R2, ensure that KB 982293 is installed on it. Refer to https://aka.ms/kblink982293 for more details.
      3. The Hyper-V Virtual Machine Management service is running.
      4. Ensure that the Windows Management Instrumentation service is running on the Hyper-V host(s).
      5. Ensure that the Windows Remote Management (WinRM) service is running on the Hyper-V host(s).
      6. Verify that CredSSP authentication is enabled on the service configuration of the Hyper-V host(s). To enable the CredSSP on the service configuration, run the following command on the Hyper-V host, from an elevated command line: winrm set winrm/config/service/auth @{CredSSP="true"}.
      7. The Provider version running on the server is up-to-date. Download and install the latest Microsoft Azure Site Recovery Provider.
      8. If the error persists, retry the operation and contact support.
    

I validated all the components in the list here, checked the referenced articles, ensured WMF was updated to 5.1, to no avail.

I finally stumbled upon this post on the Microsoft forums where a check was done against WMI for the object “StdRegProv”, which is mentioned in the original error from the replication policy. Turns out this was my problem too! When I ran the WMI query it returned an error of “Exception calling “GetStringValue” : “Provider not found “” on 3 of my 4 Hyper-V hosts:

$hklm = 2147483650
$key = "Software\Microsoft\Windows\CurrentVersion\Uninstall\Windows Azure Backup"
$value = "DisplayVersion"
$wmi = get-wmiobject -list "StdRegProv" -namespace root\cimv2
($wmi.GetStringValue($hklm,$key,$value)).svalue

I ran the mofcomp command, and then when I ran the last line of the previous query ($wmi.GetStringValue) it returned a value instead of an error.

cd c:\windows\system32\wbem
mofcomp regevent.mof

Following this, the “Update Authentication Service” job completed successfully, and I was able to associate my replication policy without further problems.

 

Get Inner Error from Azure RM command

Today I’m working on an ARM Template to deploy some resources into an Azure subscription. After building my JSON files and prepping parameters, I used the cmdlet “Test-AzureRmResourceGroupDeployment” in order to validate my template.

This failed with the error:

Code    : InvalidTemplateDeployment
Message : The template deployment 'e76887a9' is not valid according to the validation
          procedure. The tracking id is '6df0fffb'. See inner errors for details. Please
          see https://aka.ms/arm-deploy for usage details.
Details : {Microsoft.Azure.Commands.ResourceManager.Cmdlets.SdkModels.PSResourceManagerError}

I found that decidedly unhelpful, but found an effective way to determine the actual error message.

To retrieve the error details, use the following cmdlet, where the CorrelationID equals the tracking ID mentioned in the error.

get-azurermlog -correlationid 6df0fffb -detailedoutput

This will produce output which you can investigate and determine where the error lies.

In my case, I needed to create a Core quota increase request with Azure support, as my subscription had reached it’s limit.