Simple disk performance test on Linux

June 16, 2020June 16, 2020 Jeff Miles Leave a comment

I recently had a need to benchmark some Azure VMs, as I’m looking at Azure Storage Fuse and want to evaluate where bottlenecks in performance might be. I’m mostly saving this as a reference so I can come back to it again in the future.

The ‘fio’ tool can be used to perform this benchmarking. I found this originally from an ArsTechnica article.

First, need to create a profile that fio will use. I created a few different ones for different scenarios; mostly around the disk target based on the directory attribute of each reader.

Gist of ini files

# Read test from P20
[global]
size=1g
direct=1
iodepth=8
ioengine=libaio
bs=512k

[reader1]
rw=randread
directory=/tmp
[reader2]
rw=randread
directory=/tmp
[reader3]
rw=randread
directory=/tmp
[reader4]
rw=randread
directory=/tmp

I saved this file to /tmp/fioread.ini. This will give me a test reading a 1GB file with a block size of 512k and iodepth of 8 in a random pattern. 4 readers are configured to maximize throughput.

Then, I called fio with this:

fio --runtime 60 --time_based /tmp/fioread.ini

This runs the test for 60 seconds, and the –time_based ensures it runs for the full time even if the dataset size has been processed (loops back around and begins again).

When I start the command, it will layout files as defined in the .ini file you’re using

Since I have 4 readers, it produces 4 x 1 GB files in /tmp:

Then it immediately goes into the test, and after 60 seconds, produces results.

Each reader will have it’s own block of stats, with a summary at the bottom:

In this case, I’ve gotten 139 MB/s reading from a P4 (30GB) disk on an E8s_V3. The disk has a throughput of 25MB/s but can burst to 170 MB/s, and the E8s_V3 will sustain 128 MB/s on a cached disk (which I have read/write cache on). I’m not exactly sure yet why I was able to exceed 128 MB/s; perhaps part of the reads were already coming from cache from the layout, but not all of them?

I suspect that if I set cache to None on this OS disk, I’d see closer to the 170 MB/s burst, because the E8s_V3 can sustain 192 MB/s on an uncached disk.

Understanding Azure Outbound Internet and Load Balancer

June 11, 2020June 11, 2020 Jeff Miles Leave a comment

I deepened my understanding of Azure outbound Internet connections today. I’ve been working on setting up an outbound proxy using Squid, and just struggling with connectivity. I want my outbound proxy to be highly available, so I thought, “I’ll put it into a Load Balancer, and I want the Standard SKU”. As soon as I did this, my proxy stopped working.

I banged my head against that problem for an hour or two, and then came back to it another day, read the docs, and had a breakthrough in understanding.

Setup

Here’s the resource layout I’m working with:

I have a bunch of servers in a subnet. They use WPAD dns resolution to direct Internet traffic to the Private IP address of a load balancer, sitting in a different subnet.

Two Linux VMs in an availability set are the back-end of my load balancer. I use an inbound rule for port 3128 to receive the proxy traffic.

The outbound line to the Internet is what I assumed would happen: I’ve read through “Outbound Connections in Azure” a few times and thought I had a handle on what would occur.

The Problem

It didn’t work! As soon as I put my Squid VMs into the load balancer, they stopped being able to proxy traffic. So I thought, “lets put it in a Basic load balancer, it’d be cheaper anyway”. And that worked! At this time I didn’t really understand what was going on, and knew I needed to come back to it after some sleep.

The Discovery

Lets take a closer look at the Microsoft Doc I just linked above. There are 3 outbound scenarios it describes:

VM with an Instance Level Public IP address (with or without Load Balancer)
1. Not my scenario – I don’t have Public IP address attached to my Squid VMs
Public Load Balancer associated with a VM (no Public IP address on the instance)
1. Not my scenario – I’m using an Internal load balancer, not a Public one.
Standalone VM (no Load Balancer, no Public IP address)
1. This is what I thought the behavior would default to – use a dynamic public IP from Azure pool and SNAT outbound traffic

In the detail of Scenario 3, the first sentence says “In this scenario, the VM is not part of a public Load Balancer pool (and not part of an internal Standard Load Balancer pool) and does not have a Public IP address assigned to it.”

Well, I am part of an Internal Standard Load Balancer pool; where does that leave me?

The blue box just below makes it very clear:

Scenario #3 is NOT a fail-back option when using an Internal load balancer of Standard SKU.

Later on in the document, this is made more explicit:

"When using an internal Standard Load Balancer, outbound NAT is not available until outbound connectivity has been explicitly declared."

Scenario #3 IS the fail-back option when using an Internal Basic load balancer, which I think was the source of my assumption. This is why I can get Internet connectivity when I place my Squid VM into the back-end of a basic load balancer. The major downside of this is I cannot control the IP address my proxy communicates outbound on, which is a firm requirement for me.

I verified this problem and scenario #1 resolution: from my VM I used a ‘wget’ command while operating under my original design – it timed out:

Then I added a public IP address to one VM, and performed the same test:

The Solution

As the doc describes, there are a few solutions although at the time of this writing it is missing one.

add Public IP to each VM
add a public load balancer, create outbound rules, and add the squid as backend to it
add NAT gateway to the subnet

add Public IP to each VM

I can create a public IP address for each VM and associate it.

This will give me static IP addresses, but I have to do it for each VM in the back-end.

Each static Standard IP is 0.005/hour ($3.65 USD/mo).

Lets say I am transferring 100GB/mo outbound between the two VMs. Bandwidth is 95 GB (first 5 GB free) at $0.087/GB so $8.265 USD / mo.

Total cost = $15.565 USD / month

add a public load balancer, create outbound rules, and add the squid as backend to it

Turns out you can have the same VMs be back-ends to BOTH a public and an internal Standard load balancer at the same time.

This gives some flexibility, because now you can scale out in the backend, as long as you pay attention to the scaling guidelines from Microsoft and avoid SNAT port exhaustion.

You can apply only an Outbound rule (you must apply this) so that it acts only as outbound load balancer, and not in-bound.

Single public IP cost of $3.65 USD/mo still applies.

Standard Load Balancer with a small number of rules (first 5 are free) processing 100GB of traffic is $18.75 USD/mo.

We still need to pay the bandwidth charge since we have traffic egress to Internet: $8.265 USD / mo

Total cost: $27.015 USD / month

add NAT gateway to the subnet

A new feature of Azure that recently went GA is NAT Gateway (or Virtual Network NAT). This provides outbound connectivity for an entire subnet, rather than at a VM level.

Important to note, a NAT Gateway CANNOT be linked to a subnet that contains any Basic IP addresses, or Basic load balancers.

This simplifies beyond the outbound public load balancer; you can still attach to a public IP address, but do not need to manage individual outbound rules and backend pools.

If there are other resources in the subnet, they too can take advantage of the NAT Gateway and a static public IP address.

Single public IP cost of $3.65 USD/mo still applies.

It doesn’t seem like the NAT Gateway is integrated into the Pricing Calculator yet, but the Docs page has a section on Pricing:

Resource hours $0.045/hour * 730 = $32.85 USD / mo
Data processed $0.045/GB * 100GB = $4.5 USD / mo

We still need to pay the bandwidth charge since we have traffic egress to Internet: $8.265 USD / mo

Total cost: $49.265 USD / month

Azure Pipeline for Automation Runbooks

June 2, 2020June 2, 2020 Jeff Miles 1 Comment

Azure Automation runbooks are very useful – particularly for scheduled or repeated operations. One downside I have observed is that they are very often disconnected from proper version control.

Microsoft has a solution for this: source control integration

You can connect a GitHub/Azure Repo to an Automation Account, which uses resources like RunAs accounts, personal access tokens (from your repo), and automation jobs to perform the sync.

I began evaluating this for the purposes of importing, publishing, and scheduling some runbooks, however there were two key reasons I felt a different direction was better suited:

Flexibility – I could only sync a folder named “runbooks”, and only import and publish them. What if I only wanted a certain selection of runbooks? What if I wanted to segregate runbooks between different Automation Accounts? What if I wanted to manage my schedule definition as code too? Sure there are solutions to these problems, but then they become workarounds or addons to the source control integration.
Consistency – other areas of ensuring version controlled code is released to some environment typically use continuous deployment tools; why would I treat my runbooks any different? If I’m already going to be using Azure DevOps pipelines for these areas, I want to keep that consistency going.

There are a few different ways that runbook management could be accomplished in an Azure DevOps pipeline: Azure PowerShell, ARM templates, Azure CLI.

For simplicity and speed of development, I chose to use Azure PowerShell, combined within a YAML pipeline definition.

My use-case here is taking my Update Management scripts and making them run automatically on a schedule. Based on that, I have the following structure in my repository:

GitHub link for code

PowerShell (Folder)
    Update Management (Folder)
        azure-pipelines.yaml
 	Set-UpdateManagementSchedule.ps1
 	Get-AzCachedAccessToken.ps1
 	New-UpdateManagementSchedule.ps1
        Publish-AARunbookFromDevOps.ps1
    New-EmailDelivery.ps1

Set-UpdateManagementSchedule.ps1 defines some time zones, and then iterates over them calling New-UpdateManagementSchedule.ps1 for each. At the end, it uses New-EmailDelivery.ps1 to send results by email.

New-UpdateManagementSchedule.ps1 calls Get-AzCachedAccessToken.ps1 in order to use an existing Az PowerShell context to create a bearer token and authenticate against Azure REST API with it (where I’m using Update Management endpoints).

Publish-AARunbookFromDevOps.ps1 is my script that gets called within the DevOps pipeline, to import and publish the runbooks to the automation account.

I want Set-UpdateManagementSchedule.ps1 to run once per month, because it calculates the appropriate timing of Update Management schedules. I don’t want any of the other scripts to have a schedule at all, since they’re just referenced at some point in time.

In order to make this all happen, I created a DevOps pipeline, stored within the “azure-pipelines.yaml” file.

The concept of storing the pipeline as code within YAML is very advantageous, however difficult to simply write from scratch. Often I will create a dummy pipeline in the GUI, using the Task Assistant (see below) and Azure DevOps task docs in order to understand the syntax and components of what I’m trying to achieve.

I can then take that YAML that has been progressively built, store it in my repository, and create a fresh pipeline and import using it.

In this case, my pipeline starts with a trigger definition, for which I learned and utilized path-based rules that Azure Repo supports:

trigger:
  branches:
    include:
    - master
  paths:
    include:
    - PowerShell/UpdateManagement/*

This will ensure the pipeline only triggers when a runbook within the UpdateManagement folder is modified, rather than my whole repository.

Using the AzurePowerShell task, I can trust that authentication to Azure will be handled appropriately as long as I supply a service connection (shown as the “azureSubscription” property below).

- task: AzurePowerShell@5
  inputs:
    azureSubscription: 'automation-rg' #This is the devops service connection name
    ErrorActionPreference: 'Stop'
    FailOnStandardError: true
    ScriptType: 'FilePath'
    ScriptPath: './PowerShell/UpdateManagement/Publish-AARunbookFromDevOps.ps1'
    ScriptArguments:
      -runbookname Get-AzCachedAccessToken `
      -runbookfile ./PowerShell/UpdateManagement/Get-AzCachedAccessToken.ps1 #`
      #-scheduleName $(userPassword)
    azurePowerShellVersion: 'LatestVersion'
  displayName: Deploy AA Runbook - Get-AzCachedAccessToken

For each Runbook that I want to handle in my pipeline, I’m calling the “Publish-AARunbookFromDevOps.ps1” script and supplying arguments related to that specific runbook. I wanted to break it out this way to give me the most flexibility to handle runbooks independently.

This script that is called is fairly static right now; I’m making a bunch of hard-coded assumptions such as “we will always overwrite the runbook if it exists rather than comparison checking”, and “every import needs a publish”, and “only 1 kind of schedule is going to be used”.

Limiting the schedule to my hard-coded values reduced the time-to-deploy this in my environment, because I didn’t have a need right now for further manipulation. This idea is related to “YAGNI” or “You aren’t going to need it” that I recently read about from Martin Fowler.

When the pipeline runs, it performs all it’s tasks in about 3:30 minutes

I can see that it has created a Schedule in my automation account, and linked it to the runbook:

Check out the README.md from the GitHub repo for details on re-using this code for yourself.

Terraform AzureRM provider 2.0 upgrade

May 13, 2020May 13, 2020 Jeff Miles Leave a comment

Today I needed to upgrade a set of Terraform configuration to the AzureRM 2.0 provider (technically 2.9.0 as of this writing). My need is primarily to get some bug fixes regarding Application Gateway and SSL certificates, but I knew I’d need to move sooner or later as any new resources and properties are being developed on this new major version.

This post will outline my experience and some issues I faced; they’ll be very specific to my set of configuration but the process may be helpful for others.

To start, I made sure I had the following:

A solid source-control version that I could roll back to
A snapshot of my state file (sitting in an Azure storage account)

Then, within my ‘terraform’ block where I specify the backend and required versions, I updated the value for my provider version:

terraform {
  backend "azurerm" {
  }
  required_version = "~&gt; 0.12.16"
  required_providers {
    azurerm = "~&gt; 2.9.0"
  }
}

Next, I ran the ‘terraform init’ command with the upgrade switch; plus some other parameters at command line because my backend is in Azure Storage:

terraform init `
    -backend-config="storage_account_name=$storage_account" `
    -backend-config="container_name=$containerName" `
    -backend-config="access_key=$accountKey" `
    -backend-config="key=prod.terraform.tfstate" `
    -upgrade

Since my required_providers property specified AzureRm at 2.9.0, the upgrade took place:

Now I could run a “terraform validate” and expect to get some syntax errors. I could have combed through the full upgrade guide and preemptively modified my code, but I found relying upon validate to call out file names and line numbers easier.

Below are some of the things I found I needed to change – had to run “terraform validate” multiple times to catch all the items:

Add a “features” property to the provider:

provider "azurerm" {
  subscription_id = var.subscription
  client_id       = "service principal id"
  client_secret   = "service principal secret"
  tenant_id       = "tenant id"
  features {}
}

Remove “network_security_group_id” references from subnets
Modify “address_prefix” to “address_prefixes on subnet, with values as a list

Virtual Machine Extension drops “resource_group”, “location”, and “virtual_machine_name” properties
Virtual Machine Extension requires “virtual_machine_id” property

Storage Account no longer has “enable_advanced_threat_protection” property
Storage Container no longer has “resource_group_name” property

Finally, the resources for Azure Backup VM policy and protection have been renamed – this is outlined in the upgrade guide (direct link).

It was this last one that caused the most problems. Before I had replaced it, my “terraform validate” was crashing on a fatal panic:

Looking in the crash.log, I eventually found an error on line 2572:

2020/05/13 20:30:55 [ERROR] AttachSchemaTransformer: No resource schema available for azurerm_recovery_services_protected_vm.rsv-protect-rapid7console

This reminded me of the resource change in the upgrade guide and I modified it.

Now, “terraform validate” is successful, yay!

Not so fast though – my next move of “terraform plan” failed:

Error: no schema available for azurerm_recovery_services_protected_vm.rsv-protect-rapid7console while reading state; this is a bug in Terraform and should be reported

I knew this was because there was still a reference in my state file, so my first thought was to try a “terraform state mv” command, to update the reference:

 
terraform state mv old_resource_type.resource_name new_resource_type.resource_name
terraform state mv azurerm_recovery_services_protection_policy_vm.ccsharedeus-mgmt-backuppolicy azurerm_backup_policy_vm.ccsharedeus-mgmt-backuppolicy

Of course, it was too much to hope that would work; it error-ed out:

Cannot move
azurerm_recovery_services_protection_policy_vm.ccsharedeus-mgmt-backuppolicy
to azurerm_backup_policy_vm.ccsharedeus-mgmt-backuppolicy: resource types
don't match.

I couldn’t find anything else online about converting a pre-existing terraform state to the 2.0 provider with resources changing like this. And from past experience I knew that Azure Backup didn’t like deleting and re-creating VM protection and policies, so I didn’t want to try a “terraform taint” on the resource.

I decided to take a risk and modify my state file directly (confirmed my snapshot!!)

Connecting to my blob storage container, I downloaded a copy of the state file, and replaced all references of the old resource type with the new resource type.

After replacing the text, I uploaded my modified file back to the blob container, and re-ran “terraform plan”.

This worked! The plan ran successfully, and showed no further changes required in my infrastructure.

Terraform dynamic blocks in resources (Azure Backup Retention Policy)

April 20, 2020April 20, 2020 Jeff Miles Leave a comment

This post will give an example of using Terraform dynamic blocks within an Azure resource. Typically this is done when you need multiple instances of a nested block within a resource, for example multiple “http_listener” within an Azure Application Gateway.

Here I’m using an Azure Backup retention policy, which as a Terraform resource (note I’m using resource version prior to AzureRM 2.0 provider, which are no longer supported) can have blocks for multiple different retention levels. In this case, we are only looking to have one of each type of nested block (retention_daily, retention_weekly, etc) however there are cases where we want zero instances of a block.
I’m sure there would be a way to simplify this code with a list(map(string)) variable, so that the individual nested blocks I’ve identified here aren’t necessary, however I haven’t yet spent the time to make that simplification.

A single-file example of this code can be found on my GitHub repo here.

First, create a Map variable containing the desired policy values:

variable "default_rsv_retention_policy" {
  type = map(string)
  default = {
      retention_daily_count = 14
      retention_weekly_count = 0
      #retention_weekly_weekdays = ["Sunday"]
      retention_monthly_count = 0
      #retention_monthly_weekdays = ["Sunday"]
      #retention_monthly_weeks = [] #["First", "Last"]
      retention_yearly_count = 0
      #retention_yearly_weekdays = ["Sunday"]
      #retention_yearly_weeks = [] #["First", "Last"]
      #retention_yearly_months = [] #["January"]
    }
}

In the example above, this policy will retain 14 daily backups, and nothing else.

If you wanted a more typical grandfather scenario, you could retain weekly backups for 52 weeks every Monday, and monthly backups for 36 months from the first Sunday of each month:

variable "default_rsv_retention_policy" {
  type = map(string)
  default = {
      retention_daily_count = 14
      retention_weekly_count = 52
      retention_weekly_weekdays = ["Sunday"]
      retention_monthly_count = 36
      retention_monthly_weekdays = ["Sunday"]
      retention_monthly_weeks = ['First'] #["First", "Last"]
      retention_yearly_count = 0
      #retention_yearly_weekdays = ["Sunday"]
      #retention_yearly_weeks = [] #["First", "Last"]
      #retention_yearly_months = [] #["January"]
    }
}

Now we’re going to use this map variable within a dynamic function. Check the GitHub link above for the full file, as I’m only going to insert the dynamic block here for explanation. Each of these blocks would go inside the “azurerm_backup_policy_vm” resource.

First up is the daily – in my case, I am making an assumption that there will ALWAYS be a daily value, and thus directly using the variable.

#Assume we will always have daily retention
  retention_daily {
    count = var.default_rsv_retention_policy["retention_daily_count"]
  }

Next we will add in the “retention_weekly” block:

dynamic "retention_weekly" {
    for_each = var.default_rsv_retention_policy["retention_weekly_count"] > 0 ? [1] : []
    content {
      count  = var.default_rsv_retention_policy["retention_weekly_count"]
      weekdays = var.default_rsv_retention_policy["retention_weekly_weekdays"]
    }
  }

The “for_each” is using conditional logic, in this format: condition ? true_val : false_val
It is evaluated as: “if the retention_weekly_count” value of our map variable is greater than zero, then provide 1 content block, else provide 0 content block.

From our first example of the variable I provided, since the weekly count is 0, this content block would then not appear in our terraform resource.

In the second example, we did provide a value greater than zero (52) and we also specified a weekday. This is what gets inserted into the content block as the property names and values.

Similarly, the monthly retention would look like this:

dynamic "retention_monthly" {
    for_each = var.default_rsv_retention_policy["retention_monthly_count"] > 0 ? [1] : []
    content {
      count  = var.default_rsv_retention_policy["retention_monthly_count"]
      weekdays = var.default_rsv_retention_policy["retention_monthly_weekdays"]
      weeks    = var.default_rsv_retention_policy["retention_monthly_weeks"]
    }
  }

Because of the nature of this Azure resource, there is a new content property named “weeks” which signifies the week of the month to take the backup on (we said “First”).
Using dynamic blocks is an effective way to parameterize your Terraform – even with the example I gave earlier about Azure Application Gateway, in that resource itself there are many other nested resource blocks that this would be useful for, like “disabled_rule_group” within the WAF configuration, or “backend_http_settings”, or “probe”. I would expect the same is true for other Azure load balancers or Front Door configurations.