VM Images – Immutable builds and Azure resources

I’ve begun poking around with custom Azure managed images, using Packer as an on-premises build source. The goal is to use the same image within an on-premises test environment and in an Azure production environment, and take a small step towards immutable infrastructure.

There are lots of interesting questions around this topic that may have unique answers in different environments. Here’s a few thoughts on what I see where I am right now:

  • Should I use Packer to build a traditional Hyper-V image, convert it to VHD and upload it to Azure, or directly use the Packer builder for Azure?
    • Because I have the on-prem resources, am building upon a pre-existing framework where local-only images are built, and need to build images for both Hyper-V and Azure, I decided to keep it consistent rather than split the build chain off to a whole new builder.
  • Should I just use the Azure Image Builder to streamline the process?
    • Maybe eventually – again, I wanted to incrementally build upon the successes of existing on-premises deployments. The Image Builder service is very intriguing, and would be a next logical step once it leaves Preview.
  • Why do images need to be built for on-premises in the first place? Why not native Azure resources?
    • While deploying into Azure would offer a lot more flexibility and scale, there are still many reasons to maintain a local presence for resources, but it mostly boils down to financial: pre-existing CapEx investments exist vs new OpEx costs that would be realized, without the appropriate systems in place for constraining size, resource lifetime, and ultimately cost overflows
  • Why build images, and not do configuration management after deploy?
    • I go back and forth on this question frequently, and I have seen much conversation about it. Like most of IT, “it depends”. Using customized images that are version controlled provides the infrastructure ability to shift-left, and ensure quality is in the build repeatably and consistently. What if you’re doing post-deployment config management, and DNS isn’t available, or a service crashes halfway through, or any number of other things that can go wrong? Now there is a delay in the availability of that resource you’ve deployed, effort consumed to resolve the problem, and a lack of confidence in its quality.
    • Immutable builds do not natively solve the problem of configuration drift post-deployment, and this is one of the big gaps that I see trying to take traditional IaaS and fit it into a more modern profile. The ‘answer’ is to monitor drift and re-build from source (cattle not pets) when it is detected, but not everyone is working with modern micro-services running in containers orchestrated centrally to achieve this. Instead, there may be an intermediary step where immutable image builds are used, along with configuration management post-deployment to watch for drift.

 

Once I got to a place where an image is ready, I began poring over the Microsoft Docs on managed images and Shared Image Gallery, prior to testing.

I intended on a distribution flow something like this:

Packer drops VHD in Blob storage -> Create Managed Image -> Use Shared Image Gallery definition -> Create Image Version

The documentation left me with a few unanswered questions, which I’ve outlined here:

  • What if I remove the original blob, can I still use the image? Yes, you can continue to deploy the managed image without the source blob
  • What if the blob gets replaced, does it update the image? No, any future deployments of the image will continue to be delivered as when the image was created
  • What if I remove the source managed image, can I still use the Shared Image Gallery definition version?
    • According to the guidance from Microsoft: Yes, but if you plan on adding replica regions, do not delete the source managed image. The source managed image is needed for replicating the image version to additional regions.
  • What if I update the source managed image, does it update the Shared Image Gallery definition version?  Mostly no, similar to the blob-to-image relationship, if you update the source managed image, the version in a SIG definition doesn’t update. What I need to test is what happens if you replace the source managed image, and then replicate an image definition version to a new region – will it contain the updates in the image?

Here’s a few other important design discoveries I’ve made along the way:

  • A VM from an Azure Managed Image can only be deployed within the same Region and Subscription as the Image (i.e. if you want to re-use the image across multiple regions or subscriptions, you’ll have to create additional images to suit
  • A VM from a Shared Image Gallery definition CAN be deployed outside a subscription, from any region it is replicated, as long as the authentication mechanism performing the VM deployment has RBAC over the Shared Image Gallery resource
  • Microsoft says “as a best practice, we encourage you to keep the resource group, shared image gallery, image definition, and image version in the same location.” I can confirm that if the resource group you place the Shared Image Gallery in is in a different Location than the SIG itself or the image definition, there are no barriers to creating those resources or a VM from them
  • Terraform AzureRM provider support (as of today at least) has limitations in managing Shared Image components:
    • You cannot set properties for VM Generation on the image definitions
    • Resource removal does not respect the dependencies between an image version, definition, and gallery

 

At the end of the day, I’ve come up with the following flow which builds and delivers my images:

Scott Hanselman Podcast Episode 719

I recently listened to and really enjoyed this podcast episode from Scott Hansleman: “Myself. Its not weird at all

In it, he touches on such a wide range of concepts: discipline and consistency, doing what you love, “Yes, and” instead of “no” as a redirect. He talks about deliberate practice, imposter syndrome, mindfulness, commitment to what is important, and making time.

Some choice quotes that really resonated with me:

  • “Most people live their lives accepting the defaults. If you are deliberate about installing something, you hit custom.”
    • Curiosity has been a major component to my learning, both professionally and personally. I see an “advanced” window on an installer wizard, and I can’t help but click it just to find out what’s hidden there. I feel the same way outside of work too – be curious about why I think the way I think, believe what I believe; be curious about how things work in a physical sense and a conceptual sense.
  • “By simply being mindful, you cannot stop but improving”
    • This thought gives me encouragement that even small intentional actions can have a positive impact. Even if I’m not feeling particularly capable, even if I’m not crushing a project or task, even if I haven’t been dedicated to deep learning, I can still be on an upward trend by being aware, considerate, and mindful of my actions and those around me.
  • “Yes there are a thousand little things where someone can make me question who I am, and am I good; but then I go back and look at what I do, and what I have made and say “no, I did that and I’m going to give myself credit”.”
    • Often times for me, this is as much giving myself permission to be proud of what I have accomplished and reflect on it a little bit, rather than discount my efforts for a variety of reasons that my mind is surely able to conjure up.
  • “Why do I feel like an imposter talking to this person? The reason is because of the road not taken; This person did something I did not do, and I’m intimidated because they stuck with it when I did not.”
    • I frequently question and doubt some of my decisions, particularly when I see people have success on alternate paths. I need to remind myself that their success in no way invalidates my path – if I am confident in who I am and what I do, I can celebrate their success rather than wonder what could have been.

Highly recommended listen.

Job I want to have

I was looking through my Google Drive recently, doing some cleanup and pruning. I came across a document I had created in June 2016, called “Job I want to have”.

I don’t remember creating this document at all. It’s contents are a job posting for an “Infrastructure Technology Analyst”, without any kind of reference to the original company.

Here’s a snippet of what it looked like:

In June 2016 I was feeling stagnant; lack of motivation, lack of direction. I looked at this posting and thought that it was a huge stretch, and that it may be so difficult to actually achieve enough skill to be able to fill a position like this.

Now I’m reflecting on this, and realize that I have this job – I do all of these things right now, and it didn’t take a monumental effort. It wasn’t hours and hours of study time, or money for certifications and courses. I’m not saying I didn’t have to work hard to learn, or that it was random chance that put me here. It was certainly time spent learning, but by doing; by embracing the challenges as I faced them and learning how to solve them with the focus of a goal in mind.

What it really required was for me to step outside of where I was comfortable, embrace the fear of uncertainty, and try. Try something new and something different; try a chance that the grass could actually be greener.

I’m glad I came across this because I needed a refresh in my mind of what my goal was and understanding that I have achieved it. I needed a reminder that the core of what I’m doing now is still fun and drives me to have the kind of career I want to have.

Perhaps its nearing time to set my sights on something a little scary again.

Azure NSG discovery

During deployment of some resources with an Azure virtual network which has subnets with network security groups (NSG) applied, I made a new discovery that I didn’t previously know. It makes sense in the context of how Azure applies NSG rules, but it doesn’t align with a traditional understanding of firewall ACLs across a subnet.

Communication within subnet

If you apply a Deny rule that has a lower priority than the default 65000 “Allow Vnet inbound”, it will also deny resources within that subnet from communicating with each other.

I discovered this while applying a “Deny inbound” rule in order to restrict lateral movement between subnets, not intending to restrict traffic within a subnet.

For example, I have a “management” subnet, with an NSG applied. Inside this subnet is an AD domain controller, and a member server. I apply a Deny rule for any source, after my “allow incoming” rules have been applied to let other subnets talk to this domain controller.

Now I find that my domain controller cannot reach my member server, despite it residing within the same subnet.

While I do not want to allow service tag “VirtualNetwork” incoming access (again, to restrict lateral movement), I do want “everything inside this subnet can talk to everything inside this subnet”. As such I had to create a specific rule for this behavior.

Azure IaaS Deny outbound considerations

As a general practice, outbound Internet access should be denied except for approved destinations. This is referenced in NIST 800-41 as a “deny by default” posture.

Achieving this within Azure Infrastructure as a Service in a practical and economical way without breaking a large amount of services is quite difficult at the moment.

If Outbound Internet is fully denied, some of the commonly used services of Azure will cease to work:

  • Azure Backup
  • Log Analytics
  • Azure State Configuration (DSC)
  • Azure Update Management
  • Azure Security Center
  • Windows Update

Some of these are not as difficult to solve – Service Tags on NSG rules can allow Azure services where they have been defined by Microsoft. As of Ignite 2018 in late September, there are new service tags covering entire regions, or all of Azure (“AzureCloud”). This means you can allow most of those services above to function and still deny general Internet outbound.

Additional Service Tags for Windows Update, or custom definitions are supposed to be coming in the future, but this doesn’t fully resolve the problem.

What if your application has a GIS component, and it needs to reach *.arcgisonline.com? What if your users have a legitimate reason to access a particular website? It isn’t good enough to just resolve that IP address one time and add it to an NSG.

What is really needed is a method to allow access to a fully qualified domain name (FQDN), particularly with wildcard support.

Here are some possible solutions:

Implement a 3rd Party network virtualization appliance (NVA)

This is the most common response that I see recommended to the outbound problem. Unfortunately, it is really expensive, and overkill if you’re only address this one particular problem. One has to consider high availability of the resources, as well as management of them since you’re just adding more IaaS into your environment, which is what we’re all trying to get away from when we’re using the cloud isn’t it?

Some vendors may not support wildcard FQDN in it’s ACLs (Barracuda CloudGen last I checked), which means you can’t support things like Windows Update where no published IP list exists.

If the implementation is anything like SonicWALL’s method, it will have difficulty being reliable – this relies upon the SonicWALL using the same DNS server as the client (calling it ‘sanctioned’) which may or may not be true in your Azure environment with the use of Azure DNS or external providers.

Implement Azure Firewall

Azure Firewall is new on the scene and released to General Availability as of late September 2018. It supports the use of FQDN references in application rules, and while I haven’t personally tested it, the example deployment template is shown to allow an outbound rule to *microsoft.com.

Confusingly, their documentation states that FQDN tags can’t be custom created, but I believe this just references groups of FQDN, not individual items.

Azure Firewall solves the problem of deploying more IaaS, and it’s natively highly available. However it again isn’t cheap, at $1.25/hour USD it is a high price to pay for just this one feature.

Wait until FQDN support exists in an NSG rule

It has been noted on the Microsoft feedback site that NSG rules containing FQDN is a roadmap item, but since this hasn’t received the “Planned” designation yet, I expect it is very far down the roadmap; particularly considering this feature is available in the Azure Firewall.

Build something custom – Azure Function or runbook which resolves DNS and adds it to your NSG

I’ve toyed with the idea of building a custom Azure Function or Automation runbook which can resolve a record and add it to an NSG. I’ll have a post on the Function side of this coming soon that describes how it would work, and the limitations that made me discard the idea.

Realistically, this isn’t a long-term viable solution as it doesn’t solve the wildcard problem.

Utilize an outbound transparent proxy server

This method involves trusting some other source to proxy your outbound traffic and depending on that source, gives a large amount of flexibility to achieve the outbound denial without breaking your services.

This could be an IaaS resource running Squid or WinGate (a product I’m currently testing for this purpose), or it could be an external 3rd party service like zScaler which specializes in access control of this nature.

To make this work, your proxy must be able to be identified by some kind of static IP to allow it through the NSG, but after that the whitelisting could happen within the proxy service itself.

I see this as the most viable method of solving the problem until either FQDN support exists for NSG, or Azure Firewall pricing comes down with competition from 3rd party vendors.