Architecture, Azure, Cloud, IaC

Using Terraform for implementing Azure VM Disaster Recovery

Azure offers an end-to-end backup and disaster recovery solution that’s simple, secure, scalable, and cost-effective—and can be integrated with on-premises data protection...

Written by Freddy Ayala · 5 min read >

Azure offers an end-to-end backup and disaster recovery solution that’s simple, secure, scalable, and cost-effective—and can be integrated with on-premises data protection solutions. 

In the following article we will see an example of how to implement a Virtual Machine Disaster Recovery with Azure Site Recovery VM Replication using Terraform.

Site Recovery helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. When an outage occurs at your primary site, you fail over to secondary location, and access apps from there. After the primary location is running again, you can fail back to it.

Our Use case is to replicate a VM using Site Recovery, so in case we have a failure we can failover to a secondary VM that will be available using the same FQDN and Public IP as the first one. The second VM should be replicated to a different Region.

The advantage of using Site Recovery is that the second VM is not running so we do not pay for the computing resources but only for the storage and traffic to the secondary region.

You should already have a VNET and Subnet deployed.

First of all we will deploy our virtual machine with a public IP:


data "azurerm_subnet" "snet-backend" {
  depends_on           = [var.subnets]
  name                 = var.vm.snet_name
  virtual_network_name = var.vm.vnet_name
  resource_group_name  = var.resource_group
}

resource "azurerm_public_ip" "pip-vm-app" {
  name                    = "pip-app"
  location                = var.location
  resource_group_name     = var.resource_group
  allocation_method       = "Static"
  idle_timeout_in_minutes = 30
  domain_name_label       = var.vm.fqdn

  tags = {
    environment = "test"
  }
}
resource "azurerm_network_interface" "main" {
  name                = var.vm.nic_name
  location            = var.location
  resource_group_name = var.resource_group

  ip_configuration {
    name                          = "testconfiguration1"
    subnet_id                     = data.azurerm_subnet.snet-backend.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.pip-vm-app.id
  }
}

resource "azurerm_virtual_machine" "main" {
  name                  = var.vm.name
  location              = var.location
  resource_group_name   = var.resource_group
  network_interface_ids = [azurerm_network_interface.main.id]
  vm_size               = var.vm.size

  # Uncomment this line to delete the OS disk automatically when deleting the VM
  # delete_os_disk_on_termination = true

  # Uncomment this line to delete the data disks automatically when deleting the VM
  # delete_data_disks_on_termination = true

  storage_image_reference {
    publisher = var.vm.storage_image_reference.publisher
    offer     = var.vm.storage_image_reference.offer
    sku       = var.vm.storage_image_reference.sku
    version   = var.vm.storage_image_reference.version
  }

  storage_os_disk {
    name              = "disk-${var.vm.name}-os"
    caching           = var.vm.storage_os_disk.caching
    create_option     = var.vm.storage_os_disk.create_option
    managed_disk_type = var.vm.storage_os_disk.managed_disk_type
  }

  os_profile {
    computer_name  = var.vm.os_profile.computer_name
    admin_username = var.vm.os_profile.admin_username
    admin_password = var.vm.os_profile.admin_password
    #custom_data    = file(var.vm.os_profile.custom_data)
  }

  os_profile_linux_config {
    disable_password_authentication = false
  }

  boot_diagnostics {
    enabled     = true
    storage_uri = "https://${var.storage_account.name}.blob.core.windows.net"
  }
  tags = var.tags
}


resource "azurerm_managed_disk" "disk-data-app" {
  name                 = "disk-${azurerm_virtual_machine.main.name}-data"
  location             = var.location
  resource_group_name  = var.resource_group
  storage_account_type = "StandardSSD_LRS"
  create_option        = "Empty"
  disk_size_gb         = var.vm.storage_data_disk.disk_size_gb
}

resource "azurerm_virtual_machine_data_disk_attachment" "example" {
  managed_disk_id    = azurerm_managed_disk.disk-data-app.id
  virtual_machine_id = azurerm_virtual_machine.main.id
  lun                = var.vm.storage_data_disk.lun
  caching            = "ReadWrite"
}

Once the VM is deployed we will deploy a Recovery Vault to use the service Site Recovery


data "azurerm_resource_group" "secondary" {
  name =var.resource_group_secondary
}


data "azurerm_resource_group" "primary" {
  name =var.resource_group
}


resource "azurerm_recovery_services_vault" "vault" {
  name                = "rv-app-${var.region_secondary}-${var.environment}"
  location            = var.location_secondary
  resource_group_name = data.azurerm_resource_group.secondary.name
  sku                 = "Standard"
}

Then we will deploy a recovery fabric and a protection container


resource "azurerm_site_recovery_fabric" "primary" {
  name                = "primary-fabric"
  resource_group_name = data.azurerm_resource_group.secondary.name
  recovery_vault_name = azurerm_recovery_services_vault.vault.name
  location            = data.azurerm_resource_group.primary.location
}

resource "azurerm_site_recovery_fabric" "secondary" {
  name                = "secondary-fabric"
  resource_group_name = data.azurerm_resource_group.secondary.name
  recovery_vault_name = azurerm_recovery_services_vault.vault.name
  location            = var.location_secondary
}

resource "azurerm_site_recovery_protection_container" "primary" {
  name                 = "primary-protection-container"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  recovery_vault_name  = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name = azurerm_site_recovery_fabric.primary.name
}

resource "azurerm_site_recovery_protection_container" "secondary" {
  name                 = "secondary-protection-container"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  recovery_vault_name  = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name = azurerm_site_recovery_fabric.secondary.name
}

We will define a replication policy

resource "azurerm_site_recovery_replication_policy" "policy" {
  name                                                 = "policy"
  resource_group_name                                  = data.azurerm_resource_group.secondary.name
  recovery_vault_name                                  = azurerm_recovery_services_vault.vault.name
  recovery_point_retention_in_minutes                  = 24 * 60
  application_consistent_snapshot_frequency_in_minutes = 4 * 60
}

We will map the source container with the target

resource "azurerm_site_recovery_protection_container_mapping" "container-mapping" {
  name                                      = "container-mapping"
  resource_group_name                       = data.azurerm_resource_group.secondary.name
  recovery_vault_name                       = azurerm_recovery_services_vault.vault.name
  recovery_fabric_name                      = azurerm_site_recovery_fabric.primary.name
  recovery_source_protection_container_name = azurerm_site_recovery_protection_container.primary.name
  recovery_target_protection_container_id   = azurerm_site_recovery_protection_container.secondary.id
  recovery_replication_policy_id            = azurerm_site_recovery_replication_policy.policy.id
}

Now we will deploy a VNET and Subnet were we will replicate or main Virtual Machine, another one to test the failover and a staging storage account for data replication.


resource "random_string" "lower" {
  length  = 4
  upper   = false
  lower   = true
  number  = true
  special = false
}

resource "azurerm_storage_account" "primary" {
  name                     = "prireccache${random_string.lower.result}"
  location                 = var.location
  resource_group_name      = var.resource_group
  account_tier             = "Standard"
  account_replication_type = "LRS"
}


resource "azurerm_virtual_network" "secondary" {
  name                = "vnet-app"
  resource_group_name = data.azurerm_resource_group.secondary.name
  address_space       = var.vnet.address_space
  location            = var.location_secondary
}

resource "azurerm_subnet" "secondary" {
  name                 = "snet-backend"
  resource_group_name  = data.azurerm_resource_group.secondary.name
  virtual_network_name = azurerm_virtual_network.secondary.name
  address_prefix       = var.vnet.subnets[0].address_prefix
}


resource "azurerm_virtual_network" "test-failover" {
  name                = "vnet-app-test-failover"
  resource_group_name = data.azurerm_resource_group.secondary.name
  address_space       = [var.vnet.address_space_failover_test[0]]
  location            = var.location_secondary
}

resource "azurerm_subnet" "test-failover" {
  name                 = var.vnet.subnets_failover_test[0].name
  resource_group_name  = data.azurerm_resource_group.secondary.name
  virtual_network_name = azurerm_virtual_network.test-failover.name
  address_prefix       = var.vnet.subnets_failover_test[0].address_prefix
}

resource "azurerm_network_interface" "vm" {
  name                = "vm-nic"
  location            = var.location_secondary
  resource_group_name = data.azurerm_resource_group.secondary.name

  ip_configuration {
    name                          = "nic-vm-app-01"
    subnet_id                     = azurerm_subnet.secondary.id
    private_ip_address_allocation = "Dynamic"
  }
}

Finally we will deploy our replicated VM


resource "azurerm_site_recovery_replicated_vm" "vm-replication" {
  name                                      = "vm-replication"
  resource_group_name                       = data.azurerm_resource_group.secondary.name
  recovery_vault_name                       = azurerm_recovery_services_vault.vault.name
  source_recovery_fabric_name               = azurerm_site_recovery_fabric.primary.name
  source_vm_id                              = azurerm_virtual_machine.main.id
  recovery_replication_policy_id            = azurerm_site_recovery_replication_policy.policy.id
  source_recovery_protection_container_name = azurerm_site_recovery_protection_container.primary.name

  target_resource_group_id                = data.azurerm_resource_group.secondary.id
  target_recovery_fabric_id               = azurerm_site_recovery_fabric.secondary.id
  target_recovery_protection_container_id = azurerm_site_recovery_protection_container.secondary.id
   
  managed_disk {
    disk_id                    = azurerm_virtual_machine.main.storage_os_disk[0].managed_disk_id
    staging_storage_account_id = azurerm_storage_account.primary.id
    target_resource_group_id   = data.azurerm_resource_group.secondary.id
    target_disk_type           = var.vm.storage_os_disk.managed_disk_type
    target_replica_disk_type   = var.vm.storage_os_disk.managed_disk_type
  }

  managed_disk {
    disk_id                    = azurerm_managed_disk.disk-data-app.id
    staging_storage_account_id = azurerm_storage_account.primary.id
    target_resource_group_id   = data.azurerm_resource_group.secondary.id
    target_disk_type           = "StandardSSD_LRS"
    target_replica_disk_type   = "StandardSSD_LRS"
  }

  target_network_id                       = azurerm_virtual_network.secondary.id

  network_interface {
    source_network_interface_id = azurerm_network_interface.main.id
    target_static_ip = azurerm_public_ip.pip-vm-app.id
    
  }
}

Once we have deployed the resources, we can verify in the Azure Portal Recovery Vault and do a test failover:

And voila! your VM is replicated to a second region and will keep the same FQDN and public IP in the case of a region outage when performing a manual failover, which can be automated using Azure Monitor and a Automation Account.

Happy Terraforming!

4 Replies to “Using Terraform for implementing Azure VM Disaster Recovery”

  1. Regarding using the same public IP for ASR – Have you tested this? My understanding is that Azure public IPs are regionally tied, and cannot be moved to another region (or associated with resources in another region), and that you’d need to define a new PIP in the DR region. An Azure Traffic Manager can be used to make this pretty seamless.

  2. How we can configure the DR for multiple VMs above code is for single VM

  3. After a failover would terraform still run or have an issue with the state? How can changes be done to the DR environment after the failover (e.g. let’s say after failover I see that in DR I need a larger machine)?
    And how can I reverse ASR for the failback through terraform without messing up the state?

Leave a Reply

Your email address will not be published.