This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Backup and restore cluster

How to backup and restore your cluster

1 - Backup cluster

How to backup your EKS Anywhere cluster

We strongly advise performing regular cluster backups of all the EKS Anywhere clusters. This ensures that you always have an up-to-date cluster state available for restoration in case the cluster experiences issues or becomes unrecoverable. This document outlines the steps for creating the two essential types of backups required for the EKS Anywhere cluster restore process .

Etcd backup

For optimal cluster maintenance, it is crucial to perform regular etcd backups on all your EKS Anywhere management and workload clusters. Always take an etcd backup before performing an upgrade so it can be used to restore the cluster to a previous state in the event of a cluster upgrade failure. To create an etcd backup for your cluster, follow the guidelines provided in the External etcd backup and restore section.

Cluster API backup

Since cluster failures primarily occur following unsuccessful cluster upgrades, EKS Anywhere takes the proactive step of automatically creating backups for the Cluster API objects. For the management cluster, it captures the states of both the management cluster and its workload clusters if all the clusters are in ready state. If one of the workload clusters is not ready, EKS Anywhere takes the best effort to backup the management cluster itself. For the workload cluster, it captures the state workload cluster’s Cluster API objects. These backups are stored within the management cluster folder, where the upgrade command is initiated from the Admin machine, and are generated before each management and/or workload cluster upgrade process. For example, after executing a cluster upgrade command on mgmt-cluster, a backup folder is generated with the naming convention of mgmt-cluster-backup-${timestamp}:

├── mgmt-cluster-backup-2023-10-11T02_55_56 <------ Folder with a backup of the CAPI objects 
├── mgmt-cluster-eks-a-cluster.kubeconfig
├── mgmt-cluster-eks-a-cluster.yaml
└── generated

For workload cluster, a backup folder is generated with the naming convention of wkld-cluster-backup-${timestamp} under mgmt-cluster directory

├── wkld-cluster-backup-2023-10-11T02_55_56 <------ Folder with a backup of the CAPI objects 
├── mgmt-cluster-eks-a-cluster.kubeconfig
├── mgmt-cluster-eks-a-cluster.yaml
└── generated

Although the likelihood of a cluster failure occurring without any associated cluster upgrade operation is relatively low, it is still recommended to manually back up these Cluster API objects on a routine basis. For example, to create a Cluster API backup of a cluster:


# Substitute the EKS Anywhere release version with whatever CLI version you are using
BUNDLE_MANIFEST_URL=$(curl -s | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")

docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
        --namespace eksa-system \
        --kubeconfig $MGMT_CLUSTER_KUBECONFIG \
        --to-directory ${BACKUP_DIRECTORY}

This saves the Cluster API objects of the management cluster mgmt with all its workload clusters, to a local directory under the backup-mgmt folder.

2 - Restore cluster

How to restore your EKS Anywhere cluster from backup

In certain unfortunate circumstances, an EKS Anywhere cluster may find itself in an unrecoverable state due to various factors such as a failed cluster upgrade, underlying infrastructure problems, or network issues, rendering the cluster inaccessible through conventional means. This document outlines detailed steps to guide you through the process of restoring a failed cluster from backups in these critical situations.


Always backup your EKS Anywhere cluster. Refer to the Backup cluster and make sure you have the updated etcd and Cluster API backup at hand.

Restore a management cluster

As an EKS Anywhere management cluster contains the management components of itself, plus all the workload clusters it manages, the restoration process can be more complicated than just restoring all the objects from the etcd backup. To be more specific, all the core EKS Anywhere and Cluster API custom resources, that manage the lifecycle (provisioning, upgrading, operating, etc.) of the management and its workload clusters, are stored in the management cluster. This includes all the supporting infrastructure, like virtual machines, networks and load balancers. For example, after a failed cluster upgrade, the infrastructure components can change after the etcd backup was taken. Since the backup does not contain the new state of the half upgraded cluster, simply restoring it can create virtual machines UUID and IP mismatches, rendering EKS Anywhere incapable of healing the cluster.

Depending on whether the infrastructure components are changed or not after the etcd backup was taken (for example, if machines are rolled out and recreated and new IP addresses assigned to the machines), different strategy needs to be applied in order to restore the management cluster.

Cluster accessible and the infrastructure components not changed after etcd backup was taken

If the management cluster is still accessible through the API server, and the underlying infrastructure layer (nodes, machines, VMs, etc.) are not changed after the etcd backup was taken, simply follow the External etcd backup and restore to restore the management cluster itself from the backup.

Cluster not accessible or infrastructure components changed after etcd backup was taken

If the cluster is no longer accessible in any means, or the infrastructure machines are changed after the etcd backup was taken, restoring this management cluster itself from the outdated etcd backup will not work. Instead, you need to create a new management cluster, and migrate all the EKS Anywhere resources of the old workload clusters to the new one, so that the new management cluster can maintain the new ownership of managing the existing workload clusters. Below is an example of migrating a failed management cluster mgmt-old with its workload clusters w01 and w02 to a new management cluster mgmt-new:

  1. Create a new management cluster to which you will be migrating your workload clusters later.

    You can define a cluster config similar to your old management cluster, and run cluster creation of the new management cluster with the exact same EKS Anywhere version used to create the old management cluster.

    If the original management cluster still exists with old infrastructure running, you need to create a new management cluster with a different cluster name to avoid conflict.

    eksctl anywhere create cluster -f mgmt-new.yaml
  2. Move the custom resources of all the workload clusters to the new management cluster created above.

    Using the vSphere provider as an example, we are moving the Cluster API custom resources, such as vpsherevms, vspheremachines and machines of the workload clusters, from the old management cluster to the new management cluster created in above step. By using the --filter-cluster flag with the clusterctl move command, we are only targeting the custom resources from the workload clusters.

    # Use the same cluster name if the newly created management cluster has the same cluster name as the old one
    # Substitute the workspace path with the workspace you are using
    # Retrieve the Cluster API backup folder path that are automatically generated during the cluster upgrade
    # This folder contains all the resources that represent the cluster state of the old management cluster along with its workload clusters
    # Substitute the EKS Anywhere release version with the EKS Anywhere version of the original management cluster
    BUNDLE_MANIFEST_URL=$(curl -s | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
    CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
    # The clusterctl move command needs to be executed for each workload cluster.
    # It will only move the workload cluster resources from the EKS Anywhere backup to the new management cluster.
    # If you have multiple workload clusters, you have to run the command for each cluster as shown below.
    # Move workload cluster w01 resources to the new management cluster mgmt-new
    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
        --namespace eksa-system \
        --filter-cluster ${WORKLOAD_CLUSTER_1} \
        --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
        --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
    # Move workload cluster w02 resources to the new management cluster mgmt-new
    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} move \
        --namespace eksa-system \
        --filter-cluster ${WORKLOAD_CLUSTER_2} \
        --from-directory ${CLUSTER_STATE_BACKUP_LATEST_PATH} \
        --to-kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
  3. (Optional) Update the cluster config file of the workload clusters if the new management cluster has a different cluster name than the original management cluster.

    You can skip this step if the new management cluster has the same cluster name as the old management cluster.

    # workload cluster w01
    kind: Cluster
      name: w01
      namespace: default
        name: mgmt-new # This needs to be updated with the new management cluster name.
    # workload cluster w02
    kind: Cluster
      name: w02
      namespace: default
        name: mgmt-new # This needs to be updated with the new management cluster name.

    Make sure that apart from the managementCluster field you updated above, all the other cluster configs of the workload clusters need to stay the same as the old workload clusters resources after the old management cluster fails.

  4. Apply the updated cluster config of each workload cluster in the new management cluster.

    kubectl apply -f w01/w01-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
    kubectl apply -f w02/w02-eks-a-cluster.yaml --kubeconfig ${MGMT_CLUSTER_NEW_KUBECONFIG}
  5. Validate all clusters are in the desired state.

    kubectl get clusters -n default -o custom-columns=",READY:.status.conditions[?(@.type=='Ready')].status" --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
    NAME       READY
    mgmt-new   True
    w01        True
    w02        True
    kubectl get -n eksa-system --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
    NAME       PHASE         AGE
    mgmt-new   Provisioned   11h   
    w01        Provisioned   11h   
    w02        Provisioned   11h 
    kubectl get kcp -n eksa-system  --kubeconfig ${MGMT_CLUSTER_NEW}/${MGMT_CLUSTER_NEW}-eks-a-cluster.kubeconfig
    mgmt-new   mgmt-new   true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
    w01        w01        true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
    w02        w02        true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4

Restore a workload cluster

Cluster accessible and the infrastructure components not changed after etcd backup was taken

Similar to the failed management cluster without infrastructure components change situation, follow the External etcd backup and restore to restore the workload cluster itself from the backup.

Cluster not accessible or infrastructure components changed after etcd backup was taken

If the workload cluster is still accessible, but the infrastructure machines are changed after the etcd backup was taken, you can still try restoring the cluster itself from the etcd backup. Although doing so is risky: it can potentially cause the node names, IPs and other infrastructure configurations to revert to a state that is no longer valid. Restoring etcd effectively takes a cluster back in time and all clients will experience a conflicting, parallel history. This can impact the behavior of watching components like Kubernetes controller managers, EKS Anywhere cluster controller manager, and Cluster API controller managers.

If the original workload cluster becomes inaccessible or cannot be restored to a healthy state from an outdated etcd, a new workload cluster needs to be created. This new cluster should be managed by the same management cluster that oversaw the original. You must then restore your workload applications to this new cluster from the etcd backup of the original. This ensures the management cluster retains control, with all data from the old cluster intact. Below is an example of applying the etcd backup etcd-snapshot-w01.db from a failed workload cluster w01 to a new cluster w02:

  1. Create a new workload cluster to which you will be migrating your workloads and applications from the original failed workload cluster.

    You can define a cluster config similar to your old workload cluster, with a different cluster name (if the old workload cluster still exists), and run cluster creation of the new workload cluster with the exact same EKS Anywhere version used to create the old workload cluster.

    export MGMT_CLUSTER="mgmt"
    export MGMT_CLUSTER_KUBECONFIG=${MGMT_CLUSTER}/${MGMT_CLUSTER}-eks-a-cluster.kubeconfig
    eksctl anywhere create cluster -f w02.yaml --kubeconfig $MGMT_CLUSTER_KUBECONFIG
  2. Save the config map objects of the new workload cluster to a file.

    Save a copy of the new workload cluster’s cluster-info, kube-proxy and kubeadm-config config map objects before the restore. This is necessary as the etcd restore will override the config maps above with the metadata information (certificates, endpoint, etc.) from the old cluster.

    export WORKLOAD_CLUSTER_NAME="w02"
    cat <<EOF >> w02-cm.yaml
    $(kubectl get -n kube-public cm cluster-info -oyaml --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG)
    $(kubectl get -n kube-system cm kube-proxy -oyaml --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG)
    $(kubectl get -n kube-system cm kubeadm-config -oyaml --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG)

    Manually remove the creationTimestamp, resourceVersion, uid from the config map objects, so that later you can run kubectl apply against this file without errors.

  3. Follow the External etcd backup and restore to restore the old workload cluster’s etcd backup etcd-snapshot-w01.db onto the new workload cluster w02. Use different restore process based on OS family:

    You might notice that after restoring the original etcd backup to the new workload cluster w02, all the nodes go to NotReady state with node names changed to have prefix w01-*. This is because restoring etcd effectively applies the node data from the original cluster which causes a conflicting history and can impact the behavior of watching components like Kubelets, Kubernetes controller managers.

    kubectl get nodes --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
    NAME                              STATUS     ROLES           AGE     VERSION
    w01-bbtdd                         NotReady   control-plane   3d23h   v1.27.3-eks-6f07bbc
    w01-md-0-66dbcfb56cxng8lc-8ppv5   NotReady   <none>          3d23h   v1.27.3-eks-6f07bbc
  4. Restart Kubelet of the control plane and worker nodes of the new workload cluster after the restore.

    For some cases, Kubelet on the node will automatically restart and nodes becomes ready. For other cases, you need to manually restart the Kubelet on all the control plane and worker nodes in order to bring back the nodes to ready state. Kubelet registers the node itself with the apisever which then updates etcd with the correct node data of the new workload cluster w02.

    # SSH into the control plane and worker nodes. You must do this for each node.
    ssh -i ${SSH_KEY} ${SSH_USERNAME}@<node IP>
    sudo su
    systemctl restart kubelet
    # SSH into the control plane and worker nodes. You must do this for each node.
    ssh -i ${SSH_KEY} ${SSH_USERNAME}@<node IP>
    apiclient exec admin bash
    systemctl restart kubelet
  5. Add back label to all the control plane nodes.

    kubectl label nodes <control-plane-node-name> --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
  6. Remove the lagacy nodes (if any).

    kubectl get nodes --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
    NAME                              STATUS     ROLES           AGE     VERSION
    w01-bbtdd                         NotReady   control-plane   3d23h   v1.27.3-eks-6f07bbc
    w01-md-0-66dbcfb56cxng8lc-8ppv5   NotReady   <none>          3d23h   v1.27.3-eks-6f07bbc
    w02-fcbm2j                        Ready      control-plane   91m     v1.27.3-eks-6f07bbc
    w02-md-0-b7cc67cd4xd86jf-4c9ktp   Ready      <none>          73m     v1.27.3-eks-6f07bbc
    kubectl delete node w01-bbtdd --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
    kubectl delete node w01-md-0-66dbcfb56cxng8lc-8ppv5 --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
  7. Re-apply the original config map objects of the workload cluster.

    Re-apply the cluster-info, kube-proxy and kubeadm-config config map objects we saved in previous step to the workload cluster w02.

    kubectl apply -f w02-cm.yaml --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
  8. Validate the nodes are in ready state.

    kubectl get nodes --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
    NAME                              STATUS   ROLES           AGE     VERSION
    w02-djshz                         Ready    control-plane   9m7s    v1.27.3-eks-6f07bbc
    w02-md-0-6bbc8dd6d4xbgcjh-wfmb6   Ready    <none>          3m55s   v1.27.3-eks-6f07bbc
  9. Restart the system pods to ensure that they use the config maps you re-applied in previous step.

    kubectl rollout restart ds kube-proxy -n kube-system --kubeconfig $WORKLOAD_CLUSTER_KUBECONFIG
  10. Unpause the cluster reconcilers

    kubectl annotate $WORKLOAD_CLUSTER_NAME --kubeconfig=$MGMT_CLUSTER_KUBECONFIG
    kubectl patch $WORKLOAD_CLUSTER_NAME --type merge -p '{"spec":{"paused": false}}' -n eksa-system --kubeconfig=$MGMT_CLUSTER_KUBECONFIG
  11. Rollout and restart all the machine objects so that the workload cluster has a clean state.

    # Substitute the EKS Anywhere release version with whatever CLI version you are using
    BUNDLE_MANIFEST_URL=$(curl -s | yq ".spec.releases[] | select(.version==\"$EKSA_RELEASE_VERSION\").bundleManifestUrl")
    CLI_TOOLS_IMAGE=$(curl -s $BUNDLE_MANIFEST_URL | yq ".spec.versionsBundles[0].eksa.cliTools.uri")
    # Rollout restart all the control plane machines
    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} alpha rollout restart kubeadmcontrolplane/${WORKLOAD_CLUSTER_NAME} -n eksa-system --kubeconfig ${MGMT_CLUSTER_KUBECONFIG}
    # Rollout restart all the worker machines
    # You need to repeat below command for each worker node group
    docker run -i --network host -w $(pwd) -v $(pwd):/$(pwd) --entrypoint clusterctl ${CLI_TOOLS_IMAGE} alpha rollout restart machinedeployment/${WORKLOAD_CLUSTER_NAME}-md-0 -n eksa-system --kubeconfig ${MGMT_CLUSTER_KUBECONFIG}
  12. Validate the new workload cluster is in the desired state.

    kubectl get clusters -n default -o custom-columns=",READY:.status.conditions[?(@.type=='Ready')].status" --kubeconfig $MGMT_CLUSTER_KUBECONFIG
    mgmt   True
    w02    True
    kubectl get -n eksa-system --kubeconfig $MGMT_CLUSTER_KUBECONFIG
    NAME   PHASE         AGE
    mgmt   Provisioned   11h   
    w02    Provisioned   11h 
    kubectl get kcp -n eksa-system  --kubeconfig $MGMT_CLUSTER_KUBECONFIG
    mgmt   mgmt       true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4
    w02    w02        true          true                   2          2       2                       11h   v1.27.1-eks-1-27-4