Cluster troubleshooting
This guide covers EKS Anywhere troubleshooting. It is divided into the following sections:
- General troubleshooting
- Bare Metal Troubleshooting
- vSphere Troubleshooting
- Snow Troubleshooting
- Nutanix Troubleshooting
You may want to search this document for a fragment of the error you are seeing.
Troubleshooting knowledge articles
More information on troubleshooting EKS Anywhere is available from the AWS Knowledge Center including:
- How do I check EKS Anywhere cluster component logs on primary and worker nodes for BottleRocket, Ubuntu, or Redhat?
- How can I troubleshoot etcdadm controller issues in EKS Anywhere?
- How do I return an EKS Anywhere cluster to a working state when the cluster upgrade fails?
- How do I clean up EKS Anywhere resources without eksctl when cluster creation fails?
- How does EKS Anywhere cluster bootstrapping work?
General troubleshooting
Increase eksctl anywhere output
If you’re having trouble running eksctl anywhere
you may get more verbose output with the -v 6
option. The highest level of verbosity is -v 9
and the default level of logging is level equivalent to -v 0
.
Cannot run Docker commands
The EKS Anywhere binary requires access to run Docker commands without using sudo
.
If you’re using a Linux distribution you will need to be using Docker 20.x.x and your user needs to be part of the Docker group.
To add your user to the Docker group you can use.
sudo usermod -a -G docker $USER
Now you need to log out and back in to get the new group permissions.
Minimum requirements for Docker version have not been met
Error: failed to validate Docker: minimum requirements for Docker version have not been met. Install Docker version 20.x.x or above
Ensure you are running Docker 20.x.x for example:
% docker --version
Docker version 20.10.6, build 370c289
Minimum requirements for Docker version have not been met on macOS
Error: EKS Anywhere does not support Docker Desktop versions between 4.3.0 and 4.4.1 on macOS
Error: EKS Anywhere requires Docker Desktop to be configured to use CGroups v1. Please set `deprecatedCgroupv1:true` in your `~/Library/Group\\ Containers/group.com.docker/settings.json` file
Ensure you are running Docker Desktop 4.4.2 or newer and, if you are running EKS Anywhere v0.15 or earlier, have set "deprecatedCgroupv1": true
in your settings.json file
% defaults read /Applications/Docker.app/Contents/Info.plist CFBundleShortVersionString
4.42
% docker info --format '{{json .CgroupVersion}}'
"1"
For EKS Anywhere v0.15 and earlier, cgroups v2 is not supported in Ubuntu 21.10+ and 22.04
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
For EKS Anywhere v0.15 and earlier, if you are using Ubuntu it is recommended to use Ubuntu 20.04 for the Administrative Machine. This is because the EKS Anywhere Bootstrap cluster for those versions requires cgroups v1. Since Ubuntu 21.10 cgroups v2 is enabled by default. You can use Ubuntu 21.10 and 22.04 for the Administrative machine if you configure Ubuntu to use cgroups v1 instead. This is not an issue if you are using macOS for your Administrative machine.
To verify cgroups version
% docker info | grep Cgroup
Cgroup Driver: cgroupfs
Cgroup Version: 2
To use cgroups v1 you need to sudo and edit /etc/default/grub to set GRUB_CMDLINE_LINUX to “systemd.unified_cgroup_hierarchy=0” and reboot.
%sudo <editor> /etc/default/grub
GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"
sudo update-grub
sudo reboot now
Then verify you are using cgroups v1.
% docker info | grep Cgroup
Cgroup Driver: cgroupfs
Cgroup Version: 1
Pod errors due to too many open files
The bootstrap or EKS Anywhere Docker cluster pods show error: too many open files
.
This may be caused by running out of inotify
resources. Resource limits are defined by fs.inotify.max_user_watches
and fs.inotify.max_user_instances
system variables. For example, in Ubuntu these default to 8192 and 128 respectively, which is not enough to create a cluster with many nodes.
To increase these limits temporarily run the following commands on the host:
sudo sysctl fs.inotify.max_user_watches=524288
sudo sysctl fs.inotify.max_user_instances=512
To make the changes persistent, edit the file /etc/sysctl.conf
and add these lines:
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 512
Reference: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files
ECR access denied
Error: failed to create cluster: unable to initialize executables: failed to setup eks-a dependencies: Error response from daemon: pull access denied for public.ecr.aws/***/cli-tools, repository does not exist or may require 'docker login': denied: Your authorization token has expired. Reauthenticate and try again.
All images needed for EKS Anywhere are public and do not need authentication. Old cached credentials could trigger this error. Remove cached credentials by running:
docker logout public.ecr.aws
Error unmarshaling JSON: while decoding JSON: json: unknown field “spec”
Error: loading config file "cluster.yaml": error unmarshaling JSON: while decoding JSON: json: unknown field "spec"
Use eksctl anywhere create cluster -f cluster.yaml
instead of eksctl create cluster -f cluster.yaml
to create an EKS Anywhere cluster.
Error: old cluster config file exists under my-cluster, please use a different clusterName to proceed
Error: old cluster config file exists under my-cluster, please use a different clusterName to proceed
The my-cluster
directory already exists in the current directory.
Either use a different cluster name or move the directory.
At least one WorkerNodeGroupConfiguration must not have NoExecute and/or NoSchedule taints
Error: the cluster config file provided is invalid: at least one WorkerNodeGroupConfiguration must not have NoExecute and/or NoSchedule taints
An EKS Anywhere management cluster requires at least one schedulable worker node group to run cluster administration components. Both NoExecute
and NoSchedule
taints must be absent from the workerNodeGroup for it to be considered schedulable. This validation was removed for workload clusters from v0.19.0 onwards, so now it only applies to management clusters.
To remedy, remove NoExecute
and NoSchedule
taints from at least one WorkerNodeGroupConfiguration on your management cluster.
Invalid configuration example:
# Invalid workerNodeGroupConfiguration
workerNodeGroupConfigurations: # List of node groups you can define for workers
- count: 1
name: md-0
taints: # NoSchedule taint applied to md-0, not schedulable
- key: "key1"
value: "value1"
effect: "NoSchedule"
- count: 1
name: md-1
taints: # NoExecute taint applied to md-1, not schedulable
- key: "key2"
value: "value2"
effect: "NoExecute"
Valid configuration example:
# Valid workerNodeGroupConfiguration
workerNodeGroupConfigurations: # List of node groups you can define for workers
- count: 1
name: md-0
taints: # NoSchedule taint applied to md-0, not schedulable
- key: "key1"
value: "value1"
effect: "NoSchedule"
- count: 1
name: md-1 # md-1 has no NoSchedule/NoExecute taints applied, is schedulable
Memory or disk resource problem
There are various disk and memory issues on the Admin machine that can cause problems. Make sure:
- Docker is configured with enough memory.
- The system-wide Docker memory configuration provides enough RAM for the bootstrap cluster.
- You do not have unneeded KinD clusters by running
kind get clusters
. - You should delete unneeded clusters with
kind delete cluster --name <cluster-name>
. If you do not havekind
installed, you can install it from https://kind.sigs.k8s.io/ or usedocker ps
to see the KinD clusters and rundocker stop
to stop the cluster. - You do not have any unneeded Docker containers running with
docker ps
. - Terminate any unneeded Docker containers.
- Make sure Docker isn’t out of disk resources.
- If you don’t have any other Docker containers running you may want to run
docker system prune
to clean up disk space.
You may want to restart Docker.
To restart Docker on Ubuntu sudo systemctl restart docker
.
Waiting for cert-manager to be available… Error: timed out waiting for the condition
Failed to create cluster {"error": "error initializing capi resources in cluster: error executing init: Fetching providers\nInstalling cert-manager Version=\"v1.1.0\"\nWaiting for cert-manager to be available...\nError: timed out waiting for the condition\n"}
This is likely a memory or disk resource problem . You can also try using techniques from Generic cluster unavailable .
NTP Time sync issue
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/endpoint_slice.go:91: Failed to watch *v1beta1.EndpointSlice: failed to list *v1beta1.EndpointSlice: Unauthorized" subsys=k8s
You might notice authorization errors if the timestamps on your EKS Anywhere control plane nodes and worker nodes are out-of-sync. Please ensure that all the nodes are configured with the same healthy NTP servers to avoid out-of-sync issues.
Error running bootstrapper cmd: error joining as worker: Error waiting for worker join files: Kubeadm join kubelet-start killed after timeout
You might also notice that the joining of nodes will fail if your Admin machine differs in time compared to your nodes. Make sure to check the server time matches between the two as well.
The connection to the server localhost:8080 was refused
Performing provider setup and validations
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Error initializing capi in bootstrap cluster {"error": "error waiting for capi-kubeadm-control-plane-controller-manager in namespace capi-kubeadm-control-plane-system: error executing wait: The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
Failed to create cluster {"error": "error waiting for capi-kubeadm-control-plane-controller-manager in namespace capi-kubeadm-control-plane-system: error executing wait: The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
Initializing Cluster API components on the bootstrap cluster fails. This is likely a memory or disk resource problem .
Generic cluster unavailable
Troubleshoot more by inspecting bootstrap cluster or workload cluster (depending on the stage of the failure) using kubectl commands.
kubectl get pods -A --kubeconfig=<kubeconfig>
kubectl get nodes -A --kubeconfig=<kubeconfig>
kubectl logs <podname> -n <namespace> --kubeconfig=<kubeconfig>
....
Kubectl commands return dial tcp: i/o timeout
If you are unable to run kubectl commands on a cluster due to timeout errors, then it is possible that the server endpoint in the kubeconfig does not match the control plane’s endpoint in the infrastructure provider due to kube-vip failing to allocate a virtual IP address to the cluster. If the endpoints do not match, you can ssh into the control plane node to gather logs instead. The kubelet logs can be obtained by running journalctl -u kubelet.service --no-pager
. It may also be helpful to look at kube-vip logs, which can be found in the /var/log/pods/kube-system_kube-vip-*
directory.
Verify Cluster Certificates are valid
If kubectl commands are not working due to timeout issues, then it may also be helpful to verify that certificates on the etcd and control plane nodes have not expired.
SSH to one of your etcd nodes and view the etcd container logs in /var/log/containers
.
View the control plane certificates by SSHing into one of your control plane nodes and run the following commands to view the validity of the /var/lib/kubeadm certificates and see their expiration dates.
sudo kubeadm certs check-expiration
# you would be in the admin container when you ssh to the Bottlerocket machine
# open a root shell
sudo sheltie
# pull the image
IMAGE_ID=$(apiclient get | apiclient exec admin jq -r '.settings["host-containers"]["kubeadm-bootstrap"].source')
# ctr is the containerd cli.
# For more information, see https://github.com/projectatomic/containerd/blob/master/docs/cli.md
ctr image pull ${IMAGE_ID}
# you may see missing etcd certs error, which is expected if you have external etcd nodes
ctr run \
--mount type=bind,src=/var/lib/kubeadm,dst=/var/lib/kubeadm,options=rbind:rw \
--mount type=bind,src=/var/lib/kubeadm,dst=/etc/kubernetes,options=rbind:rw \
--rm \
${IMAGE_ID} tmp-certs-check \
/opt/bin/kubeadm certs check-expiration
EKS Anywhere typically renews certificates when upgrading a cluster. However, if a cluster has not been upgraded for over a year, then it is necessary to manually renew these certificates. Please see Certificate rotation to manually rotate expired certificates.
Bootstrap cluster fails to come up
If your bootstrap cluster has problems you may get detailed logs by looking at the files created under the ${CLUSTER_NAME}/logs
folder. The capv-controller-manager log file will surface issues with vsphere specific configuration while the capi-controller-manager log file might surface other generic issues with the cluster configuration passed in.
You may also access the logs from your bootstrap cluster directly as below:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl logs -f -n capv-system -l control-plane="controller-manager" -c manager
It is also useful to start a shell session on the Docker container running the bootstrap cluster by running docker ps
and then docker exec -it <container-id> bash
the kind container.
Bootstrap cluster fails to come up: node(s) already exist for a cluster with the name
During create
and delete
CLI, EKS Anywhere tries to create a temporary KinD bootstrap cluster with the name ${CLUSTER_NAME}-eks-a-cluster
on the Admin machine. This operation can fail with below error:
Error: creating bootstrap cluster: executing create/delete cluster: ERROR: failed to create/delete cluster: node(s) already exist for a cluster with the name \"cluster-name\"
This indicates that the cluster creation or deletion fails because a bootstrap cluster of the same name already exists. If you are sure the cluster is not being used, you can manually delete the old cluster:
docker ps | grep "${CLUSTER_NAME}-eks-a-cluster-control-plane" | awk '{ print $1 }' | xargs docker rm -f
Once the old KinD bootstrap cluster is deleted, you can rerun the eksctl anywhere create
or eksctl anywhere delete
command again.
Cluster upgrade fails with management components on bootstrap cluster
Important
KinD cluster is no longer used during upgrade of management cluster from v0.18.0 onwards.For eksctl anywhere
version older than v0.18.0
, if a cluster upgrade of a management (or self managed) cluster fails or is halted in the middle, you may be left in a
state where the management resources (CAPI) are still on the KinD bootstrap cluster on the Admin machine. Right now, you will have to
manually move the management resources from the KinD cluster back to the management cluster.
First create a backup:
CLUSTER_NAME=squid
KINDKUBE=${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
MGMTKUBE=${CLUSTER_NAME}/${CLUSTER_NAME}-eks-a-cluster.kubeconfig
DIRECTORY=backup
# Substitute the version with whatever version you are using
CONTAINER=public.ecr.aws/eks-anywhere/cli-tools:v0.12.0-eks-a-19
rm -rf ${DIRECTORY}
mkdir ${DIRECTORY}
docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
--namespace eksa-system \
--kubeconfig $KINDKUBE \
--to-directory ${DIRECTORY}
#After the backup, move the management cluster back
docker run -i --network host -w $(pwd) -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) --entrypoint clusterctl ${CONTAINER} move \
--to-kubeconfig $MGMTKUBE \
--namespace eksa-system \
--kubeconfig $KINDKUBE
Before you delete your bootstrap KinD cluster, verify there are no import custom resources left on it:
kubectl get crds | grep eks | while read crd rol
do
echo $crd
kubectl get $crd -A
done
Upgrade command stuck on Waiting for external etcd to be ready
There can be a few reasons if the upgrade command is stuck on waiting for external etcd for workload cluster to be ready
and eventually times out.
First, check the underlying infrastructure to see if any new etcd machines are created.
No new etcd machines are created in the infrastrure provider
If no etcd machines are created, check the etcdadm-bootstrap-provider-controller-manager
, etcdadm-controller-controller-manager
, capi-controller-manager
, capi-kubeadm-control-plane-controller-manager
, and other Cluster API controller logs using the commands mentioned in Generic cluster unavailable
section.
No IP assigned to the new etcd machine
Refer to no IP assigned to a VM section.
Machines are unhealthy after restoring CAPI control from the backup
When an EKS Anywhere management cluster loses its CAPI resources, you need use the backup stored in the
$ kubectl get machines -n eksa-system
NAME PHASE
cluster1-etcd-2md6m Failed
cluster1-etcd-s8qs5 Failed
cluster1-etcd-vqs6w Failed
cluster1-md-0-75584b4fccxfgs86-pfk22 Failed
cluster1-x6rzc Failed
This can happen when the new machines got rolled out after the backup was taken, thus the backup objects being applied do not have the latest machine configurations. When applying an outdated CAPI backup, the provider machine object is out of sync with the actual infrastructure VM.
etcdadm join failed in etcd machine bootstrap log
When an etcd machine object is not in ready state, you can ssh
into the etcd node and check the etcd bootstrap log:
cat /var/log/cloud-init.log
sudo sheltie
journalctl _COMM=host-ctr
If the bootstrap log indicates that the etcadm join operation fail, this can mean that the etcd cluster contains unhealthy member(s). Looking at the log details you can further troubleshoot from below sections.
New etcd machine cannot find the existing etcd cluster members
The edcdadm log shows error that the new etcd machine cannot connect to the existing etcd cluster members. This means the etcdadm-init
secret is outdated. To update it, run
kubectl edit secrets <cluster-name>-etcd-init -n eksa-system
and make sure the new etcd machine IP is included in the secret.
New etcd machine cannot join the cluster due to loss of quorum
An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state. For a cluster with n members, quorum is (n/2)+1. If etcd could not automatically recover and restore quorum, it is possible that there was an unhealthy or broken VM (e.g. a VM without IP assigned) that is still a member of the etcd cluster. You need to find and manually remove the unhealthy member.
To achieve that, first ssh
into the etcd node and use etcdctl member list
command to detect the unhealthy member.
sudo etcdctl --cacert=/etc/etcd/pki/ca.crt --cert=/etc/etcd/pki/etcdctl-etcd-client.crt --key=/etc/etcd/pki/etcdctl-etcd-client.key member list
ETCD_CONTAINER_ID=$(ctr -n k8s.io c ls | grep -w "etcd-io" | cut -d " " -f1)
ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \
--cacert=/var/lib/etcd/pki/ca.crt \
--cert=/var/lib/etcd/pki/server.crt \
--key=/var/lib/etcd/pki/server.key \
member list
After identifying the unhealthy member, use etcdctl member remove
to remove it from the cluster.
sudo etcdctl --cacert=/etc/etcd/pki/ca.crt --cert=/etc/etcd/pki/etcdctl-etcd-client.crt --key=/etc/etcd/pki/etcdctl-etcd-client.key member remove ${UNHEALTHY_MEMBER_ID}
ctr -n k8s.io t exec -t --exec-id etcd ${ETCD_CONTAINER_ID} etcdctl \
--cacert=/var/lib/etcd/pki/ca.crt \
--cert=/var/lib/etcd/pki/server.crt \
--key=/var/lib/etcd/pki/server.key \
member remove ${UNHEALTHY_MEMBER_ID}
Provider machine does not point to the underlying VMs
Follow the VM restore process in provider-specific section.
Nodes cycling due to insufficient pod CIDRs
Kubernetes controller manager allocates a dedicated CIDR block per node for pod IPs from within clusterNetwork.pods.cidrBlocks
. The size of this node CIDR block defaults to /24 and can be adjusted at cluster creation using the optional cidrMaskSize field
.
Since each node requires a CIDR block, the maximum number of nodes in a cluster is limited to the number of non-overlapping subnets of cidrMaskSize
that fit in the pods CIDR block. For example, for a pod CIDR block mask of /18
and a node CIDR mask size of /22
, a maximum of 16 nodes can be proivisioned since there are 16 subnets of size /22
in the overall /18
block. If more nodes are created than the clusterNetwork.pods.cidrBlocks
can accomodate, kube-controller-manager
will not be able to allocate a CIDR block to the extra nodes.
This can cause nodes to become NotReady
with the following sympotoms:
kubectl describe node <unhealthy node>
indicatesCIDRNotAvailable
events.- kube-controller-manager log displays:
Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
- Kubelet log on unhealthy node contains
NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
If more nodes need to be provisioned, either the clusterNetwork.pods.cidrBlocks
must be expanded or the node-cidr-mask-size
should be reduced.
.
Machine health check shows “Remediation is not allowed”
Sometimes a cluster node is crashed but machine health check does not start the proper remediation process to recreate the failed machine. For example, if a worker node is crashed, and running kubectl get mhc ${CLUSTER_NAME}-md-0-worker-unhealthy -n eksa-system -oyaml
shows status message below:
Remediation is not allowed, the number of not started or unhealthy machines exceeds maxUnhealthy
EKS Anywhere sets the machine health check’s MaxUnhealthy
of the workers in a worker node group to 40%. This means any further remediation is only allowed if at most 40% of the worker machines selected by “selector” are not healthy. If more than 40% of the worker machines are in unhealthy state, the remediation will not be triggered.
For example, if you create an EKS Anywhere cluster with 2 worker nodes in the same worker node group, and one of the worker node is down. The machine health check will not remediate the failed machine because the actual unhealthy machines (50%) in the worker node group already exceeds the maximum percentage of the unhealthy machine (40%) allowed. As a result, the failed machine will not be replaced with new healthy machine and your cluster will be left with single worker node. In this case, we recommend you to scale up the number of worker nodes, for example, to 4. Once the 2 more worker nodes are up and running, it brings the total unhealthy worker machines to 25% which is below the 40% limit. This will trigger the machine health check remediation which replace the unhealthy machine with new one.
Etcd machines with false NodeHealthy
condition due to WaitingForNodeRef
When inspecting the Machine
CRs, etcd machines might appear as Running
but containing a false NodeHealthy
condition, with a WaitingForNodeRef
reason. This is a purely cosmetic issue that has no impact in the health of your cluster. This has been fixed in more recent versions of EKS-A, so this condition won’t be displayed anymore in etcd machines.
Status:
Addresses:
Address: 144.47.85.93
Type: ExternalIP
Bootstrap Ready: true
Conditions:
Last Transition Time: 2023-05-15T23:13:01Z
Status: True
Type: Ready
Last Transition Time: 2023-05-15T23:12:09Z
Status: True
Type: BootstrapReady
Last Transition Time: 2023-05-15T23:13:01Z
Status: True
Type: InfrastructureReady
Last Transition Time: 2023-05-15T23:12:09Z
Reason: WaitingForNodeRef
Severity: Info
Status: False
Type: NodeHealthy
Infrastructure Ready: true
Last Updated: 2023-05-15T23:13:01Z
Observed Generation: 3
Phase: Running
Machine gets stuck at Provisioned
state without providerID
set
The VM can be created in the provider infrastructure with proper IP assigned but running kubectl get machines -n eksa-system
indicates that the machine is in Provisioned
state and never gets to Running
.
Check the CAPI controller manager log with k logs -f -n capi-system -l control-plane="controller-manager" --tail -1
:
E0218 03:41:53.126751 1 machine_controller_noderef.go:152] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="test-cluster-6ffd74bd5b-khxzr"
E0218 03:42:09.155577 1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine" "node"="test-cluster-6ffd74bd5b-khxzr"
When inspecting the CAPI machine
object, you may find out that the Node.Spec.ProviderID
is not set.
This can happen when the workload environment does not have proper network access to the underlying provider infrastructure. For example in vSphere, without the network access to vCenter endpoint, the vsphere-cloud-controller-manager
in the workload cluster cannot set node’s providerID, thus the machine will never get to Running
state, blocking cluster provisioning from continuing.
To fix it, make sure to validate the network/firewall settings from the workload cluster to the infrastructure provider environment. Read through the Requirements
page, especially around the networking requirements in each provider before retrying the cluster provisioning:
- Requirements for EKS Anywhere on VMware vSphere
- Network Requirements for EKS Anywhere on Bare Metal
- Requirements for EKS Anywhere on CloudStack
- Prerequisite Checklist for EKS Anywhere on Snow
- Requirements for EKS Anywhere on Nutanix Cloud Infrastructure
Labeling nodes with reserved labels such as node-role.kubernetes.io
fails with kubeadm error during bootstrap
If cluster creation or upgrade fails to complete successfully and kubelet throws an error similar to the one below, please refer to this section. The cluster spec for EKS Anywhere create or upgrade should look like:
.
.
controlPlaneConfiguration:
count: 2
endpoint:
host: "192.168.x.x"
labels:
"node-role.kubernetes.io/control-plane": "cp"
workerNodeGroupConfigurations:
- count: 2
labels:
"node-role.kubernetes.io/worker": "worker"
.
.
If your cluster spec looks like the above one for either the control plane configuration and/or worker node configuration, you might run into the below kubelet error:
unknown 'kubernetes.io' or 'k8s.io' labels specified with --node-labels: [node-role.kubernetes.io/worker].
--node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix (kubelet.kubernetes.io, node.kubernetes.io) or be in the specifically allowed set (beta.kubernetes.io/arch, beta.kubernetes.io/instance-type, beta.kubernetes.io/os, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone, kubernetes.io/arch, kubernetes.io/hostname, kubernetes.io/os, node.kubernetes.io/instance-type, topology.kubernetes.io/region, topology.kubernetes.io/zone)
Self-assigning node labels such as node-role.kubernetes.io
using the kubelet --node-labels
flag is not possible due to a security measure imposed by the NodeRestriction admission controller that kubeadm enables by default.
Assigning such labels to nodes can be done after the bootstrap process has completed:
kubectl label nodes <name> node-role.kubernetes.io/worker=""
For convenience, here are example one-liners to do this post-installation:
# For Kubernetes 1.19 (kubeadm 1.19 sets only the node-role.kubernetes.io/master label)
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
# For Kubernetes >= 1.20 (kubeadm >= 1.20 sets the node-role.kubernetes.io/control-plane label)
kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
No static pods is running in container runtime on control plane nodes
When the kubelet systemd service is running on a control plane node and you notice that no static pods are running, this could be due to a certificate issue. Check the kubelet service logs. If you see error messages like the ones below, renew the kubelet-client-current.pem certificate by following these steps :
part of the existing bootstrap client certificate in /etc/kubernetes/kubelet/kubeconfig
Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem
"Failed to connect to apiserver" err="Get \"https://10.150.223.2:6443/healthz?timeout=1s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
Troubleshooting DNS issues in EKSA cluster
DNS resolution is managed by coreDNS, a DNS server deployed as a pod within the cluster, it helps resolve service names to corresponding IP Addresses. If DNS resolution fails, applications maybe unable to connect to services or external endpoints.
Common errors related to DNS issues:
- Pods are unable to resolve the DNS names of other services.
Error: Get http://example-service: no such host
- Delay in DNS resolution leading to timeouts and slow application responses.
[ERROR] plugin/errors: 2 example-service.default.svc.cluster.local. A: read udp 10.0.0.2:59857->10.96.0.10:53: i/o timeout
- Incorrect DNS server can cause issues in cluster lifecycle workflows like create and update.
DNS Debugging
Verify CoreDNS deployment
-
Check if CoreDNS pods are running
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
Expected Output:
NAME READY STATUS RESTARTS AGE ... coredns-7b96bf9f76-5hsxb 1/1 Running 0 1h coredns-7b96bf9f76-mvmmt 1/1 Running 0 1h ...
-
Check for errors in the DNS pod
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
See if there are any suspicious or unexpected messages in the logs.
-
Verify CoreDNS configuration and ensure it matches with cluster requirement, also verify if forward plugin points to correct DNS servers
kubectl get configmap coredns -n kube-system -o yaml
Example configmap output
apiVersion: v1 kind: ConfigMap metadata: name: coredns namespace: kube-system data: Corefile: | .:53 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream fallthrough in-addr.arpa ip6.arpa } prometheus :9153 forward . /etc/resolv.conf cache 30 loop reload loadbalance }
Test DNS resolution from the cluster
- Create a debug pod
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
- Once that Pod is running, you can exec nslookup in that environment.
kubectl exec -i -t dnsutils -- nslookup <service-name>
- If the DNS server address is missing, check pod’s
/etc/resolv.conf
kubectl exec -ti dnsutils -- cat /etc/resolv.conf
Correcting Incorrect DNS Resolution
If DNS queries return incorrect or outdated IP addresses
-
Ensure that endpoints for service are correct and up-to-date.
kubectl get endpoints <service-name>
-
If CoreDNS is caching DNS responses, ensure the cache TTL is appropriate for your environment. You can edit the cache TTL in CoreDNS ConfigMap.
kubectl -n kube-system edit configmap coredns
Restart the CoreDNS deployment
kubectl rollout restart deployment coredns -n kube-system
Bare Metal troubleshooting
Creating new workload cluster hangs or fails
Cluster creation appears to be hung waiting for the Control Plane to be ready. If the CLI is hung on this message for over 30 mins, something likely failed during the OS provisioning:
Waiting for Control Plane to be ready
Or if cluster creation times out on this step and fails with the following messages:
Support bundle archive created {"path": "support-bundle-2022-06-28T00_41_24.tar.gz"}
Analyzing support bundle {"bundle": "CLUSTER_NAME/generated/bootstrap-cluster-2022-06-28T00:41:24Z-bundle.yaml", "archive": "support-bundle-2022-06-28T00_41_24.tar.gz"}
Analysis output generated {"path": "CLUSTER_NAME/generated/bootstrap-cluster-2022-06-28T00:43:40Z-analysis.yaml"}
collecting workload cluster diagnostics
Error: waiting for workload cluster control plane to be ready: executing wait: error: timed out waiting for the condition on clusters/CLUSTER_NAME
In either of those cases, the following steps can help you determine the problem:
-
Export the kind cluster’s kubeconfig file:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
-
If you have provided BMC information:
-
Check all of the machines that the EKS Anywhere CLI has picked up from the pool of hardware in the CSV file:
kubectl get machines.bmc -A
-
Check if those nodes are powered on. If any of those nodes are not powered on after a while then it could be possible that BMC credentials are invalid. You can verify it by checking the logs:
kubectl get tasks.bmc -n eksa-system kubectl get tasks.bmc <bmc-name> -n eksa-system -o yaml
Validate BMC credentials are correct if a connection error is observed on the
tasks.bmc
resource. Note that “IPMI over LAN” or “Redfish” must be enabled in the BMC configuration for thetasks.bmc
resource to communicate successfully. -
-
If the machine is powered on but you see linuxkit is not running, then Tinkerbell failed to serve the node via iPXE. In this case, you would want to:
-
Check the Boots service logs from the machine where you are running the CLI to see if it received and/or responded to the request:
docker logs boots
-
Confirm no other DHCP service responded to the request and check for any errors in the BMC console. Other DHCP servers on the network can result in race conditions and should be avoided by configuring the other server to block all MAC addresses and exclude all IP addresses used by EKS Anywhere.
-
-
If you see
Welcome to LinuxKit
, click enter in the BMC console to access the LinuxKit terminal. Run the following commands to check if the tink-worker container is running.docker ps -a docker logs <container-id>
-
If the machine has already started provisioning the OS and it’s in irrecoverable state, get the workflow of the provisioning/provisioned machine using:
kubectl get workflows -n eksa-system kubectl describe workflow/<workflow-name> -n eksa-system
Check all the actions and their status to determine if all actions have been executed successfully or not. If the stream-image has action failed, it’s likely due to a timeout or network related issue. You can also provide your own
image_url
by specifyingosImageURL
under datacenter spec.
vSphere troubleshooting
EKSA_VSPHERE_USERNAME is not set or is empty
❌ Validation failed {"validation": "vsphere Provider setup is valid", "error": "failed setup and validations: EKSA_VSPHERE_USERNAME is not set or is empty", "remediation": ""}
Two environment variables need to be set and exported in your environment to create clusters successfully. Be sure to use single quotes around your user name and password to avoid shell manipulation of these values.
export EKSA_VSPHERE_USERNAME='<vSphere-username>'
export EKSA_VSPHERE_PASSWORD='<vSphere-password>'
vSphere authentication failed
❌ Validation failed {"validation": "vsphere Provider setup is valid", "error": "error validating vCenter setup: vSphere authentication failed: govc: ServerFaultCode: Cannot complete login due to an incorrect user name or password.\n", "remediation": ""}
Error: failed to create cluster: validations failed
Two environment variables need to be set and exported in your environment to create clusters successfully. Be sure to use single quotes around your user name and password to avoid shell manipulation of these values.
export EKSA_VSPHERE_USERNAME='<vSphere-username>'
export EKSA_VSPHERE_PASSWORD='<vSphere-password>'
Issues detected with selected template
Issues detected with selected template. Details: - -1:-1:VALUE_ILLEGAL: No supported hardware versions among [vmx-15]; supported: [vmx-04, vmx-07, vmx-08, vmx-09, vmx-10, vmx-11, vmx-12, vmx-13].
Our upstream dependency on CAPV makes it a requirement that you use vSphere 6.7 update 3 or newer. Make sure your ESXi hosts are also up to date.
Waiting for external etcd to be ready
2022-01-19T15:56:57.734Z V3 Waiting for external etcd to be ready {"cluster": "mgmt"}
Debug this problem using techniques from Generic cluster unavailable .
Timed out waiting for the condition on deployments/capv-controller-manager
Failed to create cluster {"error": "error initializing capi in bootstrap cluster: error waiting for capv-controller-manager in namespace capv-system: error executing wait: error: timed out waiting for the condition on deployments/capv-controller-manager\n"}
Debug this problem using techniques from Generic cluster unavailable .
Timed out waiting for the condition on clusters/
Failed to create cluster {"error": "error waiting for workload cluster control plane to be ready: error executing wait: error: timed out waiting for the condition on clusters/test-cluster\n"}
This can be an issue with the number of control plane and worker node replicas defined in your cluster yaml file. Try to start off with a smaller number (3 or 5 is recommended for control plane) in order to bring up the cluster.
This error can also occur because your vCenter server is using self-signed certificates and you have insecure
set to true in the generated cluster yaml.
To check if this is the case, run the commands below:
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl get machines
If all the machines are in Provisioning
phase, this is most likely the issue.
To resolve the issue, set insecure
to false
and thumbprint
to the TLS thumbprint of your vCenter server in the cluster yaml and try again.
"msg"="discovered IP address"
The aforementioned log message can also appear with an address value of the control plane in either of the ${CLUSTER_NAME}/logs/capv-controller-manager.log file or the capv-controller-manager pod log which can be extracted with the following command,
export KUBECONFIG=${PWD}/${CLUSTER_NAME}/generated/${CLUSTER_NAME}.kind.kubeconfig
kubectl logs -f -n capv-system -l control-plane="controller-manager" -c manager
Make sure you are choosing a control plane ip in your network range that does not conflict with other VMs.
Generic cluster unavailable
The first thing to look at is: were virtual machines created on your target provider? In the case of vSphere, you should see some VMs in your folder and they should be up. Check the console and if you see:
[FAILED] Failed to start Wait for Network to be Configured.
Make sure your DHCP server is up and working.
For more troubleshooting tips, see the CAPV Troubleshooting guide.
Machine objects stuck on provisioning state
NOTE: This issue has been resolved with EKS-Anywhere v0.20.x
There is a known issue where connection is lost to vCenter and machine provisioning stops working. For more context, see the related Github issue in upstream CAPV repo that is tracking this.
If the machine objects are stuck on provisioning state, first check the corresponding vSphere VM object. If this resource is missing the status, it is likely the CAPV session issue.
To resolve, restart CAPV pod:
kubectl rollout restart deployment -n capv-system capv-controller-manager
Workload VM is created on vSphere but can not power on
A similar issue is the VM does power on but does not show any logs on the console and does not have any IPs assigned.
This issue can occur if the resourcePool
that the VM uses does not have enough CPU or memory resources to run a VM.
To resolve this issue, increase the CPU and/or memory reservations or limits for the resourcePool.
Workload VMs start but Kubernetes not working properly
If the workload VMs start, but Kubernetes does not start or is not working properly, you may want to log onto the VMs and check the logs there.
If Kubernetes is at least partially working, you may use kubectl
to get the IPs of the nodes:
kubectl get nodes -o=custom-columns="NAME:.metadata.name,IP:.status.addresses[2].address"
If Kubernetes is not working at all, you can get the IPs of the VMs from vCenter or using govc
.
When you get the external IP you can ssh
into the nodes using the private ssh key associated with the public ssh key you provided in your cluster configuration:
ssh -i <ssh-private-key> <ssh-username>@<external-IP>
Create command stuck on Creating new workload cluster
There can be a few reasons that the create command is stuck on Creating new workload cluster
for over 30 min.
First, check the vSphere UI to see if any workload VM are created.
No node VMs are created in vSphere
If no VMs are created, check the capi-controller-manager
, capv-controller-manager
and capi-kubeadm-control-plane-controller-manager
logs using the commands mentioned in Generic cluster unavailable
section.
No IP assigned to a VM
If a VM is created, check to see if it has an IPv4 IP assigned. For example, in BottleRocket machine boot logs, you might see Failed to read current IP data
.
If there are no IPv4 IPs assigned to VMs, this is most likely because you don’t have a DHCP server configured for the network
configured in the cluster config yaml, OR there are not enough IP addresses available in the DHCP pool to assign to the VMs. Ensure that you have a DHCP server running with enough IP addresses to create a cluster
before running the create or upgrade command again.
To confirm this is a DHCP issue, you could create a new VM in the same network to validate if an IPv4 IP is assigned correctly.
Control Plane IP in clusterconfig is not present on any Control Plane VM
If there are any IPv4 IPs assigned, check if one of the VMs have the controlPlane IP specified in Cluster.spec.controlPlaneConfiguration.endpoint.host
in the clusterconfig yaml.
If this IP is not present on any control plane VM, make sure the network
has access to the following endpoints:
- vCenter endpoint (must be accessible to EKS Anywhere clusters)
- public.ecr.aws
- anywhere-assets.eks.amazonaws.com (to download the EKS Anywhere binaries, manifests and OVAs)
- distro.eks.amazonaws.com (to download EKS Distro binaries and manifests)
- d2glxqk2uabbnd.cloudfront.net (for EKS Anywhere and EKS Distro ECR container images)
- api.ecr.us-west-2.amazonaws.com (for EKS Anywhere package authentication matching your region)
- d5l0dvt14r5h8.cloudfront.net (for EKS Anywhere package ECR container images)
- api.github.com (only if GitOps is enabled)
- sts.amazonaws.com (only if AWS IAM Authenticator is enabled)
Failed to connect to <vSphere-FQDN>: connection refused
on vsphere-cloud-controller-manager
If the IPv4 IPs are assigned to the VM and you have the workload kubeconfig under <cluster-name>/<cluster-name>-eks-a-cluster.kubeconfig
, you can use it to check vsphere-cloud-controller-manager
logs.
kubectl logs -n kube-system vsphere-cloud-controller-manager-<xxxxx> --kubeconfig <cluster-name>/<cluster-name>-eks-a-cluster.kubeconfig
If you see the message below in the logs, it means your cluster nodes do not have access to vSphere, which is required for the cluster to get to a ready state.
Failed to connect to <vSphere-FQDN>: connection refused
In this case, you need to enable inbound traffic from your cluster nodes on your vCenter’s management network.
Cluster Deletion Fails
If cluster deletion fails, you may need to manually delete the VMs associated with the cluster.
The VMs should be named with the cluster name.
You can power off and delete from disk using the vCenter web user interface.
You may also use govc
:
govc find -type VirtualMachine --name '<cluster-name>*'
This will give you a list of virtual machines that should be associated with your cluster. For each of the VMs you want to delete run:
VM_NAME=vm-to-destroy
govc vm.power -off -force $VM_NAME
govc object.destroy $VM_NAME
Restore VM for machine in vSphere
When a CAPI machine is in not ready state, it is possible that the machine object does not match the underlying VM in vSphere. Instead, the machine is pointing to a VM that does not exist anymore.
In order to solve this, you need to update the vspheremachine
and vspherevm
objects to point to the new VM running in vSphere. This requires getting the UUID
for the new VM (using govc
command), and using it to update the biosUUID
and providerID
fields. Notice that this would create a mismatch between the CAPI resources names (vspheremachine
and vspherevm
) and the actual VM in vSphere. However, that is only an aesthetic inconvenience, since the mapping is done based on UUID
and not name. Once the objects are updated, a machine rollout would replace the VM with a new one, removing the name mismatch problem. You also need to cleanup the error message and error reason from the statuses to resume the reconciliation of these objects. With that done, the machine objects should move to Running
state.
vCenter VM Deletion Causes Cluster Stuck in Deleting State with Orphaned Objects
When a VM in vCenter has been deleted, however, your cluster is stuck in a deleting state with the corresponding the CAPI objects left behind, the user has to manually delete a few related resources to recover. The resources of interest related to the VM are the following: machine
, vspheremachine
, vspherevm
, and node
.
To clean up the resources in order to recover from this deleting state:
-
Delete the the corresponding CAPI
machine
object.kubectl delete machine -n eksa-system <machine-name>
-
If that is stuck, it means that the
capi-controller-manager
is unable to remove the finalizer from the object, so the user needs force delete the object.You can do this by editing the machine object using
kubectl
and removing the finalizers on it.kubectl edit machine -n eksa-system <machine-name>`
Look for the finalizers field in the metadata section of the resource. remove the finalizers. Save your changes.
metadata: finalizers: - finalizer.example.com
Remove the finalizers field under metada and save your changes. After, the object should be deleted.
-
If the
capv-controller-manager
is unable to to clean up thevspheremachine
andvspherevm
objects, repeat the steps above for those orphaned objects that are related to the deleted VM. -
You may also need to manually delete the corresponding orphaned
node
object usingkubectl
.kubectl delete node <node-name>
BottleRocket OS Etcd endpoints have inadvertent IP change
If a cluster becomes inaccessible after etcd VMs IP address change but Control Plane is still reachable,,
ssh
into one of the control plane node, check kube-api server container logs in /var/log/containers
.
If the api server is having this error where the etcd endpoint is not reachable, the issue can be resolved by 2 ways.
2024-09-06T00:48:14.965194255Z stderr F W0906 00:48:14.965078 1 logging.go:59] [core] [Channel #5 SubChannel #6] grpc: addrConn.createTransport failed to connect to {Addr: "198.18.134.22:2379", ServerName: "198.18.134.22", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 198.18.134.22:2379: operation was canceled
If you have an access to DHCP server and can reboot VMs manually, create DHCP reservations for the etcd VMs using their original IP addresses. Then, reboot the etcd nodes to ensure they obtain the correct IPs.
If you do not have an access to the DHCP server or cannot reboot VMs manually, perform below steps to manually assign old IPs to VMs.
-
ssh
into etcd VM using a new IP. -
Find the correct old IP associated with this VM by looking at the etcd container log file located under
/var/log/containers
. -
Run below commands to manually change the new IP to the old IP. You will loose connection to this VM when you delete the new IP from the interface.
sudo sheltie # old IP will get added as a secondary IP on eth0 interface ip addr change <old-ip/cidr> dev eth0 # add a default ip route in this VM using the old IP as the source ip route add default via <router-ip> dev eth0 proto dhcp src <old-ip> # setup the interface so it will promote the secondary IP address when we remove the first sysctl net.ipv4.conf.eth0.promote_secondaries=1 # remove the new IP address and confirm ip addr del <new-ip/cidr> dev eth0
-
ssh
into the VM using old IP, confirm the IP has been changed by runningip a
command. -
Perform the above steps for all etcd machines.
The cluster should be accessible again in some time.
NOTE: Make sure to roll out new ETCD VMs based on DHCP lease expiration time to avoid this issue in future.
Troubleshooting GitOps integration
Cluster creation failure leaves outdated cluster configuration in GitHub.com repository
Failed cluster creation can sometimes leave behind cluster configuration files committed to your GitHub.com repository.
Make sure to delete these configuration files before you re-try eksctl anywhere create cluster
.
If these configuration files are not deleted, GitOps installation will fail but cluster creation will continue.
They’ll generally be located under the directory
clusters/$CLUSTER_NAME
if you used the default path in your flux
gitops
config.
Delete the entire directory named $CLUSTER_NAME.
Cluster creation failure leaves empty GitHub.com repository
Failed cluster creation can sometimes leave behind a completely empty GitHub.com repository. This can cause the GitOps installation to fail if you re-try the creation of a cluster which uses this repository. If cluster creation failure leaves behind an empty github repository, please manually delete the created GitHub.com repository before attempting cluster creation again.
Changes not syncing to cluster
Please remember that the only fields currently supported for GitOps are:
Cluster
Cluster.workerNodeGroupConfigurations.count
Cluster.workerNodeGroupConfigurations.machineGroupRef.name
Worker Nodes
VsphereMachineConfig.diskGiB
VsphereMachineConfig.numCPUs
VsphereMachineConfig.memoryMiB
VsphereMachineConfig.template
VsphereMachineConfig.datastore
VsphereMachineConfig.folder
VsphereMachineConfig.resourcePool
If you’ve changed these fields and they’re not syncing to the cluster as you’d expect,
check out the logs of the pod in the source-controller
deployment in the flux-system
namespaces.
If flux
is having a problem connecting to your GitHub repository the problem will be logged here.
$ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-7d644b8547-k8wfs 1/1 Running 0 4h15m
kustomize-controller-7cf5875f54-hs2bt 1/1 Running 0 4h15m
notification-controller-776f7d68f4-v22kp 1/1 Running 0 4h15m
source-controller-7c4555748d-7c7zb 1/1 Running 0 4h15m
$ kubectl logs source-controller-7c4555748d-7c7zb -n flux-system
A well behaved flux pod will simply log the ongoing reconciliation process, like so:
{"level":"info","ts":"2021-07-01T19:58:51.076Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 902.725344ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2021-07-01T19:59:52.012Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 935.016754ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2021-07-01T20:00:52.982Z","logger":"controller.gitrepository","msg":"Reconciliation finished in 970.03174ms, next run in 1m0s","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
If there are issues connecting to GitHub, you’ll instead see exceptions in the source-controller
log stream.
For example, if the deploy key used by flux
has been deleted, you’d see something like this:
{"level":"error","ts":"2021-07-01T20:04:56.335Z","logger":"controller.gitrepository","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://git@github.com/youruser/gitops-vsphere-test', error: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain"}
Other ways to troubleshoot GitOps integration
If you’re still having problems after deleting any empty EKS Anywhere created GitHub repositories and looking at the source-controller
logs.
You can look for additional issues by checking out the deployments in the flux-system
and eksa-system
namespaces and ensure they’re running and their log streams are free from exceptions.
$ kubectl get deployments -n flux-system
NAME READY UP-TO-DATE AVAILABLE AGE
helm-controller 1/1 1 1 4h13m
kustomize-controller 1/1 1 1 4h13m
notification-controller 1/1 1 1 4h13m
source-controller 1/1 1 1 4h13m
$ kubectl get deployments -n eksa-system
NAME READY UP-TO-DATE AVAILABLE AGE
eksa-controller-manager 1/1 1 1 4h13m
Snow troubleshooting
Device outage
These are some conditions that can cause a device outage:
- Intentional outage (a planned power outage or an outage when moving devices, for example).
- Unintentional outage (a subset of devices or all devices are rebooted, or experiencing network disconnections from the LAN, which make device offline or isolated from the cluster).
NOTE: If all Snowball Edge devices are moved to a different place and connected to a different local network, make sure you use the same subnet, netmask, and gateway for your network configuration. After moving, devices and all node instances need to maintain the original IP addresses. Then, follow the recover cluster procedure to get your cluster up and running again. Otherwise, it might be impossible to resume the cluster.
To recover a cluster
If there is a subset of devices or all devices experience an outage, see Downloading and Installing the Snowball Edge client to get the Snowball Edge client and then follow these steps:
-
Reboot and unlock all affected devices manually.
// use reboot-device command to reboot device, this may take several minutes $ path-to-snowballEdge_CLIENT reboot-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // use describe-device command to check the status of device $ path-to-snowballEdge_CLIENT describe-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // when the State in the output of describe-device is LOCKED, run unlock-device $ path-to-snowballEdge_CLIENT unlock-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code // use describe-device command to check the status of device until device is unlocked $ path-to-snowballEdge_CLIENT describe-device --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
-
Get all instance IDs that were part of the cluster by looking up the impacted device IP in the PROVIDERID column.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION eksa-system machine-name-1 cluster-name node-name-1 aws-snow:///192.168.1.39/s.i-8319d8c75d54a32cc Running 82s v1.24.9-eks-1-24-7 eksa-system machine-name-2 cluster-name node-name-2 aws-snow:///192.168.1.39/s.i-8d7d3679a1713e403 Running 82s v1.24.9-eks-1-24-7 eksa-system machine-name-3 cluster-name node-name-3 aws-snow:///192.168.1.231/s.i-8201c356fb369c37f Running 81s v1.24.9-eks-1-24-7 eksa-system machine-name-4 cluster-name node-name-4 aws-snow:///192.168.1.39/s.i-88597731b5a4a9044 Running 81s v1.24.9-eks-1-24-7 eksa-system machine-name-5 cluster-name node-name-5 aws-snow:///192.168.1.77/s.i-822f0f46267ad4c6e Running 81s v1.24.9-eks-1-24-7
-
Start all instances on the impacted devices as soon as possible.
$ aws ec2 start-instances --instance-id instance-id-1 instance-id-2 ... --endpoint http://snowball-ip:6078 --profile profile-name
-
Check the balance status of the current cluster after the cluster is ready again.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Check if you have unstacked etcd machines.
-
If you have unstacked etcd machines, check the provision of unstacked etcd machines. You can find the device IP in the
PROVIDERID
column.- If there are more than 1 unstacked etcd machines provisioned on the same device and there are devices with no unstacked etcd machine, you need to rebalance unstacked etcd nodes. Follow the rebalance nodes procedure to rebalance your unstacked etcd nodes in order to recover high availability.
- If you have your etcd nodes evenly distributed with 1 device having at most 1 etcd node, you are done with the recovery.
-
If you don’t have unstacked etcd machines, check the provision of control plane machines. You can find the device IP in
PROVIDERID
column.- If there are more than 1 control plan machines provisioned on the same device and there are devices with no control plane machine, you need to rebalance control plane nodes. Follow the rebalance nodes procedure to rebalance your control plane nodes in order to recover high availability.
- If you have your control plane nodes evenly distributed with 1 device having at most 1 control plane node, you are done with the recovery.
-
How to rebalance nodes
-
Confirm the machines you want to delete and get their node name from the NODENAME column.
You can determine which machines need to be deleted by referring to the AGE column. The newly-generated machines have short AGE. Delete those new etcd/control plane machine nodes which are not the only etcd/control plane machine nodes on their devices.
-
Cordon each node so no further workloads are scheduled to run on it.
$ kubectl cordon node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Drain machine nodes of all current workloads.
$ kubectl drain node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Delete machine node.
$ kubectl delete node node-name --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Repeat this process until etcd/control plane machine nodes are evenly provisioned.
Device replacement
There might be some reasons which can require device replacement:
- When a subset of devices are determined to be broken and you want to join a new device into current cluster.
- When a subset of devices are offline and come back with a new device IP.
To upgrade a cluster with new devices:
-
Add new certificates to the certificate file and new credentials to the credential file.
-
Change the device list in your cluster yaml configuration file and use the
eksctl anywhere upgrade cluster
command.$ eksctl anywhere upgrade cluster -f eks-a-cluster.yaml
Node outage
Unintentional instance outage
When an instance is in exception status (for example, terminated/stopped for some reason), it will be discovered automatically by Amazon EKS Anywhere and there will be a new replacement instance node created after 5 minutes. The new node will be provisioned to devices based on even provision strategy. In this case, the new node will be provisioned to a device with the fewest number of machines of the same type. Sometimes, more than one device will have the same number of machines of this type. Thus, we cannot guarantee it will be provisioned on the original device.
Intentional node replacement
If you want to replace an unhealthy node which didn’t get detected by Amazon EKS Anywhere automatically, you can follow these steps.
NOTE: Do not delete all worker machine nodes or control plane nodes or etcd nodes at the same time. Make sure you delete machine nodes one by one.
-
Cordon nodes so no further workloads are scheduled to run on it.
$ kubectl cordon node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Drain machine nodes of all current workloads.
$ kubectl drain node-name --ignore-daemonsets --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
Delete machine nodes.
$ kubectl delete node node-name --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
-
New nodes will be provisioned automatically. You can check the provision result with the get machines command.
$ kubectl get machines -A --kubeconfig=cluster-name/cluster-name-eks-a-cluster.kubeconfig
Cluster Deletion Fails
If your Amazon EKS Anywhere cluster creation failed and the eksctl anywhere delete cluster -f eksa-cluster.yaml
command cannot be run successfully, manually delete a few resources before trying the command again. Run the following commands from the computer on which you set up the AWS configuration and have the Snowball Edge Client installed
. If you are using multiple Snowball Edge devices, run these commands on each.
// get the list of instance ids that are created for Amazon EKS Anywhere cluster,
// that can be identified by cluster name in the tag of the output
$ aws ec2 describe-instances --endpoint http://snowball-ip:8008 --profile profile-name
// the next two commands are for deleting DNI, this needs to be done before deleting instance
$ PATH_TO_Snowball_Edge_CLIENT/bin/snowballEdge describe-direct-network-interfaces --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
// DNI arn can be found in the output of last command, which is associated with the specific instance id you get from describe-instances
$ PATH_TO_Snowball_Edge_CLIENT/bin/snowballEdge delete-direct-network-interface --direct-network-interface-arn DNI-ARN --endpoint https://snowball-ip --manifest-file path-to-manifest-file --unlock-code unlock-code
// delete instance
$ aws ec2 terminate-instances --instance-id instance-id-1,instance-id-2 --endpoint http://snowball-ip:8008 --profile profile-name
Generate a log file from the Snowball Edge device
You can also generate a log file from the Snowball Edge device for AWS Support. See AWS Snowball Edge Logs in this guide.
Nutanix troubleshooting
Error creating Nutanix client
Error: error creating nutanix client: username, password and endpoint are required
Verify if the required environment variables are set before creating the clusters:
export EKSA_NUTANIX_USERNAME="<Nutanix-username>"
export EKSA_NUTANIX_PASSWORD="<Nutanix-password>"
Also, make sure the spec.endpoint
is correctly configured in the NutanixDatacenterConfig
. The value of the spec.endpoint
should be the IP or FQDN of Prism Central.
x509: certificate signed by unknown authority
Failure of the nutanix Provider setup is valid
validation with the x509: certificate signed by unknown authority
message indicates the certificate of the Prism Central endpoint is not trusted.
In case Prism Central is configured with self-signed certificates, it is recommended to configure the additionalTrustBundle
in the NutanixDatacenterConfig
. More information can be found here
.