Virtual machine health checks

Virtual machine health checks

You can configure virtual machine (VM) health checks by defining readiness and liveness probes in the VirtualMachine resource.

About readiness and liveness probes

Use readiness and liveness probes to detect and handle unhealthy virtual machines (VMs). You can include one or more probes in the specification of the VM to ensure that traffic does not reach a VM that is not ready for it and that a new VM is created when a VM becomes unresponsive.

A readiness probe determines whether a VM is ready to accept service requests. If the probe fails, the VM is removed from the list of available endpoints until the VM is ready.

A liveness probe determines whether a VM is responsive. If the probe fails, the VM is deleted and a new VM is created to restore responsiveness.

You can configure readiness and liveness probes by setting the spec.readinessProbe and the spec.livenessProbe fields of the VirtualMachine object. These fields support the following tests:

HTTP GET

The probe determines the health of the VM by using a web hook. The test is successful if the HTTP response code is between 200 and 399. You can use an HTTP GET test with applications that return HTTP status codes when they are completely initialized.

TCP socket

The probe attempts to open a socket to the VM. The VM is only considered healthy if the probe can establish a connection. You can use a TCP socket test with applications that do not start listening until initialization is complete.

Guest agent ping

The probe uses the guest-ping command to determine if the QEMU guest agent is running on the virtual machine.

Defining an HTTP readiness probe

Define an HTTP readiness probe by setting the spec.readinessProbe.httpGet field of the virtual machine (VM) configuration.

Procedure

Include details of the readiness probe in the VM configuration file.

Sample readiness probe with an HTTP GET test

# ...
spec:
  readinessProbe:
    httpGet: (1)
      port: 1500 (2)
      path: /healthz (3)
      httpHeaders:
      - name: Custom-Header
        value: Awesome
    initialDelaySeconds: 120 (4)
    periodSeconds: 20 (5)
    timeoutSeconds: 10 (6)
    failureThreshold: 3 (7)
    successThreshold: 3 (8)
# ...

1	The HTTP GET request to perform to connect to the VM.
2	The port of the VM that the probe queries. In the above example, the probe queries port 1500.
3	The path to access on the HTTP server. In the above example, if the handler for the server’s /healthz path returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is removed from the list of available endpoints.
4	The time, in seconds, after the VM starts before the readiness probe is initiated.
5	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
6	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.
7	The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked `Unready`.
8	The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining a TCP readiness probe

Define a TCP readiness probe by setting the spec.readinessProbe.tcpSocket field of the virtual machine (VM) configuration.

Procedure

Include details of the TCP readiness probe in the VM configuration file.

Sample readiness probe with a TCP socket test

# ...
spec:
  readinessProbe:
    initialDelaySeconds: 120 (1)
    periodSeconds: 20 (2)
    tcpSocket: (3)
      port: 1500 (4)
    timeoutSeconds: 10 (5)
# ...

1	The time, in seconds, after the VM starts before the readiness probe is initiated.
2	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
3	The TCP action to perform.
4	The port of the VM that the probe queries.
5	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining an HTTP liveness probe

Define an HTTP liveness probe by setting the spec.livenessProbe.httpGet field of the virtual machine (VM) configuration. You can define both HTTP and TCP tests for liveness probes in the same way as readiness probes. This procedure configures a sample liveness probe with an HTTP GET test.

Procedure

Include details of the HTTP liveness probe in the VM configuration file.

Sample liveness probe with an HTTP GET test

# ...
spec:
  livenessProbe:
    initialDelaySeconds: 120 (1)
    periodSeconds: 20 (2)
    httpGet: (3)
      port: 1500 (4)
      path: /healthz (5)
      httpHeaders:
      - name: Custom-Header
        value: Awesome
    timeoutSeconds: 10 (6)
# ...

1	The time, in seconds, after the VM starts before the liveness probe is initiated.
2	The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
3	The HTTP GET request to perform to connect to the VM.
4	The port of the VM that the probe queries. In the above example, the probe queries port 1500. The VM installs and runs a minimal HTTP server on port 1500 via cloud-init.
5	The path to access on the HTTP server. In the above example, if the handler for the server’s `/healthz` path returns a success code, the VM is considered to be healthy. If the handler returns a failure code, the VM is deleted and a new VM is created.
6	The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Defining a watchdog

You can define a watchdog to monitor the health of the guest operating system by performing the following steps:

Configure a watchdog device for the virtual machine (VM).
Install the watchdog agent on the guest.

The watchdog device monitors the agent and performs one of the following actions if the guest operating system is unresponsive:

poweroff: The VM powers down immediately. If spec.running is set to true or spec.runStrategy is not set to manual, then the VM reboots.
reset: The VM reboots in place and the guest operating system cannot react.

The reboot time might cause liveness probes to time out. If cluster-level protections detect a failed liveness probe, the VM might be forcibly rescheduled, increasing the reboot time.
shutdown: The VM gracefully powers down by stopping all services.

Watchdog is not available for Windows VMs.

Configuring a watchdog device for the virtual machine

You configure a watchdog device for the virtual machine (VM).

Prerequisites

The VM must have kernel support for an i6300esb watchdog device. Fedora images support i6300esb.

Procedure

Create a YAML file with the following contents:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  labels:
    kubevirt.io/vm: vm2-rhel84-watchdog
  name: <vm-name>
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/vm: vm2-rhel84-watchdog
    spec:
      domain:
        devices:
          watchdog:
            name: <watchdog>
            i6300esb:
              action: "poweroff" (1)
# ...

1	Specify `poweroff`, `reset`, or `shutdown`.

The example above configures the i6300esb watchdog device on a RHEL8 VM with the poweroff action and exposes the device as /dev/watchdog.

This device can now be used by the watchdog binary.

Apply the YAML file to your cluster by running the following command:
```
$ oc apply -f <file_name>.yaml
```

Verification

This procedure is provided for testing watchdog functionality only and must not be run on production machines.

Run the following command to verify that the VM is connected to the watchdog device:
```
$ lspci | grep watchdog -i
```
Run one of the following commands to confirm the watchdog is active:
- Trigger a kernel panic:
```
# echo c > /proc/sysrq-trigger
```
- Stop the watchdog service:
```
# pkill -9 watchdog
```

Installing the watchdog agent on the guest

You install the watchdog agent on the guest and start the watchdog service.

Procedure

Log in to the virtual machine as root user.
Install the watchdog package and its dependencies:
```
# yum install watchdog
```
Uncomment the following line in the /etc/watchdog.conf file and save the changes:
```
#watchdog-device = /dev/watchdog
```
Enable the watchdog service to start on boot:
```
# systemctl enable --now watchdog.service
```

Defining a guest agent ping probe

Define a guest agent ping probe by setting the spec.readinessProbe.guestAgentPing field of the virtual machine (VM) configuration.

The guest agent ping probe is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

The QEMU guest agent must be installed and enabled on the virtual machine.

Procedure

Include details of the guest agent ping probe in the VM configuration file. For example:

Sample guest agent ping probe

# ...
spec:
  readinessProbe:
    guestAgentPing: {} (1)
    initialDelaySeconds: 120 (2)
    periodSeconds: 20 (3)
    timeoutSeconds: 10 (4)
    failureThreshold: 3 (5)
    successThreshold: 3 (6)
# ...

1	The guest agent ping probe to connect to the VM.
2	Optional: The time, in seconds, after the VM starts before the guest agent probe is initiated.
3	Optional: The delay, in seconds, between performing probes. The default delay is 10 seconds. This value must be greater than `timeoutSeconds`.
4	Optional: The number of seconds of inactivity after which the probe times out and the VM is assumed to have failed. The default value is 1. This value must be lower than `periodSeconds`.
5	Optional: The number of times that the probe is allowed to fail. The default is 3. After the specified number of attempts, the pod is marked `Unready`.
6	Optional: The number of times that the probe must report success, after a failure, to be considered successful. The default is 1.

Create the VM by running the following command:
```
$ oc create -f <file_name>.yaml
```

Additional resources

Monitoring application health by using health checks