Troubleshooting operating system issues
OKD runs on FCOS. You can follow these procedures to troubleshoot problems related to the operating system.
Investigating kernel crashes
Enabling kdump
The kdump
service, included in kexec-tools
, provides a crash-dumping mechanism. You can use this service to save the contents of the system’s memory for later analysis.
The kdump
service is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
FCOS ships with kexec-tools
, but manual configuration is required to enable kdump
.
Procedure
Perform the following steps to enable kdump
on FCOS.
To reserve memory for the crash kernel during the first kernel booting, provide kernel arguments by entering the following command:
# rpm-ostree kargs --append='crashkernel=256M'
Optional: To write the crash dump over the network or to some other location, rather than to the default local
/var/crash
location, edit the/etc/kdump.conf
configuration file.Network dumps are required when using LUKS.
kdump
does not support local crash dumps on LUKS-encrypted devices.For details on configuring the
kdump
service, see the comments in/etc/sysconfig/kdump
,/etc/kdump.conf
, and thekdump.conf
manual page.Enable the
kdump
systemd service.# systemctl enable kdump.service
Reboot your system.
# systemctl reboot
Ensure that
kdump
has loaded a crash kernel by checking that thekdump.service
has started and exited successfully and thatcat /sys/kernel/kexec_crash_loaded
prints1
.
Enabling kdump on day-1
The kdump
service is intended to be enabled per node to debug kernel problems. Because there are costs to having kdump
enabled, and these costs accumulate with each additional kdump
-enabled node, it is recommended that kdump
only be enabled on each node as needed. Potential costs of enabling kdump
on each node include:
Less available RAM due to memory being reserved for the crash kernel.
Node unavailability while the kernel is dumping the core.
Additional storage space being used to store the crash dumps.
Not being production-ready because the
kdump
service is in Technology Preview.
If you are aware of the downsides and trade-offs of having the kdump
service enabled, it is possible to enable kdump
in a cluster-wide fashion. Although machine-specific machine configs are not yet supported, you can perform the previous steps through a systemd
unit in a MachineConfig
object on day-1 and have kdump enabled on all nodes in the cluster. You can create a MachineConfig
object and inject that object into the set of manifest files used by Ignition during cluster setup. See “Customizing nodes” in the Installing → Installation configuration section for more information and examples on how to use Ignition configs.
Procedure
Create a MachineConfig
object for cluster-wide configuration:
Create a Butane config file,
99-worker-kdump.bu
, that configures and enables kdump:variant: openshift
version: 4.10.0
metadata:
name: 99-worker-kdump (1)
labels:
machineconfiguration.openshift.io/role: worker (1)
openshift:
kernel_arguments: (2)
- crashkernel=256M
storage:
files:
- path: /etc/kdump.conf (3)
mode: 0644
overwrite: true
contents:
inline: |
path /var/crash
core_collector makedumpfile -l --message-level 7 -d 31
- path: /etc/sysconfig/kdump (4)
mode: 0644
overwrite: true
contents:
inline: |
KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug quiet log_buf_len swiotlb"
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr novmcoredd hest_disable"
KEXEC_ARGS="-s"
KDUMP_IMG="vmlinuz"
systemd:
units:
- name: kdump.service
enabled: true
1 Replace worker
withmaster
in both locations when creating aMachineConfig
object for control plane nodes.2 Provide kernel arguments to reserve memory for the crash kernel. You can add other kernel arguments if necessary. 3 If you want to change the contents of /etc/kdump.conf
from the default, include this section and modify theinline
subsection accordingly.4 If you want to change the contents of /etc/sysconfig/kdump
from the default, include this section and modify theinline
subsection accordingly.Use Butane to generate a machine config YAML file,
99-worker-kdump.yaml
, containing the configuration to be delivered to the nodes:$ butane 99-worker-kdump.bu -o 99-worker-kdump.yaml
Put the YAML file into manifests during cluster setup. You can also create this
MachineConfig
object after cluster setup with the YAML file:$ oc create -f ./99-worker-kdump.yaml
Testing the kdump configuration
See the Capturing the Dump section in the Fedora documentation for kdump
.
Analyzing a core dump
See the Dump Analysis section in the Fedora documentation for kdump
.
Additional resources
kdump.conf(5) — a manual page for the
/etc/kdump.conf
configuration file containing the full documentation of available optionskexec(8) — a manual page for
kexec
Red Hat Knowledgebase article regarding
kexec
andkdump
.