07-4.支持 Nvidia GPU
cuda 是 nvidia 提供的并行计算框架和编程模型 APIs,目前最新版本为 9.2。
本文档先讲解安装和验证 cuda 9.2 软件包的步骤,它包含开发工具套件(ToolKit,二进制程序、头文件、库文件、样例程序等)、GPU Driver 等。
安装完 cuda 后,内核已经可以识别和使用 GPU 显卡了,但还需要安装 nvidia-container-runtime,容器可以通过它提供的 nvidia runtime 来使用 GPU 资源。
最后,需要修改 kublet 配置参数,支持 GPU 加速。
安装内核头文件和开发包
安装 cuda 软件包时,它会使用主机的 kernel 头文件、库文件现场编译出一个对应内核版本的驱动程序。所以需要先安装这两个软件包。
如果使用的是 kernel-lt 内核(从 elrepo 源升级安装),则使用命令:
sudo yum install kernel-lt-devel-$(uname -r) kernel-lt-headers-$(uname -r)
否则,安装官方源里的软件包:
sudo yum install epel
sudo yum install dkms
sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
安装 cuda 9.2
从 Nvidia 网站 下载离线安装包。
离线安装包包含两部分:
- 基础包:Base Installer
- 补丁包:Patch 1 (Released Aug 6, 2018)
[work@m7-demo-136003 ~]$ ls -l /tmp/cuda-repo-rhel7-9-2-*
-rw-r--r-- 1 work work 192984471 8月 14 20:47 /tmp/cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.x86_64.rpm # 补丁包
-rw-r--r-- 1 work work 1718223108 8月 14 20:45 /tmp/cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm # 基础包
如果使用的 kernel-lt 内核,则需要先关闭 elrepo,否则会提示 RPM 版本冲突:
sudo mv /etc/yum.repos.d/elrepo.repo{,.bak}
同时安装下载的两个软件包:
sudo rpm -i cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.x86_64.rpm
sudo yum clean all
sudo yum install cuda
配置 cuda
将 cuda 的二进制文件目录添加到 PATH 中:
export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}
确认 /etc/ld.so.conf.d/
目录中包含 cuda-9-2.conf
文件:
[root@m7-demo-136001 deviceQuery]# cat /etc/ld.so.conf.d/cuda-9-2.conf
/usr/local/cuda-9.2/targets/x86_64-linux/lib
开机自动加载 nvidia driver:
sudo tee /etc/rc.local <<EOF
/bin/nvidia-modprobe
EOF
将相关 library 文件拷贝到 /usr/cuda_files 目录下:
cat > generate_cuda_files.sh <<"EOF"
#!/bin/sh
TARGET_DIR="/usr/cuda_files"
rm -rf ${TARGET_DIR} || :
mkdir -p $TARGET_DIR
echo "Try install cuda dependent libraries to $TARGET_DIR"
INFRASTRUCTURE="x86-64"
LIBCUDA_PATH=$(ldconfig -p | grep libcuda | grep $INFRASTRUCTURE | awk '{print $4}' | tail -n 1)
LIBNVIDIALOADER_PATH=$(ldd $LIBCUDA_PATH | grep libnvidia | awk '{print $3}' | tail -n 1)
if [ $LIBCUDA_PATH == "" ]; then
echo "Cannot find libcuda.so in your environment"
exit 1
else
echo "Find libcuda at $LIBCUDA_PATH"
fi
LIBCUDA_DIR=$(dirname $LIBCUDA_PATH)
LIBNVIDIALOADER_DIR=$(dirname $LIBNVIDIALOADER_PATH)
echo "Copy library libcuda* from $LIBCUDA_DIR to $TARGET_DIR..."
cp -v $LIBCUDA_DIR/libcuda* $TARGET_DIR
echo "Copy library libnvidia* from $LIBNVIDIALOADER_DIR to $TARGET_DIR..."
cp -v $LIBNVIDIALOADER_DIR/libnvidia* $TARGET_DIR
#echo "Validate by run basic tensorflow docker image"
#DOCKER_IMAGE="docker02:35000/operator-repository/train-tensorflow-gpu:release-3.1.2"
#DEV_MOUNT="--device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 -v $TARGET_DIR:/usr/cuda_files"
#TEST_COMMAND="import tensorflow; tensorflow.Session()"
##docker pull $DOCKER_IMAGE
#if [ $? -ne 0 ]; then
# echo "Fail to pull image $DOCKER_IMAGE, maybe need to login (docker02: user=testuser passwd=testpassword)"
# exit 1
#fi
#
#docker run --rm -it $DEV_MOUNT $DOCKER_IMAGE python -c "$TEST_COMMAND"
#if [ $? -eq 0 ]; then
# echo "Test docker image success"
#else
# echo "Test docker image failed!"
# exit 1
#fi
EOF
bash generate_cuda_files.sh
验证 cuda
验证 nvidia driver 版本:
[root@m7-demo-136001 ~]# cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 396.44 Wed Jul 11 16:51:49 PDT 2018
GCC version: gcc 版本 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)
编译 /usr/local/cuda-9.2/samples
目录下的示例程序,确认可以正确执行,这里以 deviceQuery 为例:
[root@m7-demo-136001 ~]# cd /usr/local/cuda-9.2/samples/1_Utilities/deviceQuery
[root@m7-demo-136001 deviceQuery]# make clean && make
.[root@m7-demo-136001 deviceQuery]# ./deviceQuery
输出:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "GeForce GTX 1080 Ti"
CUDA Driver Version / Runtime Version 9.2 / 9.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11178 MBytes (11721506816 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
Compute Mode:
获取显卡信息:
[root@m7-demo-136001 deviceQuery]# sudo nvidia-smi
输出:
Mon Aug 20 12:09:31 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44 Driver Version: 396.44 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 23% 40C P0 58W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
| 23% 40C P0 58W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
| 23% 34C P0 57W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
| 23% 32C P0 57W / 250W | 0MiB / 11178MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
安装 nvidia-container-runtime
下载 yum 源配置文件:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
安装 nvidia docker runtime:
sudo yum install -y nvidia-container-runtime
在 /etc/docker/daemon.json
文件中添加 nvidia runtime 配置(runtimes 部分),完整的 daemon.json 文件如下:
{
"registry-mirrors": ["https://docker.mirrors.ustc.edu.cn", "https://hub-mirror.c.163.com"],
"insecure-registries": ["docker02:35000"],
"max-concurrent-downloads": 20,
"live-restore": true,
"max-concurrent-uploads": 10,
"debug": true,
"data-root": "/mnt/disk0/docker/data",
"exec-root": "/mnt/disk0/docker/exec",
"log-opts": {
"max-size": "100m",
"max-file": "5"
},
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
reload dockerd 进程,生效上配置参数,将 nvidia runtime 注册进 dockerd:
sudo pkill -SIGHUP dockerd
确认 nvidia runtime 注册成功:
docker info|grep Runtimes
输出:
Runtimes: nvidia runc
- 结果中包含 nvidia
获取显卡信息:
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
- cuda image 的 tag 9.0-base 与安装的 cuda 包版本一致,否则不兼容。
调整 kublet 配置参数并重启服务
确认 kubelet 配置参数中包含 --feature-gates=Accelerators=true
:
grep gate /etc/systemd/system/kubelet.service
输出:
--feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true,Accelerators=true,DevicePlugins=true
重启 kubelet 服务:
sudo systemctl restart kubelet
查看节点信息,确保 alpha.kubernetes.io/nvidia-gpu
值不为 0:
kubectl describe node|grep alpha.kubernetes.io/nvidia-gpu
输出:
alpha.kubernetes.io/nvidia-gpu: 4
alpha.kubernetes.io/nvidia-gpu: 4
alpha.kubernetes.io/nvidia-gpu: 1
alpha.kubernetes.io/nvidia-gpu: 1
alpha.kubernetes.io/nvidia-gpu: 1
alpha.kubernetes.io/nvidia-gpu: 1
如果 alpha.kubernetes.io/nvidia-gpu
值为 0,则排查处理步骤如下:
执行命令
/usr/bin/nvidia-modprobe && modprobe nvidia_uvm
,执行完成后确保 lsmod 命令输出如下:# lsmod |grep ^nvidia
nvidia_uvm 778240 0
nvidia_drm 45056 0
nvidia_modeset 1089536 1 nvidia_drm
nvidia 14036992 14 nvidia_modeset,nvidia_uvm
确认 /dev/ 目录下创建了 nvidia-uvm、nvidia-uvm-tools 设备文件,如果没有使用如下命名手动创建:
# sudo mknod -m 666 /dev/nvidia-uvm c $(grep nvidia-uvm /proc/devices | awk '{print $1}') 1
# sudo mknod -m 666 /dev/nvidia-uvm-tools c $(grep nvidia-uvm /proc/devices | awk '{print $1}') 1
然后再次确认 /dev 目录下有如下设备文件:
# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 11月 14 14:23 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 11月 14 14:23 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 11月 14 14:39 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240, 1 11月 14 14:41 /dev/nvidia-uvm
crw-rw-rw- 1 root root 240, 1 11月 14 14:41 /dev/nvidia-uvm-tools
参考
https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup