07-4.支持 Nvidia GPU

cuda 是 nvidia 提供的并行计算框架和编程模型 APIs,目前最新版本为 9.2。

本文档先讲解安装和验证 cuda 9.2 软件包的步骤,它包含开发工具套件(ToolKit,二进制程序、头文件、库文件、样例程序等)、GPU Driver 等。

安装完 cuda 后,内核已经可以识别和使用 GPU 显卡了,但还需要安装 nvidia-container-runtime,容器可以通过它提供的 nvidia runtime 来使用 GPU 资源。

最后,需要修改 kublet 配置参数,支持 GPU 加速。

安装内核头文件和开发包

安装 cuda 软件包时,它会使用主机的 kernel 头文件、库文件现场编译出一个对应内核版本的驱动程序。所以需要先安装这两个软件包。

如果使用的是 kernel-lt 内核(从 elrepo 源升级安装),则使用命令:

  1. sudo yum install kernel-lt-devel-$(uname -r) kernel-lt-headers-$(uname -r)

否则,安装官方源里的软件包:

  1. sudo yum install epel
  2. sudo yum install dkms
  3. sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

安装 cuda 9.2

Nvidia 网站 下载离线安装包。

离线安装包包含两部分:

  • 基础包:Base Installer
  • 补丁包:Patch 1 (Released Aug 6, 2018)
  1. [work@m7-demo-136003 ~]$ ls -l /tmp/cuda-repo-rhel7-9-2-*
  2. -rw-r--r-- 1 work work 192984471 8 14 20:47 /tmp/cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.x86_64.rpm # 补丁包
  3. -rw-r--r-- 1 work work 1718223108 8 14 20:45 /tmp/cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm # 基础包

如果使用的 kernel-lt 内核,则需要先关闭 elrepo,否则会提示 RPM 版本冲突:

  1. sudo mv /etc/yum.repos.d/elrepo.repo{,.bak}

同时安装下载的两个软件包:

  1. sudo rpm -i cuda-repo-rhel7-9-2-local-9.2.148-1.x86_64.rpm cuda-repo-rhel7-9-2-148-local-patch-1-1.0-1.x86_64.rpm
  2. sudo yum clean all
  3. sudo yum install cuda

官方参考手册:https://developer.download.nvidia.com/compute/cuda/9.2/Prod2/docs/sidebar/CUDA_Installation_Guide_Linux.pdf

配置 cuda

将 cuda 的二进制文件目录添加到 PATH 中:

  1. export PATH=/usr/local/cuda-9.2/bin${PATH:+:${PATH}}

确认 /etc/ld.so.conf.d/ 目录中包含 cuda-9-2.conf 文件:

  1. [root@m7-demo-136001 deviceQuery]# cat /etc/ld.so.conf.d/cuda-9-2.conf
  2. /usr/local/cuda-9.2/targets/x86_64-linux/lib

开机自动加载 nvidia driver:

  1. sudo tee /etc/rc.local <<EOF
  2. /bin/nvidia-modprobe
  3. EOF

将相关 library 文件拷贝到 /usr/cuda_files 目录下:

  1. cat > generate_cuda_files.sh <<"EOF"
  2. #!/bin/sh
  3. TARGET_DIR="/usr/cuda_files"
  4. rm -rf ${TARGET_DIR} || :
  5. mkdir -p $TARGET_DIR
  6. echo "Try install cuda dependent libraries to $TARGET_DIR"
  7. INFRASTRUCTURE="x86-64"
  8. LIBCUDA_PATH=$(ldconfig -p | grep libcuda | grep $INFRASTRUCTURE | awk '{print $4}' | tail -n 1)
  9. LIBNVIDIALOADER_PATH=$(ldd $LIBCUDA_PATH | grep libnvidia | awk '{print $3}' | tail -n 1)
  10. if [ $LIBCUDA_PATH == "" ]; then
  11. echo "Cannot find libcuda.so in your environment"
  12. exit 1
  13. else
  14. echo "Find libcuda at $LIBCUDA_PATH"
  15. fi
  16. LIBCUDA_DIR=$(dirname $LIBCUDA_PATH)
  17. LIBNVIDIALOADER_DIR=$(dirname $LIBNVIDIALOADER_PATH)
  18. echo "Copy library libcuda* from $LIBCUDA_DIR to $TARGET_DIR..."
  19. cp -v $LIBCUDA_DIR/libcuda* $TARGET_DIR
  20. echo "Copy library libnvidia* from $LIBNVIDIALOADER_DIR to $TARGET_DIR..."
  21. cp -v $LIBNVIDIALOADER_DIR/libnvidia* $TARGET_DIR
  22. #echo "Validate by run basic tensorflow docker image"
  23. #DOCKER_IMAGE="docker02:35000/operator-repository/train-tensorflow-gpu:release-3.1.2"
  24. #DEV_MOUNT="--device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 -v $TARGET_DIR:/usr/cuda_files"
  25. #TEST_COMMAND="import tensorflow; tensorflow.Session()"
  26. ##docker pull $DOCKER_IMAGE
  27. #if [ $? -ne 0 ]; then
  28. # echo "Fail to pull image $DOCKER_IMAGE, maybe need to login (docker02: user=testuser passwd=testpassword)"
  29. # exit 1
  30. #fi
  31. #
  32. #docker run --rm -it $DEV_MOUNT $DOCKER_IMAGE python -c "$TEST_COMMAND"
  33. #if [ $? -eq 0 ]; then
  34. # echo "Test docker image success"
  35. #else
  36. # echo "Test docker image failed!"
  37. # exit 1
  38. #fi
  39. EOF
  40. bash generate_cuda_files.sh

验证 cuda

验证 nvidia driver 版本:

  1. [root@m7-demo-136001 ~]# cat /proc/driver/nvidia/version
  2. NVRM version: NVIDIA UNIX x86_64 Kernel Module 396.44 Wed Jul 11 16:51:49 PDT 2018
  3. GCC version: gcc 版本 4.8.5 20150623 (Red Hat 4.8.5-28) (GCC)

编译 /usr/local/cuda-9.2/samples 目录下的示例程序,确认可以正确执行,这里以 deviceQuery 为例:

  1. [root@m7-demo-136001 ~]# cd /usr/local/cuda-9.2/samples/1_Utilities/deviceQuery
  2. [root@m7-demo-136001 deviceQuery]# make clean && make
  3. .[root@m7-demo-136001 deviceQuery]# ./deviceQuery

输出:

  1. ./deviceQuery Starting...
  2. CUDA Device Query (Runtime API) version (CUDART static linking)
  3. Detected 4 CUDA Capable device(s)
  4. Device 0: "GeForce GTX 1080 Ti"
  5. CUDA Driver Version / Runtime Version 9.2 / 9.2
  6. CUDA Capability Major/Minor version number: 6.1
  7. Total amount of global memory: 11178 MBytes (11721506816 bytes)
  8. (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
  9. GPU Max Clock rate: 1582 MHz (1.58 GHz)
  10. Memory Clock rate: 5505 Mhz
  11. Memory Bus Width: 352-bit
  12. L2 Cache Size: 2883584 bytes
  13. Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  14. Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
  15. Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
  16. Total amount of constant memory: 65536 bytes
  17. Total amount of shared memory per block: 49152 bytes
  18. Total number of registers available per block: 65536
  19. Warp size: 32
  20. Maximum number of threads per multiprocessor: 2048
  21. Maximum number of threads per block: 1024
  22. Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  23. Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
  24. Maximum memory pitch: 2147483647 bytes
  25. Texture alignment: 512 bytes
  26. Concurrent copy and kernel execution: Yes with 2 copy engine(s)
  27. Run time limit on kernels: No
  28. Integrated GPU sharing Host Memory: No
  29. Support host page-locked memory mapping: Yes
  30. Alignment requirement for Surfaces: Yes
  31. Device has ECC support: Disabled
  32. Device supports Unified Addressing (UVA): Yes
  33. Device supports Compute Preemption: Yes
  34. Supports Cooperative Kernel Launch: Yes
  35. Supports MultiDevice Co-op Kernel Launch: Yes
  36. Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0
  37. Compute Mode:

获取显卡信息:

  1. [root@m7-demo-136001 deviceQuery]# sudo nvidia-smi

输出:

  1. Mon Aug 20 12:09:31 2018
  2. +-----------------------------------------------------------------------------+
  3. | NVIDIA-SMI 396.44 Driver Version: 396.44 |
  4. |-------------------------------+----------------------+----------------------+
  5. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  6. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  7. |===============================+======================+======================|
  8. | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
  9. | 23% 40C P0 58W / 250W | 0MiB / 11178MiB | 0% Default |
  10. +-------------------------------+----------------------+----------------------+
  11. | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A |
  12. | 23% 40C P0 58W / 250W | 0MiB / 11178MiB | 0% Default |
  13. +-------------------------------+----------------------+----------------------+
  14. | 2 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A |
  15. | 23% 34C P0 57W / 250W | 0MiB / 11178MiB | 0% Default |
  16. +-------------------------------+----------------------+----------------------+
  17. | 3 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A |
  18. | 23% 32C P0 57W / 250W | 0MiB / 11178MiB | 3% Default |
  19. +-------------------------------+----------------------+----------------------+
  20. +-----------------------------------------------------------------------------+
  21. | Processes: GPU Memory |
  22. | GPU PID Type Process name Usage |
  23. |=============================================================================|
  24. | No running processes found |
  25. +-----------------------------------------------------------------------------+

安装 nvidia-container-runtime

下载 yum 源配置文件:

  1. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
  2. curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
  3. sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo

安装 nvidia docker runtime:

  1. sudo yum install -y nvidia-container-runtime

/etc/docker/daemon.json 文件中添加 nvidia runtime 配置(runtimes 部分),完整的 daemon.json 文件如下:

  1. {
  2. "registry-mirrors": ["https://docker.mirrors.ustc.edu.cn", "https://hub-mirror.c.163.com"],
  3. "insecure-registries": ["docker02:35000"],
  4. "max-concurrent-downloads": 20,
  5. "live-restore": true,
  6. "max-concurrent-uploads": 10,
  7. "debug": true,
  8. "data-root": "/mnt/disk0/docker/data",
  9. "exec-root": "/mnt/disk0/docker/exec",
  10. "log-opts": {
  11. "max-size": "100m",
  12. "max-file": "5"
  13. },
  14. "runtimes": {
  15. "nvidia": {
  16. "path": "/usr/bin/nvidia-container-runtime",
  17. "runtimeArgs": []
  18. }
  19. }
  20. }

reload dockerd 进程,生效上配置参数,将 nvidia runtime 注册进 dockerd:

  1. sudo pkill -SIGHUP dockerd

确认 nvidia runtime 注册成功:

  1. docker info|grep Runtimes

输出:

  1. Runtimes: nvidia runc
  • 结果中包含 nvidia

获取显卡信息:

  1. docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
  • cuda image 的 tag 9.0-base 与安装的 cuda 包版本一致,否则不兼容。

调整 kublet 配置参数并重启服务

确认 kubelet 配置参数中包含 --feature-gates=Accelerators=true

  1. grep gate /etc/systemd/system/kubelet.service

输出:

  1. --feature-gates=RotateKubeletClientCertificate=true,RotateKubeletServerCertificate=true,Accelerators=true,DevicePlugins=true

重启 kubelet 服务:

  1. sudo systemctl restart kubelet

查看节点信息,确保 alpha.kubernetes.io/nvidia-gpu 值不为 0:

  1. kubectl describe node|grep alpha.kubernetes.io/nvidia-gpu

输出:

  1. alpha.kubernetes.io/nvidia-gpu: 4
  2. alpha.kubernetes.io/nvidia-gpu: 4
  3. alpha.kubernetes.io/nvidia-gpu: 1
  4. alpha.kubernetes.io/nvidia-gpu: 1
  5. alpha.kubernetes.io/nvidia-gpu: 1
  6. alpha.kubernetes.io/nvidia-gpu: 1

如果 alpha.kubernetes.io/nvidia-gpu 值为 0,则排查处理步骤如下:

  1. 执行命令 /usr/bin/nvidia-modprobe && modprobe nvidia_uvm,执行完成后确保 lsmod 命令输出如下:

    1. # lsmod |grep ^nvidia
    2. nvidia_uvm 778240 0
    3. nvidia_drm 45056 0
    4. nvidia_modeset 1089536 1 nvidia_drm
    5. nvidia 14036992 14 nvidia_modeset,nvidia_uvm
  2. 确认 /dev/ 目录下创建了 nvidia-uvm、nvidia-uvm-tools 设备文件,如果没有使用如下命名手动创建:

  1. # sudo mknod -m 666 /dev/nvidia-uvm c $(grep nvidia-uvm /proc/devices | awk '{print $1}') 1
  2. # sudo mknod -m 666 /dev/nvidia-uvm-tools c $(grep nvidia-uvm /proc/devices | awk '{print $1}') 1

然后再次确认 /dev 目录下有如下设备文件:

  1. # ls -l /dev/nvidia*
  2. crw-rw-rw- 1 root root 195, 0 11 14 14:23 /dev/nvidia0
  3. crw-rw-rw- 1 root root 195, 255 11 14 14:23 /dev/nvidiactl
  4. crw-rw-rw- 1 root root 195, 254 11 14 14:39 /dev/nvidia-modeset
  5. crw-rw-rw- 1 root root 240, 1 11 14 14:41 /dev/nvidia-uvm
  6. crw-rw-rw- 1 root root 240, 1 11 14 14:41 /dev/nvidia-uvm-tools

参考

https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup