前言
近两年来人工智能发展迅速,如何充分发挥硬件性能、提升应用运行效率成为了一个重要议题,Unikernel作为一种轻量级的操作系统,以其高效、安全的特性备受关注。
本篇文章介绍如何将现有的AI应用集成至Nanos Unikernel中,并使用Nvidia驱动提供CUDA支持。
介绍
在文章 Nanos Unikernel入门 中我曾介绍了几种Unikernel的解决方案,经过调研,截止目前仅有Nanos支持Nvidia的显卡驱动,这为Unikernel集成深度学习应用提供了很大的支持。
Nanos通过以 Klibs 的方法将Nvidia驱动挂载到内核中,Klibs可以理解为Nanos的插件机制,为Nanos提供了一些额外的功能。
Nanos的Nvidia驱动位于 https://github.com/nanovms/gpu-nvidia ,主要是在Nvidia开源驱动基础上进行修改以适配Nanos内核,目前的驱动版本为535.113.01。
Nanos目前支持谷歌云GCP和本地两种平台集成GPU,本文侧重本地集成,在开始之前,需要保证本地设备至少有一块支持Nvidia开源驱动的Nvidia显卡,并且已经开启了显卡直通功能,可以参考之前的文章 PVE8.2显卡直通。
编译Klibs
目前想要使用gpu_nvidia的klib有两种方法,手动编译或者使用Nanos官方每日自动编译的klib。
官方编译的klib可以在 https://storage.googleapis.com/nanos/release/nightly/gpu-nvidia-x86_64.tar.gz 下载。
如果想手动编译,可以安装以下步骤:
- 克隆Nanos内核仓库并编译
- 克隆Nanos的Nvidia驱动仓库并编译,
NANOS_DIR
参数需指定上一步中nanos目录的路径
编译后的产物位于kernel-open/_out/Nanos_x86_64/gpu_nvidia
git clone https://github.com/nanovms/gpu-nvidia cd gpu-nvidia make NANOS_DIR=/root/nanos
构建Nanos应用
此处以最简单的CUDA Samples中的bandwidthTest和deviceQuery两个应用作为测试
- 创建项目目录
mkdir cuda-samples-nanos && cd cuda-samples-nanos
- 集成klib
这里使用官方编译的klib或者自己编译的都可以,这里以官方编译的为例
解压产物有一个gpu_nvidia
文件和nvidia/535.113.01/gsp_ga10x.bin
及nvidia/535.113.01/gsp_tu10x.bin
,其中.bin文件为GPU System Processor (GSP)固件,其中ga10x是基于Ampere架构的GPU,tu10x是基于Turing架构的GPU,可以根据自己显卡的架构保留其一或者都保留。wget https://storage.googleapis.com/nanos/release/nightly/gpu-nvidia-x86_64.tar.gz tar -vxf gpu-nvidia-x86_64.tar.gz && rm -rf gpu-nvidia-x86_64.tar.gz mkdir klibs mv gpu_nvidia klibs/
- 编译cuda-samples
bandwidthTest和deviceQuery两个应用可以在安装CUDA时勾选CUDA Demo Suite
,并在CUDA安装路径下的samples
目录中获取,或者自己编译。
这里演示如何手动编译,编译产物位于cuda-samples/bin/x86_64/linux/release/
中,将其复制到项目目录下git clone https://github.com/NVIDIA/cuda-samples cd cuda-samples/Samples/1_Utilities/bandwidthTest/ make cd ../deviceQuery/ make
- 准备动态依赖库
两个程序只需要一个libcuda.so.1
即可,需要安装535.113.01同版本的驱动,然后可以在/usr/lib/x86_64-linux-gnu/
中找到它mkdir -p usr/lib cp /usr/lib/x86_64-linux-gnu/libcuda.so.1 ./usr/lib
- 编辑配置文件
新建一个config.json
并写入以下内容{ "KlibDir": "./klibs", "Klibs": [ "gpu_nvidia" ], "Dirs": [ "nvidia", "usr" ], "RunConfig": { "GPUs": 1 } }
这里
KlibDir
设置klib的目录,Klibs
指定需要加载的klib为gpu_nvidia,Dirs
参数指定将nvidia和usr两个目录映射至Nanos的根目录下,RunConfig.GPUs
指定需要直通的GPU数量。 - 检查
目前项目下有如下目录结构,检查是否有缺失. ├── bandwidthTest ├── config.json ├── deviceQuery ├── klibs │ └── gpu_nvidia ├── nvidia │ ├── 535.113.01 │ │ ├── gsp_ga10x.bin │ │ └── gsp_tu10x.bin │ └── LICENSE └── usr └── lib └── libcuda.so.1
运行
通过ops run运行ELF文件,-c参数指定配置文件,-n参数指定以nightly版本运行
# ops run deviceQuery -c config.json -n
running local instance
booting /root/.ops/images/deviceQuery ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 535.113.01 Release Build (circleci@027ee46c5f57) Fri Aug 16 02:11:27 AM UTC 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4060"
CUDA Driver Version / Runtime Version 12.2 / 12.2
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 7734 MBytes (8109293568 bytes)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 2505 MHz (2.50 GHz)
Memory Clock rate: 8501 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 25165824 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 4
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1, Device0 = NVIDIA GeForce RTX 4060
Result = PASS
# ops run bandwidthTest -c config.json -n
running local instance
booting /root/.ops/images/bandwidthTest ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 535.113.01 Release Build (circleci@027ee46c5f57) Fri Aug 16 02:11:27 AM UTC 2024
Loaded the UVM driver, major device number 0.
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: NVIDIA GeForce RTX 4060
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12905.6
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 13203.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 232091.7
en1: assigned FE80::4466:88FF:FE1F:2F9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when G
没有回复内容