Nvidia GPU Operator
PRE
•
Opensource Kubernetes 기준 (kubespray)
•
각 노드에 GPU
Cluster에 GPU 노드를 추가한다고해서 사용자가 바로 GPU를 사용할 수 있는 건 아니다.
Driver가 없다면 설치해야하고, Container 환경에서 GPU를 사용할 수 있는 무언가를 해야하며,
Kubernetes Cluster Pods 에서 GPU 리소스를 사용할 수 있도록 Device Plugin을 배포해야 한다.
Nvidia-GPU-Operator Repo 등록
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Shell
복사
설정 Pull
helm pull nvidia/gpu-operator --untar
Shell
복사
Components 확인
.
├── Chart.lock
├── Chart.yaml
├── charts
│ └── node-feature-discovery
│ ├── Chart.yaml
│ ├── README.md
│ ├── crds
│ │ └── nfd-api-crds.yaml
│ ├── templates
│ │ ├── _helpers.tpl
│ │ ├── cert-manager-certs.yaml
│ │ ├── cert-manager-issuer.yaml
│ │ ├── clusterrole.yaml
│ │ ├── clusterrolebinding.yaml
│ │ ├── master.yaml
│ │ ├── nfd-gc.yaml
│ │ ├── nfd-master-conf.yaml
│ │ ├── nfd-topologyupdater-conf.yaml
│ │ ├── nfd-worker-conf.yaml
│ │ ├── prometheus.yaml
│ │ ├── role.yaml
│ │ ├── rolebinding.yaml
│ │ ├── service.yaml
│ │ ├── serviceaccount.yaml
│ │ ├── topologyupdater-crds.yaml
│ │ ├── topologyupdater.yaml
│ │ └── worker.yaml
│ └── values.yaml
├── crds
│ ├── nvidia.com_clusterpolicies_crd.yaml
│ └── nvidia.com_nvidiadrivers.yaml
├── templates
│ ├── _helpers.tpl
│ ├── cleanup_crd.yaml
│ ├── clusterpolicy.yaml
│ ├── nodefeaturerules.yaml
│ ├── nvidiadriver.yaml
│ ├── operator.yaml
│ ├── plugin_config.yaml
│ ├── podsecuritypolicy.yaml
│ ├── readonlyfs_scc.openshift.yaml
│ ├── role.yaml
│ ├── rolebinding.yaml
│ ├── serviceaccount.yaml
│ └── upgrade_crd.yaml
└── values.yaml
Shell
복사
Components 의 역할에 따른 동작 과정
GPU-Operator
├── 상태 및 정보 확인
│ ├── Node Feature Discovery Master (Deployment)
│ ├── Node Feature Discovery Worker (DaemonSet)
│ └── GPU Feature Discovery (DaemonSet)
├── 드라이버 설치
│ └── Nvidia Driver Installer (DaemonSet)
├── 드라이버 자원 할당
│ └── Nvidia Device Plugin (DaemonSet)
├── 컨테이너 환경에서 GPU 사용
│ └── Nvidia Container Toolkit (DaemonSet)
├── 세팅 확인
│ └── GPU Operator Validator (DaemonSet)
└── 메트릭 확인
└── DCGM Exporter (DaemonSet)
Shell
복사
주의
•
Node에 GPU Driver 및 Runtime이 이미 설치되어 있는경우
vi values.yaml
driver:
enabled: false ## Nvidia Driver 가 설치되어 있다면 False 필수
nvidiaDriverCRD:
enabled: false
deployDefaultCR: true
driverType: gpu
nodeSelector: {}
useOpenKernelModules: false
# use pre-compiled packages for NVIDIA driver installation.
# only supported for as a tech-preview feature on ubuntu22.04 kernels.
usePrecompiled: false
repository: nvcr.io/nvidia
image: driver
version: "550.54.14" ## Nvidia Driver 가 설치되어 있다면 버전체크 필수
imagePullPolicy: IfNotPresent
imagePullSecrets: []
startupProbe:
initialDelaySeconds: 60
periodSeconds: 10
# nvidia-smi can take longer than 30s in some cases
# ensure enough timeout is set
timeoutSeconds: 60
failureThreshold: 120
rdma:
enabled: false
useHostMofed: false
upgradePolicy:
# global switch for automatic upgrade feature
# if set to false all other options are ignored
autoUpgrade: true
# how many nodes can be upgraded in parallel
# 0 means no limit, all nodes will be upgraded in parallel
maxParallelUpgrades: 1
# maximum number of nodes with the driver installed, that can be unavailable during
# the upgrade. Value can be an absolute number (ex: 5) or
# a percentage of total nodes at the start of upgrade (ex:
# 10%). Absolute number is calculated from percentage by rounding
# up. By default, a fixed value of 25% is used.'
maxUnavailable: 25%
# options for waiting on pod(job) completions
waitForCompletion:
timeoutSeconds: 0
podSelector: ""
# options for gpu pod deletion
gpuPodDeletion:
force: false
timeoutSeconds: 300
deleteEmptyDir: false
# options for node drain (`kubectl drain`) before the driver reload
# this is required only if default GPU pod deletions done by the operator
# are not sufficient to re-install the driver
drain:
enable: false
force: false
podSelector: ""
# It's recommended to set a timeout to avoid infinite drain in case non-fatal error keeps happening on retries
timeoutSeconds: 300
deleteEmptyDir: false
manager:
image: k8s-driver-manager
repository: nvcr.io/nvidia/cloud-native
version: v0.6.5
imagePullPolicy: IfNotPresent
env:
- name: ENABLE_GPU_POD_EVICTION
value: "true"
- name: ENABLE_AUTO_DRAIN
value: "false"
- name: DRAIN_USE_FORCE
value: "false"
- name: DRAIN_POD_SELECTOR_LABEL
value: ""
- name: DRAIN_TIMEOUT_SECONDS
value: "0s"
- name: DRAIN_DELETE_EMPTYDIR_DATA
value: "false"
env: []
resources: {}
# Private mirror repository configuration
repoConfig:
configMapName: ""
# custom ssl key/certificate configuration
certConfig:
name: ""
# vGPU licensing configuration
licensingConfig:
configMapName: ""
nlsEnabled: true
# vGPU topology daemon configuration
virtualTopology:
config: ""
# kernel module configuration for NVIDIA driver
kernelModuleConfig:
name: ""
toolkit:
enabled: false ## Nvidia Docker Runtime 이 설치되어 있다면 false 필수
repository: nvcr.io/nvidia/k8s
image: container-toolkit
version: v1.14.6-ubuntu20.04
imagePullPolicy: IfNotPresent
imagePullSecrets: []
env: []
resources: {}
installDir: "/usr/local/nvidia"
YAML
복사