Search

GPU-Operator

Nvidia GPU Operator

PRE

Opensource Kubernetes 기준 (kubespray)
각 노드에 GPU
Cluster에 GPU 노드를 추가한다고해서 사용자가 바로 GPU를 사용할 수 있는 건 아니다. Driver가 없다면 설치해야하고, Container 환경에서 GPU를 사용할 수 있는 무언가를 해야하며, Kubernetes Cluster Pods 에서 GPU 리소스를 사용할 수 있도록 Device Plugin을 배포해야 한다.
Nvidia-GPU-Operator Repo 등록
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update
Shell
복사
설정 Pull
helm pull nvidia/gpu-operator --untar
Shell
복사
Components 확인
. ├── Chart.lock ├── Chart.yaml ├── charts │ └── node-feature-discovery │ ├── Chart.yaml │ ├── README.md │ ├── crds │ │ └── nfd-api-crds.yaml │ ├── templates │ │ ├── _helpers.tpl │ │ ├── cert-manager-certs.yaml │ │ ├── cert-manager-issuer.yaml │ │ ├── clusterrole.yaml │ │ ├── clusterrolebinding.yaml │ │ ├── master.yaml │ │ ├── nfd-gc.yaml │ │ ├── nfd-master-conf.yaml │ │ ├── nfd-topologyupdater-conf.yaml │ │ ├── nfd-worker-conf.yaml │ │ ├── prometheus.yaml │ │ ├── role.yaml │ │ ├── rolebinding.yaml │ │ ├── service.yaml │ │ ├── serviceaccount.yaml │ │ ├── topologyupdater-crds.yaml │ │ ├── topologyupdater.yaml │ │ └── worker.yaml │ └── values.yaml ├── crds │ ├── nvidia.com_clusterpolicies_crd.yaml │ └── nvidia.com_nvidiadrivers.yaml ├── templates │ ├── _helpers.tpl │ ├── cleanup_crd.yaml │ ├── clusterpolicy.yaml │ ├── nodefeaturerules.yaml │ ├── nvidiadriver.yaml │ ├── operator.yaml │ ├── plugin_config.yaml │ ├── podsecuritypolicy.yaml │ ├── readonlyfs_scc.openshift.yaml │ ├── role.yaml │ ├── rolebinding.yaml │ ├── serviceaccount.yaml │ └── upgrade_crd.yaml └── values.yaml
Shell
복사
Components 의 역할에 따른 동작 과정
GPU-Operator ├── 상태 및 정보 확인 │ ├── Node Feature Discovery Master (Deployment) │ ├── Node Feature Discovery Worker (DaemonSet) │ └── GPU Feature Discovery (DaemonSet) ├── 드라이버 설치 │ └── Nvidia Driver Installer (DaemonSet) ├── 드라이버 자원 할당 │ └── Nvidia Device Plugin (DaemonSet) ├── 컨테이너 환경에서 GPU 사용 │ └── Nvidia Container Toolkit (DaemonSet) ├── 세팅 확인 │ └── GPU Operator Validator (DaemonSet) └── 메트릭 확인 └── DCGM Exporter (DaemonSet)
Shell
복사

주의

Node에 GPU Driver 및 Runtime이 이미 설치되어 있는경우
vi values.yaml driver: enabled: false ## Nvidia Driver 가 설치되어 있다면 False 필수 nvidiaDriverCRD: enabled: false deployDefaultCR: true driverType: gpu nodeSelector: {} useOpenKernelModules: false # use pre-compiled packages for NVIDIA driver installation. # only supported for as a tech-preview feature on ubuntu22.04 kernels. usePrecompiled: false repository: nvcr.io/nvidia image: driver version: "550.54.14" ## Nvidia Driver 가 설치되어 있다면 버전체크 필수 imagePullPolicy: IfNotPresent imagePullSecrets: [] startupProbe: initialDelaySeconds: 60 periodSeconds: 10 # nvidia-smi can take longer than 30s in some cases # ensure enough timeout is set timeoutSeconds: 60 failureThreshold: 120 rdma: enabled: false useHostMofed: false upgradePolicy: # global switch for automatic upgrade feature # if set to false all other options are ignored autoUpgrade: true # how many nodes can be upgraded in parallel # 0 means no limit, all nodes will be upgraded in parallel maxParallelUpgrades: 1 # maximum number of nodes with the driver installed, that can be unavailable during # the upgrade. Value can be an absolute number (ex: 5) or # a percentage of total nodes at the start of upgrade (ex: # 10%). Absolute number is calculated from percentage by rounding # up. By default, a fixed value of 25% is used.' maxUnavailable: 25% # options for waiting on pod(job) completions waitForCompletion: timeoutSeconds: 0 podSelector: "" # options for gpu pod deletion gpuPodDeletion: force: false timeoutSeconds: 300 deleteEmptyDir: false # options for node drain (`kubectl drain`) before the driver reload # this is required only if default GPU pod deletions done by the operator # are not sufficient to re-install the driver drain: enable: false force: false podSelector: "" # It's recommended to set a timeout to avoid infinite drain in case non-fatal error keeps happening on retries timeoutSeconds: 300 deleteEmptyDir: false manager: image: k8s-driver-manager repository: nvcr.io/nvidia/cloud-native version: v0.6.5 imagePullPolicy: IfNotPresent env: - name: ENABLE_GPU_POD_EVICTION value: "true" - name: ENABLE_AUTO_DRAIN value: "false" - name: DRAIN_USE_FORCE value: "false" - name: DRAIN_POD_SELECTOR_LABEL value: "" - name: DRAIN_TIMEOUT_SECONDS value: "0s" - name: DRAIN_DELETE_EMPTYDIR_DATA value: "false" env: [] resources: {} # Private mirror repository configuration repoConfig: configMapName: "" # custom ssl key/certificate configuration certConfig: name: "" # vGPU licensing configuration licensingConfig: configMapName: "" nlsEnabled: true # vGPU topology daemon configuration virtualTopology: config: "" # kernel module configuration for NVIDIA driver kernelModuleConfig: name: "" toolkit: enabled: false ## Nvidia Docker Runtime 이 설치되어 있다면 false 필수 repository: nvcr.io/nvidia/k8s image: container-toolkit version: v1.14.6-ubuntu20.04 imagePullPolicy: IfNotPresent imagePullSecrets: [] env: [] resources: {} installDir: "/usr/local/nvidia"
YAML
복사