GPU-Operator

Nvidia GPU Operator

PRE

•

Opensource Kubernetes 기준 (kubespray)

•

각 노드에 GPU

Cluster에 GPU 노드를 추가한다고해서 사용자가 바로 GPU를 사용할 수 있는 건 아니다.
Driver가 없다면 설치해야하고, Container 환경에서 GPU를 사용할 수 있는 무언가를 해야하며,
Kubernetes Cluster Pods 에서 GPU 리소스를 사용할 수 있도록 Device Plugin을 배포해야 한다.

Nvidia-GPU-Operator Repo 등록

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Shell
복사

설정 Pull

helm pull nvidia/gpu-operator --untar
Shell
복사

Components 확인

.
├── Chart.lock
├── Chart.yaml
├── charts
│   └── node-feature-discovery
│       ├── Chart.yaml
│       ├── README.md
│       ├── crds
│       │   └── nfd-api-crds.yaml
│       ├── templates
│       │   ├── _helpers.tpl
│       │   ├── cert-manager-certs.yaml
│       │   ├── cert-manager-issuer.yaml
│       │   ├── clusterrole.yaml
│       │   ├── clusterrolebinding.yaml
│       │   ├── master.yaml
│       │   ├── nfd-gc.yaml
│       │   ├── nfd-master-conf.yaml
│       │   ├── nfd-topologyupdater-conf.yaml
│       │   ├── nfd-worker-conf.yaml
│       │   ├── prometheus.yaml
│       │   ├── role.yaml
│       │   ├── rolebinding.yaml
│       │   ├── service.yaml
│       │   ├── serviceaccount.yaml
│       │   ├── topologyupdater-crds.yaml
│       │   ├── topologyupdater.yaml
│       │   └── worker.yaml
│       └── values.yaml
├── crds
│   ├── nvidia.com_clusterpolicies_crd.yaml
│   └── nvidia.com_nvidiadrivers.yaml
├── templates
│   ├── _helpers.tpl
│   ├── cleanup_crd.yaml
│   ├── clusterpolicy.yaml
│   ├── nodefeaturerules.yaml
│   ├── nvidiadriver.yaml
│   ├── operator.yaml
│   ├── plugin_config.yaml
│   ├── podsecuritypolicy.yaml
│   ├── readonlyfs_scc.openshift.yaml
│   ├── role.yaml
│   ├── rolebinding.yaml
│   ├── serviceaccount.yaml
│   └── upgrade_crd.yaml
└── values.yaml

Shell
복사

Components 의 역할에 따른 동작 과정

GPU-Operator
├── 상태 및 정보 확인
│   ├── Node Feature Discovery Master (Deployment)
│   ├── Node Feature Discovery Worker (DaemonSet)
│   └── GPU Feature Discovery (DaemonSet)
├── 드라이버 설치
│   └── Nvidia Driver Installer (DaemonSet)
├── 드라이버 자원 할당
│   └── Nvidia Device Plugin (DaemonSet)
├── 컨테이너 환경에서 GPU 사용
│   └── Nvidia Container Toolkit (DaemonSet)
├── 세팅 확인
│   └── GPU Operator Validator (DaemonSet)
└── 메트릭 확인
    └── DCGM Exporter (DaemonSet)
Shell
복사

주의

•

Node에 GPU Driver 및 Runtime이  이미 설치되어 있는경우

vi values.yaml

driver:
  enabled: false           ## Nvidia Driver 가 설치되어 있다면 False 필수
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: true
    driverType: gpu
    nodeSelector: {}
  useOpenKernelModules: false
  # use pre-compiled packages for NVIDIA driver installation.
  # only supported for as a tech-preview feature on ubuntu22.04 kernels.
  usePrecompiled: false
  repository: nvcr.io/nvidia
  image: driver
  version: "550.54.14"      ## Nvidia Driver 가 설치되어 있다면 버전체크 필수
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  startupProbe:
    initialDelaySeconds: 60
    periodSeconds: 10
    # nvidia-smi can take longer than 30s in some cases
    # ensure enough timeout is set
    timeoutSeconds: 60
    failureThreshold: 120
  rdma:
    enabled: false
    useHostMofed: false
  upgradePolicy:
    # global switch for automatic upgrade feature
    # if set to false all other options are ignored
    autoUpgrade: true
    # how many nodes can be upgraded in parallel
    # 0 means no limit, all nodes will be upgraded in parallel
    maxParallelUpgrades: 1
    # maximum number of nodes with the driver installed, that can be unavailable during
    # the upgrade. Value can be an absolute number (ex: 5) or
    # a percentage of total nodes at the start of upgrade (ex:
    # 10%). Absolute number is calculated from percentage by rounding
    # up. By default, a fixed value of 25% is used.'
    maxUnavailable: 25%
    # options for waiting on pod(job) completions
    waitForCompletion:
      timeoutSeconds: 0
      podSelector: ""
    # options for gpu pod deletion
    gpuPodDeletion:
      force: false
      timeoutSeconds: 300
      deleteEmptyDir: false
    # options for node drain (`kubectl drain`) before the driver reload
    # this is required only if default GPU pod deletions done by the operator
    # are not sufficient to re-install the driver
    drain:
      enable: false
      force: false
      podSelector: ""
      # It's recommended to set a timeout to avoid infinite drain in case non-fatal error keeps happening on retries
      timeoutSeconds: 300
      deleteEmptyDir: false
  manager:
    image: k8s-driver-manager
    repository: nvcr.io/nvidia/cloud-native
    version: v0.6.5
    imagePullPolicy: IfNotPresent
    env:
      - name: ENABLE_GPU_POD_EVICTION
        value: "true"
      - name: ENABLE_AUTO_DRAIN
        value: "false"
      - name: DRAIN_USE_FORCE
        value: "false"
      - name: DRAIN_POD_SELECTOR_LABEL
        value: ""
      - name: DRAIN_TIMEOUT_SECONDS
        value: "0s"
      - name: DRAIN_DELETE_EMPTYDIR_DATA
        value: "false"
  env: []
  resources: {}
  # Private mirror repository configuration
  repoConfig:
    configMapName: ""
  # custom ssl key/certificate configuration
  certConfig:
    name: ""
  # vGPU licensing configuration
  licensingConfig:
    configMapName: ""
    nlsEnabled: true
  # vGPU topology daemon configuration
  virtualTopology:
    config: ""
  # kernel module configuration for NVIDIA driver
  kernelModuleConfig:
    name: ""

toolkit:
  enabled: false          ## Nvidia Docker Runtime 이 설치되어 있다면 false 필수
  repository: nvcr.io/nvidia/k8s
  image: container-toolkit
  version: v1.14.6-ubuntu20.04
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  resources: {}
  installDir: "/usr/local/nvidia"
YAML
복사