欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

nvidia-smi命令詳解和一些高階技巧講解

 更新時(shí)間:2023年01月10日 16:19:37   作者:Chaos_Wang_  
一般情況下用的比較多的就是nvidia-smi的命令,其實(shí)掌握了這一個(gè)命令也就能夠覆蓋絕大多數(shù)場景了,但是本質(zhì)求真務(wù)實(shí)的態(tài)度,本文調(diào)研了相關(guān)資料,整理了一些比較常用的nvidia-smi命令的其他用法,感興趣的朋友跟隨小編一起看看吧

nvidia-smi命令詳解

1. nvidia-smi命令介紹

在深度學(xué)習(xí)等場景中,nvidia-smi命令是我們經(jīng)常接觸到的一個(gè)命令,用來查看GPU的占用情況,可以說是一個(gè)必須要學(xué)會的命令了,普通用戶一般用的比較多的就是nvidia-smi的命令,其實(shí)掌握了這一個(gè)命令也就能夠覆蓋絕大多數(shù)場景了,但是本質(zhì)求真務(wù)實(shí)的態(tài)度,本文調(diào)研了相關(guān)資料,整理了一些比較常用的nvidia-smi命令的其他用法。

2. nvidia-smi 支持的 GPU

NVIDIA 的 SMI 工具基本上支持自 2011 年以來發(fā)布的任何 NVIDIA GPU。其中包括來自 Fermi 和更高架構(gòu)系列(Kepler、Maxwell、Pascal、Volta , Turing, Ampere等)的 Tesla、Quadro 和 GeForce 設(shè)備。
支持的產(chǎn)品包括:
特斯拉:S1070、S2050、C1060、C2050/70、M2050/70/90、X2070/90、K10、K20、K20X、K40、K80、M40、P40、P100、V100、A100、H100。
Quadro:4000、5000、6000、7000、M2070-Q、K 系列、M 系列、P 系列、RTX 系列
GeForce:不同級別的支持,可用的指標(biāo)少于 Tesla 和 Quadro 產(chǎn)品。

3. 桌面顯卡和服務(wù)器顯卡均支持的命令

(1)nvidia-smi

用戶名@主機(jī)名:~# nvidia-smi
Mon Dec  5 08:48:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01   Driver Version: 450.172.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:26:00.0 Off |                    0 |
| N/A   22C    P0    54W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:2C:00.0 Off |                    0 |
| N/A   25C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:66:00.0 Off |                    0 |
| N/A   25C    P0    50W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:6B:00.0 Off |                    0 |
| N/A   23C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:A2:00.0 Off |                    0 |
| N/A   23C    P0    57W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:A7:00.0 Off |                    0 |
| N/A   25C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

下面詳細(xì)介紹一下每個(gè)部分分別表示什么:

下面是上圖的一些具體介紹,比較簡單的我就不列舉了,只列舉一些不常見的:

持久模式:persistence mode 能夠讓 GPU 更快響應(yīng)任務(wù),待機(jī)功耗增加。
        關(guān)閉 persistence mode 同樣能夠啟動任務(wù)。
        持續(xù)模式雖然耗能大,但是在新的GPU應(yīng)用啟動時(shí),花費(fèi)的時(shí)間更少
風(fēng)扇轉(zhuǎn)速:主動散熱的顯卡一般會有這個(gè)參數(shù),服務(wù)器顯卡一般是被動散熱,這個(gè)參數(shù)顯示N/A。
        從0到100%之間變動,這個(gè)速度是計(jì)算機(jī)期望的風(fēng)扇轉(zhuǎn)速,實(shí)際情況下如果風(fēng)扇堵轉(zhuǎn),可能打不到顯示的轉(zhuǎn)速。
        有的設(shè)備不會返回轉(zhuǎn)速,因?yàn)樗灰蕾囷L(fēng)扇冷卻而是通過其他外設(shè)保持低溫(比如有些實(shí)驗(yàn)室的服務(wù)器是常年放在空調(diào)房間里的)。 
溫度:單位是攝氏度。 
性能狀態(tài):從P0到P12,P0表示最大性能,P12表示狀態(tài)最小性能。 
Disp.A:Display Active,表示GPU的顯示是否初始化。 
ECC糾錯(cuò):這個(gè)只有近幾年的顯卡才具有這個(gè)功能,老版顯卡不具備這個(gè)功能。 
MIG:Multi-Instance GPU,多實(shí)例顯卡技術(shù),支持將一張顯卡劃分成多張顯卡使用,目前只支持安培架構(gòu)顯卡。
新的多實(shí)例GPU (MIG)特性允許GPU(從NVIDIA安培架構(gòu)開始)被安全地劃分為多達(dá)7個(gè)獨(dú)立的GPU實(shí)例。
用于CUDA應(yīng)用,為多個(gè)用戶提供獨(dú)立的GPU資源,以實(shí)現(xiàn)最佳的GPU利用率。
對于GPU計(jì)算能力未完全飽和的工作負(fù)載,該特性尤其有益,因此用戶可能希望并行運(yùn)行不同的工作負(fù)載,以最大化利用率。

開啟MIG的顯卡使用nvidia-smi命令如下所示:

Mon Dec  5 16:48:56 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.172.01   Driver Version: 450.172.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:26:00.0 Off |                    0 |
| N/A   22C    P0    54W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  A100-SXM4-40GB      On   | 00000000:2C:00.0 Off |                    0 |
| N/A   24C    P0    53W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  A100-SXM4-40GB      On   | 00000000:66:00.0 Off |                    0 |
| N/A   25C    P0    50W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  A100-SXM4-40GB      On   | 00000000:6B:00.0 Off |                    0 |
| N/A   23C    P0    51W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  A100-SXM4-40GB      On   | 00000000:A2:00.0 Off |                    0 |
| N/A   23C    P0    56W / 400W |   3692MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  A100-SXM4-40GB      On   | 00000000:A7:00.0 Off |                    0 |
| N/A   25C    P0    52W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  A100-SXM4-40GB      On   | 00000000:E1:00.0 Off |                   On |
| N/A   24C    P0    43W / 400W |     22MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   7  A100-SXM4-40GB      On   | 00000000:E7:00.0 Off |                   On |
| N/A   22C    P0    52W / 400W |  19831MiB / 40537MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  6    1   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  6    2   0   1  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7    1   0   0  |     11MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  7    2   0   1  |  19820MiB / 20096MiB | 42      0 |  3   0    2    0    0 |
|                  |      4MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    4   N/A  N/A     22888      C   python                           3689MiB |
|    7    2    0      37648      C   ...vs/tf2.4-py3.8/bin/python    19805MiB |
+-----------------------------------------------------------------------------+

如上圖我將GPU 6和7兩張顯卡使用MIG技術(shù)虛擬化成4張顯卡,分別是GPU6:0,GPU6:1,GPU7:0,GPU7:1
每張卡的顯存為20GB,MIG技術(shù)支持將A100顯卡虛擬化成7張顯卡,具體介紹我會在之后的博客中更新。

(2)nvidia-smi -h

輸入 nvidia-smi -h 可查看該命令的幫助手冊,如下所示:

用戶名@主機(jī)名:$ nvidia-smi -h
NVIDIA System Management Interface -- v450.172.01

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- Limited Support
    - All Geforce products, starting with the Kepler architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

    -h,   --help                Print usage information and exit.

  LIST OPTIONS:

    -L,   --list-gpus           Display a list of GPUs connected to the system.

    -B,   --list-blacklist-gpus Display a list of blacklisted GPUs in the system.

  SUMMARY OPTIONS:

    <no arguments>              Show a summary of GPUs connected to the system.

    [plus any of]

    -i,   --id=                 Target a specific GPU.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

  QUERY OPTIONS:

    -q,   --query               Display GPU or Unit info.

    [plus any of]

    -u,   --unit                Show unit, rather than GPU, attributes.
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -x,   --xml-format          Produce XML output.
          --dtd                 When showing xml output, embed DTD.
    -d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS,
                                    FBC_STATS, ROW_REMAPPER
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned 
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.

    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu=                Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
                                This query is not supported on vGPU host.
    --query-retired-pages=      List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.
    --query-remapped-rows=      Information about remapped rows.
                                Call --help-query-remapped-rows for more info.

    [mandatory]

    --format=                   Comma separated list of format options:
                                  csv - comma separated values (MANDATORY)
                                  noheader - skip the first line with column headers
                                  nounits - don't print units for numerical
                                             values

    [plus any of]

    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.
    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.

  DEVICE MODIFICATION OPTIONS:

    [any one of]

    -pm,  --persistence-mode=   Set persistence mode: 0/DISABLED, 1/ENABLED
    -e,   --ecc-config=         Toggle ECC support: 0/DISABLED, 1/ENABLED
    -p,   --reset-ecc-errors=   Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
    -c,   --compute-mode=       Set MODE for compute applications:
                                0/DEFAULT, 1/EXCLUSIVE_PROCESS,
                                2/PROHIBITED
          --gom=                Set GPU Operation Mode:
                                    0/ALL_ON, 1/COMPUTE, 2/LOW_DP
    -r    --gpu-reset           Trigger reset of the GPU.
                                Can be used to reset the GPU HW state in situations
                                that would otherwise require a machine reboot.
                                Typically useful if a double bit ECC error has
                                occurred.
                                Reset operations are not guarenteed to work in
                                all cases and should be used with caution.
    -vm   --virt-mode=          Switch GPU Virtualization Mode:
                                Sets GPU virtualization mode to 3/VGPU or 4/VSGA
                                Virtualization mode of a GPU can only be set when
                                it is running on a hypervisor.
    -lgc  --lock-gpu-clocks=    Specifies <minGpuClock,maxGpuClock> clocks as a
                                    pair (e.g. 1500,1500) that defines the range 
                                    of desired locked GPU clock speed in MHz.
                                    Setting this will supercede application clocks
                                    and take effect regardless if an app is running.
                                    Input can also be a singular desired clock value
                                    (e.g. <GpuClockValue>).
    -rgc  --reset-gpu-clocks
                                Resets the Gpu clocks to the default values.
    -ac   --applications-clocks= Specifies <memory,graphics> clocks as a
                                    pair (e.g. 2000,800) that defines GPU's
                                    speed in MHz while running applications on a GPU.
    -rac  --reset-applications-clocks
                                Resets the applications clocks to the default values.
    -acp  --applications-clocks-permission=
                                Toggles permission requirements for -ac and -rac commands:
                                0/UNRESTRICTED, 1/RESTRICTED
    -pl   --power-limit=        Specifies maximum power management limit in watts.
    -cc   --cuda-clocks=        Overrides or restores default CUDA clocks.
                                In override mode, GPU clocks higher frequencies when running CUDA applications.
                                Only on supported devices starting from the Volta series.
                                Requires administrator privileges.
                                0/RESTORE_DEFAULT, 1/OVERRIDE
    -am   --accounting-mode=    Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
    -caa  --clear-accounted-apps
                                Clears all the accounted PIDs in the buffer.
          --auto-boost-default= Set the default auto boost policy to 0/DISABLED
                                or 1/ENABLED, enforcing the change only after the
                                last boost client has exited.
          --auto-boost-permission=
                                Allow non-admin/root control over auto boost mode:
                                0/UNRESTRICTED, 1/RESTRICTED
    -mig  --multi-instance-gpu= Enable or disable Multi Instance GPU: 0/DISABLED, 1/ENABLED
                                Requires root.

   [plus optional]

    -i,   --id=                 Target a specific GPU.
    -eow, --error-on-warning    Return a non-zero error for warnings.

  UNIT MODIFICATION OPTIONS:

    -t,   --toggle-led=         Set Unit LED state: 0/GREEN, 1/AMBER

   [plus optional]

    -i,   --id=                 Target a specific Unit.

  SHOW DTD OPTIONS:

          --dtd                 Print device DTD and exit.

     [plus optional]

    -f,   --filename=           Log to a specified file, rather than to stdout.
    -u,   --unit                Show unit, rather than device, DTD.

    --debug=                    Log encrypted debug information to a specified file. 

 STATISTICS: (EXPERIMENTAL)
    stats                       Displays device statistics. "nvidia-smi stats -h" for more information.

 Device Monitoring:
    dmon                        Displays device stats in scrolling format.
                                "nvidia-smi dmon -h" for more information.

    daemon                      Runs in background and monitor devices as a daemon process.
                                This is an experimental feature. Not supported on Windows baremetal
                                "nvidia-smi daemon -h" for more information.

    replay                      Used to replay/extract the persistent stats generated by daemon.
                                This is an experimental feature.
                                "nvidia-smi replay -h" for more information.

 Process Monitoring:
    pmon                        Displays process stats in scrolling format.
                                "nvidia-smi pmon -h" for more information.

 TOPOLOGY:
    topo                        Displays device/system topology. "nvidia-smi topo -h" for more information.

 DRAIN STATES:
    drain                       Displays/modifies GPU drain states for power idling. "nvidia-smi drain -h" for more information.

 NVLINK:
    nvlink                      Displays device nvlink information. "nvidia-smi nvlink -h" for more information.

 CLOCKS:
    clocks                      Control and query clock information. "nvidia-smi clocks -h" for more information.

 ENCODER SESSIONS:
    encodersessions             Displays device encoder sessions information. "nvidia-smi encodersessions -h" for more information.

 FBC SESSIONS:
    fbcsessions                 Displays device FBC sessions information. "nvidia-smi fbcsessions -h" for more information.

 GRID vGPU:
    vgpu                        Displays vGPU information. "nvidia-smi vgpu -h" for more information.

 MIG:
    mig                         Provides controls for MIG management. "nvidia-smi mig -h" for more information.

Please see the nvidia-smi(1) manual page for more detailed information.

(3)nvidia-smi -L

輸入 nvidia-smi -L 可以列出所有的GPU設(shè)備及其UUID,如下所示:

用戶名@主機(jī)名:$ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-9f2df045-8650-7b2e-d442-cc0d7ba0150d)
GPU 1: A100-SXM4-40GB (UUID: GPU-ad117600-5d0e-557d-c81d-7c4d60555eaa)
GPU 2: A100-SXM4-40GB (UUID: GPU-087ca21d-0c14-66d0-3869-59ff65a48d58)
GPU 3: A100-SXM4-40GB (UUID: GPU-eb9943a0-d78e-4220-fdb2-e51f2b490064)
GPU 4: A100-SXM4-40GB (UUID: GPU-8302a1f7-a8be-753f-eb71-6f60c5de1703)
GPU 5: A100-SXM4-40GB (UUID: GPU-ee1be011-0d98-3d6f-8c89-b55c28966c63)
GPU 6: A100-SXM4-40GB (UUID: GPU-8b3c7f5b-3fb4-22c3-69c9-4dd6a9f31586)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-8b3c7f5b-3fb4-22c3-69c9-4dd6a9f31586/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-8b3c7f5b-3fb4-22c3-69c9-4dd6a9f31586/2/0)
GPU 7: A100-SXM4-40GB (UUID: GPU-fab40b5f-c286-603f-8909-cdb73e5ffd6c)
  MIG 3g.20gb Device 0: (UUID: MIG-GPU-fab40b5f-c286-603f-8909-cdb73e5ffd6c/1/0)
  MIG 3g.20gb Device 1: (UUID: MIG-GPU-fab40b5f-c286-603f-8909-cdb73e5ffd6c/2/0)

(4)nvidia-smi -q

輸入 nvidia-smi -q 可以列出所有GPU設(shè)備的詳細(xì)信息。如果只想列出某一GPU的詳細(xì)信息,可使用 -i 選項(xiàng)指定,如下圖所示:

用戶名@主機(jī)名:$ nvidia-smi -q -i 0
==============NVSMI LOG==============

Timestamp                                 : Mon Dec  5 17:31:45 2022
Driver Version                            : 450.172.01
CUDA Version                              : 11.0

Attached GPUs                             : 8
GPU 00000000:26:00.0
    Product Name                          : A100-SXM4-40GB
    Product Brand                         : Tesla
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1564720004631
    GPU UUID                              : GPU-9f2df045-8650-7b2e-d442-cc0d7ba0150d
    Minor Number                          : 2
    VBIOS Version                         : 92.00.19.00.10
    MultiGPU Board                        : No
    Board ID                              : 0x2600
    GPU Part Number                       : 692-2G506-0200-002
    Inforom Version
        Image Version                     : G506.0200.00.04
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x26
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x20B010DE
        Bus Id                            : 00000000:26:00.0
        Sub System Id                     : 0x134F10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 40537 MiB
        Used                              : 0 MiB
        Free                              : 40537 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 29 MiB
        Free                              : 65507 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 640 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 26 C
        GPU Shutdown Temp                 : 92 C
        GPU Slowdown Temp                 : 89 C
        GPU Max Operating Temp            : 85 C
        Memory Current Temp               : 38 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 55.79 W
        Power Limit                       : 400.00 W
        Default Power Limit               : 400.00 W
        Enforced Power Limit              : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 1215 MHz
        Video                             : 585 MHz
    Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Default Applications Clocks
        Graphics                          : 1095 MHz
        Memory                            : 1215 MHz
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1215 MHz
        Video                             : 1290 MHz
    Max Customer Boost Clocks
        Graphics                          : 1410 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

(5)nvidia-smi -l [second]

輸入 nvidia-smi -l [second] 后會每隔 second 秒刷新一次面板。監(jiān)控GPU利用率通常會選擇每隔1秒刷新一次,例如:

nvidia-smi -l 2

(6)nvidia-smi -pm

在 Linux 上,您可以將 GPU 設(shè)置為持久模式,以保持加載 NVIDIA 驅(qū)動程序,即使沒有應(yīng)用程序正在訪問這些卡。 當(dāng)您運(yùn)行一系列短作業(yè)時(shí),這特別有用。 持久模式在每個(gè)空閑 GPU 上使用更多瓦特,但可以防止每次啟動 GPU 應(yīng)用程序時(shí)出現(xiàn)相當(dāng)長的延遲。 如果您為 GPU 分配了特定的時(shí)鐘速度或功率限制(因?yàn)樾遁d NVIDIA 驅(qū)動程序時(shí)這些更改會丟失),這也是必要的。 通過運(yùn)行以下命令在所有 GPU 上啟用持久性模式:

nvidia-smi -pm 1

也可以指定開啟某個(gè)顯卡的持久模式:

nvidia-smi -pm 1 -i 0

(7)nvidia-smi dmon

以 1 秒的更新間隔監(jiān)控整體 GPU 使用情況

(8)nvidia-smi pmon

以 1 秒的更新間隔監(jiān)控每個(gè)進(jìn)程的 GPU 使用情況:

(9)…

4. 服務(wù)器顯卡才支持的命令

(1)使用 nvidia-smi 查看系統(tǒng)/GPU 拓?fù)浜?NVLink

要正確利用更高級的 NVIDIA GPU 功能(例如 GPU Direct),正確配置系統(tǒng)拓?fù)渲陵P(guān)重要。 拓?fù)涫侵父鞣N系統(tǒng)設(shè)備(GPU、InfiniBand HCA、存儲控制器等)如何相互連接以及如何連接到系統(tǒng)的 CPU。 某些拓?fù)漕愋蜁档托阅苌踔翆?dǎo)致某些功能不可用。 為了幫助解決此類問題,nvidia-smi 支持系統(tǒng)拓?fù)浜瓦B接查詢:

nvidia-smi topo --matrix              #查看系統(tǒng)/GPU 拓?fù)?
nvidia-smi nvlink --status            #查詢 NVLink 連接本身以確保狀態(tài)、功能和運(yùn)行狀況。
nvidia-smi nvlink --capabilities      #查詢 NVLink 連接本身以確保狀態(tài)、功能和運(yùn)行狀況。

(2)顯卡分片技術(shù)

詳見之后的更新

還有其他有用命令歡迎扔到評論區(qū)

參考文獻(xiàn)

[1] nvidia-smi詳解 https://blog.csdn.net/kunhe0512/article/details/126265050
[2] nvidia-smi常用選項(xiàng)匯總 https://raelum.blog.csdn.net/article/details/126914188
[3] nvidia-smi 命令詳解 https://blog.csdn.net/m0_60721514/article/details/125241141
[4] 多實(shí)例顯卡指南 https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

到此這篇關(guān)于nvidia-smi命令詳解和一些高階技巧介紹的文章就介紹到這了,更多相關(guān)nvidia-smi命令內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

最新評論