淺談Tensorflow2對GPU內(nèi)存的分配策略

更新時(shí)間：2021年08月12日 16:07:50 作者：無風(fēng)聽海

本文主要介紹了Tensorflow2對GPU內(nèi)存的分配策略，文中通過示例代碼介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們可以參考一下

一、問題源起

從以下的異常堆?？梢钥吹绞荁LAS程序集初始化失敗，可以看到是執(zhí)行MatMul的時(shí)候發(fā)生的異常，基本可以斷定可能數(shù)據(jù)集太大導(dǎo)致memory不夠用了。

2021-08-10 16:38:04.917501: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.960048: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.986898: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992366: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2021-08-10 16:38:04.992389: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "/home/mango/PycharmProjects/DeepLearing/minist_conv.py", line 32, in <module>
    model.fit(train_images, train_labels, epochs=5, batch_size=64)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 3023, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Blas xGEMM launch failed : a.shape=[1,64,576], b.shape=[1,576,64], m=64, n=64, k=576
  [[node sequential/dense/MatMul (defined at home/mango/PycharmProjects/DeepLearing/minist_conv.py:32) ]] [Op:__inference_train_function_993]

Function call stack:
train_function

二、開發(fā)環(huán)境

mango@mango-ubuntu:~$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jul_14_19:41:19_PDT_2021
Cuda~~ compilation tools, release 11.4, V11.4.100==
Build cuda_11.4.r11.4/compiler.30188945_0

mango@mango-ubuntu:~$ tail -n 10 /usr/include/cudnn_version.h
#ifndef CUDNN_VERSION_H_
#define CUDNN_VERSION_H_

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 2
#define CUDNN_PATCHLEVEL 2

#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

#endif /* CUDNN_VERSION_H */

mango@mango-ubuntu:~$ python3 --version
Python 3.9.5

mango@mango-ubuntu:~$ nvidia-smi
Tue Aug 10 19:57:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P0    N/A /  N/A |    329MiB /  2002MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       75MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
+-----------------------------------------------------------------------------+

 

mango@mango-ubuntu:~$ python3
Python 3.9.5 (default, May 11 2021, 08:20:37) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-08-10 18:33:05.917520: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>>

三、Tensorflow針對GPU內(nèi)存的分配策略

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

默認(rèn)情況下，為了通過減少內(nèi)存碎片更有效地利用設(shè)備上相對寶貴的GPU內(nèi)存資源，TensorFlow進(jìn)程會(huì)使用所有可見的GPU。

In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two methods to control this.

在某些情況下，進(jìn)程只分配可用內(nèi)存的一個(gè)子集，或者只根據(jù)進(jìn)程的需要增加內(nèi)存使用量。TensorFlow提供了兩種方法來控制這種情況。

The first option is to turn on memory growth by calling tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, the GPU memory region is extended for the TensorFlow process. Memory is not released since it can lead to memory fragmentation. To turn on memory growth for a specific GPU, use the following code prior to allocating any tensors or executing any ops.

第一種選擇是通過調(diào)用tf.config.experimental.set_memory_growth來打開內(nèi)存增長，它嘗試只分配運(yùn)行時(shí)所需的GPU內(nèi)存:它開始分配很少的內(nèi)存，當(dāng)程序運(yùn)行時(shí)需要更多的GPU內(nèi)存時(shí)，GPU內(nèi)存區(qū)域會(huì)進(jìn)一步擴(kuò)展增大。內(nèi)存不會(huì)被釋放，因?yàn)檫@會(huì)導(dǎo)致內(nèi)存碎片。為了打開特定GPU的內(nèi)存增長，在分配任何張量或執(zhí)行任何操作之前，使用以下代碼。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Another way to enable this option is to set the environmental variable TF_FORCE_GPU_ALLOW_GROWTH to true. This configuration is platform specific.

啟用該選項(xiàng)的另一種方法是將環(huán)境變量TF_FORCE_GPU_ALLOW_GROWTH設(shè)置為true。此配置是特定于平臺(tái)的。

The second method is to configure a virtual GPU device with tf.config.experimental.set_virtual_device_configuration and set a hard limit on the total memory to allocate on the GPU.

This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process. This is common practice for local development when the GPU is shared with other applications such as a workstation GUI.

第二種方法是使用tf.config.experimental.set_virtual_device_configuration配置虛擬GPU設(shè)備,并設(shè)置GPU上可分配的總內(nèi)存的硬限制。

如果你想真正將GPU內(nèi)存的數(shù)量綁定到TensorFlow進(jìn)程中，這是非常有用的。當(dāng)GPU與其他應(yīng)用程序(如工作站GUI)共享時(shí)，這是本地開發(fā)的常見做法。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

四、問題分析驗(yàn)證

通過上邊對TensorFlow文檔的分析，默認(rèn)情況下會(huì)占用所有的GPU內(nèi)存，但是TensorFlow提供了兩種方式可以靈活的控制內(nèi)存的分配策略；

我們可以直接設(shè)置GPU內(nèi)存按需動(dòng)態(tài)分配

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_gpus[0], True)

通過以下命令可以看到執(zhí)行過程中GPU內(nèi)存的占用最高為697M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:30:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1026MiB /  2002MiB |     72%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               45MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       73MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13829      C   /usr/bin/python3.9                697MiB |
+-----------------------------------------------------------------------------+

我們也可以限制最多使用1024M的GPU內(nèi)存

import tensorflow as tf
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.set_logical_device_configuration(physical_gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])

同樣通過命令可以看到執(zhí)行過程中GPU內(nèi)存的占用最高為1455M

mango@mango-ubuntu:~$ while true; do nvidia-smi; sleep 0.2; done;
Tue Aug 10 20:31:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P0    N/A /  N/A |   1784MiB /  2002MiB |     74%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1818      G   /usr/lib/xorg/Xorg                186MiB |
|    0   N/A  N/A      2002      G   /usr/bin/gnome-shell               46MiB |
|    0   N/A  N/A      3435      G   ...AAAAAAAAA= --shared-files       72MiB |
|    0   N/A  N/A      6016      G   python3                            13MiB |
|    0   N/A  N/A     13570      C   /usr/bin/python3.9               1455MiB |
+-----------------------------------------------------------------------------+

五、GPU分配策略分析

通過四中的測試結(jié)果可得

默認(rèn)的分配策略會(huì)占用所有的內(nèi)存，并且執(zhí)行中不會(huì)進(jìn)行釋放，如果訓(xùn)練數(shù)據(jù)量比較打很容易內(nèi)存不夠用；
限制最大使用內(nèi)存，測試占用內(nèi)存比設(shè)置的大，這個(gè)可能跟訓(xùn)練中間使用的模型和操作的復(fù)雜程度有關(guān)系，需要根據(jù)具體的業(yè)務(wù)場景設(shè)置合適的值；但是要注意不能設(shè)置大了，否則還是會(huì)報(bào)錯(cuò)，但是設(shè)置小了只是執(zhí)行的慢一些罷了；
設(shè)置內(nèi)存按需分配可能是一個(gè)相對比較中庸的方案，感覺可能是一個(gè)更好的方案，不知道TensorFlow為什么沒有設(shè)置為默認(rèn)值，留作一個(gè)問題，后續(xù)有新的認(rèn)知的話再補(bǔ)充；

六、擴(kuò)展

單GPU模擬多GPU環(huán)境

當(dāng)我們的本地開發(fā)環(huán)境只有一個(gè)GPU，但卻需要編寫多GPU的程序在工作站上進(jìn)行訓(xùn)練任務(wù)時(shí)，TensorFlow為我們提供了一個(gè)方便的功能，可以讓我們在本地開發(fā)環(huán)境中建立多個(gè)模擬GPU，從而讓多GPU的程序調(diào)試變得更加方便。以下代碼在實(shí)體GPU GPU:0 的基礎(chǔ)上建立了兩個(gè)顯存均為2GB的虛擬GPU。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Create 2 virtual GPUs with 1GB memory each
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024),
         tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

多GPU的數(shù)據(jù)并行

使用 tf.distribute.Strategy可以將模型拷貝到每個(gè)GPU上，然后將訓(xùn)練數(shù)據(jù)分批在不同的GPU上執(zhí)行，達(dá)到數(shù)據(jù)并行。

tf.debugging.set_log_device_placement(True)
gpus = tf.config.list_logical_devices('GPU')
strategy = tf.distribute.MirroredStrategy(gpus)
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

到此這篇關(guān)于淺談Tensorflow2對GPU內(nèi)存的分配策略的文章就介紹到這了,更多相關(guān)Tensorflow2 GPU內(nèi)存分配內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: