pytorch模型訓(xùn)練的時(shí)候GPU使用率不高的問題

更新時(shí)間：2023年09月08日 10:47:51 作者：兩只蠟筆的小新

這篇文章主要介紹了pytorch模型訓(xùn)練的時(shí)候GPU使用率不高的問題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教

前言

博主使用的顯卡配置為：2*RTX 2080Ti，最近在訓(xùn)練的時(shí)候，監(jiān)控顯卡的資源使用情況發(fā)現(xiàn)，

雖然同是使用了兩張顯卡，但是每張顯卡的使用率很不穩(wěn)定，貌似是交替使用，這種情況下訓(xùn)練的速度是很慢的，為了解決

下面是解決這個(gè)問題的一些過程。

1. CPU和內(nèi)存的使用情況

2. 用linux命令查看顯卡資源的使用情況

watch -n 1 nvidia-smi

模型執(zhí)行預(yù)測(cè)階段使用顯卡0，但是也只有51%的使用率。

模型在訓(xùn)練階段，同時(shí)使用兩張顯卡，發(fā)現(xiàn)里利用率也不高，我截取的最高的也就60%

3. 在pytorch的文檔中找到了解決辦法

data.DataLoader(dataset: Dataset[T_co], batch_size: Optional[int] = 1,
             shuffle: bool = False, sampler: Optional[Sampler[int]] = None,
             batch_sampler: Optional[Sampler[Sequence[int]]] = None,
             num_workers: int = 0, collate_fn: _collate_fn_t = None,
             pin_memory: bool = False, drop_last: bool = False,
             timeout: float = 0, worker_init_fn: _worker_init_fn_t = None,
             multiprocessing_context=None, generator=None,
             *, prefetch_factor: int = 2,
             persistent_workers: bool = False)

上面是該類的輸入?yún)?shù)，經(jīng)常使用的用紅色標(biāo)出，與本文相關(guān)的設(shè)置用紫色標(biāo)出，

下面是該類的描述文件：

class DataLoader(Generic[T_co]):
r"""
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.
The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.
See :py:mod:`torch.utils.data` documentation page for more details.
Args:
dataset (Dataset): dataset from which to load the data.
batch_size (int, optional): how many samples per batch to load
(default: ``1``).
shuffle (bool, optional): set to ``True`` to have the data reshuffled
at every epoch (default: ``False``).
sampler (Sampler or Iterable, optional): defines the strategy to draw
samples from the dataset. Can be any ``Iterable`` with ``__len__``
implemented. If specified, :attr:`shuffle` must not be specified.
batch_sampler (Sampler or Iterable, optional): like :attr:`sampler`, but
returns a batch of indices at a time. Mutually exclusive with
:attr:`batch_size`, :attr:`shuffle`, :attr:`sampler`,
and :attr:`drop_last`.
num_workers (int, optional): how many subprocesses to use for data
loading. ``0`` means that the data will be loaded in the main process.
(default: ``0``)
collate_fn (callable, optional): merges a list of samples to form a
mini-batch of Tensor(s). Used when using batched loading from a
map-style dataset.
pin_memory (bool, optional): If ``True``, the data loader will copy Tensors
into CUDA pinned memory before returning them. If your data elements
are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
see the example below.
drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
if the dataset size is not divisible by the batch size. If ``False`` and
the size of dataset is not divisible by the batch size, then the last batch
will be smaller. (default: ``False``)
timeout (numeric, optional): if positive, the timeout value for collecting a batch
from workers. Should always be non-negative. (default: ``0``)
worker_init_fn (callable, optional): If not ``None``, this will be called on each
worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
input, after seeding and before data loading. (default: ``None``)
prefetch_factor (int, optional, keyword-only arg): Number of samples loaded
in advance by each worker. ``2`` means there will be a total of
2 * num_workers samples prefetched across all workers. (default: ``2``)
persistent_workers (bool, optional): If ``True``, the data loader will not shutdown
the worker processes after a dataset has been consumed once. This allows to
maintain the workers `Dataset` instances alive. (default: ``False``)
.. warning:: If the ``spawn`` start method is used, :attr:`worker_init_fn`
cannot be an unpicklable object, e.g., a lambda function. See
:ref:`multiprocessing-best-practices` on more details related
to multiprocessing in PyTorch.
.. warning:: ``len(dataloader)`` heuristic is based on the length of the sampler used.
When :attr:`dataset` is an :class:`~torch.utils.data.IterableDataset`,
it instead returns an estimate based on ``len(dataset) / batch_size``, with proper
rounding depending on :attr:`drop_last`, regardless of multi-process loading
configurations. This represents the best guess PyTorch can make because PyTorch
trusts user :attr:`dataset` code in correctly handling multi-process
loading to avoid duplicate data.
However, if sharding results in multiple workers having incomplete last batches,
this estimate can still be inaccurate, because (1) an otherwise complete batch can
be broken into multiple ones and (2) more than one batch worth of samples can be
dropped when :attr:`drop_last` is set. Unfortunately, PyTorch can not detect such
cases in general.
See `Dataset Types`_ for more details on these two types of datasets and how
:class:`~torch.utils.data.IterableDataset` interacts with
`Multi-process data loading`_.
.. warning:: See :ref:`reproducibility`, and :ref:`dataloader-workers-random-seed`, and
:ref:`data-loading-randomness` notes for random seed related questions.
"""

發(fā)現(xiàn)如下連個(gè)參數(shù)很關(guān)鍵：

num_workers (int, optional): how many subprocesses to use for data
? ? loading. ``0`` means that the data will be loaded in the main process.
? ? (default: ``0``)

pin_memory (bool, optional): If ``True``, the data loader will copy Tensors
? ? into CUDA pinned memory before returning them. ?If your data elements
? ? are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
? ? see the example below.

把 num_workers = 4，pin_memory = True，發(fā)現(xiàn)效率就上來(lái)啦?。?！

只開 num_workers