pytorch DistributedDataParallel 多卡訓(xùn)練結(jié)果變差的解決方案

更新時(shí)間：2021年06月03日 09:07:57 作者：啥哈哈哈

這篇文章主要介紹了pytorch DistributedDataParallel 多卡訓(xùn)練結(jié)果變差的解決方案，具有很好的參考價(jià)值，希望對大家有所幫助。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教

DDP 數(shù)據(jù)shuffle 的設(shè)置

使用DDP要給dataloader傳入sampler參數(shù)（torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)）。默認(rèn)shuffle=True，但按照pytorch DistributedSampler的實(shí)現(xiàn)：

    def __iter__(self) -> Iterator[T_co]:
        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            g = torch.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore
        else:
            indices = list(range(len(self.dataset)))  # type: ignore

產(chǎn)生隨機(jī)indix的種子是和當(dāng)前的epoch有關(guān)，所以需要在訓(xùn)練的時(shí)候手動set epoch的值來實(shí)現(xiàn)真正的shuffle：

for epoch in range(start_epoch, n_epochs):
    if is_distributed:
        sampler.set_epoch(epoch)
    train(loader)

DDP 增大batchsize 效果變差的問題

large batchsize：

理論上的優(yōu)點(diǎn)：

數(shù)據(jù)中的噪聲影響可能會變小，可能容易接近最優(yōu)點(diǎn)；

缺點(diǎn)和問題：

降低了梯度的variance；(理論上，對于凸優(yōu)化問題，低的梯度variance可以得到更好的優(yōu)化效果; 但是實(shí)際上Keskar et al驗(yàn)證了增大batchsize會導(dǎo)致差的泛化能力);

對于非凸優(yōu)化問題，損失函數(shù)包含多個(gè)局部最優(yōu)點(diǎn)，小的batchsize有噪聲的干擾可能容易跳出局部最優(yōu)點(diǎn)，而大的batchsize有可能停在局部最優(yōu)點(diǎn)跳不出來。

解決方法：

增大learning_rate，但是可能出現(xiàn)問題，在訓(xùn)練開始就用很大的learning_rate 可能導(dǎo)致模型不收斂 (https://arxiv.org/abs/1609.04836)

使用warming up (https://arxiv.org/abs/1706.02677)

warmup

在訓(xùn)練初期就用很大的learning_rate可能會導(dǎo)致訓(xùn)練不收斂的問題，warmup的思想是在訓(xùn)練初期用小的學(xué)習(xí)率，隨著訓(xùn)練慢慢變大學(xué)習(xí)率，直到base learning_rate，再使用其他decay（CosineAnnealingLR）的方式訓(xùn)練.

# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.lr_scheduler import ReduceLROnPlateau
class GradualWarmupScheduler(_LRScheduler):
    """ Gradually warm-up(increasing) learning rate in optimizer.
    Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
    Args:
        optimizer (Optimizer): Wrapped optimizer.
        multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.
        total_epoch: target learning rate is reached at total_epoch, gradually
        after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
    """
    def __init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
        self.multiplier = multiplier
        if self.multiplier < 1.:
            raise ValueError('multiplier should be greater thant or equal to 1.')
        self.total_epoch = total_epoch
        self.after_scheduler = after_scheduler
        self.finished = False
        super(GradualWarmupScheduler, self).__init__(optimizer)
    def get_lr(self):
        if self.last_epoch > self.total_epoch:
            if self.after_scheduler:
                if not self.finished:
                    self.after_scheduler.base_lrs = [base_lr * self.multiplier for base_lr in self.base_lrs]
                    self.finished = True
                return self.after_scheduler.get_last_lr()
            return [base_lr * self.multiplier for base_lr in self.base_lrs]
        if self.multiplier == 1.0:
            return [base_lr * (float(self.last_epoch) / self.total_epoch) for base_lr in self.base_lrs]
        else:
            return [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
    def step_ReduceLROnPlateau(self, metrics, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
        self.last_epoch = epoch if epoch != 0 else 1  # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning
        if self.last_epoch <= self.total_epoch:
            warmup_lr = [base_lr * ((self.multiplier - 1.) * self.last_epoch / self.total_epoch + 1.) for base_lr in self.base_lrs]
            for param_group, lr in zip(self.optimizer.param_groups, warmup_lr):
                param_group['lr'] = lr
        else:
            if epoch is None:
                self.after_scheduler.step(metrics, None)
            else:
                self.after_scheduler.step(metrics, epoch - self.total_epoch)
    def step(self, epoch=None, metrics=None):
        if type(self.after_scheduler) != ReduceLROnPlateau:
            if self.finished and self.after_scheduler:
                if epoch is None:
                    self.after_scheduler.step(None)
                else:
                    self.after_scheduler.step(epoch - self.total_epoch)
                self._last_lr = self.after_scheduler.get_last_lr()
            else:
                return super(GradualWarmupScheduler, self).step(epoch)
        else:
            self.step_ReduceLROnPlateau(metrics, epoch)

分布式多卡訓(xùn)練DistributedDataParallel踩坑

近幾天想研究了多卡訓(xùn)練，就花了點(diǎn)時(shí)間，本以為會很輕松，可是好多坑，一步一步踏過來，一般分布式訓(xùn)練分為單機(jī)多卡與多機(jī)多卡兩種類型；

主要有兩種方式實(shí)現(xiàn)：

１、DataParallel: Parameter Server模式，一張卡位reducer，實(shí)現(xiàn)也超級簡單，一行代碼

DataParallel是基于Parameter server的算法，負(fù)載不均衡的問題比較嚴(yán)重，有時(shí)在模型較大的時(shí)候（比如bert-large），reducer的那張卡會多出3-4g的顯存占用

２、DistributedDataParallel：官方建議用新的DDP，采用all-reduce算法，本來設(shè)計(jì)主要是為了多機(jī)多卡使用，但是單機(jī)上也能用

為什么要分布式訓(xùn)練？

可以用多張卡，總體跑得更快

可以得到更大的 BatchSize

有些分布式會取得更好的效果

主要分為以下幾個(gè)部分：

單機(jī)多卡，DataParallel（最常用，最簡單）

單機(jī)多卡，DistributedDataParallel（較高級）、多機(jī)多卡，DistributedDataParallel（最高級）

如何啟動訓(xùn)練

模型保存與讀取

注意事項(xiàng)

一、單機(jī)多卡（DATAPARALLEL）

from torch.nn import DataParallel
 
device = torch.device("cuda")
?；蛘遜evice = torch.device("cuda:0" if True else "cpu")
 
model = MyModel()
model = model.to(device)
model = DataParallel(model)
＃或者model = nn.DataParallel(model,device_ids=[0,1，2,3])

比較簡單，只需要加一行代碼就行， model = DataParallel(model)

二、多機(jī)多卡、單機(jī)多卡（DISTRIBUTEDDATAPARALLEL）

建議先把注意事項(xiàng)看完在修改代碼，防止出現(xiàn)莫名的bug，修改訓(xùn)練代碼如下：

其中opt.local_rank要在代碼前面解析這個(gè)參數(shù)，可以去后面看我寫的注意事項(xiàng)；

    from torch.utils.data.distributed import DistributedSampler
    import torch.distributed as dist
    import torch
 
    # Initialize Process Group
    dist_backend = 'nccl'
    print('args.local_rank: ', opt.local_rank)
    torch.cuda.set_device(opt.local_rank)
    dist.init_process_group(backend=dist_backend)
 
    model = yourModel()＃自己的模型
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # 5) 封裝
        # model = torch.nn.parallel.DistributedDataParallel(model,
        #                                                   device_ids=[opt.local_rank],
        #                                                   output_device=opt.local_rank)
        model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids=[opt.local_rank])
    device = torch.device(opt.local_rank)
    model.to(device)
    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training, img_size=opt.img_size, normalized_labels=True)#自己的讀取數(shù)據(jù)的代碼
    world_size = torch.cuda.device_count()
    datasampler = DistributedSampler(dataset, num_replicas=dist.get_world_size(), rank=opt.local_rank)
 
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=False,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
        sampler=datasampler
    )＃在原始讀取數(shù)據(jù)中加sampler參數(shù)就行
 
 
.....
 
訓(xùn)練過程中，數(shù)據(jù)轉(zhuǎn)cuda
      imgs = imgs.to(device)
      targets = targets.to(device)

三、如何啟動訓(xùn)練

１、DataParallel方式

正常訓(xùn)練即可，即

python3 train.py

２、DistributedDataParallel方式

需要通過torch.distributed.launch來啟動，一般是單節(jié)點(diǎn)，

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 train.py

其中CUDA_VISIBLE_DEVICES　設(shè)置用的顯卡編號，--nproc_pre_node 每個(gè)節(jié)點(diǎn)的顯卡數(shù)量，一般有幾個(gè)顯卡就用幾個(gè)顯卡

多節(jié)點(diǎn)

python３ -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE --nnodes=2 --node_rank=0
＃兩個(gè)節(jié)點(diǎn)，在０號節(jié)點(diǎn)

要是訓(xùn)練成功，就會打印出幾個(gè)信息，有幾個(gè)卡就打印幾個(gè)信息，如下圖所示:

四、模型保存與讀取

以下a、b是對應(yīng)的，用a保存，就用a方法加載

１、保存

a、只保存參數(shù)

torch.save(model.module.state_dict(), path)

b、保存參數(shù)與網(wǎng)絡(luò)

torch.save(model.module,path)

２、加載

a、多卡加載模型預(yù)訓(xùn)練；

model = Yourmodel()
if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

單卡加載模型，需要加載模型時(shí)指定主卡讀模型，而且這個(gè)'cuda:0',是看你訓(xùn)練的模型是０還是１（否則就會出錯(cuò)RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device），可以根據(jù)自己的更改：

model = Yourmodel()
if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights，map_location="cuda:0"))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

b、單卡加載模型；

同樣也要指定讀取模型的卡。　　

model = torch.load(opt.weights_path, map_location="cuda:0")

多卡加載預(yù)訓(xùn)練模型，以b這種方式還沒跑通。

五、注意事項(xiàng)

１、model后面添加module

獲取到網(wǎng)絡(luò)模型后，使用并行方法，并將網(wǎng)絡(luò)模型和參數(shù)移到GPU上。注意，若需要修改網(wǎng)絡(luò)模塊或者獲得模型的某個(gè)參數(shù)，一定要在model后面加上.module，否則會報(bào)錯(cuò)，比如：

model.img_size　　要改成　　model.module.img_size

２、.cuda或者.to(device)等問題

device是自己設(shè)置，如果.cuda出錯(cuò)，就要化成相應(yīng)的device

model（如：model.to(device)）

input（通常需要使用Variable包裝，如：input = Variable(input).to(device)）

target（通常需要使用Variable包裝

nn.CrossEntropyLoss()（如：criterion = nn.CrossEntropyLoss().to(device)）

３、args.local_rank的參數(shù)

通過torch.distributed.launch來啟動訓(xùn)練，torch.distributed.launch 會給模型分配一個(gè)args.local_rank的參數(shù)，所以在訓(xùn)練代碼中要解析這個(gè)參數(shù)，也可以通過torch.distributed.get_rank()獲取進(jìn)程id。

parser.add_argument("--local_rank", type=int, default=-1, help="number of cpu threads to use during batch generation")

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

pytorch DistributedDataParallel 多卡訓(xùn)練結(jié)果變差的解決方案

DDP 數(shù)據(jù)shuffle 的設(shè)置

DDP 增大batchsize 效果變差的問題

解決方法：

warmup

分布式多卡訓(xùn)練DistributedDataParallel踩坑

一、單機(jī)多卡（DATAPARALLEL）

二、多機(jī)多卡、單機(jī)多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓(xùn)練

四、模型保存與讀取

五、注意事項(xiàng)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

pytorch DistributedDataParallel 多卡訓(xùn)練結(jié)果變差的解決方案

DDP 數(shù)據(jù)shuffle 的設(shè)置

DDP 增大batchsize 效果變差的問題

解決方法：

warmup

分布式多卡訓(xùn)練DistributedDataParallel踩坑

一、單機(jī)多卡（DATAPARALLEL）

二、多機(jī)多卡、單機(jī)多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓(xùn)練

四、模型保存與讀取

五、注意事項(xiàng)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

二、多機(jī)多卡、單機(jī)多卡（DISTRIBUTEDDATAPARALLEL）

三、如何啟動訓(xùn)練

四、模型保存與讀取

五、注意事項(xiàng)