解決pytorch?model代碼內tensor?device不一致的問題

更新時間：2023年07月04日 16:34:38 作者：_LvP

這篇文章主要介紹了pytorch?model代碼內tensor?device不一致的問題,本文給大家分享完美解決方案，對pytorch?tensor?device不一致問題解決方案感興趣的朋友跟隨小編一起看看吧

pytorch model代碼內tensor device不一致的問題

在編寫一段處理兩個tensor的代碼如下，需要在forward函數(shù)內編寫函數(shù)創(chuàng)建一個新的tensor進行索引的掩碼計算

# todo(liang)空間交換
def compute_sim_and_swap(t1, t2, threshold=0.7):
     n, c, h, w = t1.shape
     sim = torch.nn.functional.cosine_similarity(t1, t2, dim=1) # n, h, w
     sim = sim.unsqueeze(0) # c, n, h, w
     expand_tensor = sim.clone()
     # 使用拼接構建相同的維度
     for _ in range(c-1): # c, n, h, w
         sim = torch.cat([sim, expand_tensor], dim=0)
     sim = sim.permute(1, 0, 2, 3) # n, c, h, w
     # 創(chuàng)建邏輯掩碼，小于 threshold 的將掩碼變?yōu)?True 用于交換
     mask = sim < threshold
     indices = torch.rand(mask.shape) < 0.5
     t1[mask&indices], t2[mask&indices] = t2[mask&indices], t1[mask&indices]
     return t1, t2

這段代碼報了這個錯誤

File "xxx/network.py", line 347, in compute_sim_and_swap
t1[mask&indices], t2[mask&indices] = t2[mask&indices], t1[mask&indices]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

統(tǒng)一下進行掩碼計算的張量的設備即可

device = mask.Device
indices = indices.to(device)

PyTorch 多GPU使用torch.nn.DataParallel訓練參數(shù)不一致問題

在多GPU訓練時，遇到了下述的錯誤：

1. Expected tensor for argument 1 'input' to have the same device as tensor for argument 2 'weight'; but device 0 does not equal 1
2. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

造成這個錯誤的可能性有挺多，總起來是模型、輸入、模型內參數(shù)不在一個GPU上。本人是在調試RandLA-Net pytorch源碼，希望使用雙GPU訓練，經(jīng)過嘗試解決這個問題，此處做一個記錄，希望給后來人一個提醒。經(jīng)過調試，發(fā)現(xiàn)報錯的地方主要是在數(shù)據(jù)拼接的時候，即一個數(shù)據(jù)在GPU0上，一個數(shù)據(jù)在GPU1上，這就會出現(xiàn)錯誤，相關代碼如下：

return torch.cat((
            self.mlp(concat),
            features.expand(B, -1, N, K)
        ), dim=-3)

上述代碼中，必須保證self.mlp(concat)與features.expand(B, -1, N, K)在同一個GPU中。在多GPU運算時，features（此時是輸入變量）有可能放在任何一個GPU中，因此此處在拼接前，獲取一下features的GPU，然后將concat放入相應的GPU中，再進行數(shù)據(jù)拼接就可以了，代碼如下：

device = features.device
concat = concat.to(device)
return torch.cat((
            self.mlp(concat),
            features.expand(B, -1, N, K)
        ), dim=-3)

該源碼中默認狀態(tài)下device是一個固定的值，在多GPU訓練狀態(tài)下就會報錯，代碼中還有幾處數(shù)據(jù)融合，大家可以依據(jù)上述思路做修改。此外該源碼中由于把device的值寫死了，訓練好的模型也必須在相應的GPU中做推理，如在cuda0中訓練的模型如果在cuda1中推理就會報錯，各位可以依據(jù)此思路對源碼做相應的修改。如果修改有困難，可以私信我，我可以把相關修改后的源碼分享。

到此這篇關于pytorch model代碼內tensor device不一致的問題的文章就介紹到這了,更多相關pytorch tensor device不一致內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: