pytorch中的dataloader使用方法詳解

更新時間：2023年11月04日 09:34:02 作者：驚瑟

這篇文章主要介紹了pytorch中的dataloader使用方法詳解,構(gòu)建自己的dataloader是模型訓練的第一步,本篇文章介紹下pytorch與dataloader以及與其相關的類的用法,需要的朋友可以參考下

pytorch中的dataloader使用方法詳解

DataLoader類中有一個必填參數(shù)為dataset，因此在構(gòu)建自己的dataloader前，先要定義好自己的Dataset類。這里先大致介紹下這兩個類的作用：

Dataset：真正的“數(shù)據(jù)集”，它的作用是：只要告訴它數(shù)據(jù)在哪里(初始化)，就可以像使用iterator一樣去拿到數(shù)據(jù)，繼承該類后，需要重載__len__()以及__getitem__
DataLoader：數(shù)據(jù)加載器，設置一些參數(shù)后，可以按照一定規(guī)則加載數(shù)據(jù)，比如設置batch_size后，每次加載一個batch_siza的數(shù)據(jù)。它像一個生成器一樣工作。

有小伙伴可能會疑惑，自己寫一個加載數(shù)據(jù)的工具似乎也沒有多“困難”，為何大費周章要繼承pytorch中類，按照它的規(guī)則加載數(shù)據(jù)呢？

總結(jié)一下就是：

當數(shù)據(jù)量很大的時候，單進程加載數(shù)據(jù)很慢
一次全加載過來，會占用很大的內(nèi)存空間（因此dataloader是一個生成器，惰性加載）
在進行訓練前，往往需要一些數(shù)據(jù)預處理或數(shù)據(jù)增強等操作，pytorch的dataloader已經(jīng)封裝好了，避免了重復造輪子

使用方法

兩步走：

定義自己的Dataset類，具體要做的事：
- 告訴它去哪兒讀數(shù)據(jù)，并將數(shù)據(jù)resize為統(tǒng)一的shape（可以思考下為什么呢）
- 重寫__len__()以及__getitem__，其中__getitem__中要確定自己想要哪些數(shù)據(jù)，然后將其return出來。
將自己的Dataset實例傳到Dataloder中并設置想要的參數(shù)，構(gòu)建自己的dataloader

下面簡單加載一個目錄下的圖片以及l(fā)abel：

import os
import numpy as np

from torch.utils.data.dataset import Dataset
from torch.utils.data.dataloader import DataLoader
import cv2

# Your Data Path
img_dir = '/home/jyz/Downloads/classify_example/val/駿馬/'
anno_file = '/home/jyz/Downloads/classify_example/val/label.txt'


class MyDataset(Dataset):
    def __init__(self, img_dir, anno_file, imgsz=(640, 640)):
        self.img_dir = img_dir
        self.anno_file = anno_file
        self.imgsz = imgsz
        self.img_namelst = os.listdir(self.img_dir)

    # need to overload
    def __len__(self):
        return len(self.img_namelst)

    # need to overload
    def __getitem__(self, idx):
        with open(self.anno_file, 'r') as f:
            label = f.readline().strip()
        img = cv2.imread(os.path.join(img_dir, self.img_namelst[idx]))
        img = cv2.resize(img, self.imgsz)
        return img, label


dataset = MyDataset(img_dir, anno_file)
dataloader = DataLoader(dataset=dataset, batch_size=2)

# display
for img_batch, label_batch in dataloader:
    img_batch = img_batch.numpy()
    print(img_batch.shape)
    # img = np.concatenate(img_batch, axis=0)
    if img_batch.shape[0] == 2:
        img = np.hstack((img_batch[0], img_batch[1]))
    else:
        img = np.squeeze(img_batch, axis=0)  # 最后一張圖時，刪除第一個維度
    print(img.shape)
    cv2.imshow(label_batch[0], img)
    cv2.waitKey(0)

上面是一次加載兩張圖片，效果如下：

在這里插入圖片描述