使用python如何將數(shù)據(jù)集劃分為訓練集、驗證集和測試集
python將數(shù)據(jù)集劃分為訓練集、驗證集和測試集
劃分數(shù)據(jù)集
眾所周知,將一個數(shù)據(jù)集只區(qū)分為訓練集和驗證集是不行的,還需要有測試集,本博文針對上一篇沒有分出測試集的不足,重新劃分數(shù)據(jù)集
直接上代碼:
#split_data.py
#劃分數(shù)據(jù)集flower_data,數(shù)據(jù)集劃分到flower_datas中,訓練集:驗證集:測試集比例為6:2:2
import os
import random
from shutil import copy2
# 源文件路徑
file_path = r"D:/other/ClassicalModel/other/flower_data"
# 新文件路徑
new_file_path = r"D:/other/ClassicalModel/other/flower_datas"
# 劃分數(shù)據(jù)比例為6:2:2
split_rate = [0.6, 0.2, 0.2]
print("Starting...")
print("Ratio= {}:{}:{}".format(int(split_rate[0] * 10), int(split_rate[1] * 10), int(split_rate[2] * 10)))
class_names = os.listdir(file_path)
# 在目標目錄下創(chuàng)建文件夾
split_names = ['train', 'val', 'test']
# 判斷是否存在木匾文件夾
if os.path.isdir(new_file_path):
pass
else:
os.mkdir(new_file_path)
for split_name in split_names:
# split_path = os.path.join(new_file_path, split_name)
split_path = new_file_path + "/" + split_name
if os.path.isdir(split_path):
pass
else:
os.mkdir(split_path)
# 然后在split_path的目錄下創(chuàng)建類別文件夾
for class_name in class_names:
class_split_path = os.path.join(split_path, class_name)
if os.path.isdir(class_split_path):
pass
else:
os.mkdir(class_split_path)
# 按照比例劃分數(shù)據(jù)集,并進行數(shù)據(jù)圖片的復制
# 首先進行分類遍歷
for class_name in class_names:
current_class_data_path = os.path.join(file_path, class_name)
current_all_data = os.listdir(current_class_data_path)
current_data_length = len(current_all_data)
current_data_index_list = list(range(current_data_length))
random.shuffle(current_data_index_list)
train_path = os.path.join(os.path.join(new_file_path, 'train'), class_name)
val_path = os.path.join(os.path.join(new_file_path, 'val'), class_name)
test_path = os.path.join(os.path.join(new_file_path, 'test'), class_name)
train_stop_flag = current_data_length * split_rate[0]
val_stop_flag = current_data_length * (split_rate[0] + split_rate[1])
current_idx = 0
train_num = 0
val_num = 0
test_num = 0
for i in current_data_index_list:
src_img_path = os.path.join(current_class_data_path, current_all_data[i])
if current_idx <= train_stop_flag:
copy2(src_img_path, train_path
train_num = train_num + 1
elif (current_idx > train_stop_flag) and (current_idx <= val_stop_flag):
copy2(src_img_path, val_path)
val_num = val_num + 1
else:
copy2(src_img_path, test_path
test_num = test_num + 1
current_idx = current_idx + 1
print("<{}> has {} pictures,train:val:test={}:{}:{}".format(class_name, current_data_length, train_num, val_num,
test_num))
print("Done")輸出結果:

注意:
只需要修改file_path(源文件夾)和new_file_path(新生成的文件夾)
其次是修改split_rate
python自動劃分訓練集和測試集
在進行深度學習的模型訓練時,我們通常需要將數(shù)據(jù)進行劃分,劃分成訓練集和測試集,若數(shù)據(jù)集太大,數(shù)據(jù)劃分花費的時間太多!??!
不多說,上代碼(python代碼)
代碼
# *_*coding: utf-8 *_*
import os
import random
import shutil
import time
def copyFile(fileDir,origion_path1,class_name):
name = class_name
path = origion_path1
image_list = os.listdir(fileDir) # 獲取圖片的原始路徑
image_number = len(image_list)
train_number = int(image_number * train_rate)
train_sample = random.sample(image_list, train_number) # 從image_list中隨機獲取0.75比例的圖像.
test_sample = list(set(image_list) - set(train_sample))
sample = [train_sample, test_sample]
# 復制圖像到目標文件夾
for k in range(len(save_dir)):
if os.path.isdir(save_dir[k]) and os.path.isdir(save_dir1[k]):
for name in sample[k]:
name1 = name.split(".")[0] + '.xml'
shutil.copy(os.path.join(fileDir, name), os.path.join(save_dir[k], name))
shutil.copy(os.path.join(path, name1), os.path.join(save_dir1[k], name1))
else:
os.makedirs(save_dir[k])
os.makedirs(save_dir1[k])
for name in sample[k]:
name1 = name.split(".")[0] + '.xml'
shutil.copy(os.path.join(fileDir, name), os.path.join(save_dir[k], name))
shutil.copy(os.path.join(path, name1), os.path.join(save_dir1[k], name1))
if __name__ == '__main__':
time_start = time.time()
# 原始數(shù)據(jù)集路徑
origion_path = './JPEGImages/'
origion_path1 = './Annotations/'
# 保存路徑
save_train_dir = './train/JPEGImages/'
save_test_dir = './test/JPEGImages/'
save_train_dir1 = './train/Annotations/'
save_test_dir1 = './test/Annotations/'
save_dir = [save_train_dir, save_test_dir]
save_dir1 = [save_train_dir1, save_test_dir1]
# 訓練集比例
train_rate = 0.75
# 數(shù)據(jù)集類別及數(shù)量
file_list = os.listdir(origion_path)
num_classes = len(file_list)
for i in range(num_classes):
class_name = file_list[i]
copyFile(origion_path,origion_path1,class_name)
print('劃分完畢!')
time_end = time.time()
print('---------------')
print('訓練集和測試集劃分共耗時%s!' % (time_end - time_start))1.需要修改的地方
- origion_path:圖片路徑
- origion_path1:xml文件路徑
- train_rate:訓練集比例
2.執(zhí)行文件deal.py后生成
- train-img:訓練集圖片數(shù)據(jù)
- train-xml:訓練集xml數(shù)據(jù)
- test-img:測試集圖片數(shù)據(jù)
- test-xml:測試及xml數(shù)據(jù)
3.train_rate可以根據(jù)實際情況進行調整,一般train:test是3:1
注:每次劃分數(shù)據(jù)都是隨機的,每次執(zhí)行時將之前劃分好的數(shù)據(jù)保存或者重命名,不然會重復寫入到4個文件夾中
總結
以上為個人經驗,希望能給大家一個參考,也希望大家多多支持腳本之家。
相關文章
Python使用MoviePy實現(xiàn)編輯音視頻并添加字幕
MoviePy是一個用于視頻編輯的Python模塊,它可被用于一些基本操作,本文主要介紹了如何使用編輯音視頻并添加字幕,感興趣的小伙伴可以了解下2024-01-01
python利用socketserver實現(xiàn)并發(fā)套接字功能
這篇文章主要為大家詳細介紹了python利用socketserver實現(xiàn)并發(fā)套接字功能,文中示例代碼介紹的非常詳細,具有一定的參考價值,感興趣的小伙伴們可以參考一下2018-01-01
Python3實現(xiàn)網(wǎng)頁內容轉換成PDF文檔和圖片
pdfkit是把 HTML+CSS 格式的文件轉換成 PDF 的一種工具,它是 wkhtmltopdf 這個工具包的 python 封裝。本文將利用pdfkit實現(xiàn)網(wǎng)頁內容轉換成PDF文檔和圖片效果,感興趣的可以學習一下2022-06-06
Python判斷字符串是否包含特定子字符串的多種方法(7種方法)
我們經常會遇這樣一個需求判斷字符串中是否包含某個關鍵詞,也就是特定的子字符串,接下來通過本文給大家分享Python判斷字符串是否包含特定子字符串的多種方法(7種方法),需要的朋友可以參考下2023-03-03
Python中實現(xiàn)傳遞未知數(shù)量的函數(shù)參數(shù)
這篇文章主要介紹了Python中實現(xiàn)傳遞未知數(shù)量的函數(shù)參數(shù)方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教2024-02-02

