tensorflow使用L2 regularization正則化修正overfitting過擬合方式
L2正則化原理:
過擬合的原理:在loss下降,進行擬合的過程中(斜線),不同的batch數(shù)據(jù)樣本造成紅色曲線的波動大,圖中低點也就是過擬合,得到的紅線點低于真實的黑線,也就是泛化更差。
可見,要想減小過擬合,減小這個波動,減少w的數(shù)值就能辦到。
L2正則化訓(xùn)練的原理:在Loss中加入(乘以系數(shù)λ的)參數(shù)w的平方和,這樣訓(xùn)練過程中就會抑制w的值,w的(絕對)值小,模型復(fù)雜度低,曲線平滑,過擬合程度低(奧卡姆剃刀),參考公式如下圖:
(正則化是不阻礙你去擬合曲線的,并不是所有參數(shù)都會被無腦抑制,實際上這是一個動態(tài)過程,是loss(cross_entropy)和L2 loss博弈的一個過程。訓(xùn)練過程會去擬合一個合理的w,正則化又會去抑制w的變化,兩項相抵消,無關(guān)的wi越變越小,但是比零強一點(就是這一點,比沒有要強,這也是L2的trade-off),有用的wi會被保留,處于一個“中庸”的范圍,在擬合的基礎(chǔ)上更好的泛化。過多的道理和演算就不再贅述。)
那為什么L1不能辦到呢?主要是L1有副作用,不太適合這個場景。
L1把L2公式中wi的平方換成wi的絕對值,根據(jù)數(shù)學(xué)特性,這種方式會導(dǎo)致wi不均衡的被減小,有些wi很大,有些wi很小,得到稀疏解,屬于特征提取。為什么L1的w衰減比L2的不均衡,這個很直覺的,同樣都是讓loss低,讓w1從0.1降為0,和w2從1.0降為0.9,對優(yōu)化器和loss來說,是一樣的。但是帶上平方以后,前者是0.01-0=0.01,后者是1-0.81=0.19,這時候明顯是減少w2更劃算。下圖最能說明問題,橫縱軸是w1、w2等高線是loss的值,左圖的交點w1=0,w2=max(w2),典型的稀疏解,丟棄了w1,而右圖則是在w1和w2之間取得平衡。這就意味著,本來能得到一條曲線,現(xiàn)在w1丟了,得到一條直線,降低過擬合的同時,擬合能力(表達能力)也下降了。
L1和L2有個別名:Lasso和ridge,經(jīng)常記錯,認(rèn)為ridge嶺回歸因為比較“尖”,所以是L1,其實ridge對應(yīng)的圖片是這種,或者翻譯成“山脊”更合適一些,山脊的特點是一條曲線緩慢綿延下來的。
訓(xùn)練
進行MNIST分類訓(xùn)練,對比cross_entropy和加了l2正則的total_loss。
因為MNIST本來就不復(fù)雜,所以FC之前不能做太多CONV,會導(dǎo)致效果太好,不容易分出差距。為展示l2 regularization的效果,我只留一層CONV(注意看FC1的輸入是h_pool1,短路了conv2),兩層conv可以作為對照組。
直接取train的前1000作為validation,test的前1000作為test。
代碼說明,一個基礎(chǔ)的CONV+FC結(jié)構(gòu),對圖像進行l(wèi)abel預(yù)測,通過cross_entropy衡量性能,進行訓(xùn)練。
對需要正則化的weight直接使用l2_loss處理,
把cross_entropy和L2 loss都扔進collection 'losses'中。
wd其實就是公式中的λ,wd越大,懲罰越大,過擬合越小,擬合能力也會變差,所以不能太大不能太小,很多人默認(rèn)設(shè)置成了0.004,一般情況下這樣做無所謂,畢竟是前人的經(jīng)驗。但是根據(jù)我的實際經(jīng)驗,這個值不是死的,尤其是你自己定制loss函數(shù)的時候,假如你的權(quán)重交叉熵的數(shù)值變成了之前的十倍,如果wd保持不變,那wd就相當(dāng)于之前的0.0004!就像loss如果用reduce_sum,grad也用reduce_sum一樣,很多東西要同步做出改變!
weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss') tf.add_to_collection('losses', weight_decay) tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))提取所有l(wèi)oss,拿total_loss去訓(xùn)練,也就實現(xiàn)了圖一中公式的效果。
完整代碼如下:
from __future__ import print_function import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data # number 1 to 10 data mnist = input_data.read_data_sets('MNIST_data', one_hot=True) def compute_accuracy(v_xs, v_ys): global prediction y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1}) correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1}) result = sess.run(accuracy, feed_dict={}) return result def weight_variable(shape, wd): initial = tf.truncated_normal(shape, stddev=0.1) if wd is not None: print('wd is not none!!!!!!!') weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss') tf.add_to_collection('losses', weight_decay) return tf.Variable(initial) def bias_variable(shape): initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) def conv2d(x, W): # stride [1, x_movement, y_movement, 1] # Must have strides[0] = strides[3] = 1 return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME') def max_pool_2x2(x): # stride [1, x_movement, y_movement, 1] return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME') # define placeholder for inputs to network xs = tf.placeholder(tf.float32, [None, 784])/255. # 28x28 ys = tf.placeholder(tf.float32, [None, 10]) keep_prob = tf.placeholder(tf.float32) x_image = tf.reshape(xs, [-1, 28, 28, 1]) # print(x_image.shape) # [n_samples, 28,28,1] ## conv1 layer ## W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32 b_conv1 = bias_variable([32]) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32 h_pool1 = max_pool_2x2(h_conv1) # output size 14x14x32 ## conv2 layer ## W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64 b_conv2 = bias_variable([64]) h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64 h_pool2 = max_pool_2x2(h_conv2) # output size 7x7x64 ############################################################################# ## fc1 layer ## W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2 #W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2 b_fc1 = bias_variable([1024]) # [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64] h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2 #h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2 ############################################################################### h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob) ## fc2 layer ## W_fc2 = weight_variable([1024, 10], wd = 0.) b_fc2 = bias_variable([10]) prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) # the error between prediction and real data cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction), reduction_indices=[1])) # loss tf.add_to_collection('losses', cross_entropy) total_loss = tf.add_n(tf.get_collection('losses')) print(total_loss) train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss) sess = tf.Session() # important step # tf.initialize_all_variables() no long valid from # 2017-03-02 if using tensorflow >= 0.12 if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1: init = tf.initialize_all_variables() else: init = tf.global_variables_initializer() sess.run(init) for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1}) # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1}) # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout if i % 100 == 0: print('train accuracy',compute_accuracy( mnist.train.images[:1000], mnist.train.labels[:1000])) print('test accuracy',compute_accuracy( mnist.test.images[:1000], mnist.test.labels[:1000]))
下邊是訓(xùn)練過程
不加dropout,不加l2 regularization,訓(xùn)練1000步:
weight_variable([1024, 10], wd = 0.)
明顯每一步train中都好于test(很多有0.01的差距),出現(xiàn)過擬合!
train accuracy 0.094 test accuracy 0.089 train accuracy 0.892 test accuracy 0.874 train accuracy 0.91 test accuracy 0.893 train accuracy 0.925 test accuracy 0.925 train accuracy 0.945 test accuracy 0.935 train accuracy 0.954 test accuracy 0.944 train accuracy 0.961 test accuracy 0.951 train accuracy 0.965 test accuracy 0.955 train accuracy 0.964 test accuracy 0.959 train accuracy 0.962 test accuracy 0.956
不加dropout,F(xiàn)C層加l2 regularization,weight decay因子設(shè)置0.004,訓(xùn)練1000步:
weight_variable([1024, 10], wd = 0.004)
過擬合現(xiàn)象明顯減輕了不少,甚至有時測試集還好于訓(xùn)練集(因為驗證集大小的關(guān)系,只展示大概效果。)
train accuracy 0.107 test accuracy 0.145 train accuracy 0.876 test accuracy 0.861 train accuracy 0.91 test accuracy 0.909 train accuracy 0.923 test accuracy 0.919 train accuracy 0.931 test accuracy 0.927 train accuracy 0.936 test accuracy 0.939 train accuracy 0.956 test accuracy 0.949 train accuracy 0.958 test accuracy 0.954 train accuracy 0.947 test accuracy 0.95 train accuracy 0.947 test accuracy 0.953
對照組:不使用l2正則,只用dropout:過擬合現(xiàn)象減輕。
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.) W_fc2 = weight_variable([1024, 10], wd = 0.) sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout train accuracy 0.132 test accuracy 0.104 train accuracy 0.869 test accuracy 0.859 train accuracy 0.898 test accuracy 0.889 train accuracy 0.917 test accuracy 0.906 train accuracy 0.923 test accuracy 0.917 train accuracy 0.928 test accuracy 0.925 train accuracy 0.938 test accuracy 0.94 train accuracy 0.94 test accuracy 0.942 train accuracy 0.947 test accuracy 0.941 train accuracy 0.944 test accuracy 0.947
對照組:雙層conv,本身過擬合不明顯,結(jié)果略
第二種寫法:一個公式寫完
其實沒有本質(zhì)區(qū)別,只是少了一步提取,增加了繁瑣代碼可讀性的區(qū)別。
loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)(w1)+tf.contrib.layers.l2_regularizer(lambda)(w2)+..........
測一下單獨運行正則化操作的效果(加到loss的代碼懶得羅列了,太長,就替換前邊的代碼就可以):
import tensorflow as tf CONST_SCALE = 0.5 w = tf.constant([[5.0, -2.0], [-3.0, 1.0]]) with tf.Session() as sess: print(sess.run(tf.abs(w))) print('preprocessing:', sess.run(tf.reduce_sum(tf.abs(w)))) print('manual computation:', sess.run(tf.reduce_sum(tf.abs(w)) * CONST_SCALE)) print('l1_regularizer:', sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE print(sess.run(w**2)) print(sess.run(tf.reduce_sum(w**2))) print('preprocessing:', sess.run(tf.reduce_sum(w**2) / 2))#default print('manual computation:', sess.run(tf.reduce_sum(w**2) / 2 * CONST_SCALE)) print('l2_regularizer:', sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE ------------------------------- [[5. 2.] [3. 1.]] preprocessing: 11.0 manual computation: 5.5 l1_regularizer: 5.5 [[25. 4.] [ 9. 1.]] 39.0 preprocessing: 19.5 manual computation: 9.75 l2_regularizer: 9.75
注意:L2正則化的預(yù)處理數(shù)據(jù)是平方和除以2,這是方便處理加的一個系數(shù),因為w平方求導(dǎo)之后會多出來一個系數(shù)2,有沒有系數(shù),優(yōu)化過程都是一樣進行的,減小a和減小10a是一樣的訓(xùn)練目標(biāo)。如果說正則化和主loss的比例不同,還有衰減系數(shù)可以調(diào)。
其實在復(fù)雜系統(tǒng)下直接寫公式不如把基本loss和正則化項都丟進collection用著方便,何況你還可能把不同的weight設(shè)置不同的衰減系數(shù)呢是吧,這寫成公式就很繁瑣了。
雖然類似的方法還有batch normalization,dropout等,這些都有“加噪聲”的效果,都有一定預(yù)防過擬合的效果。但是L1和L2正則化不叫L1 norm、L2 norm,norm叫范式,是計算距離的一種方法,就像絕對值和距離平方,不是regularization,L1 regularization和L2 regularization可以理解為用了L1 norm和L2 norm的regularization。
以上這篇tensorflow使用L2 regularization正則化修正overfitting過擬合方式就是小編分享給大家的全部內(nèi)容了,希望能給大家一個參考,也希望大家多多支持腳本之家。
相關(guān)文章
python web基礎(chǔ)之加載靜態(tài)文件實例
下面小編就為大家分享一篇python web基礎(chǔ)之加載靜態(tài)文件實例,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-03-03Transformer導(dǎo)論之Bert預(yù)訓(xùn)練語言解析
這篇文章主要為大家介紹了Transformer導(dǎo)論之Bert預(yù)訓(xùn)練語言解析,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪2023-03-034種Python基于字段的不使用元類的ORM實現(xiàn)方法總結(jié)
在 Python 中,ORM(Object-Relational Mapping)是一種將對象和數(shù)據(jù)庫之間的映射關(guān)系進行轉(zhuǎn)換的技術(shù),本文為大家整理了4種不使用元類的簡單ORM實現(xiàn)方式,需要的可以參考下2023-12-12