tensorflow使用L2 regularization正則化修正overfitting過擬合方式

更新時間：2020年05月22日 16:48:02 作者：秦偉H

這篇文章主要介紹了tensorflow使用L2 regularization正則化修正overfitting過擬合方式，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧

L2正則化原理：

過擬合的原理：在loss下降，進行擬合的過程中（斜線），不同的batch數(shù)據(jù)樣本造成紅色曲線的波動大，圖中低點也就是過擬合，得到的紅線點低于真實的黑線，也就是泛化更差。

可見，要想減小過擬合，減小這個波動，減少w的數(shù)值就能辦到。

L2正則化訓練的原理：在Loss中加入（乘以系數(shù)λ的）參數(shù)w的平方和，這樣訓練過程中就會抑制w的值，w的（絕對）值小，模型復雜度低，曲線平滑，過擬合程度低（奧卡姆剃刀），參考公式如下圖：

（正則化是不阻礙你去擬合曲線的，并不是所有參數(shù)都會被無腦抑制，實際上這是一個動態(tài)過程，是loss（cross_entropy）和L2 loss博弈的一個過程。訓練過程會去擬合一個合理的w，正則化又會去抑制w的變化，兩項相抵消，無關的wi越變越小，但是比零強一點（就是這一點，比沒有要強，這也是L2的trade-off），有用的wi會被保留，處于一個“中庸”的范圍，在擬合的基礎上更好的泛化。過多的道理和演算就不再贅述。）

那為什么L1不能辦到呢？主要是L1有副作用，不太適合這個場景。

L1把L2公式中wi的平方換成wi的絕對值，根據(jù)數(shù)學特性，這種方式會導致wi不均衡的被減小，有些wi很大，有些wi很小，得到稀疏解，屬于特征提取。為什么L1的w衰減比L2的不均衡，這個很直覺的，同樣都是讓loss低，讓w1從0.1降為0，和w2從1.0降為0.9，對優(yōu)化器和loss來說，是一樣的。但是帶上平方以后，前者是0.01-0=0.01，后者是1-0.81=0.19，這時候明顯是減少w2更劃算。下圖最能說明問題，橫縱軸是w1、w2等高線是loss的值，左圖的交點w1=0，w2=max（w2），典型的稀疏解，丟棄了w1，而右圖則是在w1和w2之間取得平衡。這就意味著，本來能得到一條曲線，現(xiàn)在w1丟了，得到一條直線，降低過擬合的同時，擬合能力（表達能力）也下降了。

L1和L2有個別名：Lasso和ridge，經(jīng)常記錯，認為ridge嶺回歸因為比較“尖”，所以是L1，其實ridge對應的圖片是這種，或者翻譯成“山脊”更合適一些，山脊的特點是一條曲線緩慢綿延下來的。

訓練

進行MNIST分類訓練，對比cross_entropy和加了l2正則的total_loss。

因為MNIST本來就不復雜，所以FC之前不能做太多CONV，會導致效果太好，不容易分出差距。為展示l2 regularization的效果，我只留一層CONV（注意看FC1的輸入是h_pool1，短路了conv2），兩層conv可以作為對照組。

直接取train的前1000作為validation，test的前1000作為test。

代碼說明，一個基礎的CONV+FC結(jié)構(gòu)，對圖像進行l(wèi)abel預測，通過cross_entropy衡量性能，進行訓練。

對需要正則化的weight直接使用l2_loss處理，

把cross_entropy和L2 loss都扔進collection 'losses'中。

wd其實就是公式中的λ，wd越大，懲罰越大，過擬合越小，擬合能力也會變差，所以不能太大不能太小，很多人默認設置成了0.004，一般情況下這樣做無所謂，畢竟是前人的經(jīng)驗。但是根據(jù)我的實際經(jīng)驗，這個值不是死的，尤其是你自己定制loss函數(shù)的時候，假如你的權(quán)重交叉熵的數(shù)值變成了之前的十倍，如果wd保持不變，那wd就相當于之前的0.0004！就像loss如果用reduce_sum，grad也用reduce_sum一樣，很多東西要同步做出改變！

weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
tf.add_to_collection('losses', weight_decay)
tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add_n(tf.get_collection('losses'))提取所有l(wèi)oss，拿total_loss去訓練，也就實現(xiàn)了圖一中公式的效果。

完整代碼如下：

 
from __future__ import print_function
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# number 1 to 10 data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
 
def compute_accuracy(v_xs, v_ys):
  global prediction
  y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})
  correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))
  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
  #result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})
  result = sess.run(accuracy, feed_dict={})
  return result
 
def weight_variable(shape, wd):
  initial = tf.truncated_normal(shape, stddev=0.1)
 
  if wd is not None:
    print('wd is not none!!!!!!!')
    weight_decay = tf.multiply(tf.nn.l2_loss(initial), wd, name='weight_loss')
    tf.add_to_collection('losses', weight_decay)
 
  return tf.Variable(initial)
 
def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)
 
def conv2d(x, W):
  # stride [1, x_movement, y_movement, 1]
  # Must have strides[0] = strides[3] = 1
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
 
def max_pool_2x2(x):
  # stride [1, x_movement, y_movement, 1]
  return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
 
# define placeholder for inputs to network
xs = tf.placeholder(tf.float32, [None, 784])/255.  # 28x28
ys = tf.placeholder(tf.float32, [None, 10])
keep_prob = tf.placeholder(tf.float32)
x_image = tf.reshape(xs, [-1, 28, 28, 1])
# print(x_image.shape) # [n_samples, 28,28,1]
 
## conv1 layer ##
W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32
h_pool1 = max_pool_2x2(h_conv1)                     # output size 14x14x32
 
## conv2 layer ##
W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64
h_pool2 = max_pool_2x2(h_conv2)                     # output size 7x7x64
 
#############################################################################
## fc1 layer ##
W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2
#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2
b_fc1 = bias_variable([1024])
# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]
h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2
#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2
###############################################################################
 
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
 
## fc2 layer ##
W_fc2 = weight_variable([1024, 10], wd = 0.)
b_fc2 = bias_variable([10])
prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
 
# the error between prediction and real data
cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),
                       reduction_indices=[1]))    # loss
 
tf.add_to_collection('losses', cross_entropy)
total_loss = tf.add_n(tf.get_collection('losses'))
print(total_loss)
 
train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)
 
sess = tf.Session()
# important step
# tf.initialize_all_variables() no long valid from
# 2017-03-02 if using tensorflow >= 0.12
if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:
  init = tf.initialize_all_variables()
else:
  init = tf.global_variables_initializer()
sess.run(init)
 
for i in range(1000):
  batch_xs, batch_ys = mnist.train.next_batch(100)
  sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
  # sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})
  # sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
  if i % 100 == 0:
    print('train accuracy',compute_accuracy(
      mnist.train.images[:1000], mnist.train.labels[:1000]))
    print('test accuracy',compute_accuracy(
      mnist.test.images[:1000], mnist.test.labels[:1000]))

下邊是訓練過程

不加dropout，不加l2 regularization，訓練1000步：

weight_variable([1024, 10], wd = 0.)

明顯每一步train中都好于test（很多有0.01的差距），出現(xiàn)過擬合！

train accuracy 0.094
test accuracy 0.089
train accuracy 0.892
test accuracy 0.874
train accuracy 0.91
test accuracy 0.893
train accuracy 0.925
test accuracy 0.925
train accuracy 0.945
test accuracy 0.935
train accuracy 0.954
test accuracy 0.944
train accuracy 0.961
test accuracy 0.951
train accuracy 0.965
test accuracy 0.955
train accuracy 0.964
test accuracy 0.959
train accuracy 0.962
test accuracy 0.956

不加dropout，F(xiàn)C層加l2 regularization，weight decay因子設置0.004，訓練1000步：

weight_variable([1024, 10], wd = 0.004)

過擬合現(xiàn)象明顯減輕了不少，甚至有時測試集還好于訓練集（因為驗證集大小的關系，只展示大概效果。）

train accuracy 0.107
test accuracy 0.145
train accuracy 0.876
test accuracy 0.861
train accuracy 0.91
test accuracy 0.909
train accuracy 0.923
test accuracy 0.919
train accuracy 0.931
test accuracy 0.927
train accuracy 0.936
test accuracy 0.939
train accuracy 0.956
test accuracy 0.949
train accuracy 0.958
test accuracy 0.954
train accuracy 0.947
test accuracy 0.95
train accuracy 0.947
test accuracy 0.953

對照組：不使用l2正則，只用dropout：過擬合現(xiàn)象減輕。

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)
W_fc2 = weight_variable([1024, 10], wd = 0.)
  sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout
train accuracy 0.132
test accuracy 0.104
train accuracy 0.869
test accuracy 0.859
train accuracy 0.898
test accuracy 0.889
train accuracy 0.917
test accuracy 0.906
train accuracy 0.923
test accuracy 0.917
train accuracy 0.928
test accuracy 0.925
train accuracy 0.938
test accuracy 0.94
train accuracy 0.94
test accuracy 0.942
train accuracy 0.947
test accuracy 0.941
train accuracy 0.944
test accuracy 0.947

對照組：雙層conv，本身過擬合不明顯，結(jié)果略

第二種寫法：一個公式寫完

其實沒有本質(zhì)區(qū)別，只是少了一步提取，增加了繁瑣代碼可讀性的區(qū)別。

loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)(w1)+tf.contrib.layers.l2_regularizer(lambda)(w2)+..........

測一下單獨運行正則化操作的效果（加到loss的代碼懶得羅列了，太長，就替換前邊的代碼就可以）：

import tensorflow as tf
CONST_SCALE = 0.5
w = tf.constant([[5.0, -2.0], [-3.0, 1.0]])
with tf.Session() as sess:
  print(sess.run(tf.abs(w)))
  print('preprocessing:', sess.run(tf.reduce_sum(tf.abs(w))))
  print('manual computation:', sess.run(tf.reduce_sum(tf.abs(w)) * CONST_SCALE))
  print('l1_regularizer:', sess.run(tf.contrib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE
 
  print(sess.run(w**2))
  print(sess.run(tf.reduce_sum(w**2)))
  print('preprocessing:', sess.run(tf.reduce_sum(w**2) / 2))#default
  print('manual computation:', sess.run(tf.reduce_sum(w**2) / 2 * CONST_SCALE))
  print('l2_regularizer:', sess.run(tf.contrib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE
-------------------------------
[[5. 2.]
 [3. 1.]]
preprocessing: 11.0
manual computation: 5.5
l1_regularizer: 5.5
[[25. 4.]
 [ 9. 1.]]
39.0
preprocessing: 19.5
manual computation: 9.75
l2_regularizer: 9.75

注意：L2正則化的預處理數(shù)據(jù)是平方和除以2，這是方便處理加的一個系數(shù)，因為w平方求導之后會多出來一個系數(shù)2，有沒有系數(shù)，優(yōu)化過程都是一樣進行的，減小a和減小10a是一樣的訓練目標。如果說正則化和主loss的比例不同，還有衰減系數(shù)可以調(diào)。

其實在復雜系統(tǒng)下直接寫公式不如把基本loss和正則化項都丟進collection用著方便，何況你還可能把不同的weight設置不同的衰減系數(shù)呢是吧，這寫成公式就很繁瑣了。

雖然類似的方法還有batch normalization，dropout等，這些都有“加噪聲”的效果，都有一定預防過擬合的效果。但是L1和L2正則化不叫L1 norm、L2 norm，norm叫范式，是計算距離的一種方法，就像絕對值和距離平方，不是regularization，L1 regularization和L2 regularization可以理解為用了L1 norm和L2 norm的regularization。

以上這篇tensorflow使用L2 regularization正則化修正overfitting過擬合方式就是小編分享給大家的全部內(nèi)容了，希望能給大家一個參考，也希望大家多多支持腳本之家。

您可能感興趣的文章: