腳本之家服務器常用軟件

快捷導航

YOLO v4常見的非線性激活函數(shù)詳解

更新時間：2021年05月12日 11:06:12 作者：滿船清夢壓星河HK

這篇文章主要介紹了YOLO v4常見的非線性激活函數(shù),本文給大家介紹的非常詳細，對大家的學習或工作具有一定的參考借鑒價值，需要的朋友可以參考下

YOLO v4中用到的激活函數(shù)是Mish激活函數(shù)
在YOLO v4中被提及的激活函數(shù)有: ReLU, Leaky ReLU, PReLU, ReLU6, SELU, Swish, Mish
其中Leaky ReLU, PReLU難以訓練，ReLU6轉(zhuǎn)為量化網(wǎng)絡設計

激活函數(shù)使用過程圖：

在這里插入圖片描述

一、飽和激活函數(shù)

1.1、Sigmoid

函數(shù)表達式：

Sigmoid函數(shù)圖像及其導數(shù)圖像：

在這里插入圖片描述

優(yōu)點：

是一個便于求導的平滑函數(shù)；
能壓縮數(shù)據(jù)，使輸出保證在 [ 0 , 1 ] [0,1] [0,1]之間（相當于對輸出做了歸一化），保證數(shù)據(jù)幅度不會有問題；
(有上下界)適合用于前向傳播，但是不利于反向傳播。

缺點：

容易出現(xiàn)梯度消失(gradient vanishing)，不利于權(quán)重更新；
不是0均值（zero-centered）的，這會導致后層的神經(jīng)元的輸入是非0均值的信號，這會對梯度產(chǎn)生影響。以 f=sigmoid(wx+b)為例，假設輸入均為正數(shù)（或負數(shù)），那么對w的導數(shù)總是正數(shù)（或負數(shù)），這樣在反向傳播過程中要么都往正方向更新，要么都往負方向更新，導致有一種捆綁效果，使得收斂緩慢。
指數(shù)運算，相對耗時。

1.2、hard-Sigmoid函數(shù)

hard-Sigmoid函數(shù)時Sigmoid激活函數(shù)的分段線性近似。

函數(shù)公式：

hard-Sigmoid函數(shù)圖像和Sigmoid函數(shù)圖像對比：

在這里插入圖片描述

hard-Sigmoid函數(shù)圖像及其導數(shù)圖像：

在這里插入圖片描述

優(yōu)點：

從公示和曲線上來看，其更易計算，沒有指數(shù)運算，因此會提高訓練的效率。

缺點：

首次派生值為零可能會導致神經(jīng)元died或者過慢的學習率。

1.3、Tanh雙曲正切

函數(shù)表達式：

Tanh函數(shù)圖像及其導函數(shù)圖像:

在這里插入圖片描述

優(yōu)點：

解決了Sigmoid函數(shù)的非zero-centered問題
能壓縮數(shù)據(jù)，使輸出保證在 [ 0 , 1 ] [0,1] [0,1]之間（相當于對輸出做了歸一化），保證數(shù)據(jù)幅度不會有問題；(有上下界)

缺點:

還是容易出現(xiàn)梯度消失(gradient vanishing)，不利于權(quán)重更新；
指數(shù)運算，相對耗時。

二、非飽和激活函數(shù)

2.1、ReLU(修正線性單元)

函數(shù)表達式：

f ( z ) = m a x ( 0 , x ) f(z)=max(0,x) f(z)=max(0,x)

ReLU函數(shù)圖像及其導數(shù)圖像：

在這里插入圖片描述

優(yōu)點:

ReLu的收斂速度比 sigmoid 和 tanh 快；
輸入為正時，解決了梯度消失的問題，適合用于反向傳播。；
計算復雜度低，不需要進行指數(shù)運算；

缺點:

ReLU的輸出不是zero-centered；
ReLU不會對數(shù)據(jù)做幅度壓縮，所以數(shù)據(jù)的幅度會隨著模型層數(shù)的增加不斷擴張。(有下界無上界)
Dead ReLU Problem（神經(jīng)元壞死現(xiàn)象）：x為負數(shù)時，梯度都是0，這些神經(jīng)元可能永遠不會被激活，導致相應參數(shù)永遠不會被更新。（輸入為負時，函數(shù)存在梯度消失的現(xiàn)象）

2.2、ReLU6(抑制其最大值)

函數(shù)表達式：

ReLU函數(shù)圖像和ReLU6函數(shù)圖像對比：

在這里插入圖片描述

ReLU6函數(shù)圖像及其導數(shù)圖像：

在這里插入圖片描述

2.3、Leakly ReLU

函數(shù)表達式：

ReLU函數(shù)圖像和Leakly ReLU函數(shù)圖像對比：

在這里插入圖片描述

Leakly ReLU函數(shù)圖像及其導數(shù)圖像：

在這里插入圖片描述

優(yōu)點：

解決上述的dead ReLU現(xiàn)象，讓負數(shù)區(qū)域也會梯度消失；

理論上Leaky ReLU 是優(yōu)于ReLU的，但是實際操作中，并不一定。

2.4、PReLU(parametric ReLU)

函數(shù)公式：

注意：

函數(shù)圖像：

在這里插入圖片描述

優(yōu)點：

可以避免dead ReLU現(xiàn)象；
與ELU相比,輸入為負數(shù)時不會出現(xiàn)梯度消失。

2.5、ELU(指數(shù)線性函數(shù))

函數(shù)表達式：

ELU函數(shù)圖像及其導數(shù)圖像（ α = 1.5 \alpha=1.5 α=1.5）：

在這里插入圖片描述

優(yōu)點：

有ReLU的所有優(yōu)點，且沒有Dead ReLU Problem（神經(jīng)元壞死現(xiàn)象）；
輸出是zero-centered的，輸出平均值接近0；
通過減少偏置偏移的影響，使正常梯度更加接近自然梯度，從而使均值向0加速學習。

缺點：

計算量更高了。

理論上ELU優(yōu)于ReLU, 但是真實數(shù)據(jù)下，并不一定。

2.6、SELU

SELU就是在ELU的基礎(chǔ)上添加了一個 λ \lambda λ參數(shù)，且 λ > 1 \lambda>1 λ>1

函數(shù)表達式：

ELU函數(shù)圖像和SELU函數(shù)圖像對比( α = 1.5 , λ = 2 \alpha=1.5, \lambda=2 α=1.5,λ=2)：

在這里插入圖片描述

SELU函數(shù)圖像及其導數(shù)圖像（ α = 1.5 , λ = 2 \alpha=1.5, \lambda=2 α=1.5,λ=2）：

在這里插入圖片描述

優(yōu)點：

以前的ReLU、P-ReLU、ELU等激活函數(shù)都是在負半軸坡度平緩，這樣在激活的方差過大時可以讓梯度減小，防止了梯度爆炸，但是在正半軸其梯度簡答的設置為了1。而SELU的正半軸大于1，在方差過小的時候可以讓它增大，但是同時防止了梯度消失。這樣激活函數(shù)就有了一個不動點，網(wǎng)絡深了之后每一層的輸出都是均值為0，方差為1. 2.7、Swish

函數(shù)表達式：

Swish函數(shù)圖像( β = 0.1 , β = 1 , β = 10 \beta=0.1, \beta=1,\beta=10 β=0.1,β=1,β=10)：

在這里插入圖片描述

Swish函數(shù)梯度圖像( β = 0.1 , β = 1 , β = 10 \beta=0.1, \beta=1,\beta=10 β=0.1,β=1,β=10)：

在這里插入圖片描述

優(yōu)點：

在x > 0的時候，同樣是不存在梯度消失的情況；而在x < 0時候，神經(jīng)元也不會像ReLU一樣出現(xiàn)死亡的情況。
同時Swish相比于ReLU導數(shù)不是一成不變的，這也是一種優(yōu)勢。
而且Swish處處可導，連續(xù)光滑。

缺點：

計算量大，本來sigmoid函數(shù)就不容易計算，它比sigmoid還難。 2.8、hard-Swish

hard = 硬，就是讓圖像在整體上沒那么光滑（從下面兩個圖都可以看出來）

函數(shù)表達式：

hard-Swish函數(shù)圖像和Swish( β = 1 \beta=1 β=1)函數(shù)圖像對比：

在這里插入圖片描述

hard-Swish函數(shù)圖像和Swish( β = 1 \beta=1 β=1)函數(shù)梯度圖像對比：

在這里插入圖片描述

優(yōu)點：

hard-Swish近似達到了Swish的效果；
且改善了Swish的計算量過大的問題，在量化模式下，ReLU函數(shù)相比Sigmoid好算太多了；

2.9、Mish

論文地址：

https://arxiv.org/pdf/1908.08681.pdf

關(guān)于激活函數(shù)改進的最新一篇文章，且被廣泛用于YOLO4中，相比Swish有0.494%的提升，相比ReLU有1.671%的提升。

Mish函數(shù)公式：

Mish函數(shù)圖像和Swish( β = 1 \beta=1 β=1)函數(shù)圖像對比：

在這里插入圖片描述

Mish函數(shù)圖像和Swish( β = 1 \beta=1 β=1)函數(shù)導數(shù)圖像對比：

在這里插入圖片描述

為什么Mish表現(xiàn)的更好：

上面無邊界(即正值可以達到任何高度)避免了由于封頂而導致的飽和。理論上對負值的輕微允許更好的梯度流，而不是像ReLU中那樣的硬零邊界。
最后，可能也是最重要的，目前的想法是，平滑的激活函數(shù)允許更好的信息深入神經(jīng)網(wǎng)絡，從而得到更好的準確性和泛化。Mish函數(shù)在曲線上幾乎所有點上都極其平滑。

三、PyTorch 實現(xiàn)

import matplotlib.pyplot as plt
import numpy as np

class ActivateFunc():
    def __init__(self, x, b=None, lamb=None, alpha=None, a=None):
        super(ActivateFunc, self).__init__()
        self.x = x
        self.b = b
        self.lamb = lamb
        self.alpha = alpha
        self.a = a

    def Sigmoid(self):
        y = np.exp(self.x) / (np.exp(self.x) + 1)
        y_grad = y*(1-y)
        return [y, y_grad]

    def Hard_Sigmoid(self):
        f = (2 * self.x + 5) / 10
        y = np.where(np.where(f > 1, 1, f) < 0, 0, np.where(f > 1, 1, f))
        y_grad = np.where(f > 0, np.where(f >= 1, 0, 1 / 5), 0)
        return [y, y_grad]

    def Tanh(self):
        y = np.tanh(self.x)
        y_grad = 1 - y * y
        return [y, y_grad]

    def ReLU(self):
        y = np.where(self.x < 0, 0, self.x)
        y_grad = np.where(self.x < 0, 0, 1)
        return [y, y_grad]

    def ReLU6(self):
        y = np.where(np.where(self.x < 0, 0, self.x) > 6, 6, np.where(self.x < 0, 0, self.x))
        y_grad = np.where(self.x > 6, 0, np.where(self.x < 0, 0, 1))
        return [y, y_grad]

    def LeakyReLU(self):   # a大于1，指定a
        y = np.where(self.x < 0, self.x / self.a, self.x)
        y_grad = np.where(self.x < 0, 1 / self.a, 1)
        return [y, y_grad]

    def PReLU(self):    # a大于1，指定a
        y = np.where(self.x < 0, self.x / self.a, self.x)
        y_grad = np.where(self.x < 0, 1 / self.a, 1)
        return [y, y_grad]

    def ELU(self): # alpha是個常數(shù)，指定alpha
        y = np.where(self.x > 0, self.x, self.alpha * (np.exp(self.x) - 1))
        y_grad = np.where(self.x > 0, 1, self.alpha * np.exp(self.x))
        return [y, y_grad]

    def SELU(self):  # lamb大于1，指定lamb和alpha
        y = np.where(self.x > 0, self.lamb * self.x, self.lamb * self.alpha * (np.exp(self.x) - 1))
        y_grad = np.where(self.x > 0, self.lamb * 1, self.lamb * self.alpha * np.exp(self.x))
        return [y, y_grad]

    def Swish(self): # b是一個常數(shù)，指定b
        y = self.x * (np.exp(self.b*self.x) / (np.exp(self.b*self.x) + 1))
        y_grad = np.exp(self.b*self.x)/(1+np.exp(self.b*self.x)) + self.x * (self.b*np.exp(self.b*self.x) / ((1+np.exp(self.b*self.x))*(1+np.exp(self.b*self.x))))
        return [y, y_grad]

    def Hard_Swish(self):
        f = self.x + 3
        relu6 = np.where(np.where(f < 0, 0, f) > 6, 6, np.where(f < 0, 0, f))
        relu6_grad = np.where(f > 6, 0, np.where(f < 0, 0, 1))
        y = self.x * relu6 / 6
        y_grad = relu6 / 6 + self.x * relu6_grad / 6
        return [y, y_grad]

    def Mish(self):
        f = 1 + np.exp(x)
        y = self.x * ((f*f-1) / (f*f+1))
        y_grad = (f*f-1) / (f*f+1) + self.x*(4*f*(f-1)) / ((f*f+1)*(f*f+1))
        return [y, y_grad]

def PlotActiFunc(x, y, title):
    plt.grid(which='minor', alpha=0.2)
    plt.grid(which='major', alpha=0.5)
    plt.plot(x, y)
    plt.title(title)
    plt.show()

def PlotMultiFunc(x, y):
    plt.grid(which='minor', alpha=0.2)
    plt.grid(which='major', alpha=0.5)
    plt.plot(x, y)

if __name__ == '__main__':
    x = np.arange(-10, 10, 0.01)
    activateFunc = ActivateFunc(x)
    activateFunc.a = 100
    activateFunc.b= 1
    activateFunc.alpha = 1.5
    activateFunc.lamb = 2

    plt.figure(1)
    PlotMultiFunc(x, activateFunc.Sigmoid()[0])
    PlotMultiFunc(x, activateFunc.Hard_Sigmoid()[0])
    PlotMultiFunc(x, activateFunc.Tanh()[0])
    PlotMultiFunc(x, activateFunc.ReLU()[0])
    PlotMultiFunc(x, activateFunc.ReLU6()[0])
    PlotMultiFunc(x, activateFunc.LeakyReLU()[0])
    PlotMultiFunc(x, activateFunc.ELU()[0])
    PlotMultiFunc(x, activateFunc.SELU()[0])
    PlotMultiFunc(x, activateFunc.Swish()[0])
    PlotMultiFunc(x, activateFunc.Hard_Swish()[0])
    PlotMultiFunc(x, activateFunc.Mish()[0])

    plt.legend(['Sigmoid', 'Hard_Sigmoid', 'Tanh', 'ReLU', 'ReLU6', 'LeakyReLU',
                'ELU', 'SELU', 'Swish', 'Hard_Swish', 'Mish'])
    plt.show()