Go標(biāo)準(zhǔn)庫http與fasthttp服務(wù)端性能對比場景分析

更新時間：2022年06月15日 10:47:51 作者：程序員王越

這篇文章主要介紹了Go標(biāo)準(zhǔn)庫http與fasthttp服務(wù)端性能比較,本文給大家介紹的非常詳細(xì)，對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值，需要的朋友可以參考下

1. 背景

Go初學(xué)者學(xué)習(xí)Go時，在編寫了經(jīng)典的“hello, world”程序之后，可能會迫不及待的體驗一下Go強大的標(biāo)準(zhǔn)庫，比如：用幾行代碼寫一個像下面示例這樣擁有完整功能的web server：

// 來自https://tip.golang.org/pkg/net/http/#example_ListenAndServe
package main
import (
    "io"
    "log"
    "net/http"
)
func main() {
    helloHandler := func(w http.ResponseWriter, req *http.Request) {
        io.WriteString(w, "Hello, world!\n")
    }
    http.HandleFunc("/hello", helloHandler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

go net/http包是一個比較均衡的通用實現(xiàn)，能滿足大多數(shù)gopher 90%以上場景的需要，并且具有如下優(yōu)點：

標(biāo)準(zhǔn)庫包，無需引入任何第三方依賴；
對http規(guī)范的滿足度較好；
無需做任何優(yōu)化，即可獲得相對較高的性能；
支持HTTP代理；
支持HTTPS；
無縫支持HTTP/2。

不過也正是因為http包的“均衡”通用實現(xiàn)，在一些對性能要求嚴(yán)格的領(lǐng)域，net/http的性能可能無法勝任，也沒有太多的調(diào)優(yōu)空間。這時我們會將眼光轉(zhuǎn)移到其他第三方的http服務(wù)端框架實現(xiàn)上。

而在第三方http服務(wù)端框架中，一個“行如其名”的框架fasthttp被提及和采納的較多，fasthttp官網(wǎng)宣稱其性能是net/http的十倍(基于go test benchmark的測試結(jié)果)。

fasthttp采用了許多性能優(yōu)化上的最佳實踐，尤其是在內(nèi)存對象的重用上，大量使用sync.Pool以降低對Go GC的壓力。

那么在真實環(huán)境中，到底fasthttp能比net/http快多少呢？恰好手里有兩臺性能還不錯的服務(wù)器可用，在本文中我們就在這個真實環(huán)境下看看他們的實際性能。

2. 性能測試

我們分別用net/http和fasthttp實現(xiàn)兩個幾乎“零業(yè)務(wù)”的被測程序：

nethttp:

// github.com/bigwhite/experiments/blob/master/http-benchmark/nethttp/main.go
package main
import (
    _ "expvar"
    "log"
    "net/http"
    _ "net/http/pprof"
    "runtime"
    "time"
)
func main() {
    go func() {
        for {
            log.Println("當(dāng)前routine數(shù)量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()

    http.Handle("/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Hello, Go!"))
    }))

    log.Fatal(http.ListenAndServe(":8080", nil))
}

fasthttp:

// github.com/bigwhite/experiments/blob/master/http-benchmark/fasthttp/main.go
package main
import (
    "fmt"
    "log"
    "net/http"
    "runtime"
    "time"
    _ "expvar"
    _ "net/http/pprof"
    "github.com/valyala/fasthttp"
)
type HelloGoHandler struct {
}
func fastHTTPHandler(ctx *fasthttp.RequestCtx) {
    fmt.Fprintln(ctx, "Hello, Go!")
}
func main() {
    go func() {
        http.ListenAndServe(":6060", nil)
    }()
    go func() {
        for {
            log.Println("當(dāng)前routine數(shù)量:", runtime.NumGoroutine())
            time.Sleep(time.Second)
        }
    }()
    s := &fasthttp.Server{
        Handler: fastHTTPHandler,
    }
    s.ListenAndServe(":8081")
}

對被測目標(biāo)實施壓力測試的客戶端，我們基于hey這個http壓測工具進(jìn)行，為了方便調(diào)整壓力水平，我們將hey“包裹”在下面這個shell腳本中(僅適于在linux上運行)：

// github.com/bigwhite/experiments/blob/master/http-benchmark/client/http_client_load.sh
# ./http_client_load.sh 3 10000 10 GET http://10.10.195.181:8080
echo "$0 task_num count_per_hey conn_per_hey method url"
task_num=$1
count_per_hey=$2
conn_per_hey=$3
method=$4
url=$5
start=$(date +%s%N)
for((i=1; i<=$task_num; i++)); do {
    tm=$(date +%T.%N)
        echo "$tm: task $i start"
    hey -n $count_per_hey -c $conn_per_hey -m $method $url > hey_$i.log
    tm=$(date +%T.%N)
        echo "$tm: task $i done"
} & done
wait
end=$(date +%s%N)
count=$(( $task_num * $count_per_hey ))
runtime_ns=$(( $end - $start ))
runtime=`echo "scale=2; $runtime_ns / 1000000000" | bc`
echo "runtime: "$runtime
speed=`echo "scale=2; $count / $runtime" | bc`
echo "speed: "$speed

該腳本的執(zhí)行示例如下：

bash http_client_load.sh 8 1000000 200 GET http://10.10.195.134:8080
http_client_load.sh task_num count_per_hey conn_per_hey method url
16:58:09.146948690: task 1 start
16:58:09.147235080: task 2 start
16:58:09.147290430: task 3 start
16:58:09.147740230: task 4 start
16:58:09.147896010: task 5 start
16:58:09.148314900: task 6 start
16:58:09.148446030: task 7 start
16:58:09.148930840: task 8 start
16:58:45.001080740: task 3 done
16:58:45.241903500: task 8 done
16:58:45.261501940: task 1 done
16:58:50.032383770: task 4 done
16:58:50.985076450: task 7 done
16:58:51.269099430: task 5 done
16:58:52.008164010: task 6 done
16:58:52.166402430: task 2 done
runtime: 43.02
speed: 185960.01

從傳入的參數(shù)來看，該腳本并行啟動了8個task(一個task啟動一個hey)，每個task向http://10.10.195.134:8080建立200個并發(fā)連接，并發(fā)送100w http GET請求。

我們使用兩臺服務(wù)器分別放置被測目標(biāo)程序和壓力工具腳本：

目標(biāo)程序所在服務(wù)器：10.10.195.181(物理機，Intel x86-64 CPU，40核，128G內(nèi)存, CentOs 7.6)

$ cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core) 

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
座：                 2
NUMA 節(jié)點：         2
廠商 ID：           GenuineIntel
CPU 系列：          6
型號：              85
型號名稱：        Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
步進(jìn)：              4
CPU MHz：             800.000
CPU max MHz:           2201.0000
CPU min MHz:           800.0000
BogoMIPS：            4400.00
虛擬化：           VT-x
L1d 緩存：          32K
L1i 緩存：          32K
L2 緩存：           1024K
L3 緩存：           14080K
NUMA 節(jié)點0 CPU：    0-9,20-29
NUMA 節(jié)點1 CPU：    10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

壓力工具所在服務(wù)器：10.10.195.133(物理機，鯤鵬arm64 cpu，96核，80G內(nèi)存, CentOs 7.9)

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (AltArch)

# lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    1
Core(s) per socket:    48
座：                 2
NUMA 節(jié)點：         4
型號：              0
CPU max MHz:           2600.0000
CPU min MHz:           200.0000
BogoMIPS：            200.00
L1d 緩存：          64K
L1i 緩存：          64K
L2 緩存：           512K
L3 緩存：           49152K
NUMA 節(jié)點0 CPU：    0-23
NUMA 節(jié)點1 CPU：    24-47
NUMA 節(jié)點2 CPU：    48-71
NUMA 節(jié)點3 CPU：    72-95
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm

我用dstat監(jiān)控被測目標(biāo)所在主機資源占用情況(dstat -tcdngym)，尤其是cpu負(fù)荷；通過[expvarmon監(jiān)控memstats]，由于沒有業(yè)務(wù)，內(nèi)存占用很少；通過go tool pprof查看目標(biāo)程序中對各類資源消耗情況的排名。

下面是多次測試后制作的一個數(shù)據(jù)表格：

圖：測試數(shù)據(jù)

3. 對結(jié)果的簡要分析

受特定場景、測試工具及腳本精確性以及壓力測試環(huán)境的影響，上面的測試結(jié)果有一定局限，但卻真實反映了被測目標(biāo)的性能趨勢。我們看到在給予同樣壓力的情況下，fasthttp并沒有10倍于net http的性能，甚至在這樣一個特定的場景下，兩倍于net/http的性能都沒有達(dá)到：我們看到在目標(biāo)主機cpu資源消耗接近70%的幾個用例中，fasthttp的性能僅比net/http高出30%~70%左右。

那么為什么fasthttp的性能未及預(yù)期呢？要回答這個問題，那就要看看net/http和fasthttp各自的實現(xiàn)原理了！我們先來看看net/http的工作原理示意圖：

圖：nethttp工作原理示意圖

http包作為server端的原理很簡單，那就是accept到一個連接(conn)之后，將這個conn甩給一個worker goroutine去處理，后者一直存在，直到該conn的生命周期結(jié)束：即連接關(guān)閉。

下面是fasthttp的工作原理示意圖：

圖：fasthttp工作原理示意圖

而fasthttp設(shè)計了一套機制，目的是盡量復(fù)用goroutine，而不是每次都創(chuàng)建新的goroutine。fasthttp的Server accept一個conn之后，會嘗試從workerpool中的ready切片中取出一個channel，該channel與某個worker goroutine一一對應(yīng)。一旦取出channel，就會將accept到的conn寫到該channel里，而channel另一端的worker goroutine就會處理該conn上的數(shù)據(jù)讀寫。當(dāng)處理完該conn后，該worker goroutine不會退出，而是會將自己對應(yīng)的那個channel重新放回workerpool中的ready切片中，等待這下一次被取出。

fasthttp的goroutine復(fù)用策略初衷很好，但在這里的測試場景下效果不明顯，從測試結(jié)果便可看得出來，在相同的客戶端并發(fā)和壓力下，net/http使用的goroutine數(shù)量與fasthttp相差無幾。這是由測試模型導(dǎo)致的：在我們這個測試中，每個task中的hey都會向被測目標(biāo)發(fā)起固定數(shù)量的[長連接(keep-alive)]，然后在每條連接上發(fā)起“飽和”請求。這樣fasthttp workerpool中的goroutine一旦接收到某個conn就只能在該conn上的通訊結(jié)束后才能重新放回，而該conn直到測試結(jié)束才會close，因此這樣的場景相當(dāng)于讓fasthttp“退化”成了net/http的模型，也染上了net/http的“缺陷”：goroutine的數(shù)量一旦多起來，go runtime自身調(diào)度所帶來的消耗便不可忽視甚至超過了業(yè)務(wù)處理所消耗的資源占比。下面分別是fasthttp在200長連接、8000長連接以及16000長連接下的cpu profile的結(jié)果：

200長連接：

(pprof) top -cum
Showing nodes accounting for 88.17s, 55.35% of 159.30s total
Dropped 150 nodes (cum <= 0.80s)
Showing top 10 nodes out of 60
      flat  flat%   sum%        cum   cum%
     0.46s  0.29%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*Server).serveConn
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%  0.29%    101.46s 63.69%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.04s 0.025%  0.31%     89.46s 56.16%  internal/poll.ignoringEINTRIO (inline)
    87.38s 54.85% 55.17%     89.27s 56.04%  syscall.Syscall
     0.12s 0.075% 55.24%     60.39s 37.91%  bufio.(*Writer).Flush
         0     0% 55.24%     60.22s 37.80%  net.(*conn).Write
     0.08s  0.05% 55.29%     60.21s 37.80%  net.(*netFD).Write
     0.09s 0.056% 55.35%     60.12s 37.74%  internal/poll.(*FD).Write
         0     0% 55.35%     59.86s 37.58%  syscall.Write (inline)
(pprof) 

8000長連接：

(pprof) top -cum
Showing nodes accounting for 108.51s, 54.46% of 199.23s total
Dropped 204 nodes (cum <= 1s)
Showing top 10 nodes out of 66
      flat  flat%   sum%        cum   cum%
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0%     0%    119.11s 59.79%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.69s  0.35%  0.35%    119.05s 59.76%  github.com/valyala/fasthttp.(*Server).serveConn
     0.04s  0.02%  0.37%    104.22s 52.31%  internal/poll.ignoringEINTRIO (inline)
   101.58s 50.99% 51.35%    103.95s 52.18%  syscall.Syscall
     0.10s  0.05% 51.40%     79.95s 40.13%  runtime.mcall
     0.06s  0.03% 51.43%     79.85s 40.08%  runtime.park_m
     0.23s  0.12% 51.55%     79.30s 39.80%  runtime.schedule
     5.67s  2.85% 54.39%     77.47s 38.88%  runtime.findrunnable
     0.14s  0.07% 54.46%     68.96s 34.61%  bufio.(*Writer).Flush

16000長連接：

(pprof) top -cum
Showing nodes accounting for 239.60s, 87.07% of 275.17s total
Dropped 190 nodes (cum <= 1.38s)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
     0.04s 0.015% 0.015%    153.38s 55.74%  runtime.mcall
     0.01s 0.0036% 0.018%    153.34s 55.73%  runtime.park_m
     0.12s 0.044% 0.062%       153s 55.60%  runtime.schedule
     0.66s  0.24%   0.3%    152.66s 55.48%  runtime.findrunnable
     0.15s 0.055%  0.36%    127.53s 46.35%  runtime.netpoll
   127.04s 46.17% 46.52%    127.04s 46.17%  runtime.epollwait
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
         0     0% 46.52%       121s 43.97%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.41s  0.15% 46.67%    120.18s 43.67%  github.com/valyala/fasthttp.(*Server).serveConn
   111.17s 40.40% 87.07%    111.99s 40.70%  syscall.Syscall
(pprof)

通過上述profile的比對，我們發(fā)現(xiàn)當(dāng)長連接數(shù)量增多時(即workerpool中g(shù)oroutine數(shù)量增多時），go runtime調(diào)度的占比會逐漸提升，在16000連接時，runtime調(diào)度的各個函數(shù)已經(jīng)排名前4了。

4. 優(yōu)化途徑

從上面的測試結(jié)果，我們看到fasthttp的模型不太適合這種連接連上后進(jìn)行持續(xù)“飽和”請求的場景，更適合短連接或長連接但沒有持續(xù)飽和請求，在后面這樣的場景下，它的goroutine復(fù)用模型才能更好的得以發(fā)揮。

但即便“退化”為了net/http模型，fasthttp的性能依然要比net/http略好，這是為什么呢？這些性能提升主要是fasthttp在內(nèi)存分配層面的優(yōu)化trick的結(jié)果，比如大量使用sync.Pool，比如避免在[]byte和string互轉(zhuǎn)等。

那么，在持續(xù)“飽和”請求的場景下，如何讓fasthttp workerpool中g(shù)oroutine的數(shù)量不會因conn的增多而線性增長呢？fasthttp官方?jīng)]有給出答案，但一條可以考慮的路徑是使用os的多路復(fù)用(linux上的實現(xiàn)為epoll)，即go runtime netpoll使用的那套機制。在多路復(fù)用的機制下，這樣可以讓每個workerpool中的goroutine處理同時處理多個連接，這樣我們可以根據(jù)業(yè)務(wù)規(guī)模選擇workerpool池的大小，而不是像目前這樣幾乎是任意增長goroutine的數(shù)量。當(dāng)然，在用戶層面引入epoll也可能會帶來系統(tǒng)調(diào)用占比的增多以及響應(yīng)延遲增大等問題。至于該路徑是否可行，還是要看具體實現(xiàn)和測試結(jié)果。

注：fasthttp.Server中的Concurrency可以用來限制workerpool中并發(fā)處理的goroutine的個數(shù)，但由于每個goroutine只處理一個連接，當(dāng)Concurrency設(shè)置過小時，后續(xù)的連接可能就會被fasthttp拒絕服務(wù)。因此fasthttp的默認(rèn)Concurrency為：

const DefaultConcurrency = 256 * 1024

到此這篇關(guān)于Go標(biāo)準(zhǔn)庫http與fasthttp服務(wù)端性能比較的文章就介紹到這了,更多相關(guān)go http與fasthttp服務(wù)端性能內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: