Golang sync.Pool的源碼解析

更新時(shí)間：2023年05月29日 11:48:56 作者：胡大海

Pool是用于存放臨時(shí)對(duì)象的集合，這些對(duì)象是為了后續(xù)的使用，以達(dá)到復(fù)用對(duì)象的效果，本文將詳解解析sync.Pool 源碼,需要的朋友可以參考下

實(shí)際使用

Pool 是用于存放臨時(shí)對(duì)象的集合，這些對(duì)象是為了后續(xù)的使用，以達(dá)到復(fù)用對(duì)象的效果。其目的是緩解頻繁創(chuàng)建對(duì)象造成的gc壓力。在許多開源組件中均使用了此組件，例如bolt 、gin 等。

下面是一組在非并發(fā)和并發(fā)場景是否使用Pool的benchmark:

package pool
import (
	"io/ioutil"
	"sync"
	"testing"
)
type Data [1024]byte
// 直接創(chuàng)建對(duì)象
func BenchmarkWithoutPool(t *testing.B) {
	for i := 0; i < t.N; i++ {
		var data Data
		ioutil.Discard.Write(data[:])
	}
}
// 使用Pool復(fù)用對(duì)象
func BenchmarkWithPool(t *testing.B) {
	pool := &sync.Pool{
		// 若沒有可用對(duì)象，則調(diào)用New創(chuàng)建一個(gè)對(duì)象
		New: func() interface{} {
			return &Data{}
		},
	}
	for i := 0; i < t.N; i++ {
		// 取
		data := pool.Get().(*Data)
		// 用
		ioutil.Discard.Write(data[:])
		// 存
		pool.Put(data)
	}
}
// 并發(fā)的直接創(chuàng)建對(duì)象
func BenchmarkWithoutPoolConncurrency(t *testing.B) {
	t.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			var data Data
			ioutil.Discard.Write(data[:])
		}
	})
}
// 使用Pool并發(fā)的復(fù)用對(duì)象
func BenchmarkWithPoolConncurrency(t *testing.B) {
	pool := &sync.Pool{
		// 若沒有可用對(duì)象，則調(diào)用New創(chuàng)建一個(gè)對(duì)象
		New: func() interface{} {
			return &Data{}
		},
	}
	t.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			// 取
			data := pool.Get().(*Data)
			// 用
			ioutil.Discard.Write(data[:])
			// 存
			pool.Put(data)
		}
	})
}

實(shí)際運(yùn)行效果如下圖所示，可以看出sync.Pool 不管是在并發(fā)還是非并發(fā)場景下，在速度和內(nèi)存分配上表現(xiàn)均遠(yuǎn)遠(yuǎn)優(yōu)異于直接創(chuàng)建對(duì)象。

goos: darwin
goarch: amd64
pkg: leetcode/pool
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkWithoutPool-8                   7346660               148.1 ns/op          1024 B/op          1 allocs/op
BenchmarkWithPool-8                     80391398                14.41 ns/op            0 B/op          0 allocs/op
BenchmarkWithoutPoolConncurrency-8       7893248               153.3 ns/op          1024 B/op          1 allocs/op
BenchmarkWithPoolConncurrency-8         363329767                4.245 ns/op           0 B/op          0 allocs/op
PASS
ok      leetcode/pool   6.590s

實(shí)現(xiàn)原理

Pool 基本結(jié)構(gòu)如下

type Pool struct {
	noCopy noCopy
	local     unsafe.Pointer // local fixed-size per-P pool, actual type is [P]poolLocal
	localSize uintptr        // size of the local array
	victim     unsafe.Pointer // local from previous cycle
	victimSize uintptr        // size of victims array
	// New optionally specifies a function to generate
	// a value when Get would otherwise return nil.
	// It may not be changed concurrently with calls to Get.
	New func() any
}

其中最為主要的是屬性 local ，是一個(gè)和P數(shù)量一致的切片，每個(gè)P的id都對(duì)應(yīng)切片中的一個(gè)元素。為了高效的利用CPU多核，元素中間填充了pad，具體細(xì)節(jié)可以參考后續(xù)的 CacheLine。

// Local per-P Pool appendix.
type poolLocalInternal struct {
	private any       // Can be used only by the respective P.
	shared  poolChain // Local P can pushHead/popHead; any P can popTail.
}
type poolLocal struct {
	poolLocalInternal
	// Prevents false sharing on widespread platforms with
	// 128 mod (cache line size) = 0 .
	pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
}

CacheLine

CPU 緩存會(huì)按照CacheLine大小來從內(nèi)存復(fù)制數(shù)據(jù)，相鄰的數(shù)據(jù)可能會(huì)處于同一個(gè) CacheLine 。如果這些數(shù)據(jù)被多核使用，那么系統(tǒng)需要耗費(fèi)較大的資源來保持各個(gè)cpu緩存中數(shù)據(jù)的一致性。當(dāng)一個(gè)線程修改某個(gè) CacheLine 中數(shù)據(jù)的時(shí)候，其他讀此 CacheLine 數(shù)據(jù)的線程會(huì)被鎖給阻塞。

下面是一組在并發(fā)場景下原子性的操作對(duì)象Age屬性的benchmark：

import (
	"sync/atomic"
	"testing"
	"unsafe"
)
type StudentWithCacheLine struct {
	Age uint32
	_   [128 - unsafe.Sizeof(uint32(0))%128]byte
}
// 有填充的場景下，并發(fā)修改Age
func BenchmarkWithCacheLine(b *testing.B) {
	count := 10
	students := make([]StudentWithCacheLine, count)
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			for j := 0; j < count; j++ {
				atomic.AddUint32(&students[j].Age, 1)
			}
		}
	})
}
type StudentWithoutCacheLine struct {
	Age uint32
}
// 無填充的場景下，并發(fā)修改Age
func BenchmarkWithoutCacheLine(b *testing.B) {
	count := 10
	students := make([]StudentWithoutCacheLine, count)
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			for j := 0; j < count; j++ {
				atomic.AddUint32(&students[j].Age, 1)
			}
		}
	})
}

StudentWithCacheLine 中填充了pad來保證切片中不同的Age處于不同的CacheLine, StudentWithoutCacheLine 中的Age未做任何處理。通過圖可以知道根據(jù) CacheLine 填充了pad的Age 原子操作速度遠(yuǎn)遠(yuǎn)快于未做任何處理的。

goos: darwin
goarch: amd64
pkg: leetcode/pool
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkWithCacheLine-8        17277380                70.18 ns/op            0 B/op          0 allocs/op
BenchmarkWithoutCacheLine-8      8916874               133.8 ns/op             0 B/op          0 allocs/op

生產(chǎn)消費(fèi)者模型

Pool的高性能不僅僅使用CacheLine 避免了多核之間的數(shù)據(jù)競爭，還根據(jù)GMP模型使用了生產(chǎn)者消費(fèi)者模型來減少數(shù)據(jù)競爭，每個(gè)P都對(duì)應(yīng)一個(gè)poolLocalInternal 。以較為復(fù)雜的Get的流程，取數(shù)流程如下：

// Get selects an arbitrary item from the Pool, removes it from the
// Pool, and returns it to the caller.
// Get may choose to ignore the pool and treat it as empty.
// Callers should not assume any relation between values passed to Put and
// the values returned by Get.
//
// If Get would otherwise return nil and p.New is non-nil, Get returns
// the result of calling p.New.
func (p *Pool) Get() any {
	if race.Enabled {
		race.Disable()
	}
	// 1. 找到當(dāng)前goroutine所在的P對(duì)應(yīng)的poolLocalInternal和P對(duì)應(yīng)的id
	l, pid := p.pin()
	x := l.private
	l.private = nil
  // 2. 如果private是空，在從shared進(jìn)行popHead
	if x == nil {
		// Try to pop the head of the local shard. We prefer
		// the head over the tail for temporal locality of
		// reuse.
		// 3. 嘗試從shared上取值
		x, _ = l.shared.popHead()
		// 4. 如果還如空，則嘗試從其他P的poolLocalInternal或者victim中獲取
		if x == nil {
			x = p.getSlow(pid)
		}
	}
	runtime_procUnpin()
	if race.Enabled {
		race.Enable()
		if x != nil {
			race.Acquire(poolRaceAddr(x))
		}
	}
	// 5. 如果還如空，則直接使用New初始化一個(gè)
	if x == nil && p.New != nil {
		x = p.New()
	}
	return x
}

優(yōu)先在當(dāng)前goroutine所在P對(duì)應(yīng)的 poolLocalInternal 上找，先找private，再找 shared
判斷private是否有值。對(duì)于每個(gè)P來說是單線程的，取 private 的時(shí)候是不用鎖，僅僅簡單判斷即可。如果有值直接返回即可；如果為空，再查找 shared。
查看 shared 是否有值。shared是一個(gè)雙向鏈表，鏈起來的是ringbuf(環(huán)形數(shù)組)，在添加ringbuf的時(shí)候，其大小是前一個(gè)的兩倍。

對(duì)于goroutine來說，既是當(dāng)前P上取值的消費(fèi)者，又是當(dāng)前P上存值的生產(chǎn)者。在這兩種場景是使用方法分別是：取值使用**popHead**；存值使用**pushHead** 。均是從 head 取數(shù)據(jù)。

// poolChain is a dynamically-sized version of poolDequeue.
//
// This is implemented as a doubly-linked list queue of poolDequeues
// where each dequeue is double the size of the previous one. Once a
// dequeue fills up, this allocates a new one and only ever pushes to
// the latest dequeue. Pops happen from the other end of the list and
// once a dequeue is exhausted, it gets removed from the list.
type poolChain struct {
	// head is the poolDequeue to push to. This is only accessed
	// by the producer, so doesn't need to be synchronized.
	head *poolChainElt
	// tail is the poolDequeue to popTail from. This is accessed
	// by consumers, so reads and writes must be atomic.
	tail *poolChainElt
}
type poolChainElt struct {
	poolDequeue
	// next and prev link to the adjacent poolChainElts in this
	// poolChain.
	//
	// next is written atomically by the producer and read
	// atomically by the consumer. It only transitions from nil to
	// non-nil.
	//
	// prev is written atomically by the consumer and read
	// atomically by the producer. It only transitions from
	// non-nil to nil.
	next, prev *poolChainElt
}

如果 private 和 shared 均沒值，就嘗試從其他 P 的 poolLocalInternal 上取值。

這個(gè)時(shí)候就是goroutine扮演的就是消費(fèi)者的角色了，使用的方式是**popTail。**從 tail 取數(shù)據(jù)。

如果其他poolLocalInternal 上也沒有值的話，就需要從victim中取值了。這個(gè) victim 就是跨越 GC 遺留下的數(shù)據(jù)。
如果都沒有的話，就只能使用 New 創(chuàng)建一個(gè)新的值了。

此模型減少了數(shù)據(jù)的競爭，保證了CAS的高效率。對(duì)于處于一個(gè)P上的多個(gè)goroutine來說是單線程的，數(shù)據(jù)之間不會(huì)有競爭關(guān)系。每個(gè)goroutine取值的時(shí)候，優(yōu)先從對(duì)應(yīng)P上的鏈表頭部取值。只有在鏈表無數(shù)據(jù)的時(shí)候，才會(huì)嘗試從其他P上的對(duì)應(yīng)的鏈表尾部取值。也就是說出現(xiàn)競爭的可能性的地方在于，一個(gè)goruotine從鏈表頭部取值或者塞值，另外一個(gè)goroutine從鏈表尾部取值，兩者出現(xiàn)沖突的可能性較小。