快捷導(dǎo)航

Python?虛擬機(jī)集合set實(shí)現(xiàn)原理及源碼解析

更新時(shí)間：2023年03月21日 08:55:38 作者：一無(wú)是處的研究僧

這篇文章主要為大家介紹了Python?虛擬機(jī)集合set實(shí)現(xiàn)原理及源碼解析，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進(jìn)步，早日升職加薪

深入理解 Python 虛擬機(jī)：集合（set）的實(shí)現(xiàn)原理及源碼剖析

在本篇文章當(dāng)中主要給大家介紹在 cpython 虛擬機(jī)當(dāng)中的集合 set 的實(shí)現(xiàn)原理（哈希表）以及對(duì)應(yīng)的源代碼分析。

數(shù)據(jù)結(jié)構(gòu)介紹

typedef struct {
    PyObject_HEAD
    Py_ssize_t fill;            /* Number active and dummy entries*/
    Py_ssize_t used;            /* Number active entries */
    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;
    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */
    setentry smalltable[PySet_MINSIZE]; // #define PySet_MINSIZE 8
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;
typedef struct {
    PyObject *key;
    Py_hash_t hash;             /* Cached hash code of the key */
} setentry;
static PyObject _dummy_struct;
#define dummy (&amp;_dummy_struct)

上面的數(shù)據(jù)結(jié)果用圖示如下圖所示：

上面各個(gè)字段的含義如下所示：

dummy entries ：如果在哈希表當(dāng)中的數(shù)組原來(lái)有一個(gè)數(shù)據(jù)，如果我們刪除這個(gè) entry 的時(shí)候，對(duì)應(yīng)的位置就會(huì)被賦值成 dummy，與 dummy 有關(guān)的定義在上面的代碼當(dāng)中已經(jīng)給出，dummy 對(duì)象的哈希值等于 -1。
明白 dummy 的含義之后，fill 和 used 這兩個(gè)字段的含義就比較容易理解了，used 就是數(shù)組當(dāng)中真實(shí)有效的對(duì)象的個(gè)數(shù)，fill 還需要加上 dummy 對(duì)象的個(gè)數(shù)。
mask，數(shù)組的長(zhǎng)度等于 2n2^n2n，mask 的值等于 2n−12^n - 12n−1 。
table，實(shí)際保存 entry 對(duì)象的數(shù)組。
hash，這個(gè)值對(duì) frozenset 有用，保存計(jì)算出來(lái)的哈希值。如果你的數(shù)組很大的話，計(jì)算哈希值其實(shí)也是一個(gè)比較大的開(kāi)銷，因此可以將計(jì)算出來(lái)的哈希值保存下來(lái)，以便下一次求的時(shí)候可以將哈希值直接返回，這也印證了在 python 當(dāng)中為什么只有 immutable 對(duì)象才能夠放入到集合和字典當(dāng)中，因?yàn)楣Ｖ涤?jì)算一次保存下來(lái)了，如果再加入對(duì)象對(duì)象的哈希值也會(huì)變化，這樣做就會(huì)發(fā)生錯(cuò)誤了。
finger，主要是用于記錄下一個(gè)開(kāi)始尋找被刪除對(duì)象的下標(biāo)。
smalltable，默認(rèn)的小數(shù)組，cpython 設(shè)置的一半的集合對(duì)象不會(huì)超過(guò)這個(gè)大?。?），因此在申請(qǐng)一個(gè)集合對(duì)象的時(shí)候直接就申請(qǐng)了這個(gè)小數(shù)組的內(nèi)存大小。
weakrelist，這個(gè)字段主要和垃圾回收有關(guān)，這里暫時(shí)不進(jìn)行詳細(xì)說(shuō)明。

創(chuàng)建集合對(duì)象

首先先了解一下創(chuàng)建一個(gè)集合對(duì)象的過(guò)程，和前面其他的對(duì)象是一樣的，首先先申請(qǐng)內(nèi)存空間，然后進(jìn)行相關(guān)的初始化操作。

這個(gè)函數(shù)有兩個(gè)參數(shù)，使用第一個(gè)參數(shù)申請(qǐng)內(nèi)存空間，然后后面一個(gè)參數(shù)如果不為 NULL 而且是一個(gè)可迭代對(duì)象的話，就將這里面的對(duì)象加入到集合當(dāng)中。

static PyObject *
make_new_set(PyTypeObject *type, PyObject *iterable)
{
    PySetObject *so = NULL;
    /* create PySetObject structure */
    so = (PySetObject *)type->tp_alloc(type, 0);
    if (so == NULL)
        return NULL;
    // 集合當(dāng)中目前沒(méi)有任何對(duì)象，因此 fill 和 used 都是 0
    so->fill = 0;
    so->used = 0;
    // 初始化哈希表當(dāng)中的數(shù)組長(zhǎng)度為 PySet_MINSIZE 因此 mask = PySet_MINSIZE - 1
    so->mask = PySet_MINSIZE - 1;
    // 讓 table 指向存儲(chǔ) entry 的數(shù)組
    so->table = so->smalltable;
    // 將哈希值設(shè)置成 -1 表示還沒(méi)有進(jìn)行計(jì)算
    so->hash = -1;
    so->finger = 0;
    so->weakreflist = NULL;
    // 如果 iterable 不等于 NULL 則需要將它指向的對(duì)象當(dāng)中所有的元素加入到集合當(dāng)中
    if (iterable != NULL) {
        // 調(diào)用函數(shù) set_update_internal 將對(duì)象 iterable 當(dāng)中的元素加入到集合當(dāng)中
        if (set_update_internal(so, iterable)) {
            Py_DECREF(so);
            return NULL;
        }
    }
    return (PyObject *)so;
}

往集合當(dāng)中加入數(shù)據(jù)

首先我們先大致理清楚往集合當(dāng)中插入數(shù)據(jù)的流程：

首先根據(jù)對(duì)象的哈希值，計(jì)算需要將對(duì)象放在哪個(gè)位置，也就是對(duì)應(yīng)數(shù)組的下標(biāo)。
查看對(duì)應(yīng)下標(biāo)的位置是否存在對(duì)象，如果不存在對(duì)象則將數(shù)據(jù)保存在對(duì)應(yīng)下標(biāo)的位置。
如果對(duì)應(yīng)的位置存在對(duì)象，則查看是否和當(dāng)前要插入的對(duì)象相等，則返回。
如果不相等，則使用類似于線性探測(cè)的方式去尋找下一個(gè)要插入的位置（具體的實(shí)現(xiàn)可以查看相關(guān)代碼，具體的操作為線性探測(cè)法 + 開(kāi)放地址法）。

static PyObject *
set_add(PySetObject *so, PyObject *key)
{
    if (set_add_key(so, key))
        return NULL;
    Py_RETURN_NONE;
}
static int
set_add_key(PySetObject *so, PyObject *key)
{
    setentry entry;
    Py_hash_t hash;
    // 這里就查看一下是否是字符串，如果是字符串直接拿到哈希值
    if (!PyUnicode_CheckExact(key) ||
        (hash = ((PyASCIIObject *) key)->hash) == -1) {
      	// 如果不是字符串則需要調(diào)用對(duì)象自己的哈希函數(shù)求得對(duì)應(yīng)的哈希值
        hash = PyObject_Hash(key);
        if (hash == -1)
            return -1;
    }
    // 創(chuàng)建一個(gè) entry 對(duì)象將這個(gè)對(duì)象加入到哈希表當(dāng)中
    entry.key = key;
    entry.hash = hash;
    return set_add_entry(so, &entry);
}
static int
set_add_entry(PySetObject *so, setentry *entry)
{
    Py_ssize_t n_used;
    PyObject *key = entry->key;
    Py_hash_t hash = entry->hash;
    assert(so->fill <= so->mask);  /* at least one empty slot */
    n_used = so->used;
    Py_INCREF(key);
    // 調(diào)用函數(shù) set_insert_key 將對(duì)象插入到數(shù)組當(dāng)中
    if (set_insert_key(so, key, hash)) {
        Py_DECREF(key);
        return -1;
    }
    // 這里就是哈希表的核心的擴(kuò)容機(jī)制
    if (!(so->used > n_used && so->fill*3 >= (so->mask+1)*2))
        return 0;
    // 這是擴(kuò)容大小的邏輯
    return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
}
static int
set_insert_key(PySetObject *so, PyObject *key, Py_hash_t hash)
{
    setentry *entry;
    // set_lookkey 這個(gè)函數(shù)便是插入的核心的邏輯的實(shí)現(xiàn)對(duì)應(yīng)的實(shí)現(xiàn)函數(shù)在下方
    entry = set_lookkey(so, key, hash);
    if (entry == NULL)
        return -1;
    if (entry->key == NULL) {
        /* UNUSED */
        entry->key = key;
        entry->hash = hash;
        so->fill++;
        so->used++;
    } else if (entry->key == dummy) {
        /* DUMMY */
        entry->key = key;
        entry->hash = hash;
        so->used++;
    } else {
        /* ACTIVE */
        Py_DECREF(key);
    }
    return 0;
}
// 下面的代碼就是在執(zhí)行我們?cè)谇懊嫠劦降倪壿?，直到找到相同?key 或者空位置才退出 while 循環(huán)
static setentry *
set_lookkey(PySetObject *so, PyObject *key, Py_hash_t hash)
{
    setentry *table = so->table;
    setentry *freeslot = NULL;
    setentry *entry;
    size_t perturb = hash;
    size_t mask = so->mask;
    size_t i = (size_t)hash & mask; /* Unsigned for defined overflow behavior */
    size_t j;
    int cmp;
    entry = &table[i];
    if (entry->key == NULL)
        return entry;
    while (1) {
        if (entry->hash == hash) {
            PyObject *startkey = entry->key;
            /* startkey cannot be a dummy because the dummy hash field is -1 */
            assert(startkey != dummy);
            if (startkey == key)
                return entry;
            if (PyUnicode_CheckExact(startkey)
                && PyUnicode_CheckExact(key)
                && unicode_eq(startkey, key))
                return entry;
            Py_INCREF(startkey);
            // returning -1 for error, 0 for false, 1 for true
            cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
            Py_DECREF(startkey);
            if (cmp < 0)                                          /* unlikely */
                return NULL;
            if (table != so->table || entry->key != startkey)     /* unlikely */
                return set_lookkey(so, key, hash);
            if (cmp > 0)                                          /* likely */
                return entry;
            mask = so->mask;                 /* help avoid a register spill */
        }
        if (entry->hash == -1 && freeslot == NULL)
            freeslot = entry;
        if (i + LINEAR_PROBES <= mask) {
            for (j = 0 ; j < LINEAR_PROBES ; j++) {
                entry++;
                if (entry->key == NULL)
                    goto found_null;
                if (entry->hash == hash) {
                    PyObject *startkey = entry->key;
                    assert(startkey != dummy);
                    if (startkey == key)
                        return entry;
                    if (PyUnicode_CheckExact(startkey)
                        && PyUnicode_CheckExact(key)
                        && unicode_eq(startkey, key))
                        return entry;
                    Py_INCREF(startkey);
                    // returning -1 for error, 0 for false, 1 for true
                    cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
                    Py_DECREF(startkey);
                    if (cmp < 0)
                        return NULL;
                    if (table != so->table || entry->key != startkey)
                        return set_lookkey(so, key, hash);
                    if (cmp > 0)
                        return entry;
                    mask = so->mask;
                }
                if (entry->hash == -1 && freeslot == NULL)
                    freeslot = entry;
            }
        }
        perturb >>= PERTURB_SHIFT; // #define PERTURB_SHIFT 5
        i = (i * 5 + 1 + perturb) & mask;
        entry = &table[i];
        if (entry->key == NULL)
            goto found_null;
    }
  found_null:
    return freeslot == NULL ? entry : freeslot;
}

哈希表數(shù)組擴(kuò)容

在 cpython 當(dāng)中對(duì)于給哈希表數(shù)組擴(kuò)容的操作，很多情況下都是用下面這行代碼，從下面的代碼來(lái)看對(duì)應(yīng)擴(kuò)容后數(shù)組的大小并不簡(jiǎn)單，當(dāng)你的哈希表當(dāng)中的元素個(gè)數(shù)大于 50000 時(shí)，新數(shù)組的大小是原數(shù)組的兩倍，而如果你哈希表當(dāng)中的元素個(gè)數(shù)小于等于 50000，那么久擴(kuò)大為原來(lái)長(zhǎng)度的四倍，這個(gè)主要是怕后面如果繼續(xù)擴(kuò)大四倍的話，可能會(huì)浪費(fèi)很多內(nèi)存空間。

set_table_resize(so, so-&gt;used&gt;50000 ? so-&gt;used*2 : so-&gt;used*4);

首先需要了解一下擴(kuò)容機(jī)制，當(dāng)哈希表需要擴(kuò)容的時(shí)候，主要有以下兩個(gè)步驟：

創(chuàng)建新的數(shù)組，用于存儲(chǔ)哈希表的鍵。
遍歷原來(lái)的哈希表，將原來(lái)哈希表當(dāng)中的數(shù)據(jù)加入到新的申請(qǐng)的數(shù)組當(dāng)中。

這里需要注意的是因?yàn)閿?shù)組的長(zhǎng)度發(fā)生了變化，但是 key 的哈希值卻沒(méi)有發(fā)生變化，因此在新的數(shù)組當(dāng)中數(shù)據(jù)對(duì)應(yīng)的下標(biāo)位置也會(huì)發(fā)生變化，因此需重新將所有的對(duì)象重新進(jìn)行一次插入操作，下面的整個(gè)操作相對(duì)來(lái)說(shuō)比較簡(jiǎn)單，這里不再進(jìn)行說(shuō)明了。

static int
set_table_resize(PySetObject *so, Py_ssize_t minused)
{
    Py_ssize_t newsize;
    setentry *oldtable, *newtable, *entry;
    Py_ssize_t oldfill = so->fill;
    Py_ssize_t oldused = so->used;
    int is_oldtable_malloced;
    setentry small_copy[PySet_MINSIZE];
    assert(minused >= 0);
    /* Find the smallest table size > minused. */
    /* XXX speed-up with intrinsics */
    for (newsize = PySet_MINSIZE;
         newsize <= minused && newsize > 0;
         newsize <<= 1)
        ;
    if (newsize <= 0) {
        PyErr_NoMemory();
        return -1;
    }
    /* Get space for a new table. */
    oldtable = so->table;
    assert(oldtable != NULL);
    is_oldtable_malloced = oldtable != so->smalltable;
    if (newsize == PySet_MINSIZE) {
        /* A large table is shrinking, or we can't get any smaller. */
        newtable = so->smalltable;
        if (newtable == oldtable) {
            if (so->fill == so->used) {
                /* No dummies, so no point doing anything. */
                return 0;
            }
            /* We're not going to resize it, but rebuild the
               table anyway to purge old dummy entries.
               Subtle:  This is *necessary* if fill==size,
               as set_lookkey needs at least one virgin slot to
               terminate failing searches.  If fill < size, it's
               merely desirable, as dummies slow searches. */
            assert(so->fill > so->used);
            memcpy(small_copy, oldtable, sizeof(small_copy));
            oldtable = small_copy;
        }
    }
    else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }
    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->fill = 0;
    so->used = 0;
    so->mask = newsize - 1;
    so->table = newtable;
    /* Copy the data over; this is refcount-neutral for active entries;
       dummy entries aren't copied over, of course */
    if (oldfill == oldused) {
        for (entry = oldtable; oldused > 0; entry++) {
            if (entry->key != NULL) {
                oldused--;
                set_insert_clean(so, entry->key, entry->hash);
            }
        }
    } else {
        for (entry = oldtable; oldused > 0; entry++) {
            if (entry->key != NULL && entry->key != dummy) {
                oldused--;
                set_insert_clean(so, entry->key, entry->hash);
            }
        }
    }
    if (is_oldtable_malloced)
        PyMem_DEL(oldtable);
    return 0;
}
static void
set_insert_clean(PySetObject *so, PyObject *key, Py_hash_t hash)
{
    setentry *table = so->table;
    setentry *entry;
    size_t perturb = hash;
    size_t mask = (size_t)so->mask;
    size_t i = (size_t)hash & mask;
    size_t j;
    // #define LINEAR_PROBES 9
    while (1) {
        entry = &table[i];
        if (entry->key == NULL)
            goto found_null;
        if (i + LINEAR_PROBES <= mask) {
            for (j = 0; j < LINEAR_PROBES; j++) {
                entry++;
                if (entry->key == NULL)
                    goto found_null;
            }
        }
        perturb >>= PERTURB_SHIFT;
        i = (i * 5 + 1 + perturb) & mask;
    }
  found_null:
    entry->key = key;
    entry->hash = hash;
    so->fill++;
    so->used++;
}

從集合當(dāng)中刪除元素 pop

從集合當(dāng)中刪除元素的代碼如下所示：

static PyObject *
set_pop(PySetObject *so)
{
    /* Make sure the search finger is in bounds */
    Py_ssize_t i = so->finger & so->mask;
    setentry *entry;
    PyObject *key;
    assert (PyAnySet_Check(so));
    if (so->used == 0) {
        PyErr_SetString(PyExc_KeyError, "pop from an empty set");
        return NULL;
    }
    while ((entry = &so->table[i])->key == NULL || entry->key==dummy) {
        i++;
        if (i > so->mask)
            i = 0;
    }
    key = entry->key;
    entry->key = dummy;
    entry->hash = -1;
    so->used--;
    so->finger = i + 1;         /* next place to start */
    return key;
}

上面的代碼相對(duì)來(lái)說(shuō)也比較清晰，從 finger 開(kāi)始尋找存在的元素，并且刪除他。我們?cè)谇懊嫣岬竭^(guò)，當(dāng)一個(gè)元素被刪除之后他會(huì)被賦值成 dummy 而且哈希值為 -1 。

總結(jié)

在本篇文章當(dāng)中主要給大家簡(jiǎn)要介紹了一下在 cpython 當(dāng)中的集合對(duì)象是如何實(shí)現(xiàn)的，主要是介紹了一些核心的數(shù)據(jù)結(jié)構(gòu)和 cpython 當(dāng)中具體的哈希表的實(shí)現(xiàn)原理，在 cpython 內(nèi)部是使用線性探測(cè)法和開(kāi)放地址法兩種方法去解決哈希沖突的，同時(shí) cpython 哈希表的擴(kuò)容方式比價(jià)有意思，在哈希表當(dāng)中的元素個(gè)數(shù)小于 50000 時(shí)，擴(kuò)容的時(shí)候，擴(kuò)容大小為原來(lái)的四倍，當(dāng)大于 50000 時(shí)，擴(kuò)容的大小為原來(lái)的兩倍，這個(gè)主要是因?yàn)榕潞竺嫒绻麛U(kuò)容太大沒(méi)有使用非常浪費(fèi)內(nèi)存空間。

本篇文章是深入理解 python 虛擬機(jī)系列文章之一，文章地址：github.com/Chang-LeHun…

更多精彩內(nèi)容合集可訪問(wèn)項(xiàng)目：github.com/Chang-LeHun…

以上就是Python 虛擬機(jī)集合set實(shí)現(xiàn)原理及源碼解析的詳細(xì)內(nèi)容，更多關(guān)于Python 虛擬機(jī)set集合的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: