哈希表实现原理

哈希表是一种保存键值对（key-value）的数据结构。哈希表优点在于，它能以 O(1) 的复杂度快速查询数据

源码对应文件：

dict.c
dict.h

源码及原理

结构设计

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    dictEntry **table; // 哈希表数组
    unsigned long size; // 哈希表大小
    unsigned long sizemask; // 哈希表大小掩码，用于计算索引
    unsigned long used; // 哈希表中已使用的节点数
} dictht;

哈希表底层就是一个数组，数组的每个元素是一个指向哈希节点的指针

对于每个元素节点dictEntry

typedef struct dictEntry { // 定义一个结构体 dictEntry
    void *key; // 键指针
    union { // 联合体，用于存储不同类型的值
        void *val; // 值指针
        uint64_t u64; // 无符号64位整数
        int64_t s64; // 有符号64位整数
        double d; // 双精度浮点数
    } v; // 联合体变量 v
    struct dictEntry *next; // 指向下一个字典条目的指针
} dictEntry; // 结构体类型定义为 dictEntry

void *key：这是一个指向键的指针。键可以是任何类型的数据，通过指针来存储。
union：这是一个联合体，用于存储不同类型的值。联合体中的成员可以是void *val（值指针）、uint64_t u64（无符号 64 位整数）、int64_t s64（有符号 64 位整数）或double d（双精度浮点数）。联合体的特点是在同一时刻只能存储其中的一个成员。
struct dictEntry *next：这是一个指向下一个字典条目的指针，用于构建链表结构，解决哈希冲突。

使用联合体的好处是节省内存空间，因为当「值」是整数或浮点数时，就可以将值的数据内嵌在 dictEntry 结构里，无需再用一个指针指向实际的值，从而节省了内存空间。

哈希冲突

哈希表实际上是一个数组，数组里多每一个元素就是一个哈希桶。当一个键值对的键经过 Hash 函数计算后得到哈希值，再将(哈希值 % 哈希表大小)取模计算，得到的结果值就是该 key-value 对应的数组元素位置，也就是第几个哈希桶。

如果说有两个元素哈希值相同，被分配到了同一个桶中，那么就是哈希冲突。

链式哈希

每个哈希表节点都有一个 next 指针，用于指向下一个哈希表节点，因此多个哈希表节点可以用 next 指针构成一个单项链表，被分配到同一个哈希桶上的多个节点可以用这个单项链表连接起来，这样就解决了哈希冲突。

随着单项链表长度的增加，耗时也会增加，因为链表查询是 $O(n)$ ，要解决这个问题，就需要rehash

rehash触发条件

两个：

负载因子大于1，且没有执行bgsave或者bgrewiteaof，也没有RDB快照和AOF重写。
负载因子大于5，说明哈希冲突严重，强制rehash

rehash

typedef struct dict {
    dictType *type; // 字典类型
    void *privdata; // 私有数据
    dictht ht[2]; // 哈希表，包含两个哈希表用于渐进式rehash
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;

Redis定义的这个dict结构体中有两个哈希表

Redis 的哈希（Hash）在存储元素数量增加或负载因子超过一定阈值时，会进行 rehash 操作，以保证哈希表的性能和效率。rehash 操作主要包括以下步骤：

给另一个哈希表分配空间，一般为两倍
迁移元素：将原哈希表中的元素重新分配到新的哈希表中，根据重新计算的哈希值找到对应的槽位，并将元素放入其中。
将第一个哈希表空间释放，并将第二个哈希表设置为第一个。原本的第一个哈希表为空白，为下次哈希做准备

如果在这个过程中，第一个哈希表空间很大，那么会涉及大量数据拷贝，造成Redis阻塞

渐进式hash

迁移的工作不再是一次性迁移完成，而是分多次迁移。步骤如下：

给第二个哈希表开空间
在rehash期间，每次哈希表元素增删改，除了操作，还会将哈希表1中索引位置所有kv迁移到哈希表2
随着客户端请求操作变多，最终在某个时间将第一个哈希表全部迁移到第二个哈希表，完成操作。

在此期间会有两个哈希表，增删改查会在两个哈希表上进行。

查找：先在第一个找，找不到去第二个找
新增，直接新增到第二个，这样保证第一个哈希表在不断减少

/* Performs N steps of incremental rehashing. Returns 1 if there are still
 * keys to move from the old to the new hash table, otherwise 0 is returned.
 *
 * Note that a rehashing step consists in moving a bucket (that may have more
 * than one key as we use chaining) from the old to the new hash table, however
 * since part of the hash table may be composed of empty spaces, it is not
 * guaranteed that this function will rehash even a single bucket, since it
 * will visit at max N*10 empty buckets in total, otherwise the amount of
 * work it does would be unbound and the function may block for a long time. */
/* 执行N步增量rehash。如果仍有键需要从旧哈希表移动到新哈希表，则返回1，否则返回0。
 *
 * 注意，rehash步骤包括将一个桶（可能有多个键，因为我们使用链表）从旧哈希表移动到新哈希表，
 * 然而，由于部分哈希表可能由空白空间组成，因此不能保证此函数会rehash一个桶，
 * 因为它最多会访问N*10个空桶，否则它执行的工作量将是无限的，函数可能会阻塞很长时间。 */
int dictRehash(dict *d, int n) { // 定义函数dictRehash，参数为字典指针d和整数n
    int empty_visits = n*10; /* Max number of empty buckets to visit. */ // 定义变量empty_visits，最大空桶访问次数为n*10
    unsigned long s0 = d->ht[0].size; // 获取哈希表0的大小
    unsigned long s1 = d->ht[1].size; // 获取哈希表1的大小
    if (dict_can_resize == DICT_RESIZE_FORBID || !dictIsRehashing(d)) return 0; // 如果禁止调整大小或不在rehash中，返回0
    if (dict_can_resize == DICT_RESIZE_AVOID &&  // 如果调整大小设置为避免，并且满足以下条件
        ((s1 > s0 && s1 / s0 < dict_force_resize_ratio) || // 如果哈希表1大于哈希表0且比例小于强制调整大小比例
         (s1 < s0 && s0 / s1 < dict_force_resize_ratio))) // 或者哈希表0大于哈希表1且比例小于强制调整大小比例
    {
        return 0; // 返回0
    }

    while(n-- && d->ht[0].used != 0) { // 当n大于0且哈希表0中有元素时
        dictEntry *de, *nextde; // 定义字典条目指针de和nextde

        /* Note that rehashidx can't overflow as we are sure there are more
         * elements because ht[0].used != 0 */
        /* 注意rehashidx不会溢出，因为我们确定有更多元素，因为ht[0].used != 0 */
        assert(d->ht[0].size > (unsigned long)d->rehashidx); // 断言哈希表0的大小大于rehash索引
        while(d->ht[0].table[d->rehashidx] == NULL) { // 当哈希表0的当前rehash索引处为空时
            d->rehashidx++; // 增加rehash索引
            if (--empty_visits == 0) return 1; // 如果空桶访问次数用完，返回1
        }
        de = d->ht[0].table[d->rehashidx]; // 获取当前rehash索引处的字典条目
        /* Move all the keys in this bucket from the old to the new hash HT */
        /* 将此桶中的所有键从旧哈希表移动到新哈希表 */
        while(de) { // 当字典条目不为空时
            uint64_t h; // 定义64位无符号整数h

            nextde = de->next; // 获取下一个字典条目
            /* Get the index in the new hash table */
            /* 获取新哈希表中的索引 */
            h = dictHashKey(d, de->key) & d->ht[1].sizemask; // 计算新哈希表中的索引
            de->next = d->ht[1].table[h]; // 将当前字典条目的下一个指针指向新哈希表中的对应位置
            d->ht[1].table[h] = de; // 将新哈希表中的对应位置指向当前字典条目
            d->ht[0].used--; // 减少哈希表0的已使用节点数
            d->ht[1].used++; // 增加哈希表1的已使用节点数
            de = nextde; // 将当前字典条目指向下一个字典条目
        }
        d->ht[0].table[d->rehashidx] = NULL; // 将哈希表0的当前rehash索引处置为空
        d->rehashidx++; // 增加rehash索引
    }

    /* Check if we already rehashed the whole table... */
    /* 检查我们是否已经rehash了整个表... */
    if (d->ht[0].used == 0) { // 如果哈希表0中没有已使用节点
        zfree(d->ht[0].table); // 释放哈希表0的内存
        d->ht[0] = d->ht[1]; // 将哈希表1赋值给哈希表0
        _dictReset(&d->ht[1]); // 重置哈希表1
        d->rehashidx = -1; // 将rehash索引置为-1
        return 0; // 返回0
    }

    /* More to rehash... */
    /* 还有更多需要rehash... */
    return 1; // 返回1
}

参考资料

小林Coding 哈希表

源码及原理​

结构设计​

哈希冲突​

链式哈希​

rehash触发条件​

rehash​

渐进式hash​

参考资料​