0%

Redis源码剖析之压缩列表

以下涉及到的源码均为redis5.0-rc3版本的代码【点击查看官方源码

压缩列表(ziplist)

  1. 压缩列表是一个特殊编码的列表,为节约内存而开发。
  2. 压缩列表可以保存字符串和整型值,其中整数存放时仍是整数,而不会变成字符。
  3. 压缩列表允许push和pop操作在时间复杂度为O(1)的情况下完成。但是每个操作都可能需要重新分配ziplist使用的内存,所以实际的复杂性受ziplist使用内存量及其变动的影响。
  4. 压缩列表在redis中是列表键和哈希键的底层实现之一。
  5. 由于压缩列表本就为节约内存而开发,且其改动的操作都可能会重新分配entry的内存,导致操作的时间复杂度增加,所以为了性能上不受影响,其适用于存储小整数值和较短的字符串。

数据结构

压缩列表

压缩列表的数据结构是在ziplist.c文件中注释说明中这样描述的:

1
<zlbytes> <zltail> <zllen> <entry> <entry> ... <entry> <zlend>

其中各部分的含义如下:

  • zlbytes
    值类型为uint32_t,占用4个字节。是一个无符号整型值,保存着整个压缩列表占用的字节数(包含自己占用的4个字节)。此值的作用在于当对ziplist的内存进行重新分配的时候直接使用该值,而无需进行遍历整个ziplist;
  • zltail
    值类型为uint32_t,占用4个字节。列表最后一个节点的地址偏移量,这样就可以使得进行pop操作的时候无需遍历整个ziplist;
  • zllen
    值类型为uint16_t,占用2个字节。记录整个ziplist中节点的总数,当其大小大于2^16-2的时候,zllen=2^16-1,此时,便需要遍历整个ziplist才能确定节点的数量;
  • entry
    ziplist的节点
  • zlend
    值类型为uint8_t,占用1个字节。值为255,表示ziplist结束的位置,当一个字节被设置值为255时即表示后续不会再有节点;

下面用图来举例说明:
redis-3-1

压缩列表节点

压缩列表节点的组成一般情况下如下所示:

1
<prevlen> <encoding> <entry-data>

然而,如果节点存储的是小整型时,节点就不会有这个数据,因为真正的值可以用encoding的值表示,然后节点的组成又会如下所示:

1
<prevlen> <encoding>

由上可见,每个entry节点都会有前缀信息,且其前缀都由两方面信息组成:prvlen、encoding。

prvlen

  1. 代表前一个节点占用的字节长度,方便ziplist从后往前遍历。如p1代表压缩列表地址,p2代表最后一个节点地址,p3代表倒数第二个节点地址,由此可见:
1
2
3
p2 = p1 + zltail
p3 = p2 - p2->prvlen
p(pre) = p3 - p3->prvlen
  1. 如果长度小于254字节,prvlen则占用1个字节。

    1
    <prevlen from 0 to 253> <encoding> <entry>
  2. 如果长度大于254字节,prvlen则占用5个字节长度,第一个字节代值为254(FE),后面的四个字节用于保存前一个节点的长度。

    1
    0xFE <4 bytes unsigned little endian prevlen> <encoding> <entry>

encoding

  1. 表示当前节点的编码,可以是integer或者string。
  2. 当其节点存储的是一个string时,encoding的第一个字节的前2个bit(00、01、10)代表了ecoding的类型,其他的bit则表示string的长度。
  3. 当encoding的第一个字节的前2个bit为11时,代表entry节点存储的是整数,且后续的2个bit则可以用来表示integer的值(即节点存储小整数的时候)

下面直接贴身源码中的例子,来举例当类型不同时的encoding的样子。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
//---------------字符串类型---------------//
* |00pppppp| - 1 byte
* String value with length less than or equal to 63 bytes (6 bits).
* "pppppp" represents the unsigned 6 bit length.
* |01pppppp|qqqqqqqq| - 2 bytes
* String value with length less than or equal to 16383 bytes (14 bits).
* IMPORTANT: The 14 bit number is stored in big endian.
* |10000000|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes
* String value with length greater than or equal to 16384 bytes.
* Only the 4 bytes following the first byte represents the length
* up to 32^2-1. The 6 lower bits of the first byte are not used and
* are set to zero.
* IMPORTANT: The 32 bit number is stored in big endian.
//-----------------整数类型------------//
* |11000000| - 3 bytes
* Integer encoded as int16_t (2 bytes).
* |11010000| - 5 bytes
* Integer encoded as int32_t (4 bytes).
* |11100000| - 9 bytes
* Integer encoded as int64_t (8 bytes).
* |11110000| - 4 bytes
* Integer encoded as 24 bit signed (3 bytes).
* |11111110| - 2 bytes
* Integer encoded as 8 bit signed (1 byte).
* |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer.
* Unsigned integer from 0 to 12. The encoded value is actually from
* 1 to 13 because 0000 and 1111 can not be used, so 1 should be
* subtracted from the encoded 4 bit value to obtain the right value.
* |11111111| - End of ziplist special entry.

压缩列表存储样例

这里用源码中的两个例子来具体表示压缩列表的存储情况,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
//-----------------整数类型------------//
* The following is a ziplist containing the two elements representing
* the strings "2" and "5". It is composed of 15 bytes, that we visually
* split into sections:
*
* [0f 00 00 00] [0c 00 00 00] [02 00] [00 f3] [02 f6] [ff]
* | | | | | |
* zlbytes zltail entries "2" "5" end
*
* The first 4 bytes represent the number 15, that is the number of bytes
* the whole ziplist is composed of. The second 4 bytes are the offset
* at which the last ziplist entry is found, that is 12, in fact the
* last entry, that is "5", is at offset 12 inside the ziplist.
* The next 16 bit integer represents the number of elements inside the
* ziplist, its value is 2 since there are just two elements inside.
* Finally "00 f3" is the first entry representing the number 2. It is
* composed of the previous entry length, which is zero because this is
* our first entry, and the byte F3 which corresponds to the encoding
* |1111xxxx| with xxxx between 0001 and 1101. We need to remove the "F"
* higher order bits 1111, and subtract 1 from the "3", so the entry value
* is "2". The next entry has a prevlen of 02, since the first entry is
* composed of exactly two bytes. The entry itself, F6, is encoded exactly
* like the first entry, and 6-1 = 5, so the value of the entry is 5.
* Finally the special entry FF signals the end of the ziplist.
*
//-----------------字符串类型------------//
* Adding another element to the above string with the value "Hello World"
* allows us to show how the ziplist encodes small strings. We'll just show
* the hex dump of the entry itself. Imagine the bytes as following the
* entry that stores "5" in the ziplist above:
*
* [02] [0b] [48 65 6c 6c 6f 20 57 6f 72 6c 64]
*
* The first byte, 02, is the length of the previous entry. The next
* byte represents the encoding in the pattern |00pppppp| that means
* that the entry is a string of length <pppppp>, so 0B means that
* an 11 bytes string follows. From the third byte (48) to the last (64)
* there are just the ASCII characters for "Hello World".

API

在ziplist.h头文件中给出了如下api:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
//创建压缩列表
unsigned char *ziplistNew(void);
//合并两个压缩列表,将second追加在first后面
unsigned char *ziplistMerge(unsigned char **first, unsigned char **second);
//新加节点
unsigned char *ziplistPush(unsigned char *zl, unsigned char *s, unsigned int slen, int where);
//返回给定索引的节点
unsigned char *ziplistIndex(unsigned char *zl, int index);
//下一个节点
unsigned char *ziplistNext(unsigned char *zl, unsigned char *p);
//前一个节点
unsigned char *ziplistPrev(unsigned char *zl, unsigned char *p);
//获取节点中的值
unsigned int ziplistGet(unsigned char *p, unsigned char **sval, unsigned int *slen, long long *lval);
//插入节点
unsigned char *ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen);
//删除节点
unsigned char *ziplistDelete(unsigned char *zl, unsigned char **p);
//删除多个节点
unsigned char *ziplistDeleteRange(unsigned char *zl, int index, unsigned int num);
//比较节点值大小
unsigned int ziplistCompare(unsigned char *p, unsigned char *s, unsigned int slen);
//查找
unsigned char *ziplistFind(unsigned char *p, unsigned char *vstr, unsigned int vlen, unsigned int skip);
//获取节点个数
unsigned int ziplistLen(unsigned char *zl);
//压缩列表暂用总字节数
size_t ziplistBlobLen(unsigned char *zl);
//格式化打印压缩列表信息
void ziplistRepr(unsigned char *zl);

下面将只对节点的定义,以及节点的插入和删除操作的源码做介绍,其余的若想要了解可去浏览源码。而为什么要单独只介绍删除和插入两个操作呢?因为其涉及到压缩列表的一个影响性能的重要操作——空间扩展

entry的定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
typedef struct zlentry {
//prevrawlensize占用的字节情况
unsigned int prevrawlensize;
//前一个节点的长度
unsigned int prevrawlen;
//节点的长度(string有1、2、5字节情况,integer只有单字节的情况)
unsigned int lensize;
//节点占用字节数
unsigned int len;
//prevrawlensize + lensize
unsigned int headersize;
//编码类型:ZIP_STR_* or ZIP_INT_*
unsigned char encoding;
//指向节点起始的指针
unsigned char *p;
} zlentry;

删除操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
//删除元素为num的节点
unsigned char *__ziplistDelete(unsigned char *zl, unsigned char *p, unsigned int num) {
unsigned int i, totlen, deleted = 0;
size_t offset;
int nextdiff = 0;
zlentry first, tail;

zipEntry(p, &first);
for (i = 0; p[0] != ZIP_END && i < num; i++) {
p += zipRawEntryLength(p);
deleted++;
}

//计算需要删除的空间
totlen = p-first.p;
if (totlen > 0) {
if (p[0] != ZIP_END) {
nextdiff = zipPrevLenByteDiff(p,first.prevrawlen);

p -= nextdiff;
zipStorePrevEntryLength(p,first.prevrawlen);

ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))-totlen);

zipEntry(p, &tail);
if (p[tail.headersize+tail.len] != ZIP_END) {
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
}
//移动后面的节点
memmove(first.p,p,
intrev32ifbe(ZIPLIST_BYTES(zl))-(p-zl)-1);
} else {
//尾节点删除,不需要进行空间扩展
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe((first.p-zl)-first.prevrawlen);
}

//重新计算尾节点的偏移量和计算节点总字节数
offset = first.p-zl;
zl = ziplistResize(zl, intrev32ifbe(ZIPLIST_BYTES(zl))-totlen+nextdiff);
ZIPLIST_INCR_LENGTH(zl,-deleted);
p = zl+offset;

if (nextdiff != 0)
//进行空间扩展更新
zl = __ziplistCascadeUpdate(zl,p);
}
return zl;
}

插入操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
//插入节点p
unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {
size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen;
unsigned int prevlensize, prevlen = 0;
size_t offset;
int nextdiff = 0;
unsigned char encoding = 0;
long long value = 123456789;
zlentry tail;

//查找要插入节点的前一个节点
if (p[0] != ZIP_END) {
ZIP_DECODE_PREVLEN(p, prevlensize, prevlen);
} else {
unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);
if (ptail[0] != ZIP_END) {
prevlen = zipRawEntryLength(ptail);
}
}

//判断entry是否要encoding
if (zipTryEncoding(s,slen,&value,&encoding)) {
//设置int类型
reqlen = zipIntSize(encoding);
} else {
reqlen = slen;
}

reqlen += zipStorePrevEntryLength(NULL,prevlen);
reqlen += zipStoreEntryEncoding(NULL,encoding,slen);

//当插入的节点不是在最后的时候,我们需要确保下一个节点有足够的内存来保存当前节点的长度
int forcelarge = 0;
nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;
if (nextdiff == -4 && reqlen < 4) {
nextdiff = 0;
forcelarge = 1;
}

/* Store offset because a realloc may change the address of zl. */
offset = p-zl;
zl = ziplistResize(zl,curlen+reqlen+nextdiff);
p = zl+offset;

//当不是插入到最后的时候
if (p[0] != ZIP_END) {
/* Subtract one because of the ZIP_END bytes */
memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);

/* Encode this entry's raw length in the next entry. */
if (forcelarge)
zipStorePrevEntryLengthLarge(p+reqlen,reqlen);
else
zipStorePrevEntryLength(p+reqlen,reqlen);

/* Update offset for tail */
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);

//当要插入位置后面有多个节点时
zipEntry(p+reqlen, &tail);
if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);
}
} else {
//当是插入到最后的时候
ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);
}

//判断是否进行空间扩展
if (nextdiff != 0) {
offset = p-zl;
zl = __ziplistCascadeUpdate(zl,p+reqlen);
p = zl+offset;
}

//写入节点
p += zipStorePrevEntryLength(p,prevlen);
p += zipStoreEntryEncoding(p,encoding,slen);
if (ZIP_IS_STR(encoding)) {
memcpy(p,s,slen);
} else {
zipSaveInteger(p,value,encoding);
}
ZIPLIST_INCR_LENGTH(zl,1);
return zl;
}

空间扩展

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* When an entry is inserted, we need to set the prevlen field of the next
* entry to equal the length of the inserted entry. It can occur that this
* length cannot be encoded in 1 byte and the next entry needs to be grow
* a bit larger to hold the 5-byte encoded prevlen. This can be done for free,
* because this only happens when an entry is already being inserted (which
* causes a realloc and memmove). However, encoding the prevlen may require
* that this entry is grown as well. This effect may cascade throughout
* the ziplist when there are consecutive entries with a size close to
* ZIP_BIG_PREVLEN, so we need to check that the prevlen can be encoded in
* every consecutive entry.
*
* Note that this effect can also happen in reverse, where the bytes required
* to encode the prevlen field can shrink. This effect is deliberately ignored,
* because it can cause a "flapping" effect where a chain prevlen fields is
* first grown and then shrunk again after consecutive inserts. Rather, the
* field is allowed to stay larger than necessary, because a large prevlen
* field implies the ziplist is holding large entries anyway.
*
* The pointer "p" points to the first entry that does NOT need to be
* updated, i.e. consecutive fields MAY need an update. */

unsigned char *__ziplistCascadeUpdate(unsigned char *zl, unsigned char *p) {
size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), rawlen, rawlensize;
size_t offset, noffset, extra;
unsigned char *np;
zlentry cur, next;

while (p[0] != ZIP_END) {
zipEntry(p, &cur);
rawlen = cur.headersize + cur.len;
rawlensize = zipStorePrevEntryLength(NULL,rawlen);

//当以及时最后一个节点的时候,结束操作
if (p[rawlen] == ZIP_END) break;
zipEntry(p+rawlen, &next);

//当prvlen内存不需要改变的时候,结束操作
if (next.prevrawlen == rawlen) break;

if (next.prevrawlensize < rawlensize) {
offset = p-zl;
extra = rawlensize-next.prevrawlensize;
zl = ziplistResize(zl,curlen+extra);
p = zl+offset;

np = p+rawlen;
noffset = np-zl;

if ((zl+intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))) != np) {
ZIPLIST_TAIL_OFFSET(zl) =
intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+extra);
}

memmove(np+rawlensize,
np+next.prevrawlensize,
curlen-noffset-next.prevrawlensize-1);
zipStorePrevEntryLength(np,rawlen);

p += rawlen;
curlen += extra;
} else {
if (next.prevrawlensize > rawlensize) {
zipStorePrevEntryLengthLarge(p+rawlen,rawlen);
} else {
zipStorePrevEntryLength(p+rawlen,rawlen);
}
break;
}
}
return zl;
}