Information Retrieval: Implementing and Evaluating Search Engines

Addenda for Chapter 6: Index Compression

Group VarInt

In Chapter 6, we discussed the byte-aligned vByte method as an example of an index compression technique with high decoding performance. After we had written the chapter,
Jeff Dean gave a keynote talk at WSDM 2009 in which he discussed a few compression algorithms used by Google, including the byte-aligned Group VarInt method.
Group VarInt is named after VarInt (variable-size integer coding), Google's version of the vByte algorithm. vByte's decoding routine is very efficient, but often suffers from branch misprediction. Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:
Δ(L)   =   ⟨80, 320, 31, 255⟩
vByte(L)   =   01010000 11000000 00000010 00011111 11111111 00000001
(where the continuation bits are underlined). When working through the encoded postings sequence, the decoder needs to inspect every single continuation bit and, depending on whether it is 1 or 0, either continue with the current posting or skip to the next. This means there is a branch instruction, and thus the possibility of a branch misprediction, for every byte in the encoded postings sequence.
Group VarInt addresses this problem in two steps:
  1. Replacing the continuation bits with a 2-bit selector value that appears before the body of the encoded posting. This allows us to use simple arithmetic (or a lookup table) to determine how many bits to consume for the current posting.
  2. Combining four consecutive postings into a group for better space utilization and alignment of the selecotr values (or to reduce the number of table lookups in case a lookup table is used).
For the same postings list as before, the Group VarInt representation is:
GroupVarInt(L)  =  00010000 0101000 01000000 00000001 00011111 11111111
(the selector bits corresponding to each encoded posting are shown in the same color as the respective posting). For example, the second Δ-value, 320, can be encoded using 2 bytes, so its selector value is 01.
Publicly available implementations of Group VarInt typically use a precomputed lookup table to determine the position of each posting in the encoded data chunk and the bit mask to apply when extracting it. In our own experiments, we found that a little bit shifting works just as well.
The following decoder implementation assumes a little-endian architecture that supports unaligned memory access (e.g., Intel/AMD); on other architectures, the decoding operations may be slightly more complicated.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
  const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
  const byte* limit = compressed + size;
  uint32_t current_value = 0;
  while (compressed != limit) {
    const uint32_t selector = *compressed++;
    const uint32_t selector1 = (selector & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector1];
    *uncompressed++ = current_value;
    compressed += selector1 + 1;
    const uint32_t selector2 = ((selector >> 2) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector2];
    *uncompressed++ = current_value;
    compressed += selector2 + 1;
    const uint32_t selector3 = ((selector >> 4) & 3);
    current_value += *((uint32_t*)(compressed)) & MASK[selector3];
    *uncompressed++ = current_value;
    compressed += selector3 + 1;
    const uint32_t selector4 = (selector >> 6);
    current_value += *((uint32_t*)(compressed)) & MASK[selector4];
    *uncompressed++ = current_value;
    compressed += selector4 + 1;
If the number of postings in the compressed list is not a multiple of 4, then special-case handling may be necessary for the final 1–3 postings in the list — for instance, by encoding those postings using vByte.
Decoding Performance
The following table is identical with Table 6.9 (page 213) in the book, except for the addition of Group VarInt:
Compression Decoding       Cumulative Overhead
Compression Method       (bits per docid)       (ns per docid) (decoding + disk I/O)
Gamma4.567.7313.98 ns
Golomb3.7810.8516.03 ns
Rice3.816.4611.68 ns
LLRUN3.897.0412.37 ns
Interpolative3.8726.5031.80 ns
vByte8.111.3912.50 ns
Simple-94.912.769.49 ns
Group VarInt10.451.0914.84 ns
Perhaps surprisingly, the performance difference between vByte and Group VarInt in the above table is very small. The reason for this is that the vast majority of Δ-values in the docid index for GOV2 require 7 bits or less, in which case vByte incurs no branch mispredictions. To evaluate the performance difference between vByte and Group VarInt on an index in which the Δ-values vary over a larger range, we repeated the experiment from Table 6.9 for a schema-independent index instead of a docid index:
Compression Decoding       Cumulative Overhead
Compression Method       (bits per position)       (ns per position) (decoding + disk I/O)
Gamma14.0912.8132.11 ns
Golomb12.1412.7429.37 ns
Rice12.417.3224.32 ns
LLRUN10.078.8022.60 ns
Interpolative10.3431.6345.80 ns
vByte12.034.3420.82 ns
Simple-913.144.2122.21 ns
Group VarInt13.101.9019.85 ns
Compared to the docid index, the decoding overhead for vByte increases by 3.0 ns per list element (+221%). The decoding overhead for Group VarInt, on the other hand, grows by only 0.8 ns per list element (+76%). Thus, while vByte may be an appropriate compression method for docid indices, since it is almost as fast as Group VarInt and slightly more compact, it is not ideal for schema-independent postings lists, where Group VarInt delivers much better decoding performance. If postings lists are stored on disk, one may still argue that vByte is the superior compression method, as it achieves better compression rates. However, if the index is kept in memory, then the factor-2.3 difference in decoding performance makes Group VarInt the clear winner.