final class IndexedDISI extends DocIdSetIterator
DocIdSetIterator
which can return
the index of the current document, i.e. the ordinal of the current document
among the list of documents that this iterator can return. This is useful
to implement sparse doc values by only having to encode values for documents
that actually have a value.
Implementation-wise, this DocIdSetIterator
is inspired of
roaring bitmaps
and encodes ranges of 65536
documents independently and picks between 3 encodings depending on the
density of the range:
ALL
if the range contains 65536 documents exactly,
DENSE
if the range contains 4096 documents or more; in that
case documents are stored in a bit set,
SPARSE
otherwise, and the lower 16 bits of the doc IDs are
stored in a short
.
Only ranges that contain at least one value are encoded.
This implementation uses 6 bytes per document in the worst-case, which happens
in the case that all ranges contain exactly one document.
To avoid O(n) lookup time complexity, with n being the number of documents, two lookup
tables are used: A lookup table for block offset and index, and a rank structure
for DENSE block index lookups.
The lookup table is an array of int
-pairs, with a pair for each block. It allows for
direct jumping to the block, as opposed to iteration from the current position and forward
one block at a time.
Each int-pair entry consists of 2 logical parts:
The first 32 bit int holds the index (number of set bits in the blocks) up to just before the
wanted block. The maximum number of set bits is the maximum number of documents, which is < 2^31.
The next int holds the offset in bytes into the underlying slice. As there is a maximum of 2^16
blocks, it follows that the maximum size of any block must not exceed 2^15 bytes to avoid
overflow (2^16 bytes if the int is treated as unsigned). This is currently the case, with the
largest block being DENSE and using 2^13 + 36 bytes.
The cache overhead is numDocs/1024 bytes.
Note: There are 4 types of blocks: ALL, DENSE, SPARSE and non-existing (0 set bits).
In the case of non-existing blocks, the entry in the lookup table has index equal to the
previous entry and offset equal to the next non-empty block.
The block lookup table is stored at the end of the total block structure.
The rank structure for DENSE blocks is an array of byte-pairs with an entry for each
sub-block (default 512 bits) out of the 65536 bits in the outer DENSE block.
Each rank-entry states the number of set bits within the block up to the bit before the
bit positioned at the start of the sub-block.
Note that that the rank entry of the first sub-block is always 0 and that the last entry can
at most be 65536-2 = 65634 and thus will always fit into an byte-pair of 16 bits.
The rank structure for a given DENSE block is stored at the beginning of the DENSE block.
This ensures locality and keeps logistics simple.
Modifier and Type | Class and Description |
---|---|
(package private) static class |
IndexedDISI.Method |
Modifier and Type | Field and Description |
---|---|
(package private) int |
block |
private static int |
BLOCK_SIZE |
(package private) long |
blockEnd |
(package private) long |
cost |
static byte |
DEFAULT_DENSE_RANK_POWER |
private static int |
DENSE_BLOCK_LONGS |
(package private) long |
denseBitmapOffset |
(package private) int |
denseOrigoIndex |
(package private) byte |
denseRankPower |
(package private) byte[] |
denseRankTable |
(package private) int |
doc |
(package private) boolean |
exists |
(package private) int |
gap |
(package private) int |
index |
(package private) RandomAccessInput |
jumpTable |
(package private) int |
jumpTableEntryCount |
(package private) static int |
MAX_ARRAY_LENGTH |
(package private) IndexedDISI.Method |
method |
(package private) int |
nextBlockIndex |
(package private) int |
numberOfOnes |
(package private) IndexInput |
slice
The slice that stores the
DocIdSetIterator . |
(package private) long |
word |
(package private) int |
wordIndex |
NO_MORE_DOCS
Constructor and Description |
---|
IndexedDISI(IndexInput in,
long offset,
long length,
int jumpTableEntryCount,
byte denseRankPower,
long cost)
This constructor always creates a new blockSlice and a new jumpTable from in, to ensure that operations are
independent from the caller.
|
IndexedDISI(IndexInput blockSlice,
RandomAccessInput jumpTable,
int jumpTableEntryCount,
byte denseRankPower,
long cost)
This constructor allows to pass the slice and jumpTable directly in case it helps reuse.
|
Modifier and Type | Method and Description |
---|---|
private static int[] |
addJumps(int[] jumps,
long offset,
int index,
int startBlock,
int endBlock) |
int |
advance(int target)
Advances to the first beyond the current whose document number is greater
than or equal to target, and returns the document number itself.
|
private void |
advanceBlock(int targetBlock) |
boolean |
advanceExact(int target) |
long |
cost()
Returns the estimated cost of this
DocIdSetIterator . |
static IndexInput |
createBlockSlice(IndexInput slice,
java.lang.String sliceDescription,
long offset,
long length,
int jumpTableEntryCount)
Helper method for using
IndexedDISI(IndexInput, RandomAccessInput, int, byte, long) . |
static RandomAccessInput |
createJumpTable(IndexInput slice,
long offset,
long length,
int jumpTableEntryCount)
Helper method for using
IndexedDISI(IndexInput, RandomAccessInput, int, byte, long) . |
private static byte[] |
createRank(FixedBitSet buffer,
byte denseRankPower) |
int |
docID()
Returns the following:
-1 if DocIdSetIterator.nextDoc() or
DocIdSetIterator.advance(int) were not called yet. |
private static void |
flush(int block,
FixedBitSet buffer,
int cardinality,
byte denseRankPower,
IndexOutput out) |
private static short |
flushBlockJumps(int[] jumps,
int blockCount,
IndexOutput out,
long origo) |
int |
index() |
int |
nextDoc()
Advances to the next document in the set and returns the doc it is
currently on, or
DocIdSetIterator.NO_MORE_DOCS if there are no more docs in the
set.NOTE: after the iterator has exhausted you should not call this method, as it may result in unpredicted behavior. |
private static void |
rankSkip(IndexedDISI disi,
int targetInBlock)
If the distance between the current position and the target is > 8 words, the rank cache will
be used to guarantee a worst-case of 1 rank-lookup and 7 word-read-and-count-bits operations.
|
private void |
readBlockHeader() |
(package private) static short |
writeBitSet(DocIdSetIterator it,
IndexOutput out)
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically increasing
gap-less order.
|
(package private) static short |
writeBitSet(DocIdSetIterator it,
IndexOutput out,
byte denseRankPower)
Writes the docIDs from it to out, in logical blocks, one for each 65536 docIDs in monotonically
increasing gap-less order.
|
all, empty, range, slowAdvance
private static final int BLOCK_SIZE
private static final int DENSE_BLOCK_LONGS
public static final byte DEFAULT_DENSE_RANK_POWER
static final int MAX_ARRAY_LENGTH
final IndexInput slice
DocIdSetIterator
.final int jumpTableEntryCount
final byte denseRankPower
final RandomAccessInput jumpTable
final byte[] denseRankTable
final long cost
int block
long blockEnd
long denseBitmapOffset
int nextBlockIndex
IndexedDISI.Method method
int doc
int index
boolean exists
long word
int wordIndex
int numberOfOnes
int denseOrigoIndex
int gap
IndexedDISI(IndexInput in, long offset, long length, int jumpTableEntryCount, byte denseRankPower, long cost) throws java.io.IOException
IndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
for re-use of blockSlice and jumpTable.in
- backing data.offset
- starting offset for blocks in the backing data.length
- the number of bytes holding blocks and jump-table in the backing data.jumpTableEntryCount
- the number of blocks covered by the jump-table.
This must match the number returned by writeBitSet(DocIdSetIterator, IndexOutput, byte)
.denseRankPower
- the number of docIDs covered by each rank entry in DENSE blocks, expressed as 2^denseRankPower
.
This must match the power given in writeBitSet(DocIdSetIterator, IndexOutput, byte)
cost
- normally the number of logical docIDs.java.io.IOException
IndexedDISI(IndexInput blockSlice, RandomAccessInput jumpTable, int jumpTableEntryCount, byte denseRankPower, long cost) throws java.io.IOException
blockSlice
- data blocks, normally created by createBlockSlice(org.apache.lucene.store.IndexInput, java.lang.String, long, long, int)
.jumpTable
- table holding jump-data for block-skips, normally created by createJumpTable(org.apache.lucene.store.IndexInput, long, long, int)
.jumpTableEntryCount
- the number of blocks covered by the jump-table.
This must match the number returned by writeBitSet(DocIdSetIterator, IndexOutput, byte)
.denseRankPower
- the number of docIDs covered by each rank entry in DENSE blocks, expressed as 2^denseRankPower
.
This must match the power given in writeBitSet(DocIdSetIterator, IndexOutput, byte)
cost
- normally the number of logical docIDs.java.io.IOException
private static void flush(int block, FixedBitSet buffer, int cardinality, byte denseRankPower, IndexOutput out) throws java.io.IOException
java.io.IOException
private static byte[] createRank(FixedBitSet buffer, byte denseRankPower)
static short writeBitSet(DocIdSetIterator it, IndexOutput out) throws java.io.IOException
DEFAULT_DENSE_RANK_POWER
of 9 (every 512 docIDs / 8 longs).
The caller must keep track of the number of jump-table entries (returned by this method) as well as the
denseRankPower (9 for this method) and provide them when constructing an IndexedDISI for reading.it
- the document IDs.out
- destination for the blocks.java.io.IOException
- if there was an error writing to out.static short writeBitSet(DocIdSetIterator it, IndexOutput out, byte denseRankPower) throws java.io.IOException
it
- the document IDs.out
- destination for the blocks.denseRankPower
- for IndexedDISI.Method.DENSE
blocks, a rank will be written every 2^denseRankPower
docIDs.
Values < 7 (every 128 docIDs) or > 15 (every 32768 docIDs) disables DENSE rank.
Recommended values are 8-12: Every 256-4096 docIDs or 4-64 longs.
DEFAULT_DENSE_RANK_POWER
is 9: Every 512 docIDs.
This should be stored in meta and used when creating an instance of IndexedDISI.java.io.IOException
- if there was an error writing to out.private static int[] addJumps(int[] jumps, long offset, int index, int startBlock, int endBlock)
private static short flushBlockJumps(int[] jumps, int blockCount, IndexOutput out, long origo) throws java.io.IOException
java.io.IOException
public static IndexInput createBlockSlice(IndexInput slice, java.lang.String sliceDescription, long offset, long length, int jumpTableEntryCount) throws java.io.IOException
IndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
.
Creates a disiSlice for the IndexedDISI data blocks, without the jump-table.slice
- backing data, holding both blocks and jump-table.sliceDescription
- human readable slice designation.offset
- relative to the backing data.length
- full length of the IndexedDISI, including blocks and jump-table data.jumpTableEntryCount
- the number of blocks covered by the jump-table.java.io.IOException
- if a RandomAccessInput could not be created from slice.public static RandomAccessInput createJumpTable(IndexInput slice, long offset, long length, int jumpTableEntryCount) throws java.io.IOException
IndexedDISI(IndexInput, RandomAccessInput, int, byte, long)
.
Creates a RandomAccessInput covering only the jump-table data or null.slice
- backing data, holding both blocks and jump-table.offset
- relative to the backing data.length
- full length of the IndexedDISI, including blocks and jump-table data.jumpTableEntryCount
- the number of blocks covered by the jump-table.java.io.IOException
- if a RandomAccessInput could not be created from slice.public int docID()
DocIdSetIterator
-1
if DocIdSetIterator.nextDoc()
or
DocIdSetIterator.advance(int)
were not called yet.
DocIdSetIterator.NO_MORE_DOCS
if the iterator has exhausted.
docID
in class DocIdSetIterator
public int advance(int target) throws java.io.IOException
DocIdSetIterator
DocIdSetIterator.NO_MORE_DOCS
if target
is greater than the highest document number in the set.
The behavior of this method is undefined when called with
target ≤ current
, or after the iterator has exhausted.
Both cases may result in unpredicted behavior.
When target > current
it behaves as if written:
int advance(int target) { int doc; while ((doc = nextDoc()) < target) { } return doc; }Some implementations are considerably more efficient than that.
NOTE: this method may be called with DocIdSetIterator.NO_MORE_DOCS
for
efficiency by some Scorers. If your implementation cannot efficiently
determine that it should exhaust, it is recommended that you check for that
value in each call to this method.
advance
in class DocIdSetIterator
java.io.IOException
public boolean advanceExact(int target) throws java.io.IOException
java.io.IOException
private void advanceBlock(int targetBlock) throws java.io.IOException
java.io.IOException
private void readBlockHeader() throws java.io.IOException
java.io.IOException
public int nextDoc() throws java.io.IOException
DocIdSetIterator
DocIdSetIterator.NO_MORE_DOCS
if there are no more docs in the
set.nextDoc
in class DocIdSetIterator
java.io.IOException
public int index()
public long cost()
DocIdSetIterator
DocIdSetIterator
.
This is generally an upper bound of the number of documents this iterator might match, but may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
cost
in class DocIdSetIterator
private static void rankSkip(IndexedDISI disi, int targetInBlock) throws java.io.IOException
disi
- standard DISI.targetInBlock
- lower 16 bits of the targetjava.io.IOException
- if a DISI seek failed.