public class Builder<T>
extends java.lang.Object
NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698
The parameterized type T is the output type. See the
subclasses of Outputs
.
FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.
Modifier and Type | Class and Description |
---|---|
static class |
Builder.Arc<T>
Expert: holds a pending (seen but not yet serialized) arc.
|
(package private) static class |
Builder.CompiledNode |
(package private) static class |
Builder.FixedLengthArcsBuffer
Reusable buffer for building nodes with fixed length arcs (binary search or direct addressing).
|
(package private) static interface |
Builder.Node |
static class |
Builder.UnCompiledNode<T>
Expert: holds a pending (seen but not yet serialized) Node.
|
Modifier and Type | Field and Description |
---|---|
(package private) boolean |
allowFixedLengthArcs |
(package private) long |
arcCount |
(package private) long |
binarySearchNodeCount |
(package private) BytesStore |
bytes |
private NodeHash<T> |
dedupHash |
(package private) static float |
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
Default oversizing factor used to decide whether to encode a node with direct addressing or binary search.
|
(package private) long |
directAddressingExpansionCredit |
(package private) float |
directAddressingMaxOversizingFactor |
(package private) long |
directAddressingNodeCount |
private boolean |
doShareNonSingletonNodes |
(package private) Builder.FixedLengthArcsBuffer |
fixedLengthArcsBuffer |
private Builder.UnCompiledNode<T>[] |
frontier |
(package private) FST<T> |
fst |
(package private) long |
lastFrozenNode |
private IntsRefBuilder |
lastInput |
private int |
minSuffixCount1 |
private int |
minSuffixCount2 |
private T |
NO_OUTPUT |
(package private) long |
nodeCount |
(package private) int[] |
numBytesPerArc |
(package private) int[] |
numLabelBytesPerArc |
private int |
shareMaxTailLength |
Constructor and Description |
---|
Builder(FST.INPUT_TYPE inputType,
int minSuffixCount1,
int minSuffixCount2,
boolean doShareSuffix,
boolean doShareNonSingletonNodes,
int shareMaxTailLength,
Outputs<T> outputs,
boolean allowFixedLengthArcs,
int bytesPageBits)
Instantiates an FST/FSA builder with all the possible tuning and construction
tweaks.
|
Builder(FST.INPUT_TYPE inputType,
Outputs<T> outputs)
Instantiates an FST/FSA builder without any pruning.
|
Modifier and Type | Method and Description |
---|---|
void |
add(IntsRef input,
T output)
Add the next input/output pair.
|
private void |
compileAllTargets(Builder.UnCompiledNode<T> node,
int tailLength) |
private Builder.CompiledNode |
compileNode(Builder.UnCompiledNode<T> nodeIn,
int tailLength) |
FST<T> |
finish()
Returns final FST.
|
private void |
freezeTail(int prefixLenPlus1) |
long |
fstRamBytesUsed() |
long |
getArcCount() |
float |
getDirectAddressingMaxOversizingFactor() |
long |
getMappedStateCount() |
long |
getNodeCount() |
long |
getTermCount() |
Builder<T> |
setDirectAddressingMaxOversizingFactor(float factor)
Overrides the default the maximum oversizing of fixed array allowed to enable direct addressing
of arcs instead of binary search.
|
private boolean |
validOutput(T output) |
static final float DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
This factor does not determine whether to encode a node with a list of variable length arcs or with
fixed length arcs. It only determines the effective encoding of a node that is already known to be
encoded with fixed length arcs.
See FST.shouldExpandNodeWithFixedLengthArcs()
and FST.shouldExpandNodeWithDirectAddressing()
.
For English words we measured 217K nodes, only 3.27% nodes are encoded with fixed length arcs, and 99.99% of them with direct addressing. Overall FST memory reduced by 1.67%.
For worst case we measured 168K nodes, 50% of them are encoded with fixed length arcs, and 14% of them with direct encoding. Overall FST memory reduced by 0.8%.
Use TestFstDirectAddressing.main()
and TestFstDirectAddressing.testWorstCaseForDirectAddressing()
to evaluate a change.
private final T NO_OUTPUT
private final int minSuffixCount1
private final int minSuffixCount2
private final boolean doShareNonSingletonNodes
private final int shareMaxTailLength
private final IntsRefBuilder lastInput
private Builder.UnCompiledNode<T>[] frontier
long lastFrozenNode
int[] numBytesPerArc
int[] numLabelBytesPerArc
final Builder.FixedLengthArcsBuffer fixedLengthArcsBuffer
long arcCount
long nodeCount
long binarySearchNodeCount
long directAddressingNodeCount
boolean allowFixedLengthArcs
float directAddressingMaxOversizingFactor
long directAddressingExpansionCredit
BytesStore bytes
public Builder(FST.INPUT_TYPE inputType, Outputs<T> outputs)
Builder(FST.INPUT_TYPE, int, int, boolean, boolean, int, Outputs, boolean, int)
with
pruning options turned off.public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix, boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs<T> outputs, boolean allowFixedLengthArcs, int bytesPageBits)
inputType
- The input type (transition labels). Can be anything from FST.INPUT_TYPE
enumeration. Shorter types will consume less memory. Strings (character sequences) are
represented as FST.INPUT_TYPE.BYTE4
(full unicode codepoints).minSuffixCount1
- If pruning the input graph during construction, this threshold is used for telling
if a node is kept or pruned. If transition_count(node) >= minSuffixCount1, the node
is kept.minSuffixCount2
- (Note: only Mike McCandless knows what this one is really doing...)doShareSuffix
- If true
, the shared suffixes will be compacted into unique paths.
This requires an additional RAM-intensive hash map for lookups in memory. Setting this parameter to
false
creates a single suffix path for all input sequences. This will result in a larger
FST, but requires substantially less memory and CPU during building.doShareNonSingletonNodes
- Only used if doShareSuffix is true. Set this to
true to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.shareMaxTailLength
- Only used if doShareSuffix is true. Set this to
Integer.MAX_VALUE to ensure FST is fully minimal, at cost of more
CPU and more RAM during building.outputs
- The output type for each input sequence. Applies only if building an FST. For
FSA, use NoOutputs.getSingleton()
and NoOutputs.getNoOutput()
as the
singleton output object.allowFixedLengthArcs
- Pass false to disable the fixed length arc optimization (binary search or
direct addressing) while building the FST; this will make the resulting FST smaller but slower to
traverse.bytesPageBits
- How many bits wide to make each
byte[] block in the BytesStore; if you know the FST
will be large then make this larger. For example 15
bits = 32768 byte pages.public Builder<T> setDirectAddressingMaxOversizingFactor(float factor)
Setting this factor to a negative value (e.g. -1) effectively disables direct addressing, only binary search nodes will be created.
DIRECT_ADDRESSING_MAX_OVERSIZING_FACTOR
public float getDirectAddressingMaxOversizingFactor()
public long getTermCount()
public long getNodeCount()
public long getArcCount()
public long getMappedStateCount()
private Builder.CompiledNode compileNode(Builder.UnCompiledNode<T> nodeIn, int tailLength) throws java.io.IOException
java.io.IOException
private void freezeTail(int prefixLenPlus1) throws java.io.IOException
java.io.IOException
public void add(IntsRef input, T output) throws java.io.IOException
IntsRef.compareTo(org.apache.lucene.util.IntsRef)
. It's also OK to add the same
input twice in a row with different outputs, as long
as Outputs
implements the Outputs.merge(T, T)
method. Note that input is fully consumed after this
method is returned (so caller is free to reuse), but
output is not. So if your outputs are changeable (eg
ByteSequenceOutputs
or IntSequenceOutputs
) then you cannot reuse across
calls.java.io.IOException
private boolean validOutput(T output)
public FST<T> finish() throws java.io.IOException
java.io.IOException
private void compileAllTargets(Builder.UnCompiledNode<T> node, int tailLength) throws java.io.IOException
java.io.IOException
public long fstRamBytesUsed()