support.dawg module

This module implements an FST/FSA writer and reader. An FST (Finite State Transducer) stores a directed acyclic graph with values associated with the leaves. Common elements of the values are pushed inside the tree. An FST that does not store values is a regular FSA.

The format of the leaf values is pluggable using subclasses of the Values class.

Whoosh uses these structures to store a directed acyclic word graph (DAWG) for use in (at least) spell checking.

Graph nodes

class whoosh.support.dawg.Node(owner, address, accept=False)

A slow but easier-to-use wrapper for FSA/DAWGs. Translates the low-level arc-based interface of GraphReader into Node objects with methods to follow edges.

class whoosh.support.dawg.UnionNode(a, b)

Makes two graphs appear to be the union of the two graphs.

class whoosh.support.dawg.IntersectionNode(a, b)

Makes two graphs appear to be the intersection of the two graphs.

Cursor interface

class whoosh.support.dawg.BaseCursor

Base class for a cursor-type object for navigating an FST/word graph, represented by a GraphReader object.

>>> cur = GraphReader(dawgfile).cursor()
>>> for key in cur.follow():
...   print(repr(key))

The cursor “rests” on arcs in the FSA/FST graph, rather than nodes.

accept()

Returns True if the current arc leads to an accept state (the end of a valid key).

at_last_arc()

Returns True if the current arc is the last outgoing arc from the previous node.

find_path(path)

Follows the labels in the given path, starting at the current position.

flatten()

Yields the keys in the graph, starting at the current position.

flatten_v()

Yields (key, value) tuples in an FST, starting at the current position.

follow()

Follows the current arc.

is_active()

Returns True if this cursor is still active, that is it has not read past the last arc in the graph.

label()

Returns the label bytes of the current arc.

next_arc()

Moves to the next outgoing arc from the previous node.

peek_key()

Returns a sequence of label bytes representing the next closest key in the graph.

peek_key_bytes()

Returns the next closest key in the graph as a single bytes object.

peek_key_string()

Returns the next closest key in the graph as a decoded unicode string.

prefix()

Returns a sequence of the label bytes for the path from the root to the current arc.

prefix_bytes()

Returns the label bytes for the path from the root to the current arc as a single joined bytes object.

prefix_string()

Returns the labels of the path from the root to the current arc as a decoded unicode string.

skip_to(key)

Moves the cursor to the path represented by the given key bytes.

stopped()

Returns True if the current arc leads to a stop state.

switch_to(label)

Switch to the sibling arc with the given label bytes.

value()

Returns the value at the current arc, if reading an FST.

class whoosh.support.dawg.Cursor(graph, root=None, stack=None)
class whoosh.support.dawg.Arc(label=None, target=None, value=None, accept=False, acceptval=None)

Represents a directed arc between two nodes in an FSA/FST graph.

The lastarc attribute is True if this is the last outgoing arc from the previous node.

:param label:The label bytes for this arc. For a word graph, this will
be a character.
Parameters:
  • target – The address of the node at the endpoint of this arc.
  • value – The inner FST value at the endpoint of this arc.
  • accept – Whether the endpoint of this arc is an accept state (eg the end of a valid word).
  • acceptval – If the endpoint of this arc is an accept state, the final FST value for that accepted state.

IO classes

class whoosh.support.dawg.GraphWriter(dbfile, vtype=None, merge=None)

Writes an FSA/FST graph to disk.

Call insert(key) to insert keys into the graph. You must insert keys in sorted order. Call close() to finish the graph and close the file.

>>> gw = GraphWriter(my_file)
>>> gw.insert("alfa")
>>> gw.insert("bravo")
>>> gw.insert("charlie")
>>> gw.close()

The graph writer can write separate graphs for multiple fields. Use start_field(name) and finish_field() to separate fields.

>>> gw = GraphWriter(my_file)
>>> gw.start_field("content")
>>> gw.insert("alfalfa")
>>> gw.insert("apple")
>>> gw.finish_field()
>>> gw.start_field("title")
>>> gw.insert("artichoke")
>>> gw.finish_field()
>>> gw.close()
Parameters:
  • dbfile – the file to write to.
  • vtype – a Values class to use for storing values. This is only necessary if you will be storing values for the keys.
  • merge – a function that takes two values and returns a single value. This is called if you insert two identical keys with values.
close()

Finishes the current graph and closes the underlying file.

finish_field()

Finishes the graph for the current field.

insert(key, value=None)

Inserts the given key into the graph.

Parameters:
  • key – a sequence of bytes objects, a bytes object, or a string.
  • value – an optional value to encode in the graph along with the key. If the writer was not instantiated with a value type, passing a value here will raise an error.
start_field(fieldname)

Starts a new graph for the given field.

class whoosh.support.dawg.BaseGraphReader
class whoosh.support.dawg.GraphReader(dbfile, rootname=None, vtype=None, filebase=0)

Utility functions

whoosh.support.dawg.to_labels(key)

Takes a string and returns a list of bytestrings, suitable for use as a key or path in an FSA/FST graph.

whoosh.support.dawg.within(graph, text, k=1, prefix=0, address=None)

Yields a series of keys in the given graph within k edit distance of text. If prefix is greater than 0, all keys must match the first prefix characters of text.

FST Value classes

class whoosh.support.dawg.Values

Base for classes the describe how to encode and decode FST values.

static add(prefix, v)

Adds the given prefix (the result of a call to common()) to the given value.

static common(v1, v2)

Returns the “common” part of the two values, for whatever “common” means for this class. For example, a string implementation would return the common shared prefix, for an int implementation it would return the minimum of the two numbers.

If there is no common part, this method should return None.

static is_valid(v)

Returns True if v is a valid object that can be stored by this class.

static read(dbfile)

Reads a value from the given file.

classmethod skip(dbfile)

Skips over a value in the given file.

static subtract(v, prefix)

Subtracts the “common” part (the prefix) from the given value.

static to_bytes(v)

Returns a str (Python 2.x) or bytes (Python 3) representation of the given value. This is used for calculating node digests, so it should be unique but fast to calculate, and does not have to be parseable.

static write(dbfile, v)

Writes value v to a file.

class whoosh.support.dawg.IntValues

Stores integer values in an FST.

class whoosh.support.dawg.SequenceValues

Abstract base class for value types that store sequences.

class whoosh.support.dawg.BytesValues

Stores bytes objects (str in Python 2.x) in an FST.

class whoosh.support.dawg.ArrayValues(typecode)

Stores array.array objects in an FST.

class whoosh.support.dawg.IntListValues

Stores lists of positive, increasing integers (that is, lists of integers where each number is >= 0 and each number is greater than or equal to the number that precedes it) in an FST.

Table Of Contents

Previous topic

support.charset module

Next topic

support.levenshtein module

This Page