Classes | Public Types | Public Member Functions | Protected Member Functions
zorba::Tokenizer Class Reference

A Tokenizer breaks a string into a stream of word tokens. More...

#include <zorba/tokenizer.h>

List of all members.

Classes

class  Callback
 A Callback is called once per token. More...
struct  Properties
 Various properties of this Tokenizer. More...
struct  State
 A State contains inter-Tokenizer state, currently the current token, sentence, and paragraph numbers. More...

Public Types

typedef std::unique_ptr
< Tokenizer,
internal::ztd::destroy_delete
< Tokenizer > > 
ptr
typedef unsigned size_type

Public Member Functions

virtual void destroy () const =0
 Destroys this Tokenizer.
virtual void properties (Properties *result) const =0
 Gets the Properties of this Tokenizer.
Statestate ()
 Gets this Tokenizer's associated State.
State const & state () const
 Gets this Tokenizer's associated State.
void tokenize_node (Item const &node, locale::iso639_1::type lang, Callback &callback)
 Tokenizes the given node.
virtual void tokenize_string (char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=0)=0
 Tokenizes the given string.

Protected Member Functions

bool find_lang_attribute (Item const &element, locale::iso639_1::type *lang)
 Given an element, finds its xml:lang attribute, if any, and gets its value.
virtual void item (Item const &item, bool entering)
 This member-function is called whenever an item that is being tokenized is entered or exited.
virtual void tokenize_node_impl (Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp)
 Tokenizes the given node and all of its child nodes, if any.
 Tokenizer (State &state)
 Constructs a Tokenizer.
virtual ~Tokenizer ()=0
 Destroys a Tokenizer.

Detailed Description

A Tokenizer breaks a string into a stream of word tokens.

Each token is assigned a token, sentence, and paragraph number.

A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number.

Definition at line 41 of file tokenizer.h.


Member Typedef Documentation

Definition at line 44 of file tokenizer.h.

typedef unsigned zorba::Tokenizer::size_type

Definition at line 46 of file tokenizer.h.


Constructor & Destructor Documentation

zorba::Tokenizer::Tokenizer ( State state)
inlineprotected

Constructs a Tokenizer.

Parameters:
statethe State to use.

Definition at line 262 of file tokenizer.h.

virtual zorba::Tokenizer::~Tokenizer ( )
protectedpure virtual

Destroys a Tokenizer.


Member Function Documentation

virtual void zorba::Tokenizer::destroy ( ) const
pure virtual

Destroys this Tokenizer.

This function is called by Zorba when the Tokenizer is no longer needed.

If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can simply be (and usually is) delete this.

If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing.

bool zorba::Tokenizer::find_lang_attribute ( Item const &  element,
locale::iso639_1::type lang 
)
protected

Given an element, finds its xml:lang attribute, if any, and gets its value.

Parameters:
elementThe element to check.
langA pointer to where to put the found language, if any.
Returns:
Returns true only if an xml:lang attribute is found and the value is a known language.
virtual void zorba::Tokenizer::item ( Item const &  item,
bool  entering 
)
protectedvirtual

This member-function is called whenever an item that is being tokenized is entered or exited.

Parameters:
itemThe item being entered or exited.
enteringIf true, the item is being entered; if false, the item is being exited.
virtual void zorba::Tokenizer::properties ( Properties result) const
pure virtual

Gets the Properties of this Tokenizer.

Parameters:
resultThe Properties to populate.
Tokenizer::State & zorba::Tokenizer::state ( )
inline

Gets this Tokenizer's associated State.

Returns:
Returns said State.

Definition at line 265 of file tokenizer.h.

Tokenizer::State const & zorba::Tokenizer::state ( ) const
inline

Gets this Tokenizer's associated State.

Returns:
Returns said State.

Definition at line 269 of file tokenizer.h.

void zorba::Tokenizer::tokenize_node ( Item const &  node,
locale::iso639_1::type  lang,
Callback callback 
)
inline

Tokenizes the given node.

Parameters:
nodeThe node to tokenize.
langThe default language to use.
callbackThe Callback to call once per token.

Definition at line 273 of file tokenizer.h.

References tokenize_node_impl().

virtual void zorba::Tokenizer::tokenize_node_impl ( Item const &  node,
locale::iso639_1::type  lang,
Callback callback,
bool  tokenize_acp 
)
protectedvirtual

Tokenizes the given node and all of its child nodes, if any.

For each node, it is required that this function call the item() member function of both this Tokenizer and of the Callback twice, once each for entrance and exit.

Parameters:
nodeThe node to tokenize.
langThe default language to use.
callbackThe Callback to call per token.
tokenize_acpIf true, additionally tokenize all attribute, comment, and processing-instruction nodes encountered; if false, skip them.

Referenced by tokenize_node().

virtual void zorba::Tokenizer::tokenize_string ( char const *  utf8_s,
size_type  utf8_len,
locale::iso639_1::type  lang,
bool  wildcards,
Callback callback,
Item const *  item = 0 
)
pure virtual

Tokenizes the given string.

Parameters:
utf8_sThe UTF-8 string to tokenize. It need not be null-terminated.
utf8_lenThe number of bytes in the string to be tokenized.
langThe language of the string.
wildcardsIf true, allows XQuery wildcard syntax characters to be part of tokens.
callbackThe Callback to call once per token.
itemThe Item this string is from, if any.

The documentation for this class was generated from the following file:
blog comments powered by Disqus