A Tokenizer breaks a string into a stream of word tokens. More...
#include <zorba/tokenizer.h>
Classes | |
class | Callback |
A Callback is called once per token. More... | |
struct | Properties |
Various properties of this Tokenizer. More... | |
struct | State |
A State contains inter-Tokenizer state, currently the current token, sentence, and paragraph numbers. More... |
Public Types | |
typedef std::unique_ptr < Tokenizer, internal::ztd::destroy_delete < Tokenizer > > | ptr |
typedef unsigned | size_type |
Public Member Functions | |
virtual void | destroy () const =0 |
Destroys this Tokenizer. | |
virtual void | properties (Properties *result) const =0 |
Gets the Properties of this Tokenizer. | |
State & | state () |
Gets this Tokenizer's associated State. | |
State const & | state () const |
Gets this Tokenizer's associated State. | |
void | tokenize_node (Item const &node, locale::iso639_1::type lang, Callback &callback) |
Tokenizes the given node. | |
virtual void | tokenize_string (char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, Item const *item=0)=0 |
Tokenizes the given string. |
Protected Member Functions | |
bool | find_lang_attribute (Item const &element, locale::iso639_1::type *lang) |
Given an element, finds its xml:lang attribute, if any, and gets its value. | |
virtual void | item (Item const &item, bool entering) |
This member-function is called whenever an item that is being tokenized is entered or exited. | |
virtual void | tokenize_node_impl (Item const &node, locale::iso639_1::type lang, Callback &callback, bool tokenize_acp) |
Tokenizes the given node and all of its child nodes, if any. | |
Tokenizer (State &state) | |
Constructs a Tokenizer. | |
virtual | ~Tokenizer ()=0 |
Destroys a Tokenizer. |
A Tokenizer breaks a string into a stream of word tokens.
Each token is assigned a token, sentence, and paragraph number.
A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number.
Definition at line 41 of file tokenizer.h.
Definition at line 44 of file tokenizer.h.
typedef unsigned zorba::Tokenizer::size_type |
Definition at line 46 of file tokenizer.h.
|
inlineprotected |
Constructs a Tokenizer.
state | the State to use. |
Definition at line 262 of file tokenizer.h.
|
protectedpure virtual |
Destroys a Tokenizer.
|
pure virtual |
Destroys this Tokenizer.
This function is called by Zorba when the Tokenizer is no longer needed.
If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can simply be (and usually is) delete this
.
If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing.
|
protected |
Given an element, finds its xml:lang
attribute, if any, and gets its value.
element | The element to check. |
lang | A pointer to where to put the found language, if any. |
true
only if an xml:lang
attribute is found and the value is a known language. This member-function is called whenever an item that is being tokenized is entered or exited.
item | The item being entered or exited. |
entering | If true , the item is being entered; if false , the item is being exited. |
|
pure virtual |
Gets the Properties of this Tokenizer.
result | The Properties to populate. |
|
inline |
Gets this Tokenizer's associated State.
Definition at line 265 of file tokenizer.h.
|
inline |
Gets this Tokenizer's associated State.
Definition at line 269 of file tokenizer.h.
|
inline |
Tokenizes the given node.
node | The node to tokenize. |
lang | The default language to use. |
callback | The Callback to call once per token. |
Definition at line 273 of file tokenizer.h.
References tokenize_node_impl().
|
protectedvirtual |
Tokenizes the given node and all of its child nodes, if any.
For each node, it is required that this function call the item() member function of both this Tokenizer and of the Callback twice, once each for entrance and exit.
node | The node to tokenize. |
lang | The default language to use. |
callback | The Callback to call per token. |
tokenize_acp | If true , additionally tokenize all attribute, comment, and processing-instruction nodes encountered; if false , skip them. |
Referenced by tokenize_node().
|
pure virtual |
Tokenizes the given string.
utf8_s | The UTF-8 string to tokenize. It need not be null-terminated. |
utf8_len | The number of bytes in the string to be tokenized. |
lang | The language of the string. |
wildcards | If true , allows XQuery wildcard syntax characters to be part of tokens. |
callback | The Callback to call once per token. |
item | The Item this string is from, if any. |