Classes | Public Types | Public Member Functions | Protected Member Functions
zorba::Tokenizer Class Reference

A Tokenizer breaks a string into a stream of word tokens. More...

#include <zorba/tokenizer.h>

List of all members.

Classes

class  Callback
 A Callback is called once per token. More...
struct  Numbers
 A Numbers contains the current token, sentence, and paragraph numbers. More...

Public Types

enum  ElementTraceOptions { trace_none = 0x0, trace_begin = 0x1, trace_end = 0x2 }
 Trace options for XML elements combined via bitwise-or. More...
typedef std::unique_ptr
< Tokenizer,
internal::ztd::destroy_delete
< Tokenizer > > 
ptr
typedef unsigned size_type

Public Member Functions

virtual void destroy () const =0
 Destroys this Tokenizer.
virtual void element (Item const &qname, int trace_options)
 This function is called whenever an XML element is entered during tokenization.
Numbersnumbers ()
 Gets this Tokenizer's associated Numbers.
Numbers const & numbers () const
 Gets this Tokenizer's associated Numbers.
virtual void tokenize (char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang, bool wildcards, Callback &callback, void *payload=0)=0
 Tokenizes the given string.
int trace_options () const
 Gets the trace options.

Protected Member Functions

 Tokenizer (Numbers &numbers, int trace_options=trace_none)
 Constructs a Tokenizer.
virtual ~Tokenizer ()=0
 Destroys a Tokenizer.

Detailed Description

A Tokenizer breaks a string into a stream of word tokens.

Each token is assigned a token, sentence, and paragraph number.

A Tokenizer determines word and sentence boundaries automatically, but must be told when to increment the paragraph number.


Member Typedef Documentation

Definition at line 42 of file tokenizer.h.

typedef unsigned zorba::Tokenizer::size_type

Definition at line 44 of file tokenizer.h.


Member Enumeration Documentation

Trace options for XML elements combined via bitwise-or.

Enumerator:
trace_none 

Trace no elements.

trace_begin 

Trace the beginning of elements.

trace_end 

Trace the ending of elements.

Definition at line 111 of file tokenizer.h.


Constructor & Destructor Documentation

zorba::Tokenizer::Tokenizer ( Numbers numbers,
int  trace_options = trace_none 
) [protected]

Constructs a Tokenizer.

Parameters:
numbersthe Numbers to use.
trace_optionsThe bitwise-or of the available trace options, if any.
virtual zorba::Tokenizer::~Tokenizer ( ) [protected, pure virtual]

Destroys a Tokenizer.


Member Function Documentation

virtual void zorba::Tokenizer::destroy ( ) const [pure virtual]

Destroys this Tokenizer.

This function is called by Zorba when the Tokenizer is no longer needed.

If your TokenizerProvider dynamically allocates Tokenizer objects, then the implementation can (and usually is) simply delete this.

If your TokenizerProvider returns a pointer to a static Tokenizer object, then the implementation should do nothing.

virtual void zorba::Tokenizer::element ( Item const &  qname,
int  trace_options 
) [virtual]

This function is called whenever an XML element is entered during tokenization.

Note that this function is called only if trace_options() returns non-zero.

Parameters:
qnameThe element's QName.
trace_optionsThe bitwise-or of the trace option(s) in effect for a particular call.
See also:
trace_options()
Tokenizer::Numbers & zorba::Tokenizer::numbers ( ) [inline]

Gets this Tokenizer's associated Numbers.

Returns:
Returns said Numbers.

Definition at line 193 of file tokenizer.h.

Tokenizer::Numbers const & zorba::Tokenizer::numbers ( ) const [inline]

Gets this Tokenizer's associated Numbers.

Returns:
Returns said Numbers.

Definition at line 197 of file tokenizer.h.

virtual void zorba::Tokenizer::tokenize ( char const *  utf8_s,
size_type  utf8_len,
locale::iso639_1::type  lang,
bool  wildcards,
Callback callback,
void *  payload = 0 
) [pure virtual]

Tokenizes the given string.

Parameters:
utf8_sThe UTF-8 string to tokenize. It need not be null-terminated.
utf8_lenThe number of bytes in the string to be tokenized.
langThe language of the string.
wildcardsIf true, allows XQuery wildcard syntax characters to be part of tokens.
callbackThe Callback to call once per token.
payloadOptional user-defined data.
int zorba::Tokenizer::trace_options ( ) const [inline]

Gets the trace options.

If the value is trace_none, then the paragraph number will be incremented upon entering an XML element; if the value is anything other than trace_none, then the tokenizer assumes responsibility for incrementing the paragraph number.

Returns:
Returns said options.

Definition at line 125 of file tokenizer.h.


The documentation for this class was generated from the following file:
blog comments powered by Disqus