Class Ferret::Analysis::RegExpTokenizer
In: ext/r_analysis.c
Parent: Ferret::Analysis::TokenStream

A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.

Example

Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.

  # of course you would add more than just é
  RegExpTokenizer.new(input, /[[:alpha:]é]+/)

  "Dave's résumé, at http://www.davebalmain.com/ 1234"
    => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]

Methods

new   text   text=  

Constants

REGEXP = rtoken_re

Public Class methods

Create a new tokenizer based on a regular expression

input:text to tokenizer
regexp:regular expression used to recognize tokens in the input

Public Instance methods

Get the text being tokenized by the tokenizer.

Set the text to be tokenized by the tokenizer. The tokenizer gets reset to tokenize the text from the beginning.

[Validate]