Class | Ferret::Analysis::RegExpTokenizer |
In: |
ext/r_analysis.c
|
Parent: | Ferret::Analysis::TokenStream |
A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é RegExpTokenizer.new(input, /[[:alpha:]é]+/) "Dave's résumé, at http://www.davebalmain.com/ 1234" => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]