A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é RegExpTokenizer.new(input, /[[:alpha:]é]+/) "Dave's résumé, at http://www.davebalmain.com/ 1234" => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]
Create a new tokenizer based on a regular expression
text to tokenizer
regular expression used to recognize tokens in the input
static VALUE frb_rets_init(int argc, VALUE *argv, VALUE self) { VALUE rtext, regex, proc; TokenStream *ts; rb_scan_args(argc, argv, "11&", &rtext, ®ex, &proc); ts = rets_new(rtext, regex, proc); Frt_Wrap_Struct(self, &frb_rets_mark, &frb_rets_free, ts); object_add(ts, self); return self; }
Get the text being tokenized by the tokenizer.
static VALUE frb_rets_get_text(VALUE self) { TokenStream *ts; GET_TS(ts, self); return RETS(ts)->rtext; }
Set the text to be tokenized by the tokenizer. The tokenizer gets reset to tokenize the text from the beginning.
static VALUE frb_rets_set_text(VALUE self, VALUE rtext) { TokenStream *ts; GET_TS(ts, self); rb_hash_aset(object_space, ((VALUE)ts)|1, rtext); StringValue(rtext); RETS(ts)->rtext = rtext; RETS(ts)->curr_ind = 0; return rtext; }