org.gjt.xpp.impl.tokenizer

Class Tokenizer


public class Tokenizer
extends java.lang.Object

Simpe XML Tokenizer (SXT) performs input stream tokenizing. Advantages: Limitations:
Author:
Aleksander Slominski

Field Summary

static byte
ATTR_CHARACTERS
static byte
ATTR_CONTENT
static byte
ATTR_NAME
static byte
CDSECT
static byte
CHARACTERS
static byte
CHAR_REF
static byte
COMMENT
static byte
CONTENT
static byte
DOCTYPE
static byte
EMPTY_ELEMENT
static byte
END_DOCUMENT
static byte
ENTITY_REF
static byte
ETAG_NAME
protected static int
LOOKUP_MAX
protected static char
LOOKUP_MAX_CHAR
static byte
PI
static byte
STAG_END
static byte
STAG_NAME
char[]
buf
protected static boolean[]
lookupNameChar
protected static boolean[]
lookupNameStartChar
int
nsColonCount
boolean
paramNotifyAttValue
boolean
paramNotifyCDSect
boolean
paramNotifyCharRef
boolean
paramNotifyCharacters
boolean
paramNotifyComment
boolean
paramNotifyDoctype
boolean
paramNotifyEntityRef
boolean
paramNotifyPI
boolean
parsedContent
This falg decides which buffer will be used to retrieve content for current token.
char[]
pc
This is buffer for parsed content such as actual valuue of entity ('<' in buf but in pc it is '&lt')
int
pcEnd
int
pcStart
Range [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == false
int
pos
position of next char that will be read from buffer
int
posEnd
int
posNsColon
int
posStart
Range [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == false
boolean
seenContent

Constructor Summary

Tokenizer()

Method Summary

int
getBufferShrinkOffset()
int
getColumnNumber()
int
getHardLimit()
int
getLineNumber()
String
getPosDesc()
Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.
int
getSoftLimit()
boolean
isAllowedMixedContent()
boolean
isBufferShrinkable()
protected boolean
isNameChar(char ch)
protected boolean
isNameStartChar(char ch)
protected boolean
isS(char ch)
Determine if ch is whitespace ([3] S)
byte
next()
Return next recognized toke or END_DOCUMENT if no more input.
void
reset()
void
setAllowedMixedContent(boolean enable)
Set support for mixed conetent.
void
setBufferShrinkable(boolean shrinkable)
void
setHardLimit(int value)
Set hard limit on internal buffer size.
void
setInput(Reader r)
Reset tokenizer state and set new input source
void
setInput(char[] data)
Reset tokenizer state and set new input source
void
setInput(char[] data, int off, int len)
void
setNotifyAll(boolean enable)
Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).
void
setParseContent(boolean enable)
Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).
void
setSoftLimit(int value)
Set soft limit on internal buffer size.

Field Details

ATTR_CHARACTERS

public static final byte ATTR_CHARACTERS
Field Value:
124

ATTR_CONTENT

public static final byte ATTR_CONTENT
Field Value:
127

ATTR_NAME

public static final byte ATTR_NAME
Field Value:
122

CDSECT

public static final byte CDSECT
Field Value:
30

CHARACTERS

public static final byte CHARACTERS
Field Value:
20

CHAR_REF

public static final byte CHAR_REF
Field Value:
75

COMMENT

public static final byte COMMENT
Field Value:
40

CONTENT

public static final byte CONTENT
Field Value:
10

DOCTYPE

public static final byte DOCTYPE
Field Value:
50

EMPTY_ELEMENT

public static final byte EMPTY_ELEMENT
Field Value:
111

END_DOCUMENT

public static final byte END_DOCUMENT
Field Value:
2

ENTITY_REF

public static final byte ENTITY_REF
Field Value:
70

ETAG_NAME

public static final byte ETAG_NAME
Field Value:
110

LOOKUP_MAX

protected static final int LOOKUP_MAX
Field Value:
1024

LOOKUP_MAX_CHAR

protected static final char LOOKUP_MAX_CHAR
Field Value:
'\u0400'

PI

public static final byte PI
Field Value:
60

STAG_END

public static final byte STAG_END
Field Value:
112

STAG_NAME

public static final byte STAG_NAME
Field Value:
120

buf

public char[] buf

lookupNameChar

protected static boolean[] lookupNameChar

lookupNameStartChar

protected static boolean[] lookupNameStartChar

nsColonCount

public int nsColonCount

paramNotifyAttValue

public boolean paramNotifyAttValue

paramNotifyCDSect

public boolean paramNotifyCDSect

paramNotifyCharRef

public boolean paramNotifyCharRef

paramNotifyCharacters

public boolean paramNotifyCharacters

paramNotifyComment

public boolean paramNotifyComment

paramNotifyDoctype

public boolean paramNotifyDoctype

paramNotifyEntityRef

public boolean paramNotifyEntityRef

paramNotifyPI

public boolean paramNotifyPI

parsedContent

public boolean parsedContent
This falg decides which buffer will be used to retrieve content for current token. If true use pc and [pcStart, pcEnd) and if false use buf and [posStart, posEnd)

pc

public char[] pc
This is buffer for parsed content such as actual valuue of entity ('<' in buf but in pc it is '&lt')

pcEnd

public int pcEnd

pcStart

public int pcStart
Range [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == false

pos

public int pos
position of next char that will be read from buffer

posEnd

public int posEnd

posNsColon

public int posNsColon

posStart

public int posStart
Range [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == false

seenContent

public boolean seenContent

Constructor Details

Tokenizer

public Tokenizer()

Method Details

getBufferShrinkOffset

public int getBufferShrinkOffset()

getColumnNumber

public int getColumnNumber()

getHardLimit

public int getHardLimit()

getLineNumber

public int getLineNumber()

getPosDesc

public String getPosDesc()
Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.

getSoftLimit

public int getSoftLimit()

isAllowedMixedContent

public boolean isAllowedMixedContent()

isBufferShrinkable

public boolean isBufferShrinkable()

isNameChar

protected boolean isNameChar(char ch)

isNameStartChar

protected boolean isNameStartChar(char ch)

isS

protected boolean isS(char ch)
Determine if ch is whitespace ([3] S)

next

public byte next()
            throws TokenizerException,
                   IOException
Return next recognized toke or END_DOCUMENT if no more input.

This is simple automata (in pseudo-code):

 byte next() {
    while(state != END_DOCUMENT) {
      ch = more();  // read character from input
      state = func(ch, state); // do transition
      if(state is accepting)
        return state;  // return token to caller
    }
 }
 

For speed (and simplicity?) it is using few procedures such as readName() or isS().


reset

public void reset()

setAllowedMixedContent

public void setAllowedMixedContent(boolean enable)
Set support for mixed conetent. If mixed content is disabled tokenizer will do its best to ensure that no element has mixed content model also ignorable whitespaces will not be reported as element content.

setBufferShrinkable

public void setBufferShrinkable(boolean shrinkable)
            throws TokenizerException

setHardLimit

public void setHardLimit(int value)
            throws TokenizerException
Set hard limit on internal buffer size. That means that if input (such as element content) is bigger than hard limit size tokenizer will throw XmlTokenizerBufferOverflowException.

setInput

public void setInput(Reader r)
Reset tokenizer state and set new input source

setInput

public void setInput(char[] data)
Reset tokenizer state and set new input source

setInput

public void setInput(char[] data,
                     int off,
                     int len)

setNotifyAll

public void setNotifyAll(boolean enable)
Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).

setParseContent

public void setParseContent(boolean enable)
Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).

setSoftLimit

public void setSoftLimit(int value)
            throws TokenizerException
Set soft limit on internal buffer size. That means suggested size that tokznzier will try to keep.

Copyright (c) 2003 IU Extreme! Lab http://www.extreme.indiana.edu/ All Rights Reserved.

Note this package is deprecated by XPP3 that implements XmlPull API