public class BoilerpipeHTMLParser extends org.apache.xerces.parsers.AbstractSAXParser implements BoilerpipeDocumentSource
BoilerpipeSAXInput. The parser uses CyberNeko to parse HTML content.ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNINGfDocumentSource, fDTDContentModelSource, fDTDSource, fInDTDENTITY_RESOLVER, ERROR_HANDLER, fConfiguration| Modifier | Constructor and Description |
|---|---|
|
BoilerpipeHTMLParser()
Constructs a
BoilerpipeHTMLParser using a default HTML content handler. |
|
BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
Constructs a
BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler. |
protected |
BoilerpipeHTMLParser(boolean ignore) |
| Modifier and Type | Method and Description |
|---|---|
void |
setContentHandler(BoilerpipeHTMLContentHandler contentHandler) |
void |
setContentHandler(org.xml.sax.ContentHandler contentHandler) |
TextDocument |
toTextDocument()
Returns a
TextDocument containing the extracted TextBlock
s. |
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDeclany, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDeclpublic BoilerpipeHTMLParser()
BoilerpipeHTMLParser using a default HTML content handler.public BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler.contentHandler - protected BoilerpipeHTMLParser(boolean ignore)
public void setContentHandler(BoilerpipeHTMLContentHandler contentHandler)
public void setContentHandler(org.xml.sax.ContentHandler contentHandler)
setContentHandler in interface org.xml.sax.XMLReadersetContentHandler in class org.apache.xerces.parsers.AbstractSAXParserpublic TextDocument toTextDocument()
TextDocument containing the extracted TextBlock
s. NOTE: Only call this after AbstractSAXParser.parse(org.xml.sax.InputSource).toTextDocument in interface BoilerpipeDocumentSourceTextDocument