public final class CommonExtractors
extends java.lang.Object
BoilerpipeExtractors.| Modifier and Type | Field and Description |
|---|---|
static ArticleExtractor |
ARTICLE_EXTRACTOR
Works very well for most types of Article-like HTML.
|
static CanolaExtractor |
CANOLA_EXTRACTOR
Trained on krdwrd Canola (different definition of "boilerplate").
|
static DefaultExtractor |
DEFAULT_EXTRACTOR
Usually worse than
ArticleExtractor, but simpler/no heuristics. |
static KeepEverythingExtractor |
KEEP_EVERYTHING_EXTRACTOR
Dummy Extractor; should return the input text.
|
static LargestContentExtractor |
LARGEST_CONTENT_EXTRACTOR
Like
DefaultExtractor, but keeps the largest text block only. |
public static final ArticleExtractor ARTICLE_EXTRACTOR
public static final DefaultExtractor DEFAULT_EXTRACTOR
ArticleExtractor, but simpler/no heuristics.public static final LargestContentExtractor LARGEST_CONTENT_EXTRACTOR
DefaultExtractor, but keeps the largest text block only.public static final CanolaExtractor CANOLA_EXTRACTOR
public static final KeepEverythingExtractor KEEP_EVERYTHING_EXTRACTOR
BoilerpipeExtractor, or
somewhere else.