public final class CustomAnalyzer extends Analyzer
TokenizerFactory
,
TokenFilterFactory
, and CharFilterFactory
.
You can create an instance of this Analyzer using the builder:
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir")) .withTokenizer(StandardTokenizerFactory.class) .addTokenFilter(StandardFilterFactory.class) .addTokenFilter(LowerCaseFilterFactory.class) .addTokenFilter(StopFilterFactory.class, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset") .build();The parameters passed to components are also used by Apache Solr and are documented on their corresponding factory classes. Refer to documentation of subclasses of
TokenizerFactory
, TokenFilterFactory
, and CharFilterFactory
.
You can also use the SPI names (as defined by ServiceLoader
interface):
Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir")) .withTokenizer("standard") .addTokenFilter("standard") .addTokenFilter("lowercase") .addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset") .build();
The list of names to be used for components can be looked up through:
TokenizerFactory.availableTokenizers()
, TokenFilterFactory.availableTokenFilters()
,
and CharFilterFactory.availableCharFilters()
.
You can create conditional branches in the analyzer by using CustomAnalyzer.Builder.when(String, String...)
and
CustomAnalyzer.Builder.whenTerm(Predicate)
:
Analyzer ana = CustomAnalyzer.builder() .withTokenizer("standard") .addTokenFilter("lowercase") .whenTerm(t -> t.length() > 10) .addTokenFilter("reversestring") .endwhen() .build();
Modifier and Type | Class and Description |
---|---|
static class |
CustomAnalyzer.Builder
Builder for
CustomAnalyzer . |
static class |
CustomAnalyzer.ConditionBuilder
Factory class for a
ConditionalTokenFilter |
Analyzer.ReuseStrategy, Analyzer.TokenStreamComponents
Modifier and Type | Field and Description |
---|---|
private CharFilterFactory[] |
charFilters |
private java.lang.Integer |
offsetGap |
private java.lang.Integer |
posIncGap |
private TokenFilterFactory[] |
tokenFilters |
private TokenizerFactory |
tokenizer |
GLOBAL_REUSE_STRATEGY, PER_FIELD_REUSE_STRATEGY
Constructor and Description |
---|
CustomAnalyzer(Version defaultMatchVersion,
CharFilterFactory[] charFilters,
TokenizerFactory tokenizer,
TokenFilterFactory[] tokenFilters,
java.lang.Integer posIncGap,
java.lang.Integer offsetGap) |
Modifier and Type | Method and Description |
---|---|
static CustomAnalyzer.Builder |
builder()
Returns a builder for custom analyzers that loads all resources from
Lucene's classloader.
|
static CustomAnalyzer.Builder |
builder(java.nio.file.Path configDir)
Returns a builder for custom analyzers that loads all resources from the given
file system base directory.
|
static CustomAnalyzer.Builder |
builder(ResourceLoader loader)
Returns a builder for custom analyzers that loads all resources using the given
ResourceLoader . |
protected Analyzer.TokenStreamComponents |
createComponents(java.lang.String fieldName)
Creates a new
Analyzer.TokenStreamComponents instance for this analyzer. |
java.util.List<CharFilterFactory> |
getCharFilterFactories()
Returns the list of char filters that are used in this analyzer.
|
int |
getOffsetGap(java.lang.String fieldName)
Just like
Analyzer.getPositionIncrementGap(java.lang.String) , except for
Token offsets instead. |
int |
getPositionIncrementGap(java.lang.String fieldName)
Invoked before indexing a IndexableField instance if
terms have already been added to that field.
|
java.util.List<TokenFilterFactory> |
getTokenFilterFactories()
Returns the list of token filters that are used in this analyzer.
|
TokenizerFactory |
getTokenizerFactory()
Returns the tokenizer that is used in this analyzer.
|
protected java.io.Reader |
initReader(java.lang.String fieldName,
java.io.Reader reader)
Override this if you want to add a CharFilter chain.
|
protected java.io.Reader |
initReaderForNormalization(java.lang.String fieldName,
java.io.Reader reader)
Wrap the given
Reader with CharFilter s that make sense
for normalization. |
protected TokenStream |
normalize(java.lang.String fieldName,
TokenStream in)
Wrap the given
TokenStream in order to apply normalization filters. |
java.lang.String |
toString() |
attributeFactory, close, getReuseStrategy, getVersion, normalize, setVersion, tokenStream, tokenStream
private final CharFilterFactory[] charFilters
private final TokenizerFactory tokenizer
private final TokenFilterFactory[] tokenFilters
private final java.lang.Integer posIncGap
private final java.lang.Integer offsetGap
CustomAnalyzer(Version defaultMatchVersion, CharFilterFactory[] charFilters, TokenizerFactory tokenizer, TokenFilterFactory[] tokenFilters, java.lang.Integer posIncGap, java.lang.Integer offsetGap)
public static CustomAnalyzer.Builder builder()
public static CustomAnalyzer.Builder builder(java.nio.file.Path configDir)
public static CustomAnalyzer.Builder builder(ResourceLoader loader)
ResourceLoader
.protected java.io.Reader initReader(java.lang.String fieldName, java.io.Reader reader)
Analyzer
The default implementation returns reader
unchanged.
initReader
in class Analyzer
fieldName
- IndexableField name being indexedreader
- original Readerprotected java.io.Reader initReaderForNormalization(java.lang.String fieldName, java.io.Reader reader)
Analyzer
Reader
with CharFilter
s that make sense
for normalization. This is typically a subset of the CharFilter
s
that are applied in Analyzer.initReader(String, Reader)
. This is used by
Analyzer.normalize(String, String)
.initReaderForNormalization
in class Analyzer
protected Analyzer.TokenStreamComponents createComponents(java.lang.String fieldName)
Analyzer
Analyzer.TokenStreamComponents
instance for this analyzer.createComponents
in class Analyzer
fieldName
- the name of the fields content passed to the
Analyzer.TokenStreamComponents
sink as a readerAnalyzer.TokenStreamComponents
for this analyzer.protected TokenStream normalize(java.lang.String fieldName, TokenStream in)
Analyzer
TokenStream
in order to apply normalization filters.
The default implementation returns the TokenStream
as-is. This is
used by Analyzer.normalize(String, String)
.public int getPositionIncrementGap(java.lang.String fieldName)
Analyzer
getPositionIncrementGap
in class Analyzer
fieldName
- IndexableField name being indexed.Analyzer.tokenStream(String,Reader)
.
This value must be >= 0
.public int getOffsetGap(java.lang.String fieldName)
Analyzer
Analyzer.getPositionIncrementGap(java.lang.String)
, except for
Token offsets instead. By default this returns 1.
This method is only called if the field
produced at least one token for indexing.getOffsetGap
in class Analyzer
fieldName
- the field just indexedAnalyzer.tokenStream(String,Reader)
.
This value must be >= 0
.public java.util.List<CharFilterFactory> getCharFilterFactories()
public TokenizerFactory getTokenizerFactory()
public java.util.List<TokenFilterFactory> getTokenFilterFactories()
public java.lang.String toString()
toString
in class java.lang.Object