public class GermanStemmer
extends java.lang.Object
The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns at isst.fhg.de).
Modifier and Type | Field and Description |
---|---|
private static java.util.Locale |
locale |
private java.lang.StringBuilder |
sb
Buffer for the terms while stemming them.
|
private int |
substCount
Amount of characters that are removed with substitute() while stemming.
|
Constructor and Description |
---|
GermanStemmer() |
Modifier and Type | Method and Description |
---|---|
private boolean |
isStemmable(java.lang.String term)
Checks if a term could be stemmed.
|
private void |
optimize(java.lang.StringBuilder buffer)
Does some optimizations on the term.
|
private void |
removeParticleDenotion(java.lang.StringBuilder buffer)
Removes a particle denotion ("ge") from a term.
|
private void |
resubstitute(java.lang.StringBuilder buffer)
Undoes the changes made by substitute().
|
protected java.lang.String |
stem(java.lang.String term)
Stemms the given term to an unique discriminator.
|
private void |
strip(java.lang.StringBuilder buffer)
suffix stripping (stemming) on the current term.
|
private void |
substitute(java.lang.StringBuilder buffer)
Do some substitutions for the term to reduce overstemming:
- Substitute Umlauts with their corresponding vowel:
äöü -> aou ,
"ß" is substituted by "ss"
- Substitute a second char of a pair of equal characters with
an asterisk: ?? -> ?*
- Substitute some common character combinations with a token:
sch/ch/ei/ie/ig/st -> $/§/%/&/#/! |
private java.lang.StringBuilder sb
private int substCount
private static final java.util.Locale locale
protected java.lang.String stem(java.lang.String term)
term
- The term that should be stemmed.private boolean isStemmable(java.lang.String term)
private void strip(java.lang.StringBuilder buffer)
private void optimize(java.lang.StringBuilder buffer)
private void removeParticleDenotion(java.lang.StringBuilder buffer)
private void substitute(java.lang.StringBuilder buffer)
äöü -> aou
,
"ß" is substituted by "ss"
- Substitute a second char of a pair of equal characters with
an asterisk: ?? -> ?*
- Substitute some common character combinations with a token:
sch/ch/ei/ie/ig/st -> $/§/%/&/#/!
private void resubstitute(java.lang.StringBuilder buffer)