public class DFISimilarity extends SimilarityBase
DFI is both parameter-free and non-parametric:
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
SimilarityBase.BasicSimScorer
Similarity.SimScorer
Modifier and Type | Field and Description |
---|---|
private Independence |
independence |
discountOverlaps
Constructor and Description |
---|
DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure
|
Modifier and Type | Method and Description |
---|---|
protected Explanation |
explain(BasicStats stats,
Explanation freq,
double docLen)
Explains the score.
|
Independence |
getIndependence()
Returns the measure of independence
|
protected double |
score(BasicStats stats,
double freq,
double docLen)
Scores the document
doc . |
java.lang.String |
toString()
Subclasses must override this method to return the name of the Similarity
and preferably the values of parameters (if any) as well.
|
computeNorm, explain, fillBasicStats, getDiscountOverlaps, log2, newStats, scorer, setDiscountOverlaps
private final Independence independence
public DFISimilarity(Independence independenceMeasure)
independenceMeasure
- measure of divergence from independenceprotected double score(BasicStats stats, double freq, double docLen)
SimilarityBase
doc
.
Subclasses must apply their scoring formula in this class.
score
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.public Independence getIndependence()
protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
SimilarityBase
SimilarityBase.score(BasicStats, double, double)
method) and the explanation for the term frequency. Subclasses content with
this format may add additional details in
SimilarityBase.explain(List, BasicStats, double, double)
.explain
in class SimilarityBase
stats
- the corpus level statistics.freq
- the term frequency and its explanation.docLen
- the document length.public java.lang.String toString()
SimilarityBase
toString
in class SimilarityBase