org.apache.lucene.index.memory
public class MemoryIndex extends Object
float score = search(String text, Query query)
.
Each instance can hold at most one Lucene "document", with a document containing
zero or more "fields", each field having a name and a fulltext value. The
fulltext value is tokenized (split and transformed) into zero or more index terms
(aka words) on addField()
, according to the policy implemented by an
Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case
for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop
words), reduce the terms to their natural linguistic root form such as "fishing"
being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri
(upon indexing and/or querying), etc. For details, see
Lucene Analyzer Intro.
Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.
For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Analyzer analyzer = PatternAnalyzer.DEFAULT_ANALYZER; //Analyzer analyzer = new SimpleAnalyzer(); MemoryIndex index = new MemoryIndex(); index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer); index.addField("author", "Tales of James", analyzer); float score = index.search(QueryParser.parse("+author:james +salmon~ +fish* manual~", "content", analyzer)); if (score > 0.0f) { System.out.println("it's a match"); } else { System.out.println("no match found"); } System.out.println("indexData=" + index.toString());
(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :) declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book
MemoryIndex index = ... synchronized (index) { // read and/or write index (i.e. add fields and/or query) }
This class performs very well for very small texts (e.g. 10 chars)
as well as for large texts (e.g. 10 MB) and everything in between.
Typically, it is about 10-100 times faster than RAMDirectory
.
Note that RAMDirectory
has particularly
large efficiency overheads for small to medium sized texts, both in time and space.
Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst
case. Memory consumption is probably larger than for RAMDirectory
.
If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
Constructor Summary | |
---|---|
MemoryIndex()
Constructs an empty instance. |
Method Summary | |
---|---|
void | addField(String fieldName, String text, Analyzer analyzer)
Convenience method; Tokenizes the given field text and adds the resulting
terms to the index; Equivalent to adding a tokenized, indexed,
termVectorStored, unstored, non-keyword Lucene
{@link org.apache.lucene.document.Field}.
|
void | addField(String fieldName, TokenStream stream)
Iterates over the given token stream and adds the resulting terms to the index;
Equivalent to adding a tokenized, indexed, termVectorStored, unstored,
Lucene {@link org.apache.lucene.document.Field}.
|
IndexSearcher | createSearcher()
Creates and returns a searcher that can be used to execute arbitrary
Lucene queries and to collect the resulting query results as hits.
|
int | getMemorySize()
Returns a reasonable approximation of the main memory [bytes] consumed by
this instance. |
TokenStream | keywordTokenStream(Collection keywords)
Convenience method; Creates and returns a token stream that generates a
token for each keyword in the given collection, "as is", without any
transforming text analysis. |
float | search(Query query)
Convenience method that efficiently returns the relevance score by
matching this index against the given Lucene query expression.
|
String | toString()
Returns a String representation of the index data for debugging purposes.
|
Parameters: fieldName a name to be associated with the text text the text to tokenize and index. analyzer the analyzer to use for tokenization
KeywordTokenizer
or similar utilities.
Parameters: fieldName a name to be associated with the text stream the token stream to retrieve tokens from.
Returns: a searcher
Returns: the main memory consumption
Parameters: keywords the keywords to generate tokens for
Returns: the corresponding token stream
Parameters: query an arbitrary Lucene query to run against this index
Returns: the relevance score of the matchmaking; A number in the range [0.0 .. 1.0], with 0.0 indicating no match. The higher the number the better the match.
See Also: parse
Returns: the string representation