The IndexWriter is the class used to add documents to an index. You can also delete documents from the index using this class. The indexing process is highly customizable and the IndexWriter has the following parameters;
This is an Ferret::Store::Directory
object. You should either pass a :dir
or a :path
when creating an index.
A string representing the path to the index directory. If you are creating
the index for the first time the directory will be created if it's missing.
You should not choose a directory which contains other files as they could
be over-written. To protect against this set
:create_if_missing
to false.
Default: true. Create the index if no index is found in the specified directory. Otherwise, use the existing index.
Default: false. Creates the index, even if one already exists. That means any existing index will be deleted. It is probably better to use the create_if_missing option so that the index is only created the first time when it doesn't exist.
Default Ferret::Index::FieldInfos.new. The
FieldInfos object to use when creating a new
index if :create_if_missing
or :create
is set to
true. If an existing index is opened then this parameter is ignored.
Default: Ferret::Analysis::StandardAnalyzer. Sets the default analyzer for the index. This is used by both the IndexWriter and the QueryParser to tokenize the input. The default is the StandardAnalyzer.
Default: 0x100000 or 1Mb. Memory performance tuning parameter. Sets the default size of chunks of memory malloced for use during indexing. You can usually leave this parameter as is.
Default: 0x1000000 or 16Mb. Memory performance tuning parameter. Sets the amount of memory to be used by the indexing process. Set to a larger value to increase indexing speed. Note that this only includes memory used by the indexing process, not the rest of your ruby application.
Default: 128. The skip interval between terms in the term dictionary. A smaller value will possibly increase search performance while also increasing memory usage and impacting negatively impacting indexing performance.
Default: 16. The skip interval for document numbers in the index. As with
:term_index_interval
you have a trade-off. A smaller number
may increase search performance while also increasing memory usage and
impacting negatively impacting indexing performance.
Default: 10. This must never be less than 2. Specifies the number of segments of a certain size that must exist before they are merged. A larger value will improve indexing performance while slowing search performance.
Default: 10000. The maximum number of documents that may be stored in memory before being written to the index. If you have a lot of memory and are indexing a large number of small documents (like products in a product database for example) you may want to set this to a much higher number (like Ferret::FIX_INT_MAX). If you are worried about your application crashing during the middle of index you might set this to a smaller number so that the index is committed more often. This is like having an auto-save in a word processor application.
Set this value to limit the number of documents that go into a single segment. Use this to avoid extremely long merge times during indexing which can make your application seem unresponsive. This is only necessary for very large indexes (millions of documents).
Default: 10000. The maximum number of terms added to a single field. This can be useful to protect the indexer when indexing documents from the web for example. Usually the most important terms will occur early on in a document so you can often safely ignore the terms in a field after a certain number of them. If you wanted to speed up indexing and same space in your index you may only want to index the first 1000 terms in a field. On the other hand, if you want to be more thorough and you are indexing documents from your file-system you may set this parameter to Ferret::FIX_INT_MAX.
Default: true. Uses a compound file to store the index. This prevents an error being raised for having too many files open at the same time. The default is true but performance is better if this is set to false.
Both IndexReader and IndexWriter allow you to delete documents. You should use the IndexReader to delete documents by document id and IndexWriter to delete documents by term which we'll explain now. It is preferrable to delete documents from an index using IndexWriter for performance reasons. To delete documents using the IndexWriter you should give each document in the index a unique ID. If you are indexing documents from the file-system this unique ID will be the full file path. If indexing documents from the database you should use the primary key as the ID field. You can then use the delete method to delete a file referenced by the ID. For example;
index_writer.delete(:id, "/path/to/indexed/file")
Create a new IndexWriter. You should either pass a path or a directory to this constructor. For example, here are three ways you can create an IndexWriter; dir = RAMDirectory.new() iw = IndexWriter.new(:dir => dir) dir = FSDirectory.new("/path/to/index") iw = IndexWriter.new(:dir => dir) iw = IndexWriter.new(:path => "/path/to/index")
See IndexWriter for more options.
static VALUE frb_iw_init(int argc, VALUE *argv, VALUE self) { VALUE roptions, rval; bool create = false; bool create_if_missing = true; Store *store = NULL; Analyzer *analyzer = NULL; IndexWriter *volatile iw = NULL; Config config = default_config; rb_scan_args(argc, argv, "01", &roptions); if (argc > 0) { Check_Type(roptions, T_HASH); if ((rval = rb_hash_aref(roptions, sym_dir)) != Qnil) { Check_Type(rval, T_DATA); store = DATA_PTR(rval); } else if ((rval = rb_hash_aref(roptions, sym_path)) != Qnil) { StringValue(rval); frb_create_dir(rval); store = open_fs_store(rs2s(rval)); DEREF(store); } /* Let ruby's garbage collector handle the closing of the store if (!close_dir) { close_dir = RTEST(rb_hash_aref(roptions, sym_close_dir)); } */ /* use_compound_file defaults to true */ config.use_compound_file = (rb_hash_aref(roptions, sym_use_compound_file) == Qfalse) ? false : true; if ((rval = rb_hash_aref(roptions, sym_analyzer)) != Qnil) { analyzer = frb_get_cwrapped_analyzer(rval); } create = RTEST(rb_hash_aref(roptions, sym_create)); if ((rval = rb_hash_aref(roptions, sym_create_if_missing)) != Qnil) { create_if_missing = RTEST(rval); } SET_INT_ATTR(chunk_size); SET_INT_ATTR(max_buffer_memory); SET_INT_ATTR(index_interval); SET_INT_ATTR(skip_interval); SET_INT_ATTR(merge_factor); SET_INT_ATTR(max_buffered_docs); SET_INT_ATTR(max_merge_docs); SET_INT_ATTR(max_field_length); } if (NULL == store) { store = open_ram_store(); DEREF(store); } if (!create && create_if_missing && !store->exists(store, "segments")) { create = true; } if (create) { FieldInfos *fis; if ((rval = rb_hash_aref(roptions, sym_field_infos)) != Qnil) { Data_Get_Struct(rval, FieldInfos, fis); index_create(store, fis); } else { fis = fis_new(STORE_YES, INDEX_YES, TERM_VECTOR_WITH_POSITIONS_OFFSETS); index_create(store, fis); fis_deref(fis); } } iw = iw_open(store, analyzer, &config); Frt_Wrap_Struct(self, &frb_iw_mark, &frb_iw_free, iw); if (rb_block_given_p()) { rb_yield(self); frb_iw_close(self); return Qnil; } else { return self; } }
Add a document to the index. See Document. A document can also be a simple hash object.
static VALUE frb_iw_add_doc(VALUE self, VALUE rdoc) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); Document *doc = frb_get_doc(rdoc); iw_add_doc(iw, doc); doc_destroy(doc); return self; }
Add a document to the index. See Document. A document can also be a simple hash object.
static VALUE frb_iw_add_doc(VALUE self, VALUE rdoc) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); Document *doc = frb_get_doc(rdoc); iw_add_doc(iw, doc); doc_destroy(doc); return self; }
Use this method to merge other indexes into the one being written by IndexWriter. This is useful for parallel indexing. You can have several indexing processes running in parallel, possibly even on different machines. Then you can finish by merging all of the indexes into a single index.
static VALUE frb_iw_add_readers(VALUE self, VALUE rreaders) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); int i; IndexReader **irs; Check_Type(rreaders, T_ARRAY); irs = ALLOC_N(IndexReader *, RARRAY_LEN(rreaders)); i = RARRAY_LEN(rreaders); while (i-- > 0) { IndexReader *ir; Data_Get_Struct(RARRAY_PTR(rreaders)[i], IndexReader, ir); irs[i] = ir; } iw_add_readers(iw, irs, RARRAY_LEN(rreaders)); free(irs); return self; }
Get the Analyzer for this IndexWriter. This is useful if you need to use the same analyzer in a QueryParser.
static VALUE frb_iw_get_analyzer(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return frb_get_analyzer(iw->analyzer); }
Set the Analyzer for this IndexWriter. This is useful if you need to change the analyzer for a special document. It is risky though as the same analyzer will be used for all documents during search.
static VALUE frb_iw_set_analyzer(VALUE self, VALUE ranalyzer) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); a_deref(iw->analyzer); iw->analyzer = frb_get_cwrapped_analyzer(ranalyzer); return ranalyzer; }
Return the current value of #chunk_size
static VALUE frb_iw_get_chunk_size(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.chunk_size); }
Set the #chunk_size parameter
static VALUE frb_iw_set_chunk_size(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.chunk_size = FIX2INT(rval); return rval; }
Close the IndexWriter. This will close and free all resources used exclusively by the index writer. The garbage collector will do this automatically if not called explicitly.
static VALUE frb_iw_close(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); Frt_Unwrap_Struct(self); iw_close(iw); return Qnil; }
Explicitly commit any changes to the index that may be hanging around in memory. You should call this method if you want to read the latest index with an IndexWriter.
static VALUE frb_iw_commit(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw_commit(iw); return self; }
Delete all documents in the index with the given term
or
terms
in the field field
. You should usually have
a unique document id which you use with this method, rather then deleting
all documents with the word "the" in them. There are of course exceptions
to this rule. For example, you may want to delete all documents with the
term "viagra" when deleting spam.
static VALUE frb_iw_delete(VALUE self, VALUE rfield, VALUE rterm) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); if (TYPE(rterm) == T_ARRAY) { const int term_cnt = RARRAY_LEN(rterm); int i; char **terms = ALLOC_N(char *, term_cnt); for (i = 0; i < term_cnt; i++) { terms[i] = StringValuePtr(RARRAY_PTR(rterm)[i]); } iw_delete_terms(iw, frb_field(rfield), terms, term_cnt); free(terms); } else { iw_delete_term(iw, frb_field(rfield), StringValuePtr(rterm)); } return self; }
Returns the number of documents in the Index. Note that deletions won't be taken into account until the IndexWriter has been committed.
static VALUE frb_iw_get_doc_count(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw_doc_count(iw)); }
Return the current value of #doc_skip_interval
static VALUE frb_iw_get_skip_interval(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.skip_interval); }
Set the #doc_skip_interval parameter
static VALUE frb_iw_set_skip_interval(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.skip_interval = FIX2INT(rval); return rval; }
Get the FieldInfos object for this IndexWriter. This is useful if you need to dynamically add new fields to the index with specific properties.
static VALUE frb_iw_field_infos(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return frb_get_field_infos(iw->fis); }
Return the current value of #max_buffer_memory
static VALUE frb_iw_get_max_buffer_memory(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.max_buffer_memory); }
Set the #max_buffer_memory parameter
static VALUE frb_iw_set_max_buffer_memory(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.max_buffer_memory = FIX2INT(rval); return rval; }
Return the current value of #max_buffered_docs
static VALUE frb_iw_get_max_buffered_docs(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.max_buffered_docs); }
Set the #max_buffered_docs parameter
static VALUE frb_iw_set_max_buffered_docs(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.max_buffered_docs = FIX2INT(rval); return rval; }
Return the current value of #max_field_length
static VALUE frb_iw_get_max_field_length(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.max_field_length); }
Set the #max_field_length parameter
static VALUE frb_iw_set_max_field_length(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.max_field_length = FIX2INT(rval); return rval; }
Return the current value of #max_merge_docs
static VALUE frb_iw_get_max_merge_docs(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.max_merge_docs); }
Set the #max_merge_docs parameter
static VALUE frb_iw_set_max_merge_docs(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.max_merge_docs = FIX2INT(rval); return rval; }
Return the current value of #merge_factor
static VALUE frb_iw_get_merge_factor(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.merge_factor); }
Set the #merge_factor parameter
static VALUE frb_iw_set_merge_factor(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.merge_factor = FIX2INT(rval); return rval; }
Optimize the index for searching. This commits any unwritten data to the index and optimizes the index into a single segment to improve search performance. This is an expensive operation and should not be called too often. The best time to call this is at the end of a long batch indexing process. Note that calling the optimize method do not in any way effect indexing speed (except for the time taken to complete the optimization process).
static VALUE frb_iw_optimize(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw_optimize(iw); return self; }
Return the current value of #term_index_interval
static VALUE frb_iw_get_index_interval(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return INT2FIX(iw->config.index_interval); }
Set the #term_index_interval parameter
static VALUE frb_iw_set_index_interval(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.index_interval = FIX2INT(rval); return rval; }
Return the current value of #use_compound_file
static VALUE frb_iw_get_use_compound_file(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return iw->config.use_compound_file ? Qtrue : Qfalse; }
Set the #use_compound_file parameter
static VALUE frb_iw_set_use_compound_file(VALUE self, VALUE rval) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); iw->config.use_compound_file = RTEST(rval); return rval; }
Returns the current version of the index writer.
static VALUE frb_iw_version(VALUE self) { IndexWriter *iw = (IndexWriter *)DATA_PTR(self); return ULL2NUM(iw->sis->version); }