Lucene Update Index
Dec 07, 2013 We would like to implement a Lucene based full text search for a 80 GB sized database. We are novice to Lucene technology, and the current plan is to.
See More On Stackoverflow
I'm developing a Desktop Search Engine in Visual Basic 9 (VS2008) using Lucene.NET (v2.0). I use the following code to initialize the IndexWriter Private writer As IndexWriter writer = New IndexWriter(indexDirectory, New StandardAnalyzer, False) writer.SetUseCompoundFile(True) If I select the same document folder (containing files to be indexed) twice, two different entries for each file in that document folder are created in the index. I want the IndexWriter to discard any files that are already present in the Index. What should I do to ensure this?
As Steve mentioned, you need to use an instance of IndexReader and call its DeleteDocuments method. DeleteDocuments accepts either an instance of a Term object or Lucene's internal id of the document (it is generally not recommended to use the internal id as it can and will change as Lucene merges segments). The best way is to use a unique identifier that you've stored in the index specific to your application.
Lucene Update Index
For example, in an index of patients in a doctor's office, if you had a field called 'patientid' you could create a term and pass that as an argument to DeleteDocuments. See the following example (sorry, C#): int patientID = 12; IndexReader indexReader = IndexReader.Open( indexDirectory ); indexReader.DeleteDocuments( new Term( 'patientid', patientID ) ); Then you could add the patient record again with an instance of IndexWriter. I learned a lot from this article. Hope this helps.
There are many out-of-date examples out there on deleting with an id field. The code below will work with Lucene.NET 2.4. It's not necessary to open an IndexReader if you're already using an IndexWriter or to access IndexSearcher.Reader. You can use IndexWriter.DeleteDocuments(Term), but the tricky part is making sure you've stored your id field correctly in the first place. Be sure and use Field.Index.NOTANALYZED as the index setting on your id field when storing the document. This indexes the field without tokenizing it, which is very important, and none of the other Field.Index values will work when used this way: IndexWriter writer = new IndexWriter(' MyIndexFolder', new StandardAnalyzer); var doc = new Document; var idField = new Field('id', 'MyItemId', Field.Store.YES, Field.Index.NOTANALYZED); doc.Add(idField); writer.AddDocument(doc); writer.Commit; Now you can easily delete or update the document using the same writer: Term idTerm = new Term('id', 'MyItemId'); writer.DeleteDocuments(idTerm); writer.Commit; Solution:4.
One option is of course to remove a document and then to add the updated version of the document. Alternatively you can also use the UpdateDocument method of the IndexWriter class: writer.UpdateDocument(new Term('patientid', document.Get('patientid')), document); This of course requires you to have a mechanism by which you can locate the document you want to update ('patientid' in this example).
Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com.
We would like to implement a Lucene based full text search for a 80 GB sized database. We are novice to Lucene technology, and the current plan is to achieve this by storing the index terms in a flat file. 1) For Indexing, is it wise to choose Zend Lucene (in terms of performance, stability & usage)? Research shows that Zend/PHP Lucene is much slower!
So, would it better to use Java Lucene (or SOLR) for the indexing and Zend Framework for querying the search results? 2) In case of insert/update/delete in the records, how to handle the re-indexing? Hi Jaggy, 1) we also use Zend Lucene in our product for full text searches. The searches are fast and reliable and you can customize the search the way you need it 2) insert are also fast and easy to do.
Updates are not possible, you have to delete the old entry (and thus you need a unique and bijective id to identify the record in the lucene index) from the search index and then readd the new/updated content. Deletes can be very time consuming as there seems to be a bug in the Zend Lucen implementation. The reliable method is very slow but works on a consistent Database. The preferred method by the developers is fast but does not work in all cases.
We will submit a ticket after me made a Testcase for that. Advantage of the PHP/Lucene is that you have more controll over you searches and elements, but you might also use the java version depending on the interface you need from the java version.