Apache Lucene is a popular open source text search engine that can be easily embedded in applications needing search functionality. Lucene is not a full blown, out of box web site search engine or crawler. Instead as you will soon see, Lucene exposes a small API to create and search indexes. In this first part of the series, I will show how Lucene can be used to create indexes.
Before any searches can be performed on large amounts of data, it is essential to convert the data into a easy to lookup format. This conversion process is called Indexing (much like a book index). Indexes created by Lucene contain a collection of documents and are usually stored as a list of files on the file system. A Lucene document itself is a sequence of name-value pairs called fields. The strings in a field are referred as terms.
Let say we are writing an employee search application that allows employees lookup each others information. The first step in the process is indexing the employee information. For the sake of simplicity, let’s assume that the employee information is available as a list of Employee objects. Here is the prototype method for creating a Lucene index (using the 2.3.2 version of Lucene API):
public void createIndex() throws Exception
{
// Create a writer
IndexWriter writer = new IndexWriter("c:/lucene/index/", new SimpleAnalyzer());
// Add documents to the index
addDocuments(writer);
// Lucene recommends calling optimize upon completion of indexing
writer.optimize();
// clean up
writer.close();
}
IndexWriter is the heart to Lucene indexing. It creates a new index and exposes API to add documents to the index. The first parameter to the constructor is the file system path where Lucene needs to store the index files. Before Lucene can index text, the text needs to be broken down in to tokens which is done via an Analyzer. Lucene out of box provides a variety of analyzers such as SimpleAnalyzer, StandardAnalyzer, StopAnalyzer etc. An anlyzer is specified as the second parameter to the writer constructor.
The next step in the process is adding the documents to the index. Here is a prototype implementation:
public void addDocuments(IndexWriter writer) throws Exception
{
for(Employee e : employeeList)
{
// Create a document
Document document = new Document();
// Add fields to the document
document.add(new Field("firstName", e.getFirstName(), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("lastName", e.getLastName(), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("phoneNumber", e.getPhoneNumber(), Field.Store.YES, Field.Index.UN_TOKENIZED));
}
}
In the above code, the first two parameters to the Field specify the field name, field value and the last two parameters provide metadata on how the field needs to be stored and indexed. When storing a field we have three options:
Field.Store.YES – Original value is stored in the index
Field.Store.COMPRESS – Original value is stored in the index in a compressed form
Field.Store.NO – Field value is not stored in the index
Similarly, when indexing, we have couple options:
Field.Index.NO – Field value is not indexed (useful for data like primary keys)
Field.Index.TOKENIZED – Field value is analyzed and indexed (commonly used option)
Field.Index.UN_TOKENIZED – Field value is not analyzed but indexed (useful for indexing “keywords” or data such as phone numbers)
Putting it all together:
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
public class EmployeeIndexer {
// Path to the index directory
private static final String INDEX_DIRECTORY = "c:/lucene/index";
private List<Employee> employeeList = new ArrayList<Employee>();
public EmployeeIndexer() {
employeeList.add(new Employee("Jane", "Doe", "123-456-8910"));
employeeList.add(new Employee("John", "Smith", "123-456-8910"));
employeeList.add(new Employee("Mike", "Test", "123-456-8910"));
employeeList.add(new Employee("Judy", "Test", "123-456-8910"));
}
public void createIndex() throws Exception {
// Create a writer
IndexWriter writer = new IndexWriter(INDEX_DIRECTORY, new SimpleAnalyzer());
// Add documents to the index
addDocuments(writer);
// Lucene recommends calling optimize upon completion of indexing
writer.optimize();
// clean up
writer.close();
}
public void addDocuments(IndexWriter writer) throws Exception {
for(Employee e : employeeList) {
// Create a document
Document document = new Document();
// Add fields to the document
document.add(new Field("firstName", e.getFirstName(), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("lastName", e.getLastName(), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("phoneNumber", e.getPhoneNumber(), Field.Store.YES, Field.Index.UN_TOKENIZED));
}
}
public class Employee {
private String firstName;
private String lastName;
private String phoneNumber;
public Employee(String firstName, String lastName, String phoneNumber) {
this.firstName = firstName;
this.lastName = lastName;
this.phoneNumber = phoneNumber;
}
public String getFirstName() {
return firstName;
}
public void setFirstName(String firstName) {
this.firstName = firstName;
}
public String getLastName() {
return lastName;
}
public void setLastName(String lastName) {
this.lastName = lastName;
}
public String getPhoneNumber() {
return phoneNumber;
}
public void setPhoneNumber(String phoneNumber) {
this.phoneNumber = phoneNumber;
}
}
}