Random Thoughts

Getting started with Lucene – Part 2

September 11th, 2008 Balaji Varanasi No comments

In this post I will highlight some of Lucene’s search functionality. Refer to part one of this series for creating indexes using Lucene.

Searching in Lucene involves submitting a search query to the IndexSearcher class. The IndexSearcher executes this query against an index and returns search results (hits). Here is a prototype implementation:

public Hits searchIndex(Query q) throws Exception
  {
    IndexSearcher searcher = new IndexSearcher("c:/lucene/index");
    return searcher.search(q);
  }

The IndexSearcher constructor takes the path to the index it needs to search against. The IndexSearcher class is thread safe and Lucene API recommends opening and using one IndexSearcher for all searches.

The Query class is an abstract class that encapsulates a user input. The simplest way to generate a concrete query is to use the QueryParser class. The following code generates a query for all the employees whose first name is Judy:

QueryParser parser = new QueryParser("firstName", new SimpleAnalyzer());
Query query = parser.parse("Judy");
Hits hits = searchIndex(query);

The first parameter to the QueryParser is the field name in the document against which the query is being made. For better results, the analyzer passed as the second parameter should be of the same type that is used while creating indexes.

The Hits class encapsulates search results. Hits can be easily iterated over to get to the “interesting” stuff:

for(int i = 0; i < hits.length(); i++)
    {
      Document d = hits.doc(i);
      System.out.println(d.getField("firstName").stringValue());
    }

QueryParser does a good job at interpreting user entered search expressions. If developers find limitations using QueryParser, Lucene provides a nice API to programmatically generate and combine queries. Let’s say a user wants to find all the Employees with first name Judy and last name Test:

Query fnQuery = new TermQuery(new Term("firstName", "Judy"));
    Query lnQuery = new TermQuery(new Term("lastName", "Test"));
    BooleanQuery query = new BooleanQuery();
    query.add(fnQuery, BooleanClause.Occur.MUST);
    query.add(lnQuery, BooleanClause.Occur.MUST);
    // Notice we are not analyzing user entered input before executing search
    Hits hits = searchIndex(query);

By default the returned search results are ordered by decreasing relevance. This however can be easily changed using overloaded search methods in IndexSearcher. The following code sorts the results of the above query on first name field:

Sort sort = new Sort("firstName");
Hits sortedHits = indexSearcher.search(query, sort);

An important thing to remember is that fields used for sorting must not be tokenized. Otherwise you will run into this exception: “there are more terms than documents in field “XXXXXX”, but it’s impossible to sort on tokenized fields”.

Categories: Getting Started Tags: Lucene, Search

Getting started with Lucene – Part 1

September 8th, 2008 Balaji Varanasi No comments

Apache Lucene is a popular open source text search engine that can be easily embedded in applications needing search functionality. Lucene is not a full blown, out of box web site search engine or crawler. Instead as you will soon see, Lucene exposes a small API to create and search indexes. In this first part of the series, I will show how Lucene can be used to create indexes.

Before any searches can be performed on large amounts of data, it is essential to convert the data into a easy to lookup format. This conversion process is called Indexing (much like a book index). Indexes created by Lucene contain a collection of documents and are usually stored as a list of files on the file system. A Lucene document itself is a sequence of name-value pairs called fields. The strings in a field are referred as terms.

Let say we are writing an employee search application that allows employees lookup each others information. The first step in the process is indexing the employee information. For the sake of simplicity, let’s assume that the employee information is available as a list of Employee objects. Here is the prototype method for creating a Lucene index (using the 2.3.2 version of Lucene API):

public void createIndex() throws Exception
  {
    // Create a writer
    IndexWriter writer = new IndexWriter("c:/lucene/index/", new SimpleAnalyzer());

    // Add documents to the index
    addDocuments(writer);

    // Lucene recommends calling optimize upon completion of indexing
    writer.optimize();
    // clean up
    writer.close();
  }

IndexWriter is the heart to Lucene indexing. It creates a new index and exposes API to add documents to the index. The first parameter to the constructor is the file system path where Lucene needs to store the index files. Before Lucene can index text, the text needs to be broken down in to tokens which is done via an Analyzer. Lucene out of box provides a variety of analyzers such as SimpleAnalyzer, StandardAnalyzer, StopAnalyzer etc. An anlyzer is specified as the second parameter to the writer constructor.

The next step in the process is adding the documents to the index. Here is a prototype implementation:

  public void addDocuments(IndexWriter writer) throws Exception
  {
    for(Employee e : employeeList)
    {
      // Create a document
      Document document = new Document();
      // Add fields to the document
      document.add(new Field("firstName", e.getFirstName(), Field.Store.YES, Field.Index.TOKENIZED));
      document.add(new Field("lastName", e.getLastName(), Field.Store.YES, Field.Index.TOKENIZED));
      document.add(new Field("phoneNumber", e.getPhoneNumber(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    }
  }

In the above code, the first two parameters to the Field specify the field name, field value and the last two parameters provide metadata on how the field needs to be stored and indexed. When storing a field we have three options:
          Field.Store.YES – Original value is stored in the index
          Field.Store.COMPRESS – Original value is stored in the index in a compressed form
          Field.Store.NO – Field value is not stored in the index

Similarly, when indexing, we have couple options:
          Field.Index.NO – Field value is not indexed (useful for data like primary keys)
          Field.Index.TOKENIZED – Field value is analyzed and indexed (commonly used option)
          Field.Index.UN_TOKENIZED – Field value is not analyzed but indexed (useful for indexing “keywords” or data such as phone numbers)

Putting it all together:

import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

public class EmployeeIndexer {

  // Path to the index directory
  private static final String INDEX_DIRECTORY = "c:/lucene/index";

  private List<Employee> employeeList = new ArrayList<Employee>();

  public EmployeeIndexer() {
    employeeList.add(new Employee("Jane", "Doe", "123-456-8910"));
    employeeList.add(new Employee("John", "Smith", "123-456-8910"));
    employeeList.add(new Employee("Mike", "Test", "123-456-8910"));
    employeeList.add(new Employee("Judy", "Test", "123-456-8910"));
  }

  public void createIndex() throws Exception {
    // Create a writer
    IndexWriter writer = new IndexWriter(INDEX_DIRECTORY, new SimpleAnalyzer());

    // Add documents to the index
    addDocuments(writer);

    // Lucene recommends calling optimize upon completion of indexing
    writer.optimize();
    // clean up
    writer.close();
  }

  public void addDocuments(IndexWriter writer) throws Exception {
    for(Employee e : employeeList) {
      // Create a document
      Document document = new Document();
      // Add fields to the document
      document.add(new Field("firstName", e.getFirstName(), Field.Store.YES, Field.Index.TOKENIZED));
      document.add(new Field("lastName", e.getLastName(), Field.Store.YES, Field.Index.TOKENIZED));
      document.add(new Field("phoneNumber", e.getPhoneNumber(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    }
  }

  public class Employee {
    private String firstName;
    private String lastName;
    private String phoneNumber;

    public Employee(String firstName, String lastName, String phoneNumber) {
      this.firstName = firstName;
      this.lastName = lastName;
      this.phoneNumber = phoneNumber;
    }

    public String getFirstName() {
      return firstName;
    }
    public void setFirstName(String firstName) {
      this.firstName = firstName;
    }
    public String getLastName() {
      return lastName;
    }
    public void setLastName(String lastName) {
      this.lastName = lastName;
    }
    public String getPhoneNumber() {
      return phoneNumber;
    }
    public void setPhoneNumber(String phoneNumber) {
      this.phoneNumber = phoneNumber;
    }
  }
}

Categories: Getting Started Tags: Indexing, Lucene

Random Thoughts

Archive

Getting started with Lucene – Part 2

Getting started with Lucene – Part 1

My Book

Preparing for a JEE interview?

Random Posts

Categories

Archives

Cluster Maps

Random Thoughts

Archive

Getting started with Lucene – Part 2

Getting started with Lucene – Part 1

My Book

Preparing for a JEE interview?

Random Posts

Tag Cloud

Categories

Archives

Cluster Maps