Indexing and Searching with Apache Lucene

We will start with what Apache Lucene is and then showcase one basic example on how to index documents and search with it.

1. Introduction

Lucene is a high-performance, scalable information retrieval (IR) library. IR refers to the process of searching for documents, information within documents, or metadata about documents. Lucene lets you add searching capabilities to your application. [ref. Apache Lucene in Action Second edition covers Apache Lucene v3.0]

The main reason for popularity of Lucene is its simplicity. You don't require in-depth knowledge of indexing and searching process to get started with Lucene. You can start with learning handful of classes which actually do the indexing and searching for Lucene. At the time of writing, the latest version released is 4.7 and books are only available for v3.0.

Important note

Lucene is not ready-to-use application like file-search program, web-crawler or search engine. It is a software toolkit or library and with the help of it you can build your own search application or libraries. There are many frameworks build on top of Lucene Core API for searching.

Libraries used

JDK 1.7
lucene-core-4.7.2.jar
lucene-queryparser-4.7.2.jar
lucene-demo-4.7.2.jar
lucene-analyzers-common-4.7.2.jar

2. Indexing with Apache Lucene

Let's start with indexing documents using Apache Lucene. We have created a utility class to create index named IndexHelper.

2.1. IndexerTest is a demo class to index data using Apache Lucene.

/**
 * @author Gaurav Rai Mazra
 */
public class IndexerTest {
 
    public static void main(String[] args) throws Exception {
        String indexDir = "index";
        String dataDir = "dir";

        long start = System.currentTimeMillis();
        final IndexingHelper indexHelper = new IndexingHelper(indexDir);
        int numIndexed;

        try {
            numIndexed = indexHelper.index(dataDir, new TextFilesFilter());
        }
        finally {
            indexHelper.close();
        }

        long end = System.currentTimeMillis();
        System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
    }
}

2.2. TextFilesFilter class filters the file which ends with .txt extension.

// class filters only .txt files for indexing
class TextFilesFilter implements FileFilter {
    @Override
    public boolean accept(File pathname) {
        return pathname.getName().toLowerCase().endsWith(".txt");
    }
}

2.3. IndexingHelper class is indexing the files using Apache Lucene apis.

/**
 * @author Gaurav Rai Mazra
 */
public class IndexingHelper {
    //class which actually creates and maintain the indexes in the file
    private IndexWriter indexWriter;
 
    public IndexingHelper(String indexDir) throws Exception {
        //To represent actual directory
        Directory directory = FSDirectory.open(new File(indexDir));
        //Holds configuration required in creation of IndexWriter object
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47));
        indexWriter = new IndexWriter(directory, indexWriterConfig);
    }
 
    public void close() throws IOException {
        indexWriter.close();
    }
 
    // exposed method to index files 
    public int index(String dataDir, FileFilter fileFilter) throws Exception {
        File[] files = new File(dataDir).listFiles();
        for (File f : files) {
            if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && (fileFilter == null || fileFilter.accept(f)))
                indexFile(f);
        } 

        return indexWriter.numDocs();
    }
 
    private void indexFile(File f) throws Exception {
        Document doc = getDocument(f);
        indexWriter.addDocument(doc);
    }

    private Document getDocument(File f) throws Exception {
        // class used by lucene indexwriter and indexreader to store and reterive indexed data
        Document document = new Document();
        document.add(new TextField("contents", new FileReader(f)));
        document.add(new StringField("filename", f.getName(), Field.Store.YES));
        document.add(new StringField("fullpath", f.getCanonicalPath(), Field.Store.YES));
        return document;
    }
}

3. Explanation on Indexing

In IndexingHelper class, we have used following classes from Apache Lucene library for indexing .txt files.

3.1. IndexWriter class.
3.2. IndexWriterConfig class.
3.3. Directory class.
3.4. FSDirectory class.
3.5. Analyzer class.
3.6. Document class.
3.7. Field class.

3.1. IndexWriter class: It is the centeral component of indexing process. This class actually creates new Index or opens the existing one and add, remove and update the document in the index. It has one public constructor which takes Directory class's object and IndexWriterConfig class's object as parameters.

This class exposes many methods to add Document class object to be used internally in Indexing.

This class exposes methods used for deleting Documents from the index as well and other informative methods like numDocs() which returns all the documents in the index including deleted once if they are not flushed on file.

3.2. IndexWriterConfig class: It holds the configuration required to create IndexWriter object. It has one public constructor which takes two parameter one is enum of Version i.e. lucene version for compatibility issues. The other parameter is object of Analyzer class which itself is abstract class but have many implementing classes like WhiteSpaceAnalyzer, StandardAnalyzer etc. which helps in Analyzing the tokens. It is used in analysis process.

3.3. Directory class: The Directory class represents the location of Lucene index. It is an abstract class and have many different concrete implementation.

3.4. FSDirectory class: No one implementation is best suited for the computer architecture you have. Hence use FSDirectory abstract class to get best possible concrete implementation available for the Directory class.

3.5. Analyzer class: Before any text is indexed, it is passed to Analyzer for extracting tokens out of that text that should be indexed and rest will be eliminated.

3.6. Document class: It represents the collection of Fields. It is a chunk of data which we want to index and make it retrievable at a later time.

3.7. Field class: Each document will have one or more than one fields. Each field has a name and corresponding to it a value. Most of Field class methods are depreciated. It is favourable to use other existing implementation of Field class like IntField, LongField, FloatField, DoubleField, BinaryDocValuesField, NumericDocValuesField, SortedDocValuesField, StringField, TextField, StoredField.

4. Searching with Apache Lucene

Let's start with searching searching documents with Apache Lucene. We have created a utility class SearcherTest for this.

4.1. SearcherTest is a demo class for searching using Apache Lucene.

/**
 * @author Gaurav Rai Mazra
 */
public class SearcherTest {

    public static void main(String[] args) throws IOException, ParseException {
        String indexDir = "index";
        String q = "direwolf";

        search(indexDir, q);
    }

    //Search in lucene index
    private static void search(String indexDir, String q) throws IOException, ParseException {
        //get a directory to search from
        Directory directory = FSDirectory.open(new File(indexDir));
        // get reader to read directory
        IndexReader indexReader = DirectoryReader.open(directory);
        //create indexSearcher
        IndexSearcher is = new IndexSearcher(indexReader);
        // Create analyzer to analyse documents
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); 
        //create query parser
        QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer);
        //get query
        Query query = queryParser.parse(q);

        //Query query1 = new TermQuery(new Term("contents", q));

        long start = System.currentTimeMillis();
        //hit query
        TopDocs hits = is.search(query, 10);
        long end = System.currentTimeMillis();

        System.err.println("Found " + hits.totalHits + " document(s) in " + (end-start) + " milliseconds");
        for (ScoreDoc scoreDoc : hits.scoreDocs) {
            Document document = is.doc(scoreDoc.doc);
            System.out.println(document.get("fullpath"));
        }
    }
}

5. Explanation on Searching

5.1. IndexReader class.
5.2. IndexSearcher class.
5.3. QueryParser class.
5.4. Query class.
5.5. TopDocs class.

5.1. IndexReader class: This is an abstract class providing an interface for assessing an index. For getting particular implementation helper class DirectoryReader is used which calls open method with passing directory reference to get IndexReader object.

5.2. IndexSearcher class: IndexSearcher is used to search data which is indexed by IndexWriter. You can think of IndexSearcher as a class which opens the index in read-only mode. It requires the IndexReader instance to create object of it. It has method to search and getting documents.

5.3. QueryParser class: This class is used to parse the string to generate query out of it.

5.4. Query class: It is abstract class represent the query to be used in searching. There are many concrete classes to it like TermQuery, BooleanQuery, PhraseQuery etc. It contains several utility method, one of it is setBoost(float).

5.5. TopDocs class: It represents the hit returned by search method of IndexSearcher. It has one public constructor which take three parameters totalHits of type iny, array of scoreDocs of type ScoreDoc, maxScore of type float. The ScoreDoc contains the score and documentId of the document.

Tags: Apache Lucene, Document indexing, Document searching, Building search applications, Java

← Back home