We will start with what Apache Lucene is and then showcase one basic example on how to index documents and search with it.
Lucene is a high-performance, scalable information retrieval (IR) library. IR refers to the process of searching for documents, information within documents, or metadata about documents. Lucene lets you add searching capabilities to your application. [ref. Apache Lucene in Action Second edition covers Apache Lucene v3.0]
The main reason for popularity of Lucene is its simplicity. You don't require in-depth knowledge of indexing and searching process to get started with Lucene. You can start with learning handful of classes which actually do the indexing and searching for Lucene. At the time of writing, the latest version released is 4.7 and books are only available for v3.0.
Important note
Lucene is not ready-to-use application like file-search program, web-crawler or search engine. It is a software toolkit or library and with the help of it you can build your own search application or libraries. There are many frameworks build on top of Lucene Core API for searching.
Let's start with indexing documents using Apache Lucene. We have created a utility class to create index named IndexHelper
.
2.1. IndexerTest
is a demo class to index data using Apache Lucene.
/**
* @author Gaurav Rai Mazra
*/
public class IndexerTest {
public static void main(String[] args) throws Exception {
String indexDir = "index";
String dataDir = "dir";
long start = System.currentTimeMillis();
final IndexingHelper indexHelper = new IndexingHelper(indexDir);
int numIndexed;
try {
numIndexed = indexHelper.index(dataDir, new TextFilesFilter());
}
finally {
indexHelper.close();
}
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
}
}
2.2. TextFilesFilter
class filters the file which ends with .txt
extension.
// class filters only .txt files for indexing
class TextFilesFilter implements FileFilter {
@Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().endsWith(".txt");
}
}
2.3. IndexingHelper
class is indexing the files using Apache Lucene apis.
/**
* @author Gaurav Rai Mazra
*/
public class IndexingHelper {
//class which actually creates and maintain the indexes in the file
private IndexWriter indexWriter;
public IndexingHelper(String indexDir) throws Exception {
//To represent actual directory
Directory directory = FSDirectory.open(new File(indexDir));
//Holds configuration required in creation of IndexWriter object
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47));
indexWriter = new IndexWriter(directory, indexWriterConfig);
}
public void close() throws IOException {
indexWriter.close();
}
// exposed method to index files
public int index(String dataDir, FileFilter fileFilter) throws Exception {
File[] files = new File(dataDir).listFiles();
for (File f : files) {
if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead() && (fileFilter == null || fileFilter.accept(f)))
indexFile(f);
}
return indexWriter.numDocs();
}
private void indexFile(File f) throws Exception {
Document doc = getDocument(f);
indexWriter.addDocument(doc);
}
private Document getDocument(File f) throws Exception {
// class used by lucene indexwriter and indexreader to store and reterive indexed data
Document document = new Document();
document.add(new TextField("contents", new FileReader(f)));
document.add(new StringField("filename", f.getName(), Field.Store.YES));
document.add(new StringField("fullpath", f.getCanonicalPath(), Field.Store.YES));
return document;
}
}
In IndexingHelper
class, we have used following classes from Apache Lucene library for indexing .txt
files.
IndexWriter
class.IndexWriterConfig
class.Directory
class.FSDirectory
class.Analyzer
class.Document
class.Field
class.3.1. IndexWriter
class: It is the centeral component of indexing process. This class actually creates new Index or opens the existing one and add, remove and update the document in the index. It has one public constructor which takes Directory
class's object and IndexWriterConfig
class's object as parameters.
This class exposes many methods to add Document
class object to be used internally in Indexing.
This class exposes methods used for deleting Documents from the index as well and other informative methods like numDocs()
which returns all the documents in the index including deleted once if they are not flushed on file.
3.2. IndexWriterConfig
class: It holds the configuration required to create IndexWriter
object. It has one public constructor which takes two parameter one is enum of Version i.e. lucene version for compatibility issues. The other parameter is object of Analyzer
class which itself is abstract class but have many implementing classes like WhiteSpaceAnalyzer
, StandardAnalyzer
etc. which helps in Analyzing the tokens. It is used in analysis process.
3.3. Directory
class: The Directory
class represents the location of Lucene index. It is an abstract class and have many different concrete implementation.
3.4. FSDirectory
class: No one implementation is best suited for the computer architecture you have. Hence use FSDirectory
abstract class to get best possible concrete implementation available for the Directory
class.
3.5. Analyzer
class: Before any text is indexed, it is passed to Analyzer for extracting tokens out of that text that should be indexed and rest will be eliminated.
3.6. Document
class: It represents the collection of Fields. It is a chunk of data which we want to index and make it retrievable at a later time.
3.7. Field
class: Each document will have one or more than one fields. Each field has a name and corresponding to it a value. Most of Field class methods are depreciated. It is favourable to use other existing implementation of Field
class like IntField
, LongField
, FloatField
, DoubleField
, BinaryDocValuesField
, NumericDocValuesField
, SortedDocValuesField
, StringField
, TextField
, StoredField
.
Let's start with searching searching documents with Apache Lucene. We have created a utility class SearcherTest
for this.
4.1. SearcherTest
is a demo class for searching using Apache Lucene.
/**
* @author Gaurav Rai Mazra
*/
public class SearcherTest {
public static void main(String[] args) throws IOException, ParseException {
String indexDir = "index";
String q = "direwolf";
search(indexDir, q);
}
//Search in lucene index
private static void search(String indexDir, String q) throws IOException, ParseException {
//get a directory to search from
Directory directory = FSDirectory.open(new File(indexDir));
// get reader to read directory
IndexReader indexReader = DirectoryReader.open(directory);
//create indexSearcher
IndexSearcher is = new IndexSearcher(indexReader);
// Create analyzer to analyse documents
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
//create query parser
QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer);
//get query
Query query = queryParser.parse(q);
//Query query1 = new TermQuery(new Term("contents", q));
long start = System.currentTimeMillis();
//hit query
TopDocs hits = is.search(query, 10);
long end = System.currentTimeMillis();
System.err.println("Found " + hits.totalHits + " document(s) in " + (end-start) + " milliseconds");
for (ScoreDoc scoreDoc : hits.scoreDocs) {
Document document = is.doc(scoreDoc.doc);
System.out.println(document.get("fullpath"));
}
}
}
IndexReader
class.IndexSearcher
class.QueryParser
class.Query
class.TopDocs
class.5.1. IndexReader
class: This is an abstract class providing an interface for assessing an index. For getting particular implementation helper class DirectoryReader
is used which calls open method with passing directory reference to get IndexReader
object.
5.2. IndexSearcher
class: IndexSearcher
is used to search data which is indexed by IndexWriter
. You can think of IndexSearcher
as a class which opens the index in read-only mode. It requires the IndexReader
instance to create object of it. It has method to search and getting documents.
5.3. QueryParser
class: This class is used to parse the string to generate query out of it.
5.4. Query
class: It is abstract class represent the query to be used in searching. There are many concrete classes to it like TermQuery
, BooleanQuery
, PhraseQuery
etc. It contains several utility method, one of it is setBoost(float)
.
5.5. TopDocs
class: It represents the hit returned by search method of IndexSearcher. It has one public constructor which take three parameters totalHits
of type iny
, array of scoreDocs
of type ScoreDoc
, maxScore
of type float
. The ScoreDoc
contains the score and documentId of the document.