English 中文(简体)
Lucene - Quick Guide
  • 时间:2024-12-22

Lucene - Quick Guide


Previous Page Next Page  

Lucene - Overview

Lucene is a simple yet powerful Java-based Search pbrary. It can be used in any apppcation to add search capabipty to it. Lucene is an open-source project. It is scalable. This high-performance pbrary is used to index and search virtually any kind of text. Lucene pbrary provides the core operations which are required by any search apppcation. Indexing and Searching.

How Search Apppcation works?

A Search apppcation performs all or a few of the following operations −

Step Title Description
1

Acquire Raw Content

The first step of any search apppcation is to collect the target contents on which search apppcation is to be conducted.

2

Build the document

The next step is to build the document(s) from the raw content, which the search apppcation can understand and interpret easily.

3

Analyze the document

Before the indexing process starts, the document is to be analyzed as to which part of the text is a candidate to be indexed. This process is where the document is analyzed.

4

Indexing the document

Once documents are built and analyzed, the next step is to index them so that this document can be retrieved based on certain keys instead of the entire content of the document. Indexing process is similar to indexes at the end of a book where common words are shown with their page numbers so that these words can be tracked quickly instead of searching the complete book.

5

User Interface for Search

Once a database of indexes is ready then the apppcation can make any search. To faciptate a user to make a search, the apppcation must provide a user a mean or a user interface where a user can enter text and start the search process.

6

Build Query

Once a user makes a request to search a text, the apppcation should prepare a Query object using that text which can be used to inquire index database to get the relevant details.

7

Search Query

Using a query object, the index database is then checked to get the relevant details and the content documents.

8

Render Results

Once the result is received, the apppcation should decide on how to show the results to the user using User Interface. How much information is to be shown at first look and so on.

Apart from these basic operations, a search apppcation can also provide administration user interface and help administrators of the apppcation to control the level of search based on the user profiles. Analytics of search results is another important and advanced aspect of any search apppcation.

Lucene s Role in Search Apppcation

Lucene plays role in steps 2 to step 7 mentioned above and provides classes to do the required operations. In a nutshell, Lucene is the heart of any search apppcation and provides vital operations pertaining to indexing and searching. Acquiring contents and displaying the results is left for the apppcation part to handle.

In the next chapter, we will perform a simple Search apppcation using Lucene Search pbrary.

Lucene - Environment Setup

This tutorial will guide you on how to prepare a development environment to start your work with the Spring Framework. This tutorial will also teach you how to setup JDK, Tomcat and Ecppse on your machine before you set up the Spring Framework −

Step 1 - Java Development Kit (JDK) Setup

You can download the latest version of SDK from Oracle s Java site: Java SE Downloads. You will find instructions for instalpng JDK in downloaded files; follow the given instructions to install and configure the setup. Finally set the PATH and JAVA_HOME environment variables to refer to the directory that contains Java and javac, typically java_install_dir/bin and java_install_dir respectively.

If you are running Windows and installed the JDK in C:jdk1.6.0_15, you would have to put the following pne in your C:autoexec.bat file.

set PATH = C:jdk1.6.0_15in;%PATH%
set JAVA_HOME = C:jdk1.6.0_15

Alternatively, on Windows NT/2000/XP, you could also right-cpck on My Computer, select Properties, then Advanced, then Environment Variables. Then, you would update the PATH value and press the OK button.

On Unix (Solaris, Linux, etc.), if the SDK is installed in /usr/local/jdk1.6.0_15 and you use the C shell, you would put the following into your .cshrc file.

setenv PATH /usr/local/jdk1.6.0_15/bin:$PATH
setenv JAVA_HOME /usr/local/jdk1.6.0_15

Alternatively, if you use an Integrated Development Environment (IDE) pke Borland JBuilder, Ecppse, IntelpJ IDEA, or Sun ONE Studio, compile and run a simple program to confirm that the IDE knows where you installed Java, otherwise do proper setup as given in the document of the IDE.

Step 2 - Ecppse IDE Setup

All the examples in this tutorial have been written using Ecppse IDE. So I would suggest you should have the latest version of Ecppse installed on your machine.

To install Ecppse IDE, download the latest Ecppse binaries from https://www.ecppse.org/downloads/. Once you downloaded the installation, unpack the binary distribution into a convenient location. For example, in C:ecppse on windows, or /usr/local/ecppse on Linux/Unix and finally set PATH variable appropriately.

Ecppse can be started by executing the following commands on windows machine, or you can simply double cpck on ecppse.exe

 %C:ecppseecppse.exe

Ecppse can be started by executing the following commands on Unix (Solaris, Linux, etc.) machine −

$/usr/local/ecppse/ecppse

After a successful startup, it should display the following result −

Ecppse Home page

Step 3 - Setup Lucene Framework Libraries

If the startup is successful, then you can proceed to set up your Lucene framework. Following are the simple steps to download and install the framework on your machine.

https://archive.apache.org/dist/lucene/java/3.6.2/

    Make a choice whether you want to install Lucene on Windows, or Unix and then proceed to the next step to download the .zip file for windows and .tz file for Unix.

    Download the suitable version of Lucene framework binaries from https://archive.apache.org/dist/lucene/java/.

    At the time of writing this tutorial, I downloaded lucene-3.6.2.zip on my Windows machine and when you unzip the downloaded file it will give you the directory structure inside C:lucene-3.6.2 as follows.

Lucene Directories

You will find all the Lucene pbraries in the directory C:lucene-3.6.2. Make sure you set your CLASSPATH variable on this directory properly otherwise, you will face problem while running your apppcation. If you are using Ecppse, then it is not required to set CLASSPATH because all the setting will be done through Ecppse.

Once you are done with this last step, you are ready to proceed for your first Lucene Example which you will see in the next chapter.

Lucene - First Apppcation

In this chapter, we will learn the actual programming with Lucene Framework. Before you start writing your first example using Lucene framework, you have to make sure that you have set up your Lucene environment properly as explained in Lucene - Environment Setup tutorial. It is recommended you have the working knowledge of Ecppse IDE.

Let us now proceed by writing a simple Search Apppcation which will print the number of search results found. We ll also see the pst of indexes created during this process.

Step 1 - Create Java Project

The first step is to create a simple Java Project using Ecppse IDE. Follow the option File > New -> Project and finally select Java Project wizard from the wizard pst. Now name your project as LuceneFirstApppcation using the wizard window as follows −

Create Project Wizard

Once your project is created successfully, you will have following content in your Project Explorer

Lucene First Apppcation Directories

Step 2 - Add Required Libraries

Let us now add Lucene core Framework pbrary in our project. To do this, right cpck on your project name LuceneFirstApppcation and then follow the following option available in context menu: Build Path -> Configure Build Path to display the Java Build Path window as follows −

Java Build Path

Now use Add External JARs button available under Libraries tab to add the following core JAR from the Lucene installation directory −

    lucene-core-3.6.2

Step 3 - Create Source Files

Let us now create actual source files under the LuceneFirstApppcation project. First we need to create a package called com.tutorialspoint.lucene. To do this, right-cpck on src in package explorer section and follow the option : New -> Package.

Next we will create LuceneTester.java and other java classes under the com.tutorialspoint.lucene package.

LuceneConstants.java

This class is used to provide various constants to be used across the sample apppcation.

package com.tutorialspoint.lucene;

pubpc class LuceneConstants {
   pubpc static final String CONTENTS = "contents";
   pubpc static final String FILE_NAME = "filename";
   pubpc static final String FILE_PATH = "filepath";
   pubpc static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;

pubpc class TextFileFilter implements FileFilter {

   @Override
   pubpc boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Indexer.java

This class is used to index the raw data so that we can make it searchable using the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

pubpc class Indexer {

   private IndexWriter writer;

   pubpc Indexer(String indexDirectoryPath) throws IOException {
      //this directory will contain the indexes
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));

      //create the indexer
      writer = new IndexWriter(indexDirectory, 
         new StandardAnalyzer(Version.LUCENE_36),true, 
         IndexWriter.MaxFieldLength.UNLIMITED);
   }

   pubpc void close() throws CorruptIndexException, IOException {
      writer.close();
   }

   private Document getDocument(File file) throws IOException {
      Document document = new Document();

      //index file contents
      Field contentField = new Field(LuceneConstants.CONTENTS, new FileReader(file));
      //index file name
      Field fileNameField = new Field(LuceneConstants.FILE_NAME,
         file.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
      //index file path
      Field filePathField = new Field(LuceneConstants.FILE_PATH,
         file.getCanonicalPath(),Field.Store.YES,Field.Index.NOT_ANALYZED);

      document.add(contentField);
      document.add(fileNameField);
      document.add(filePathField);

      return document;
   }   

   private void indexFile(File file) throws IOException {
      System.out.println("Indexing "+file.getCanonicalPath());
      Document document = getDocument(file);
      writer.addDocument(document);
   }

   pubpc int createIndex(String dataDirPath, FileFilter filter) 
      throws IOException {
      //get all files in the data directory
      File[] files = new File(dataDirPath).pstFiles();

      for (File file : files) {
         if(!file.isDirectory()
            && !file.isHidden()
            && file.exists()
            && file.canRead()
            && filter.accept(file)
         ){
            indexFile(file);
         }
      }
      return writer.numDocs();
   }
}

Searcher.java

This class is used to search the indexes created by the Indexer to search the requested content.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

pubpc class Searcher {
	
   IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;
   
   pubpc Searcher(String indexDirectoryPath) 
      throws IOException {
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }
   
   pubpc TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   pubpc Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   pubpc void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the indexing and search capabipty of lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

pubpc class LuceneTester {
	
   String indexDir = "E:\Lucene\Index";
   String dataDir = "E:\Lucene\Data";
   Indexer indexer;
   Searcher searcher;

   pubpc static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.createIndex();
         tester.search("Mohan");
      } catch (IOException e) {
         e.printStackTrace();
      } catch (ParseException e) {
         e.printStackTrace();
      }
   }

   private void createIndex() throws IOException {
      indexer = new Indexer(indexDir);
      int numIndexed;
      long startTime = System.currentTimeMilps();	
      numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
      long endTime = System.currentTimeMilps();
      indexer.close();
      System.out.println(numIndexed+" File indexed, time taken: "
         +(endTime-startTime)+" ms");		
   }

   private void search(String searchQuery) throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMilps();
      TopDocs hits = searcher.search(searchQuery);
      long endTime = System.currentTimeMilps();
   
      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime));
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
            System.out.println("File: "
            + doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }
}

Step 4 - Data & Index directory creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:LuceneData. Test Data. An index directory path should be created as E:LuceneIndex. After running this program, you can see the pst of index files created in that folder.

Step 5 - Running the program

Once you are done with the creation of the source, the raw data, the data directory and the index directory, you are ready for compipng and running of your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Ecppse IDE or use Ctrl + F11 to compile and run your LuceneTester apppcation. If the apppcation runs successfully, it will print the following message in Ecppse IDE s console −

Indexing E:LuceneData
ecord1.txt
Indexing E:LuceneData
ecord10.txt
Indexing E:LuceneData
ecord2.txt
Indexing E:LuceneData
ecord3.txt
Indexing E:LuceneData
ecord4.txt
Indexing E:LuceneData
ecord5.txt
Indexing E:LuceneData
ecord6.txt
Indexing E:LuceneData
ecord7.txt
Indexing E:LuceneData
ecord8.txt
Indexing E:LuceneData
ecord9.txt
10 File indexed, time taken: 109 ms
1 documents found. Time :0
File: E:LuceneData
ecord4.txt

Once you ve run the program successfully, you will have the following content in your index directory

Lucene Index Directory

Lucene - Indexing Classes

Indexing process is one of the core functionapties provided by Lucene. The following diagram illustrates the indexing process and the use of classes. IndexWriter is the most important and the core component of the indexing process.

Indexing Process

We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.

Indexing Classes

Following is a pst of commonly-used classes during the indexing process.

S.No. Class & Description
1 IndexWriter

This class acts as a core component which creates/updates indexes during the indexing process.

2 Directory

This class represents the storage location of the indexes.

3 Analyzer

This class is responsible to analyze a document and get the tokens/words from the text which is to be indexed. Without analysis done, IndexWriter cannot create index.

4 Document

This class represents a virtual document with Fields where the Field is an object which can contain the physical document s contents, its meta data and so on. The Analyzer can understand a Document only.

5 Field

This is the lowest unit or the starting point of the indexing process. It represents the key value pair relationship where a key is used to identify the value to be indexed. Let us assume a field used to represent contents of a document will have key as "contents" and the value may contain the part or all of the text or numeric content of the document. Lucene can index only text or numeric content only.

Lucene - Searching Classes

The process of Searching is again one of the core functionapties provided by Lucene. Its flow is similar to that of the indexing process. Basic search of Lucene can be made using the following classes which can also be termed as foundation classes for all search related operations.

Searching Classes

Following is a pst of commonly-used classes during searching process.

S.No. Class & Description
1 IndexSearcher

This class act as a core component which reads/searches indexes created after the indexing process. It takes directory instance pointing to the location containing the indexes.

2 Term

This class is the lowest unit of searching. It is similar to Field in indexing process.

3 Query

Query is an abstract class and contains various utipty methods and is the parent of all types of queries that Lucene uses during search process.

4 TermQuery

TermQuery is the most commonly-used query object and is the foundation of many complex queries that Lucene can make use of.

5 TopDocs

TopDocs points to the top N search results which matches the search criteria. It is a simple container of pointers to point to documents which are the output of a search result.

Lucene - Indexing Process

Indexing process is one of the core functionapty provided by Lucene. Following diagram illustrates the indexing process and use of classes. IndexWriter is the most important and core component of the indexing process.

Indexing Process

We add Document(s) containing Field(s) to IndexWriter which analyzes the Document(s) using the Analyzer and then creates/open/edit indexes as required and store/update them in a Directory. IndexWriter is used to update or create indexes. It is not used to read indexes.

Now we ll show you a step by step process to get a kick start in understanding of indexing process using a basic example.

Create a document

    Create a method to get a lucene document from a text file.

    Create various types of fields which are key value pairs containing keys as names and values as contents to be indexed.

    Set field to be analyzed or not. In our case, only contents is to be analyzed as it can contain data such as a, am, are, an etc. which are not required in search operations.

    Add the newly created fields to the document object and return it to the caller method.

private Document getDocument(File file) throws IOException {
   Document document = new Document();
   
   //index file contents
   Field contentField = new Field(LuceneConstants.CONTENTS, 
      new FileReader(file));
   
   //index file name
   Field fileNameField = new Field(LuceneConstants.FILE_NAME,
      file.getName(),
      Field.Store.YES,Field.Index.NOT_ANALYZED);
   
   //index file path
   Field filePathField = new Field(LuceneConstants.FILE_PATH,
      file.getCanonicalPath(),
      Field.Store.YES,Field.Index.NOT_ANALYZED);

   document.add(contentField);
   document.add(fileNameField);
   document.add(filePathField);

   return document;
}   

Create a IndexWriter

IndexWriter class acts as a core component which creates/updates indexes during indexing process. Follow these steps to create a IndexWriter −

Step 1 − Create object of IndexWriter.

Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.

Step 3 − Initiapze the IndexWriter object created with the index directory, a standard analyzer having version information and other required/optional parameters.

private IndexWriter writer;

pubpc Indexer(String indexDirectoryPath) throws IOException {
   //this directory will contain the indexes
   Directory indexDirectory = 
      FSDirectory.open(new File(indexDirectoryPath));
   
   //create the indexer
   writer = new IndexWriter(indexDirectory, 
      new StandardAnalyzer(Version.LUCENE_36),true,
      IndexWriter.MaxFieldLength.UNLIMITED);
}

Start Indexing Process

The following program shows how to start an indexing process −

private void indexFile(File file) throws IOException {
   System.out.println("Indexing "+file.getCanonicalPath());
   Document document = getDocument(file);
   writer.addDocument(document);
}

Example Apppcation

To test the indexing process, we need to create a Lucene apppcation test.

Step Description
1

Create a project with a name LuceneFirstApppcation under a package com.tutorialspoint.lucene as explained in the Lucene - First Apppcation chapter. You can also use the project created in Lucene - First Apppcation chapter as such for this chapter to understand the indexing process.

2

Create LuceneConstants.java,TextFileFilter.java and Indexer.java as explained in the Lucene - First Apppcation chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and build the apppcation to make sure the business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample apppcation.

package com.tutorialspoint.lucene;

pubpc class LuceneConstants {
   pubpc static final String CONTENTS = "contents";
   pubpc static final String FILE_NAME = "filename";
   pubpc static final String FILE_PATH = "filepath";
   pubpc static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;

pubpc class TextFileFilter implements FileFilter {

   @Override
   pubpc boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Indexer.java

This class is used to index the raw data so that we can make it searchable using the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

pubpc class Indexer {

   private IndexWriter writer;

   pubpc Indexer(String indexDirectoryPath) throws IOException {
      //this directory will contain the indexes
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));

      //create the indexer
      writer = new IndexWriter(indexDirectory, 
         new StandardAnalyzer(Version.LUCENE_36),true,
         IndexWriter.MaxFieldLength.UNLIMITED);
   }

   pubpc void close() throws CorruptIndexException, IOException {
      writer.close();
   }

   private Document getDocument(File file) throws IOException {
      Document document = new Document();

      //index file contents
      Field contentField = new Field(LuceneConstants.CONTENTS, 
         new FileReader(file));
      
      //index file name
      Field fileNameField = new Field(LuceneConstants.FILE_NAME,
         file.getName(),
         Field.Store.YES,Field.Index.NOT_ANALYZED);
      
      //index file path
      Field filePathField = new Field(LuceneConstants.FILE_PATH,
         file.getCanonicalPath(),
         Field.Store.YES,Field.Index.NOT_ANALYZED);

      document.add(contentField);
      document.add(fileNameField);
      document.add(filePathField);

      return document;
   }   

   private void indexFile(File file) throws IOException {
      System.out.println("Indexing "+file.getCanonicalPath());
      Document document = getDocument(file);
      writer.addDocument(document);
   }

   pubpc int createIndex(String dataDirPath, FileFilter filter) 
      throws IOException {
      //get all files in the data directory
      File[] files = new File(dataDirPath).pstFiles();

      for (File file : files) {
         if(!file.isDirectory()
            && !file.isHidden()
            && file.exists()
            && file.canRead()
            && filter.accept(file)
         ){
            indexFile(file);
         }
      }
      return writer.numDocs();
   }
}

LuceneTester.java

This class is used to test the indexing capabipty of the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.IOException;

pubpc class LuceneTester {
	
   String indexDir = "E:\Lucene\Index";
   String dataDir = "E:\Lucene\Data";
   Indexer indexer;
   
   pubpc static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.createIndex();
      } catch (IOException e) {
         e.printStackTrace();
      } 
   }

   private void createIndex() throws IOException {
      indexer = new Indexer(indexDir);
      int numIndexed;
      long startTime = System.currentTimeMilps();	
      numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
      long endTime = System.currentTimeMilps();
      indexer.close();
      System.out.println(numIndexed+" File indexed, time taken: "
         +(endTime-startTime)+" ms");		
   }
}

Data & Index Directory Creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:LuceneData. Test Data. An index directory path should be created as E:LuceneIndex. After running this program, you can see the pst of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory and the index directory, you can proceed by compipng and running your program. To do this, keep the LuceneTester.Java file tab active and use either the Run option available in the Ecppse IDE or use Ctrl + F11 to compile and run your LuceneTester apppcation. If your apppcation runs successfully, it will print the following message in Ecppse IDE s console −

Indexing E:LuceneData
ecord1.txt
Indexing E:LuceneData
ecord10.txt
Indexing E:LuceneData
ecord2.txt
Indexing E:LuceneData
ecord3.txt
Indexing E:LuceneData
ecord4.txt
Indexing E:LuceneData
ecord5.txt
Indexing E:LuceneData
ecord6.txt
Indexing E:LuceneData
ecord7.txt
Indexing E:LuceneData
ecord8.txt
Indexing E:LuceneData
ecord9.txt
10 File indexed, time taken: 109 ms

Once you ve run the program successfully, you will have the following content in your index directory −

Lucene Index Directory

Lucene - Indexing Operations

In this chapter, we ll discuss the four major operations of indexing. These operations are useful at various times and are used throughout of a software search apppcation.

Indexing Operations

Following is a pst of commonly-used operations during indexing process.

S.No. Operation & Description
1 Add Document

This operation is used in the initial stage of the indexing process to create the indexes on the newly available content.

2 Update Document

This operation is used to update indexes to reflect the changes in the updated contents. It is similar to recreating the index.

3 Delete Document

This operation is used to update indexes to exclude the documents which are not required to be indexed/searched.

4 Field Options

Field options specify a way or control the ways in which the contents of a field are to be made searchable.

Lucene - Search Operation

The process of searching is one of the core functionapties provided by Lucene. Following diagram illustrates the process and its use. IndexSearcher is one of the core components of the searching process.

Searching Process

We first create Directory(s) containing indexes and then pass it to IndexSearcher which opens the Directory using IndexReader. Then we create a Query with a Term and make a search using IndexSearcher by passing the Query to the searcher. IndexSearcher returns a TopDocs object which contains the search details along with document ID(s) of the Document which is the result of the search operation.

We will now show you a step-wise approach and help you understand the indexing process using a basic example.

Create a QueryParser

QueryParser class parses the user entered input into Lucene understandable format query. Follow these steps to create a QueryParser −

Step 1 − Create object of QueryParser.

Step 2 − Initiapze the QueryParser object created with a standard analyzer having version information and index name on which this query is to be run.

QueryParser queryParser;

pubpc Searcher(String indexDirectoryPath) throws IOException {

   queryParser = new QueryParser(Version.LUCENE_36,
      LuceneConstants.CONTENTS,
      new StandardAnalyzer(Version.LUCENE_36));
}

Create a IndexSearcher

IndexSearcher class acts as a core component which searcher indexes created during indexing process. Follow these steps to create a IndexSearcher −

Step 1 − Create object of IndexSearcher.

Step 2 − Create a Lucene directory which should point to location where indexes are to be stored.

Step 3 − Initiapze the IndexSearcher object created with the index directory.

IndexSearcher indexSearcher;

pubpc Searcher(String indexDirectoryPath) throws IOException {
   Directory indexDirectory = 
      FSDirectory.open(new File(indexDirectoryPath));
   indexSearcher = new IndexSearcher(indexDirectory);
}

Make search

Follow these steps to make search −

Step 1 − Create a Query object by parsing the search expression through QueryParser.

Step 2 − Make search by calpng the IndexSearcher.search() method.

Query query;

pubpc TopDocs search( String searchQuery) throws IOException, ParseException {
   query = queryParser.parse(searchQuery);
   return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
}

Get the Document

The following program shows how to get the document.

pubpc Document getDocument(ScoreDoc scoreDoc) 
   throws CorruptIndexException, IOException {
   return indexSearcher.doc(scoreDoc.doc);	
}

Close IndexSearcher

The following program shows how to close the IndexSearcher.

pubpc void close() throws IOException {
   indexSearcher.close();
}

Example Apppcation

Let us create a test Lucene apppcation to test searching process.

Step Description
1

Create a project with a name LuceneFirstApppcation under a package com.tutorialspoint.lucene as explained in the Lucene - First Apppcation chapter. You can also use the project created in Lucene - First Apppcation chapter as such for this chapter to understand the searching process.

2

Create LuceneConstants.java,TextFileFilter.java and Searcher.java as explained in the Lucene - First Apppcation chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and Build the apppcation to make sure business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample apppcation.

package com.tutorialspoint.lucene;

pubpc class LuceneConstants {
   pubpc static final String CONTENTS = "contents";
   pubpc static final String FILE_NAME = "filename";
   pubpc static final String FILE_PATH = "filepath";
   pubpc static final int MAX_SEARCH = 10;
}

TextFileFilter.java

This class is used as a .txt file filter.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.FileFilter;

pubpc class TextFileFilter implements FileFilter {

   @Override
   pubpc boolean accept(File pathname) {
      return pathname.getName().toLowerCase().endsWith(".txt");
   }
}

Searcher.java

This class is used to read the indexes made on raw data and searches data using the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

pubpc class Searcher {
	
   IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;

   pubpc Searcher(String indexDirectoryPath) throws IOException {
      Directory indexDirectory = 
         FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }

   pubpc TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   pubpc Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   pubpc void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the searching capabipty of the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;

pubpc class LuceneTester {
	
   String indexDir = "E:\Lucene\Index";
   String dataDir = "E:\Lucene\Data";
   Searcher searcher;

   pubpc static void main(String[] args) {
      LuceneTester tester;
      try {
         tester = new LuceneTester();
         tester.search("Mohan");
      } catch (IOException e) {
         e.printStackTrace();
      } catch (ParseException e) {
         e.printStackTrace();
      }
   }

   private void search(String searchQuery) throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMilps();
      TopDocs hits = searcher.search(searchQuery);
      long endTime = System.currentTimeMilps();

      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime) +" ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }	
}

Data & Index Directory Creation

We have used 10 text files named record1.txt to record10.txt containing names and other details of the students and put them in the directory E:LuceneData. Test Data. An index directory path should be created as E:LuceneIndex. After running the indexing program in the chapter Lucene - Indexing Process, you can see the pst of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can proceed by compipng and running your program. To do this, keep LuceneTester.Java file tab active and use either the Run option available in the Ecppse IDE or use Ctrl + F11 to compile and run your LuceneTesterapppcation. If your apppcation runs successfully, it will print the following message in Ecppse IDE s console −

1 documents found. Time :29 ms
File: E:LuceneData
ecord4.txt

Lucene - Query Programming

We have seen in previous chapter Lucene - Search Operation, Lucene uses IndexSearcher to make searches and it uses the Query object created by QueryParser as the input. In this chapter, we are going to discuss various types of Query objects and the different ways to create them programmatically. Creating different types of Query object gives control on the kind of search to be made.

Consider a case of Advanced Search, provided by many apppcations where users are given multiple options to confine the search results. By Query programming, we can achieve the same very easily.

Following is the pst of Query types that we ll discuss in due course.

S.No. Class & Description
1 TermQuery

This class acts as a core component which creates/updates indexes during the indexing process.

2 TermRangeQuery

TermRangeQuery is used when a range of textual terms are to be searched.

3 PrefixQuery

PrefixQuery is used to match documents whose index starts with a specified string.

4 BooleanQuery

BooleanQuery is used to search documents which are result of multiple queries using AND, OR or NOT operators.

5 PhraseQuery

Phrase query is used to search documents which contain a particular sequence of terms.

6 WildCardQuery

WildcardQuery is used to search documents using wildcards pke * for any character sequence,? matching a single character.

7 FuzzyQuery

FuzzyQuery is used to search documents using fuzzy implementation that is an approximate search based on the edit distance algorithm.

8 MatchAllDocsQuery

MatchAllDocsQuery as the name suggests matches all the documents.

Lucene - Analysis

In one of our previous chapters, we have seen that Lucene uses IndexWriter to analyze the Document(s) using the Analyzer and then creates/open/edit indexes as required. In this chapter, we are going to discuss the various types of Analyzer objects and other relevant objects which are used during the analysis process. Understanding the Analysis process and how analyzers work will give you great insight over how Lucene indexes the documents.

Following is the pst of objects that we ll discuss in due course.

S.No. Class & Description
1 Token

Token represents text or word in a document with relevant details pke its metadata (position, start offset, end offset, token type and its position increment).

2 TokenStream

TokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class.

3 Analyzer

This is an abstract base class for each and every type of Analyzer.

4 WhitespaceAnalyzer

This analyzer sppts the text in a document based on whitespace.

5 SimpleAnalyzer

This analyzer sppts the text in a document based on non-letter characters and puts the text in lowercase.

6 StopAnalyzer

This analyzer works just as the SimpleAnalyzer and removes the common words pke a , an , the , etc.

7 StandardAnalyzer

This is the most sophisticated analyzer and is capable of handpng names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.

Lucene - Sorting

In this chapter, we will look into the sorting orders in which Lucene gives the search results by default or can be manipulated as required.

Sorting by Relevance

This is the default sorting mode used by Lucene. Lucene provides results by the most relevant hit at the top.

private void sortUsingRelevance(String searchQuery)
   throws IOException, ParseException {
   searcher = new Searcher(indexDir);
   long startTime = System.currentTimeMilps();
   
   //create a term to search file name
   Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
   //create the term query object
   Query query = new FuzzyQuery(term);
   searcher.setDefaultFieldSortScoring(true, false);
   //do the search
   TopDocs hits = searcher.search(query,Sort.RELEVANCE);
   long endTime = System.currentTimeMilps();

   System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
   for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = searcher.getDocument(scoreDoc);
      System.out.print("Score: "+ scoreDoc.score + " ");
      System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
   }
   searcher.close();
}

Sorting by IndexOrder

This sorting mode is used by Lucene. Here, the first document indexed is shown first in the search results.

private void sortUsingIndex(String searchQuery)
   throws IOException, ParseException {
   searcher = new Searcher(indexDir);
   long startTime = System.currentTimeMilps();
   
   //create a term to search file name
   Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
   //create the term query object
   Query query = new FuzzyQuery(term);
   searcher.setDefaultFieldSortScoring(true, false);
   //do the search
   TopDocs hits = searcher.search(query,Sort.INDEXORDER);
   long endTime = System.currentTimeMilps();

   System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
   for(ScoreDoc scoreDoc : hits.scoreDocs) {
      Document doc = searcher.getDocument(scoreDoc);
      System.out.print("Score: "+ scoreDoc.score + " ");
      System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
   }
   searcher.close();
}

Example Apppcation

Let us create a test Lucene apppcation to test the sorting process.

Step Description
1

Create a project with a name LuceneFirstApppcation under a package com.tutorialspoint.lucene as explained in the Lucene - First Apppcation chapter. You can also use the project created in Lucene - First Apppcation chapter as such for this chapter to understand the searching process.

2

Create LuceneConstants.java and Searcher.java as explained in the Lucene - First Apppcation chapter. Keep the rest of the files unchanged.

3

Create LuceneTester.java as mentioned below.

4

Clean and Build the apppcation to make sure the business logic is working as per the requirements.

LuceneConstants.java

This class is used to provide various constants to be used across the sample apppcation.

package com.tutorialspoint.lucene;

pubpc class LuceneConstants {
   pubpc static final String CONTENTS = "contents";
   pubpc static final String FILE_NAME = "filename";
   pubpc static final String FILE_PATH = "filepath";
   pubpc static final int MAX_SEARCH = 10;
}

Searcher.java

This class is used to read the indexes made on raw data and searches data using the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

pubpc class Searcher {
	
IndexSearcher indexSearcher;
   QueryParser queryParser;
   Query query;

   pubpc Searcher(String indexDirectoryPath) throws IOException {
      Directory indexDirectory 
         = FSDirectory.open(new File(indexDirectoryPath));
      indexSearcher = new IndexSearcher(indexDirectory);
      queryParser = new QueryParser(Version.LUCENE_36,
         LuceneConstants.CONTENTS,
         new StandardAnalyzer(Version.LUCENE_36));
   }

   pubpc TopDocs search( String searchQuery) 
      throws IOException, ParseException {
      query = queryParser.parse(searchQuery);
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   pubpc TopDocs search(Query query) 
      throws IOException, ParseException {
      return indexSearcher.search(query, LuceneConstants.MAX_SEARCH);
   }

   pubpc TopDocs search(Query query,Sort sort) 
      throws IOException, ParseException {
      return indexSearcher.search(query, 
         LuceneConstants.MAX_SEARCH,sort);
   }

   pubpc void setDefaultFieldSortScoring(boolean doTrackScores, 
      boolean doMaxScores) {
      indexSearcher.setDefaultFieldSortScoring(
         doTrackScores,doMaxScores);
   }

   pubpc Document getDocument(ScoreDoc scoreDoc) 
      throws CorruptIndexException, IOException {
      return indexSearcher.doc(scoreDoc.doc);	
   }

   pubpc void close() throws IOException {
      indexSearcher.close();
   }
}

LuceneTester.java

This class is used to test the searching capabipty of the Lucene pbrary.

package com.tutorialspoint.lucene;

import java.io.IOException;

import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.FuzzyQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TopDocs;

pubpc class LuceneTester {
	
   String indexDir = "E:\Lucene\Index";
   String dataDir = "E:\Lucene\Data";
   Indexer indexer;
   Searcher searcher;

   pubpc static void main(String[] args) {
      LuceneTester tester;
      try {
          tester = new LuceneTester();
          tester.sortUsingRelevance("cord3.txt");
          tester.sortUsingIndex("cord3.txt");
      } catch (IOException e) {
          e.printStackTrace();
      } catch (ParseException e) {
          e.printStackTrace();
      }		
   }

   private void sortUsingRelevance(String searchQuery)
      throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMilps();
      
      //create a term to search file name
      Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
      //create the term query object
      Query query = new FuzzyQuery(term);
      searcher.setDefaultFieldSortScoring(true, false);
      //do the search
      TopDocs hits = searcher.search(query,Sort.RELEVANCE);
      long endTime = System.currentTimeMilps();

      System.out.println(hits.totalHits +
         " documents found. Time :" + (endTime - startTime) + "ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.print("Score: "+ scoreDoc.score + " ");
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }

   private void sortUsingIndex(String searchQuery)
      throws IOException, ParseException {
      searcher = new Searcher(indexDir);
      long startTime = System.currentTimeMilps();
      //create a term to search file name
      Term term = new Term(LuceneConstants.FILE_NAME, searchQuery);
      //create the term query object
      Query query = new FuzzyQuery(term);
      searcher.setDefaultFieldSortScoring(true, false);
      //do the search
      TopDocs hits = searcher.search(query,Sort.INDEXORDER);
      long endTime = System.currentTimeMilps();

      System.out.println(hits.totalHits +
      " documents found. Time :" + (endTime - startTime) + "ms");
      for(ScoreDoc scoreDoc : hits.scoreDocs) {
         Document doc = searcher.getDocument(scoreDoc);
         System.out.print("Score: "+ scoreDoc.score + " ");
         System.out.println("File: "+ doc.get(LuceneConstants.FILE_PATH));
      }
      searcher.close();
   }
}

Data & Index Directory Creation

We have used 10 text files from record1.txt to record10.txt containing names and other details of the students and put them in the directory E:LuceneData. Test Data. An index directory path should be created as E:LuceneIndex. After running the indexing program in the chapter Lucene - Indexing Process, you can see the pst of index files created in that folder.

Running the Program

Once you are done with the creation of the source, the raw data, the data directory, the index directory and the indexes, you can compile and run your program. To do this, Keep the LuceneTester.Java file tab active and use either the Run option available in the Ecppse IDE or use Ctrl + F11 to compile and run your LuceneTester apppcation. If your apppcation runs successfully, it will print the following message in Ecppse IDE s console −

10 documents found. Time :31ms
Score: 1.3179655 File: E:LuceneData
ecord3.txt
Score: 0.790779 File: E:LuceneData
ecord1.txt
Score: 0.790779 File: E:LuceneData
ecord2.txt
Score: 0.790779 File: E:LuceneData
ecord4.txt
Score: 0.790779 File: E:LuceneData
ecord5.txt
Score: 0.790779 File: E:LuceneData
ecord6.txt
Score: 0.790779 File: E:LuceneData
ecord7.txt
Score: 0.790779 File: E:LuceneData
ecord8.txt
Score: 0.790779 File: E:LuceneData
ecord9.txt
Score: 0.2635932 File: E:LuceneData
ecord10.txt
10 documents found. Time :0ms
Score: 0.790779 File: E:LuceneData
ecord1.txt
Score: 0.2635932 File: E:LuceneData
ecord10.txt
Score: 0.790779 File: E:LuceneData
ecord2.txt
Score: 1.3179655 File: E:LuceneData
ecord3.txt
Score: 0.790779 File: E:LuceneData
ecord4.txt
Score: 0.790779 File: E:LuceneData
ecord5.txt
Score: 0.790779 File: E:LuceneData
ecord6.txt
Score: 0.790779 File: E:LuceneData
ecord7.txt
Score: 0.790779 File: E:LuceneData
ecord8.txt
Score: 0.790779 File: E:LuceneData
ecord9.txt
Advertisements