English 中文(简体)
Mahout - Classification
  • 时间:2024-12-22

Mahout - Classification


Previous Page Next Page  

What is Classification?

Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. For example,

    iTunes apppcation uses classification to prepare playpsts.

    Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

How Classification Works

While classifying a given set of data, the classifier system performs the following actions:

    Initially a new data model is prepared using any of the learning algorithms.

    Then the prepared data model is tested.

    Thereafter, this data model is used to evaluate the new data and to determine its class.

Classification Works

Apppcations of Classification

    Credit card fraud detection - The Classification mechanism is used to predict credit card frauds. Using historical information of previous frauds, the classifier can predict which future transactions may turn into frauds.

    Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:

    Distributed Naive Bayes classification

    Complementary Naive Bayes classification

Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.

For some types of probabipty models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

Despite its oversimppfied assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

    Generate example data

    Create sequence files from data

    Convert sequence files to vectors

    Train the vectors

    Test the vectors

Step1: Generate Example Data

Generate or download the data to be classified. For example, you can get the 20 newsgroups example data from the following pnk: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Create a directory for storing input data. Download the example as shown below.

$ mkdir classification_example
$ cd classification_example
$tar xzvf 20news-bydate.tar.gz
wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 

Step 2: Create Sequence Files

Create sequence file from the example using seqdirectory utipty. The syntax to generate sequence is given below:

mahout seqdirectory -i <input file path> -o <output directory>

Step 3: Convert Sequence Files to Vectors

Create vector files from sequence files using seq2parse utipty. The options of seq2parse utipty are given below:

$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName  The class name of the analyzer
--chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes.
--output (-o) output              The directory pathname for o/p
--input (-i) input                Path to job input directory. 

Step 4: Train the Vectors

Train the generated vectors using the trainnb utipty. The options to use trainnb utipty are given below:

mahout trainnb
 -i ${PATH_TO_TFIDF_VECTORS}
 -el
 -o ${PATH_TO_MODEL}/model
 -p ${PATH_TO_MODEL}/labepndex
 -ow
 -c

Step 5: Test the Vectors

Test the vectors using testnb utipty. The options to use testnb utipty are given below:

mahout testnb
 -i ${PATH_TO_TFIDF_TEST_VECTORS}
 -m ${PATH_TO_MODEL}/model
 -l ${PATH_TO_MODEL}/labepndex
 -ow
 -o ${PATH_TO_OUTPUT}
 -c
 -seq
Advertisements