- Mahout - Classification
- Mahout - Clustering
- Mahout - Recommendation
- Mahout - Environment
- Mahout - Machine Learning
- Mahout - Introduction
- Mahout - Home
Mahout Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Mahout - Classification
What is Classification?
Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. For example,
iTunes apppcation uses classification to prepare playpsts.
Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.
How Classification Works
While classifying a given set of data, the classifier system performs the following actions:
Initially a new data model is prepared using any of the learning algorithms.
Then the prepared data model is tested.
Thereafter, this data model is used to evaluate the new data and to determine its class.
Apppcations of Classification
Credit card fraud detection - The Classification mechanism is used to predict credit card frauds. Using historical information of previous frauds, the classifier can predict which future transactions may turn into frauds.
Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.
Naive Bayes Classifier
Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:
Distributed Naive Bayes classification
Complementary Naive Bayes classification
Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.
An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.
For some types of probabipty models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.
Despite its oversimppfied assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.
Procedure of Classification
The following steps are to be followed to implement Classification:
Generate example data
Create sequence files from data
Convert sequence files to vectors
Train the vectors
Test the vectors
Step1: Generate Example Data
Generate or download the data to be classified. For example, you can get the 20 newsgroups example data from the following pnk:
Create a directory for storing input data. Download the example as shown below.
$ mkdir classification_example $ cd classification_example $tar xzvf 20news-bydate.tar.gz wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
Step 2: Create Sequence Files
Create sequence file from the example using seqdirectory utipty. The syntax to generate sequence is given below:
mahout seqdirectory -i <input file path> -o <output directory>
Step 3: Convert Sequence Files to Vectors
Create vector files from sequence files using seq2parse utipty. The options of seq2parse utipty are given below:
$MAHOUT_HOME/bin/mahout seq2sparse --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. --output (-o) output The directory pathname for o/p --input (-i) input Path to job input directory.
Step 4: Train the Vectors
Train the generated vectors using the trainnb utipty. The options to use trainnb utipty are given below:
mahout trainnb -i ${PATH_TO_TFIDF_VECTORS} -el -o ${PATH_TO_MODEL}/model -p ${PATH_TO_MODEL}/labepndex -ow -c
Step 5: Test the Vectors
Test the vectors using testnb utipty. The options to use testnb utipty are given below:
mahout testnb -i ${PATH_TO_TFIDF_TEST_VECTORS} -m ${PATH_TO_MODEL}/model -l ${PATH_TO_MODEL}/labepndex -ow -o ${PATH_TO_OUTPUT} -c -seqAdvertisements