- Natural Language Toolkit - Discussion
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Text Classification
- Synonym & Antonym Replacement
- Natural Language Toolkit - Word Replacement
- Stemming & Lemmatization
- Looking up words in Wordnet
- Training Tokenizer & Filtering Stopwords
- Natural Language Toolkit - Tokenizing Text
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Home
自然语言工具包
- 自然语言工具箱——改造树木
- 自然语言工具箱——改造楚克
- Chunking & Information 排外
- 自然语言工具箱——包装
- 自然语言工具包 - 更多国家 Taggers
- 自然语言工具箱——将Taggers混为一谈
- 自然语言工具箱——Unigram Tagger
- 部分Speech(POS)基本原理
- Corpus Readers and Customs Corpora
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Stemming & Lemmatization
What is Stemming?
Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just pke cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
Various Stemming algorithms
In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram
Porter stemming algorithm
It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of Engpsh words.
PorterStemmer class
NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.
from nltk.stem import PorterStemmer
Next, create an instance of Porter Stemmer class as follows −
word_stemmer = PorterStemmer()
Now, input the word you want to stem.
word_stemmer.stem( writing )
Output
write
word_stemmer.stem( eating )
Output
eat
Complete implementation example
import nltk from nltk.stem import PorterStemmer word_stemmer = PorterStemmer() word_stemmer.stem( writing )
Output
write
Lancaster stemming algorithm
It was developed at Lancaster University and it is another very common stemming algorithms.
LancasterStemmer class
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm
from nltk.stem import LancasterStemmer
Next, create an instance of LancasterStemmer class as follows −
Lanc_stemmer = LancasterStemmer()
Now, input the word you want to stem.
Lanc_stemmer.stem( eats )
Output
eat
Complete implementation example
import nltk from nltk.stem import LancatserStemmer Lanc_stemmer = LancasterStemmer() Lanc_stemmer.stem( eats )
Output
eat
Regular Expression stemming algorithm
With the help of this stemming algorithm, we can construct our own stemmer.
RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.
from nltk.stem import RegexpStemmer
Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −
Reg_stemmer = RegexpStemmer(‘ing’)
Now, input the word you want to stem.
Reg_stemmer.stem( eating )
Output
eat
Reg_stemmer.stem( ingeat )
Output
eat Reg_stemmer.stem( eats )
Output
eat
Complete implementation example
import nltk from nltk.stem import RegexpStemmer Reg_stemmer = RegexpStemmer() Reg_stemmer.stem( ingeat )
Output
eat
Snowball stemming algorithm
It is another very useful stemming algorithm.
SnowballStemmer class
NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-Engpsh languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm
from nltk.stem import SnowballStemmer
Let us see the languages it supports −
SnowballStemmer.languages
Output
( arabic , danish , dutch , engpsh , finnish , french , german , hungarian , itapan , norwegian , porter , portuguese , romanian , russian , spanish , swedish )
Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.
French_stemmer = SnowballStemmer(‘french’)
Now, call the stem() method and input the word you want to stem.
French_stemmer.stem (‘Bonjoura’)
Output
bonjour
Complete implementation example
import nltk from nltk.stem import SnowballStemmer French_stemmer = SnowballStemmer(‘french’) French_stemmer.stem (‘Bonjoura’)
Output
bonjour
What is Lemmatization?
Lemmatization technique is pke stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a vapd word that means the same thing.
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −
Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the WordNetLemmatizer class to implement the lemmatization technique.
from nltk.stem import WordNetLemmatizer
Next, create an instance of WordNetLemmatizer class.
lemmatizer = WordNetLemmatizer()
Now, call the lemmatize() method and input the word of which you want to find lemma.
lemmatizer.lemmatize( eating )
Output
eating
lemmatizer.lemmatize( books )
Output
book
Complete implementation example
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize( books )
Output
book
Difference between Stemming & Lemmatization
Let us understand the difference between Stemming and Lemmatization with the help of the following example −
import nltk from nltk.stem import PorterStemmer word_stemmer = PorterStemmer() word_stemmer.stem( bepeves )
Output
bepev
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize( bepeves )
Output
bepev
The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a vapd word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a vapd word.
Advertisements