Stemming & Lemmatization-alljchome-开发者的教程家园

Natural Language Toolkit Tutorial

自然语言工具包

Selected Reading

Stemming & Lemmatization

What is Stemming?

Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just pke cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

Various Stemming algorithms

In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram

Porter stemming algorithm

It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of Engpsh words.

PorterStemmer class

NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −

First, we need to import the natural language toolkit(nltk).


import nltk

Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.


from nltk.stem import PorterStemmer

Next, create an instance of Porter Stemmer class as follows −


word_stemmer = PorterStemmer()

Now, input the word you want to stem.


word_stemmer.stem( writing )

Output


 write


word_stemmer.stem( eating )

Output

eat

Complete implementation example


import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem( writing )

Output


 write

Lancaster stemming algorithm

It was developed at Lancaster University and it is another very common stemming algorithms.

LancasterStemmer class

NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −

First, we need to import the natural language toolkit(nltk).


import nltk

Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm


from nltk.stem import LancasterStemmer

Next, create an instance of LancasterStemmer class as follows −


Lanc_stemmer = LancasterStemmer()

Now, input the word you want to stem.


Lanc_stemmer.stem( eats )

Output

eat

Complete implementation example


import nltk
from nltk.stem import LancatserStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem( eats )

Output

eat

Regular Expression stemming algorithm

With the help of this stemming algorithm, we can construct our own stemmer.

RegexpStemmer class

NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −

First, we need to import the natural language toolkit(nltk).


import nltk

Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.


from nltk.stem import RegexpStemmer

Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −


Reg_stemmer = RegexpStemmer(‘ing’)

Now, input the word you want to stem.


Reg_stemmer.stem( eating )

Output

eat


Reg_stemmer.stem( ingeat )

Output


 eat 
Reg_stemmer.stem( eats )

Output

eat

Complete implementation example


import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer()
Reg_stemmer.stem( ingeat )

Output

eat

Snowball stemming algorithm

It is another very useful stemming algorithm.

SnowballStemmer class

NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-Engpsh languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −

First, we need to import the natural language toolkit(nltk).


import nltk

Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm


from nltk.stem import SnowballStemmer

Let us see the languages it supports −


SnowballStemmer.languages

Output


(
    arabic ,
    danish ,
    dutch ,
    engpsh ,
    finnish ,
    french ,
    german ,
    hungarian ,
    itapan ,
    norwegian ,
    porter ,
    portuguese ,
    romanian ,
    russian ,
    spanish ,
    swedish 
)

Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for ‘French’ language.


French_stemmer = SnowballStemmer(‘french’)

Now, call the stem() method and input the word you want to stem.


French_stemmer.stem (‘Bonjoura’)

Output


 bonjour

Complete implementation example


import nltk
from nltk.stem import SnowballStemmer
French_stemmer = SnowballStemmer(‘french’)
French_stemmer.stem (‘Bonjoura’)

Output


 bonjour

What is Lemmatization?

Lemmatization technique is pke stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a vapd word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

Example

First, we need to import the natural language toolkit(nltk).


import nltk

Now, import the WordNetLemmatizer class to implement the lemmatization technique.


from nltk.stem import WordNetLemmatizer

Next, create an instance of WordNetLemmatizer class.


lemmatizer = WordNetLemmatizer()

Now, call the lemmatize() method and input the word of which you want to find lemma.


lemmatizer.lemmatize( eating )

Output


 eating


lemmatizer.lemmatize( books )

Output


 book

Complete implementation example


import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize( books )

Output


 book

Difference between Stemming & Lemmatization

Let us understand the difference between Stemming and Lemmatization with the help of the following example −


import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem( bepeves )

Output


bepev


import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(  bepeves  )

Output


bepev

The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the ‘es’ from the word. On the other hand, WordNetLemmatizer class finds a vapd word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a vapd word.

Stemming & Lemmatization

What is Stemming?

Various Stemming algorithms

Porter stemming algorithm

PorterStemmer class

Output

Output

Complete implementation example

Output

Lancaster stemming algorithm

LancasterStemmer class

Output

Complete implementation example

Output

Regular Expression stemming algorithm

RegexpStemmer class

Output

Output

Output

Complete implementation example

Output

Snowball stemming algorithm

SnowballStemmer class

Output

Output

Complete implementation example

Output

What is Lemmatization?

Example

Output

Output

Complete implementation example

Output

Difference between Stemming & Lemmatization

Output

Output

友情链接