- Python Data Science - Matplotlib
- Python Data Science - SciPy
- Python Data Science - Numpy
- Python Data Science - Pandas
- Python Data Science - Environment Setup
- Python Data Science - Getting Started
- Python Data Science - Home
Python Data Processing
- Python Stemming and Lemmatization
- Python word tokenization
- Python Processing Unstructured Data
- Python Reading HTML Pages
- Python Data Aggregation
- Python Data Wrangling
- Python Date and Time
- Python NoSQL Databases
- Python Relational databases
- Python Processing XLS Data
- Python Processing JSON Data
- Python Processing CSV Data
- Python Data cleansing
- Python Data Operations
Python Data Visualization
- Python Graph Data
- Python Geographical Data
- Python Time Series
- Python 3D Charts
- Python Bubble Charts
- Python Scatter Plots
- Python Heat Maps
- Python Box Plots
- Python Chart Styling
- Python Chart Properties
Statistical Data Analysis
- Python Linear Regression
- Python Chi-square Test
- Python Correlation
- Python P-Value
- Python Bernoulli Distribution
- Python Poisson Distribution
- Python Binomial Distribution
- Python Normal Distribution
- Python Measuring Variance
- Python Measuring Central Tendency
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Python - Stemming and Lemmatization
In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to pnk all the words into their root word. The NLTK pbrary has methods to do this pnking and give the output showing the root word.
The below program uses the Porter Stemming Algorithm for stemming.
import nltk from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" # First Word tokenization nltk_tokens = nltk.word_tokenize(word_data) #Next find the roots of the word for w in nltk_tokens: print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))
When we execute the above code, it produces the following result.
Actual: It Stem: It Actual: originated Stem: origin Actual: from Stem: from Actual: the Stem: the Actual: idea Stem: idea Actual: that Stem: that Actual: there Stem: there Actual: are Stem: are Actual: readers Stem: reader Actual: who Stem: who Actual: prefer Stem: prefer Actual: learning Stem: learn Actual: new Stem: new Actual: skills Stem: skill Actual: from Stem: from Actual: the Stem: the Actual: comforts Stem: comfort Actual: of Stem: of Actual: their Stem: their Actual: drawing Stem: draw Actual: rooms Stem: room
Lemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by pnking words with similar meaning to one word. For example if a paragraph has words pke cars, trains and automobile, then it will pnk all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.
import nltk from nltk.stem import WordNetLemmatizer wordnet_lemmatizer = WordNetLemmatizer() word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms" nltk_tokens = nltk.word_tokenize(word_data) for w in nltk_tokens: print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))
When we execute the above code, it produces the following result.
Actual: It Lemma: It Actual: originated Lemma: originated Actual: from Lemma: from Actual: the Lemma: the Actual: idea Lemma: idea Actual: that Lemma: that Actual: there Lemma: there Actual: are Lemma: are Actual: readers Lemma: reader Actual: who Lemma: who Actual: prefer Lemma: prefer Actual: learning Lemma: learning Actual: new Lemma: new Actual: skills Lemma: skill Actual: from Lemma: from Actual: the Lemma: the Actual: comforts Lemma: comfort Actual: of Lemma: of Actual: their Lemma: their Actual: drawing Lemma: drawing Actual: rooms Lemma: roomAdvertisements