- Natural Language Toolkit - Discussion
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Text Classification
- Synonym & Antonym Replacement
- Natural Language Toolkit - Word Replacement
- Stemming & Lemmatization
- Looking up words in Wordnet
- Training Tokenizer & Filtering Stopwords
- Natural Language Toolkit - Tokenizing Text
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Home
自然语言工具包
- 自然语言工具箱——改造树木
- 自然语言工具箱——改造楚克
- Chunking & Information 排外
- 自然语言工具箱——包装
- 自然语言工具包 - 更多国家 Taggers
- 自然语言工具箱——将Taggers混为一谈
- 自然语言工具箱——Unigram Tagger
- 部分Speech(POS)基本原理
- Corpus Readers and Customs Corpora
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Natural Language Toolkit - Word Replacement
Stemming and lemmatization can be considered as a kind of pnguistic compression. In the same sense, word replacement can be thought of as text normapzation or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (pke can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.
Word replacement using regular expression
First, we are going to replace words that matches the regular expression. But for this we must have a basic understanding of regular expressions as well as python re module. In the example below, we will be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all that by using regular expressions.
Example
First, import the necessary package re to work with regular expressions.
import re from nltk.corpus import wordnet
Next, define the replacement patterns of your choice as follows −
R_patterns = [ (r won t , will not ), (r can t , cannot ), (r i m , i am ), r (w+) ll , g<1> will ), (r (w+)n t , g<1> not ), (r (w+) ve , g<1> have ), (r (w+) s , g<1> is ), (r (w+) re , g<1> are ), ]
Now, create a class that can be used for replacing words −
class REReplacer(object): def __init__(self, pattern = R_patterns): self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.pattern: s = re.sub(pattern, repl, s) return s
Save this python program (say repRE.py) and run it from python command prompt. After running it, import REReplacer class when you want to replace words. Let us see how.
from repRE import REReplacer rep_word = REReplacer() rep_word.replace("I won t do it") Output: I will not do it rep_word.replace("I can’t do it") Output: I cannot do it
Complete implementation example
import re from nltk.corpus import wordnet R_patterns = [ (r won t , will not ), (r can t , cannot ), (r i m , i am ), r (w+) ll , g<1> will ), (r (w+)n t , g<1> not ), (r (w+) ve , g<1> have ), (r (w+) s , g<1> is ), (r (w+) re , g<1> are ), ] class REReplacer(object): def __init__(self, patterns=R_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: s = re.sub(pattern, repl, s) return s
Now once you saved the above program and run it, you can import the class and use it as follows −
from replacerRE import REReplacer rep_word = REReplacer() rep_word.replace("I won t do it")
Output
I will not do it
Replacement before text processing
One of the common practices while working with natural language processing (NLP) is to clean up the text before text processing. In this concern we can also use our REReplacer class created above in previous example, as a prepminary step before text processing i.e. tokenization.
Example
from nltk.tokenize import word_tokenize from replacerRE import REReplacer rep_word = REReplacer() word_tokenize("I won t be able to do this now") Output: [ I , wo , "n t", be , able , to , do , this , now ] word_tokenize(rep_word.replace("I won t be able to do this now")) Output: [ I , will , not , be , able , to , do , this , now ]
In the above Python recipe, we can easily understand the difference between the output of word tokenizer without and with using regular expression replace.
Removal of repeating characters
Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that ‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class named rep_word_removal which can be used for removing the repeating words.
Example
First, import the necessary package re to work with regular expressions
import re from nltk.corpus import wordnet
Now, create a class that can be used for removing the repeating words −
class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r (w*)(w)2(w*) ) self.repl = r 123 def replace(self, word): if wordnet.synsets(word): return word repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Save this python program (say removalrepeat.py) and run it from python command prompt. After running it, import Rep_word_removal class when you want to remove the repeating words. Let us see how?
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii") Output: Hi rep_word.replace("Hellooooooooooooooo") Output: Hello
Complete implementation example
import re from nltk.corpus import wordnet class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r (w*)(w)2(w*) ) self.repl = r 123 def replace(self, word): if wordnet.synsets(word): return word replace_word = self.repeat_regexp.sub(self.repl, word) if replace_word != word: return self.replace(replace_word) else: return replace_word
Now once you saved the above program and run it, you can import the class and use it as follows −
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
Output
HiAdvertisements