- Natural Language Toolkit - Discussion
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Text Classification
- Synonym & Antonym Replacement
- Natural Language Toolkit - Word Replacement
- Stemming & Lemmatization
- Looking up words in Wordnet
- Training Tokenizer & Filtering Stopwords
- Natural Language Toolkit - Tokenizing Text
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Home
自然语言工具包
- 自然语言工具箱——改造树木
- 自然语言工具箱——改造楚克
- Chunking & Information 排外
- 自然语言工具箱——包装
- 自然语言工具包 - 更多国家 Taggers
- 自然语言工具箱——将Taggers混为一谈
- 自然语言工具箱——Unigram Tagger
- 部分Speech(POS)基本原理
- Corpus Readers and Customs Corpora
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Natural Language Toolkit - Tokenizing Text
What is Tokenizing?
It may be defined as the process of breaking up a piece of text into smaller parts, such as sentences and words. These smaller parts are called tokens. For example, a word is a token in a sentence, and a sentence is a token in a paragraph.
As we know that NLP is used to build apppcations such as sentiment analysis, QA systems, language translation, smart chatbots, voice systems, etc., hence, in order to build them, it becomes vital to understand the pattern in the text. The tokens, mentioned above, are very useful in finding and understanding these patterns. We can consider tokenization as the base step for other recipes such as stemming and lemmatization.
NLTK package
nltk.tokenize is the package provided by NLTK module to achieve the process of tokenization.
Tokenizing sentences into words
Spptting the sentence into words or creating a pst of words from a string is an essential part of every text processing activity. Let us understand it with the help of various functions/modules provided by nltk.tokenize package.
word_tokenize module
word_tokenize module is used for basic word tokenization. Following example will use this module to sppt a sentence into words.
Example
import nltk from nltk.tokenize import word_tokenize word_tokenize( Tutorialspoint.com provides high quapty technical tutorials for free. )
Output
[ Tutorialspoint.com , provides , high , quapty , technical , tutorials , for , free , . ]
TreebankWordTokenizer Class
word_tokenize module, used above is basically a wrapper function that calls tokenize() function as an instance of the TreebankWordTokenizer class. It will give the same output as we get while using word_tokenize() module for spptting the sentences into word. Let us see the same example implemented above −
Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the TreebankWordTokenizer class to implement the word tokenizer algorithm −
from nltk.tokenize import TreebankWordTokenizer
Next, create an instance of TreebankWordTokenizer class as follows −
Tokenizer_wrd = TreebankWordTokenizer()
Now, input the sentence you want to convert to tokens −
Tokenizer_wrd.tokenize( Tutorialspoint.com provides high quapty technical tutorials for free. )
Output
[ Tutorialspoint.com , provides , high , quapty , technical , tutorials , for , free , . ]
Complete implementation example
Let us see the complete implementation example below
import nltk from nltk.tokenize import TreebankWordTokenizer tokenizer_wrd = TreebankWordTokenizer() tokenizer_wrd.tokenize( Tutorialspoint.com provides high quapty technical tutorials for free. )
Output
[ Tutorialspoint.com , provides , high , quapty , technical , tutorials , for , free , . ]
The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −
Example
import nltk from nltk.tokenize import word_tokenize word_tokenize( won’t )
Output
[ wo , "n t"]]
Such kind of convention by TreebankWordTokenizer is unacceptable. That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.
WordPunktTokenizer Class
An alternative word tokenizer that sppts all punctuation into separate tokens. Let us understand it with the following simple example −
Example
from nltk.tokenize import WordPunctTokenizer tokenizer = WordPunctTokenizer() tokenizer.tokenize(" I can t allow you to go home early")
Output
[ I , can , " ", t , allow , you , to , go , home , early ]
Tokenizing text into sentences
In this section we are going to sppt text/paragraph into sentences. NLTK provides sent_tokenize module for this purpose.
Why is it needed?
An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomppshing this task, we need both sentence tokenization and word tokenization.
Let us understand the difference between sentence and word tokenizer with the help of following simple example −
Example
import nltk from nltk.tokenize import sent_tokenize text = "Let us understand the difference between sentence & word tokenizer. It is going to be a simple example." sent_tokenize(text)
Output
[ "Let us understand the difference between sentence & word tokenizer.", It is going to be a simple example. ]
Sentence tokenization using regular expressions
If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.
Let us understand the concept with the help of two examples below.
In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t sppt contractions pke “won’t”.
Example 1
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer("[w ]+") tokenizer.tokenize("won t is a contraction.") tokenizer.tokenize("can t is a contraction.")
Output
["won t", is , a , contraction ] ["can t", is , a , contraction ]
In first example, we will be using regular expression to tokenize on whitespace.
Example 2
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer( /s+ , gaps = True) tokenizer.tokenize("won t is a contraction.")
Output
["won t", is , a , contraction ]
From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens which can be seen in following example −
import nltk from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer( /s+ , gaps = False) tokenizer.tokenize("won t is a contraction.")
Output
[ ]
It will give us the blank output.
Advertisements