- Natural Language Toolkit - Discussion
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Text Classification
- Synonym & Antonym Replacement
- Natural Language Toolkit - Word Replacement
- Stemming & Lemmatization
- Looking up words in Wordnet
- Training Tokenizer & Filtering Stopwords
- Natural Language Toolkit - Tokenizing Text
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Home
自然语言工具包
- 自然语言工具箱——改造树木
- 自然语言工具箱——改造楚克
- Chunking & Information 排外
- 自然语言工具箱——包装
- 自然语言工具包 - 更多国家 Taggers
- 自然语言工具箱——将Taggers混为一谈
- 自然语言工具箱——Unigram Tagger
- 部分Speech(POS)基本原理
- Corpus Readers and Customs Corpora
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Training Tokenizer & Filtering Stopwords
Why to train own sentence tokenizer?
This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question pes in the quapty of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.
Implementation Example
For this example, we will be using the webtext corpus. The text file which we are going to use from this corpus is having the text formatted as dialogs shown below −
Guy: How old are you? Hipster girl: You know, I never answer that question. Because to me, it s about how mature you are, you know? I mean, a fourteen year old could be more mature than a twenty-five year old, right? I m sorry, I just never answer that question. Guy: But, uh, you re older than eighteen, right? Hipster girl: Oh, yeah.
We have saved this text file with the name of training_tokenizer. NLTK provides a class named PunktSentenceTokenizer with the help of which we can train on raw text to produce a custom sentence tokenizer. We can get raw text either by reading in a file or from an NLTK corpus using the raw() method.
Let us see the example below to get more insight into it −
First, import PunktSentenceTokenizer class from nltk.tokenize package −
from nltk.tokenize import PunktSentenceTokenizer
Now, import webtext corpus from nltk.corpus package
from nltk.corpus import webtext
Next, by using raw() method, get the raw text from training_tokenizer.txt file as follows −
text = webtext.raw( C://Users/Leekha/training_tokenizer.txt )
Now, create an instance of PunktSentenceTokenizer and print the tokenize sentences from text file as follows −
sent_tokenizer = PunktSentenceTokenizer(text) sents_1 = sent_tokenizer.tokenize(text) print(sents_1[0])
Output
White guy: So, do you have any plans for this evening? print(sents_1[1]) Output: Asian girl: Yeah, being angry! print(sents_1[670]) Output: Guy: A hundred bucks? print(sents_1[675]) Output: Girl: But you already have a Big Mac...
Complete implementation example
from nltk.tokenize import PunktSentenceTokenizer from nltk.corpus import webtext text = webtext.raw( C://Users/Leekha/training_tokenizer.txt ) sent_tokenizer = PunktSentenceTokenizer(text) sents_1 = sent_tokenizer.tokenize(text) print(sents_1[0])
Output
White guy: So, do you have any plans for this evening?
To understand the difference between NLTK’s default sentence tokenizer and our own trained sentence tokenizer, let us tokenize the same file with default sentence tokenizer i.e. sent_tokenize().
from nltk.tokenize import sent_tokenize from nltk.corpus import webtext text = webtext.raw( C://Users/Leekha/training_tokenizer.txt ) sents_2 = sent_tokenize(text) print(sents_2[0]) Output: White guy: So, do you have any plans for this evening? print(sents_2[675]) Output: Hobo: Y know what I d do if I was rich?
With the help of difference in the output, we can understand the concept that why it is useful to train our own sentence tokenizer.
What are stopwords?
Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are ‘the’ and ‘a’.
NLTK stopwords corpus
Actually, Natural Language Tool kit comes with a stopword corpus containing word psts for many languages. Let us understand its usage with the help of the following example −
First, import the stopwords copus from nltk.corpus package −
from nltk.corpus import stopwords
Now, we will be using stopwords from Engpsh Languages
engpsh_stops = set(stopwords.words( engpsh )) words = [ I , am , a , writer ] [word for word in words if word not in engpsh_stops]
Output
[ I , writer ]
Complete implementation example
from nltk.corpus import stopwords engpsh_stops = set(stopwords.words( engpsh )) words = [ I , am , a , writer ] [word for word in words if word not in engpsh_stops]
Output
[ I , writer ]
Finding complete pst of supported languages
With the help of following Python script, we can also find the complete pst of languages supported by NLTK stopwords corpus −
from nltk.corpus import stopwords stopwords.fileids()
Output
[ arabic , azerbaijani , danish , dutch , engpsh , finnish , french , german , greek , hungarian , indonesian , itapan , kazakh , nepap , norwegian , portuguese , romanian , russian , slovene , spanish , swedish , tajik , turkish ]Advertisements