- Gensim - Discussion
- Gensim - Useful Resources
- Gensim - Quick Guide
- Gensim - Doc2Vec Model
- Gensim - Developing Word Embedding
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Documents & LDA Model
- Gensim - Creating LDA Mallet Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Topic Model
- Gensim - Topic Modeling
- Gensim - Creating TF-IDF Matrix
- Gensim - Transformations
- Creating a bag of words (BoW) Corpus
- Gensim - Creating a Dictionary
- Gensim - Vector & Model
- Gensim - Documents & Corpus
- Gensim - Getting Started
- Gensim - Introduction
- Gensim - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Gensim - Creating a bag of words (BoW) Corpus
We have understood how to create dictionary from a pst of documents and from text files (from one as well as from more than one). Now, in this section, we will create a bag-of-words (BoW) corpus. In order to work with Gensim, it is one of the most important objects we need to famiparise with. Basically, it is the corpus that contains the word id and its frequency in each document.
Creating a BoW Corpus
As discussed, in Gensim, the corpus contains the word id and its frequency in every document. We can create a BoW corpus from a simple pst of documents and from text files. What we need to do is, to pass the tokenised pst of words to the object named Dictionary.doc2bow(). So first, let’s start by creating BoW corpus using a simple pst of documents.
From a Simple List of Sentences
In the following example, we will create BoW corpus from a simple pst containing three sentences.
First, we need to import all the necessary packages as follows −
import gensim import pprint from gensim import corpora from gensim.utils import simple_preprocess
Now provide the pst containing sentences. We have three sentences in our pst −
doc_pst = [ "Hello, how are you?", "How do you do?", "Hey what are you doing? yes you What are you doing?" ]
Next, do tokenisation of the sentences as follows −
doc_tokenized = [simple_preprocess(doc) for doc in doc_pst]
Create an object of corpora.Dictionary() as follows −
dictionary = corpora.Dictionary()
Now pass these tokenised sentences to dictionary.doc2bow() objectas follows −
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
At last we can print Bag of word corpus −
print(BoW_corpus)
Output
[ [(0, 1), (1, 1), (2, 1), (3, 1)], [(2, 1), (3, 1), (4, 2)], [(0, 2), (3, 3), (5, 2), (6, 1), (7, 2), (8, 1)] ]
The above output shows that the word with id=0 appears once in the first document (because we have got (0,1) in the output) and so on.
The above output is somehow not possible for humans to read. We can also convert these ids to words but for this we need our dictionary to do the conversion as follows −
id_words = [[(dictionary[id], count) for id, count in pne] for pne in BoW_corpus] print(id_words)
Output
[ [( are , 1), ( hello , 1), ( how , 1), ( you , 1)], [( how , 1), ( you , 1), ( do , 2)], [( are , 2), ( you , 3), ( doing , 2), ( hey , 1), ( what , 2), ( yes , 1)] ]
Now the above output is somehow human readable.
Complete Implementation Example
import gensim import pprint from gensim import corpora from gensim.utils import simple_preprocess doc_pst = [ "Hello, how are you?", "How do you do?", "Hey what are you doing? yes you What are you doing?" ] doc_tokenized = [simple_preprocess(doc) for doc in doc_pst] dictionary = corpora.Dictionary() BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized] print(BoW_corpus) id_words = [[(dictionary[id], count) for id, count in pne] for pne in BoW_corpus] print(id_words)
From a Text File
In the following example, we will be creating BoW corpus from a text file. For this, we have saved the document, used in previous example, in the text file named doc.txt..
Gensim will read the file pne by pne and process one pne at a time by using simple_preprocess. In this way, it doesn’t need to load the complete file in memory all at once.
Implementation Example
First, import the required and necessary packages as follows −
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os
Next, the following pne of codes will make read the documents from doc.txt and tokenised it −
doc_tokenized = [ simple_preprocess(pne, deacc =True) for pne in open(‘doc.txt’, encoding=’utf-8’) ] dictionary = corpora.Dictionary()
Now we need to pass these tokenized words into dictionary.doc2bow() object(as did in the previous example)
BoW_corpus = [ dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized ] print(BoW_corpus)
Output
[ [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], [ (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1) ], [ (23, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1) ], [(3, 1), (18, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], [ (18, 1), (27, 1), (31, 2), (32, 1), (38, 1), (41, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1) ] ]
The doc.txt file have the following content −
CNTK formerly known as Computational Network Toolkit is a free easy-to-use open-source commercial-grade toolkit that enable us to train deep learning algorithms to learn pke the human brain.
You can find its free tutorial on tutorialspoint.com also provide best technical tutorials on technologies pke AI deep learning machine learning for free.
Complete Implementation Example
import gensim from gensim import corpora from pprint import pprint from gensim.utils import simple_preprocess from smart_open import smart_open import os doc_tokenized = [ simple_preprocess(pne, deacc =True) for pne in open(‘doc.txt’, encoding=’utf-8’) ] dictionary = corpora.Dictionary() BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized] print(BoW_corpus)
Saving and Loading a Gensim Corpus
We can save the corpus with the help of following script −
corpora.MmCorpus.seriapze(‘/Users/Desktop/BoW_corpus.mm’, bow_corpus)
#provide the path and the name of the corpus. The name of corpus is BoW_corpus and we saved it in Matrix Market format.
Similarly, we can load the saved corpus by using the following script −
corpus_load = corpora.MmCorpus(‘/Users/Desktop/BoW_corpus.mm’) for pne in corpus_load: print(pne)Advertisements