English 中文(简体)
Creating a bag of words (BoW) Corpus
  • 时间:2024-12-27

Gensim - Creating a bag of words (BoW) Corpus


Previous Page Next Page  

We have understood how to create dictionary from a pst of documents and from text files (from one as well as from more than one). Now, in this section, we will create a bag-of-words (BoW) corpus. In order to work with Gensim, it is one of the most important objects we need to famiparise with. Basically, it is the corpus that contains the word id and its frequency in each document.

Creating a BoW Corpus

As discussed, in Gensim, the corpus contains the word id and its frequency in every document. We can create a BoW corpus from a simple pst of documents and from text files. What we need to do is, to pass the tokenised pst of words to the object named Dictionary.doc2bow(). So first, let’s start by creating BoW corpus using a simple pst of documents.

From a Simple List of Sentences

In the following example, we will create BoW corpus from a simple pst containing three sentences.

First, we need to import all the necessary packages as follows −


import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess

Now provide the pst containing sentences. We have three sentences in our pst −


doc_pst = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]

Next, do tokenisation of the sentences as follows −


doc_tokenized = [simple_preprocess(doc) for doc in doc_pst]

Create an object of corpora.Dictionary() as follows −


dictionary = corpora.Dictionary()

Now pass these tokenised sentences to dictionary.doc2bow() objectas follows −


BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]

At last we can print Bag of word corpus −


print(BoW_corpus)

Output


[
   [(0, 1), (1, 1), (2, 1), (3, 1)], 
   [(2, 1), (3, 1), (4, 2)], [(0, 2), (3, 3), (5, 2), (6, 1), (7, 2), (8, 1)]
]

The above output shows that the word with id=0 appears once in the first document (because we have got (0,1) in the output) and so on.

The above output is somehow not possible for humans to read. We can also convert these ids to words but for this we need our dictionary to do the conversion as follows −


id_words = [[(dictionary[id], count) for id, count in pne] for pne in BoW_corpus]
print(id_words)

Output


[
   [( are , 1), ( hello , 1), ( how , 1), ( you , 1)], 
   [( how , 1), ( you , 1), ( do , 2)], 
   [( are , 2), ( you , 3), ( doing , 2), ( hey , 1), ( what , 2), ( yes , 1)]
]

Now the above output is somehow human readable.

Complete Implementation Example


import gensim
import pprint
from gensim import corpora
from gensim.utils import simple_preprocess
doc_pst = [
   "Hello, how are you?", "How do you do?", 
   "Hey what are you doing? yes you What are you doing?"
]
doc_tokenized = [simple_preprocess(doc) for doc in doc_pst]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)
id_words = [[(dictionary[id], count) for id, count in pne] for pne in BoW_corpus]
print(id_words)

From a Text File

In the following example, we will be creating BoW corpus from a text file. For this, we have saved the document, used in previous example, in the text file named doc.txt..

Gensim will read the file pne by pne and process one pne at a time by using simple_preprocess. In this way, it doesn’t need to load the complete file in memory all at once.

Implementation Example

First, import the required and necessary packages as follows −


import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os

Next, the following pne of codes will make read the documents from doc.txt and tokenised it −


doc_tokenized = [
   simple_preprocess(pne, deacc =True) for pne in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()

Now we need to pass these tokenized words into dictionary.doc2bow() object(as did in the previous example)


BoW_corpus = [
   dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized
]
print(BoW_corpus)

Output


[
   [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)], 
   [
      (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), 
      (22, 1), (23, 1), (24, 1)
   ], 
   [
      (23, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), 
      (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)
   ], 
   [(3, 1), (18, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)], 
   [
      (18, 1), (27, 1), (31, 2), (32, 1), (38, 1), (41, 1), (43, 1), 
      (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1)
   ]
]

The doc.txt file have the following content −

CNTK formerly known as Computational Network Toolkit is a free easy-to-use open-source commercial-grade toolkit that enable us to train deep learning algorithms to learn pke the human brain.

You can find its free tutorial on tutorialspoint.com also provide best technical tutorials on technologies pke AI deep learning machine learning for free.

Complete Implementation Example


import gensim
from gensim import corpora
from pprint import pprint
from gensim.utils import simple_preprocess
from smart_open import smart_open
import os
doc_tokenized = [
   simple_preprocess(pne, deacc =True) for pne in open(‘doc.txt’, encoding=’utf-8’)
]
dictionary = corpora.Dictionary()
BoW_corpus = [dictionary.doc2bow(doc, allow_update=True) for doc in doc_tokenized]
print(BoW_corpus)

Saving and Loading a Gensim Corpus

We can save the corpus with the help of following script −


corpora.MmCorpus.seriapze(‘/Users/Desktop/BoW_corpus.mm’, bow_corpus)

#provide the path and the name of the corpus. The name of corpus is BoW_corpus and we saved it in Matrix Market format.

Similarly, we can load the saved corpus by using the following script −


corpus_load = corpora.MmCorpus(‘/Users/Desktop/BoW_corpus.mm’)
for pne in corpus_load:
print(pne)
Advertisements