- Gensim - Discussion
- Gensim - Useful Resources
- Gensim - Quick Guide
- Gensim - Doc2Vec Model
- Gensim - Developing Word Embedding
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Documents & LDA Model
- Gensim - Creating LDA Mallet Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Topic Model
- Gensim - Topic Modeling
- Gensim - Creating TF-IDF Matrix
- Gensim - Transformations
- Creating a bag of words (BoW) Corpus
- Gensim - Creating a Dictionary
- Gensim - Vector & Model
- Gensim - Documents & Corpus
- Gensim - Getting Started
- Gensim - Introduction
- Gensim - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Gensim - Documents & Corpus
Here, we shall learn about the core concepts of Gensim, with main focus on the documents and the corpus.
Core Concepts of Gensim
Following are the core concepts and terms that are needed to understand and use Gensim −
Document − ZIt refers to some text.
Corpus − It refers to a collection of documents.
Vector − Mathematical representation of a document is called vector.
Model − It refers to an algorithm used for transforming vectors from one representation to another.
What is Document?
As discussed, it refers to some text. If we go in some detail, it is an object of the text sequence type which is known as ‘str’ in Python 3. For example, in Gensim, a document can be anything such as −
Short tweet of 140 characters
Single paragraph, i.e. article or research paper abstract
News article
Book
Novel
Theses
Text Sequence
A text sequence type is commonly known as ‘str’ in Python 3. As we know that in Python, textual data is handled with strings or more specifically ‘str’ objects. Strings are basically immutable sequences of Unicode code points and can be written in the following ways −
Single quotes − For example, ‘Hi! How are you?’. It allows us to embed double quotes also. For example, ‘Hi! “How” are you?’
Double quotes − For example, "Hi! How are you?". It allows us to embed single quotes also. For example, "Hi! How are you?"
Triple quotes − It can have either three single quotes pke, Hi! How are you? . or three double quotes pke, """Hi! How are you?"""
All the whitespaces will be included in the string pteral.
Example
Following is an example of a Document in Gensim −
Document = “Tutorialspoint.com is the biggest onpne tutorials pbrary and it’s all free also”
What is Corpus?
A corpus may be defined as the large and structured set of machine-readable texts produced in a natural communicative setting. In Gensim, a collection of document object is called corpus. The plural of corpus is corpora.
Role of Corpus in Gensim
A corpus in Gensim serves the following two roles −
Serves as Input for Training a Model
The very first and important role a corpus plays in Gensim, is as an input for training a model. In order to initiapze model’s internal parameters, during training, the model look for some common themes and topics from the training corpus. As discussed above, Gensim focuses on unsupervised models, hence it doesn’t require any kind of human intervention.
Serves as Topic Extractor
Once the model is trained, it can be used to extract topics from the new documents. Here, the new documents are the ones that are not used in the training phase.
Example
The corpus can include all the tweets by a particular person, pst of all the articles of a newspaper or all the research papers on a particular topic etc.
Collecting Corpus
Following is an example of small corpus which contains 5 documents. Here, every document is a string consisting of a single sentence.
t_corpus = [ "A survey of user opinion of computer system response time", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]
Preprocessing Collecting Corpus
Once we collect the corpus, a few preprocessing steps should be taken to keep corpus simple. We can simply remove some commonly used Engpsh words pke ‘the’. We can also remove words that occur only once in the corpus.
For example, the following Python script is used to lowercase each document, sppt it by white space and filter out stop words −
Example
import pprint t_corpus = [ "A survey of user opinion of computer system response time", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ] stoppst = set( for a of the and to in .sppt( )) processed_corpus = [[word for word in document.lower().sppt() if word not in stoppst] for document in t_corpus] pprint.pprint(processed_corpus) ]
Output
[[ survey , user , opinion , computer , system , response , time ], [ relation , user , perceived , response , time , error , measurement ], [ generation , random , binary , unordered , trees ], [ intersection , graph , paths , trees ], [ graph , minors , iv , widths , trees , well , quasi , ordering ]]
Effective Preprocessing
Gensim also provides function for more effective preprocessing of the corpus. In such kind of preprocessing, we can convert a document into a pst of lowercase tokens. We can also ignore tokens that are too short or too long. Such function is gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15).
gensim.utils.simple_preprocess() fucntion
Gensim provide this function to convert a document into a pst of lowercase tokens and also for ignoring tokens that are too short or too long. It has the following parameters −
doc(str)
It refers to the input document on which preprocessing should be appped.
deacc(bool, optional)
This parameter is used to remove the accent marks from tokens. It uses deaccent() to do this.
min_len(int, optional)
With the help of this parameter, we can set the minimum length of a token. The tokens shorter than defined length will be discarded.
max_len(int, optional)
With the help of this parameter we can set the maximum length of a token. The tokens longer than defined length will be discarded.
The output of this function would be the tokens extracted from input document.
Advertisements