- Gensim - Discussion
- Gensim - Useful Resources
- Gensim - Quick Guide
- Gensim - Doc2Vec Model
- Gensim - Developing Word Embedding
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Documents & LDA Model
- Gensim - Creating LDA Mallet Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Topic Model
- Gensim - Topic Modeling
- Gensim - Creating TF-IDF Matrix
- Gensim - Transformations
- Creating a bag of words (BoW) Corpus
- Gensim - Creating a Dictionary
- Gensim - Vector & Model
- Gensim - Documents & Corpus
- Gensim - Getting Started
- Gensim - Introduction
- Gensim - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Gensim - Creating LDA Topic Model
This chapter will help you learn how to create Latent Dirichlet allocation (LDA) topic model in Gensim.
Automatically extracting information about topics from large volume of texts in one of the primary apppcations of NLP (natural language processing). Large volume of texts could be feeds from hotel reviews, tweets, Facebook posts, feeds from any other social media channel, movie reviews, news stories, user feedbacks, e-mails etc.
In this digital era, to know what people/customers are talking about, to understand their opinions, and their problems, can be highly valuable for businesses, poptical campaigns and administrators. But, is it possible to manually read through such large volumes of text and then extracting the information from topics?
No, it’s not. It requires an automatic algorithm that can read through these large volume of text documents and automatically extract the required information/topics discussed from it.
Role of LDA
LDA’s approach to topic modepng is to classify text in a document to a particular topic. Modeled as Dirichlet distributions, LDA builds −
A topic per document model and
Words per topic model
After providing the LDA topic model algorithm, in order to obtain a good composition of topic-keyword distribution, it re-arrange −
The topics distributions within the document and
Keywords distribution within the topics
While processing, some of the assumptions made by LDA are −
Every document is modeled as multi-nominal distributions of topics.
Every topic is modeled as multi-nominal distributions of words.
We should have to choose the right corpus of data because LDA assumes that each chunk of text contains the related words.
LDA also assumes that the documents are produced from a mixture of topics.
Implementation with Gensim
Here, we are going to use LDA (Latent Dirichlet Allocation) to extract the naturally discussed topics from dataset.
Loading Data Set
The dataset which we are going to use is the dataset of ’20 Newsgroups’ having thousands of news articles from various sections of a news report. It is available under Sklearn data sets. We can easily download with the help of following Python script −
from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset= train )
Let’s look at some of the sample news with the help of following script −
newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where s my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enpghten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Brickpn. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- ", "From: guykuo@carson.u.washington.edu (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll. Please send a brief message detaipng your experiences with the procedure. Top speed attained, CPU rated speed, add on cards and adapters, heat sinks, hour of usage per day, floppy disk functionapty with 800 and 1.4 m floppies are especially requested. I will be summarizing in the next two days, so please add to the network knowledge base if you have done the clock upgrade and haven t answered this poll. Thanks. Guy Kuo <;guykuo@u.washington.edu> ", From: twilps@ec.ecn.purdue.edu (Thomas E Wilps) Subject: PB questions... Organization: Purdue University Engineering Computer Network Distribution: usa Lines: 36 well folks, my mac plus finally gave up the ghost this weekend after starting pfe as a 512k way back in 1985. sooo, i m in the market for a new machine a bit sooner than i intended to be... i m looking into picking up a powerbook 160 or maybe 180 and have a bunch of questions that (hopefully) somebody can answer: * does anybody know any dirt on when the next round of powerbook introductions are expected? i d heard the 185c was supposed to make an appearence "this summer" but haven t heard anymore on it - and since i don t have access to macleak, i was wondering if anybody out there had more info... * has anybody heard rumors about price drops to the powerbook pne pke the ones the duo s just went through recently? * what s the impression of the display on the 180? i could probably swing a 180 if i got the 80Mb disk rather than the 120, but i don t really have a feel for how much "better" the display is (yea, it looks great in the store, but is that all "wow" or is it really that good?). could i sopcit some opinions of people who use the 160 and 180 day-to-day on if its worth taking the disk size and money hit to get the active display? (i reapze this is a real subjective question, but i ve only played around with the machines in a computer store breifly and figured the opinions of somebody who actually uses the machine daily might prove helpful). * how well does hellcats perform? ;) thanks a bunch in advance for any info - if you could email, i ll post a summary (news reading time is at a premium with finals just around the corner... : ( ) -- Tom Wilps \ twilps@ecn.purdue.edu \ Purdue Electrical Engineering --------------------------------------------------------------------------- n"Convictions are more dangerous enemies of truth than pes." - F. W. Nietzsche , From: jgreen@amber (Joe Green) Subject: Re: Weitek P9000 ? Organization: Harris Computer Systems Division Lines: 14 Distribution: world NNTP-Posting-Host: amber.ssd.csd.harris.com X-Newsreader: TIN [version 1.1 PL9] Robert J.C. Kyanko (rob@rjck.UUCP) wrote: >abraxis@iastate.edu writes in article <abraxis.734340159@class1.iastate.edu >: > > Anyone know about the Weitek P9000 graphics chip? > As far as the low-level stuff goes, it looks pretty nice. It s got this > quadrilateral fill command that requires just the four points. Do you have Weitek s address/phone number? I d pke to get some information about this chip. -- Joe Green Harris Corporation jgreen@csd.harris.com Computer Systems Division "The only thing that really scares me is a person with no sense of humor. " -- Jonathan Winters ]
Prerequisite
We need Stopwords from NLTK and Engpsh model from Scapy. Both can be downloaded as follows −
import nltk; nltk.download( stopwords ) nlp = spacy.load( en_core_web_md , disable=[ parser , ner ])
Importing Necessary Packages
In order to build LDA model we need to import following necessary package −
import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim import matplotpb.pyplot as plt
Preparing Stopwords
Now, we need to import the Stopwords and use them −
from nltk.corpus import stopwords stop_words = stopwords.words( engpsh ) stop_words.extend([ from , subject , re , edu , use ])
Clean up the Text
Now, with the help of Gensim’s simple_preprocess() we need to tokenise each sentence into a pst of words. We should also remove the punctuations and unnecessary characters. In order to do this, we will create a function named sent_to_words() −
def sent_to_words(sentences): for sentence in sentences: yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) data_words = pst(sent_to_words(data))
Building Bigram & Trigram Models
As we know that, bigrams are two words that are frequently occurring together in the document and trigram are three words that are frequently occurring together in the document. With the help of Gensim’s Phrases model, we can do this −
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[data_words], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram)
Filter out Stopwords
Next, we need to filter out the Stopwords. Along with that, we will also create functions to make bigrams, trigrams and for lemmatisation −
def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod[doc] for doc in texts] def make_trigrams(texts): return [trigram_mod[bigram_mod[doc]] for doc in texts] def lemmatization(texts, allowed_postags=[ NOUN , ADJ , VERB , ADV ]): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_out
Building Dictionary & Corpus for Topic Model
We now need to build the dictionary & corpus. We did it in the previous examples as well −
id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts]
Building LDA Topic Model
We already implemented everything that is required to train the LDA model. Now, it is the time to build the LDA topic model. For our implementation example, it can be done with the help of following pne of codes −
lda_model = gensim.models.ldamodel.LdaModel( corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha= auto , per_word_topics=True )
Implementation Example
Let’s see the complete implementation example to build LDA topic model −
import re import numpy as np import pandas as pd from pprint import pprint import gensim import gensim.corpora as corpora from gensim.utils import simple_preprocess from gensim.models import CoherenceModel import spacy import pyLDAvis import pyLDAvis.gensim import matplotpb.pyplot as plt from nltk.corpus import stopwords stop_words = stopwords.words( engpsh ) stop_words.extend([ from , subject , re , edu , use ]) from sklearn.datasets import fetch_20newsgroups newsgroups_train = fetch_20newsgroups(subset= train ) data = newsgroups_train.data data = [re.sub( S*@S*s? , , sent) for sent in data] data = [re.sub( s+ , , sent) for sent in data] data = [re.sub(" ", "", sent) for sent in data] print(data_words[:4]) #it will print the data after prepared for stopwords bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[data_words], threshold=100) bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram) def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod[doc] for doc in texts] def make_trigrams(texts): [trigram_mod[bigram_mod[doc]] for doc in texts] def lemmatization(texts, allowed_postags=[ NOUN , ADJ , VERB , ADV ]): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags]) return texts_out data_words_nostops = remove_stopwords(data_words) data_words_bigrams = make_bigrams(data_words_nostops) nlp = spacy.load( en_core_web_md , disable=[ parser , ner ]) data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[ NOUN , ADJ , VERB , ADV ]) print(data_lemmatized[:4]) #it will print the lemmatized data. id2word = corpora.Dictionary(data_lemmatized) texts = data_lemmatized corpus = [id2word.doc2bow(text) for text in texts] print(corpus[:4]) #it will print the corpus we created above. [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] #it will print the words with their frequencies. lda_model = gensim.models.ldamodel.LdaModel( corpus=corpus, id2word=id2word, num_topics=20, random_state=100, update_every=1, chunksize=100, passes=10, alpha= auto , per_word_topics=True )
We can now use the above created LDA model to get the topics, to compute Model Perplexity.
Advertisements