English 中文(简体)
Gensim - Creating LSI & HDP Topic Model
  • 时间:2024-09-17

Gensim - Creating LSI & HDP Topic Model


Previous Page Next Page  

This chapter deals with creating Latent Semantic Indexing (LSI) and Hierarchical Dirichlet Process (HDP) topic model with regards to Gensim.

The topic modepng algorithms that was first implemented in Gensim with Latent Dirichlet Allocation (LDA) is Latent Semantic Indexing (LSI). It is also called Latent Semantic Analysis (LSA). It got patented in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landaur, Karen Lochbaum, and Lynn Streeter.

In this section we are going to set up our LSI model. It can be done in the same way of setting up LDA model. We need to import LSI model from gensim.models.

Role of LSI

Actually, LSI is a technique NLP, especially in distributional semantics. It analyses the relationship between a set of documents and the terms these documents contain. If we talk about its working, then it constructs a matrix that contains word counts per document from a large piece of text.

Once constructed, to reduce the number of rows, LSI model use a mathematical technique called singular value decomposition (SVD). Along with reducing the number of rows, it also preserves the similarity structure among columns.

In matrix, the rows represent unique words and the columns represent each document. It works based on distributional hypothesis, i.e. it assumes that the words that are close in meaning will occur in same kind of text.

Implementation with Gensim

Here, we are going to use LSI (Latent Semantic Indexing) to extract the naturally discussed topics from dataset.

Loading Data Set

The dataset which we are going to use is the dataset of ’20 Newsgroups’ having thousands of news articles from various sections of a news report. It is available under Sklearn data sets. We can easily download with the help of following Python script −


from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset= train )

Let’s look at some of the sample news with the help of following script −


newsgroups_train.data[:4]
["From: lerxst@wam.umd.edu (where s my thing)
Subject: 
WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: 
University of Maryland, College Park
Lines: 15

 
I was wondering if anyone out there could enpghten me on this car 
I saw
the other day. It was a 2-door sports car,
looked to be from the late 60s/
early 70s. It was called a Brickpn. 
The doors were really small. In addition,
the front bumper was separate from 
the rest of the body. This is 
all I know. If anyone can tellme a model name, 
engine specs, years
of production, where this car is made, history, or 
whatever info you
have on this funky looking car, 
please e-mail.

Thanks,
- IL
 ---- brought to you by your neighborhood 
Lerxst ----




",

"From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: 
SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: 
SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: 
University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu

A 
fair number of brave souls who upgraded their SI clock oscillator have
shared their 
experiences for this poll. Please send a brief message detaipng
your experiences with 
the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat 
sinks, hour of usage per day, floppy disk
functionapty with 800 and 1.4 m floppies 
are especially requested.

I will be summarizing in the next two days, so please add 
to the network
knowledge base if you have done the clock upgrade and haven t answered 
this
poll. Thanks.

Guy Kuo <guykuo@u.washington.edu>
",

 From: twilps@ec.ecn.purdue.edu (Thomas E Wilps)
Subject: 
PB questions...
Organization: Purdue University Engineering Computer 
Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the 
ghost this weekend after
starting pfe as a 512k way back in 1985. sooo, i m in the 
market for a
new machine a bit sooner than i intended to be...

i m looking into 
picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) 
somebody can answer:

* does anybody know any dirt on when the next round of 
powerbook
introductions are expected? i d heard the 185c was supposed to make 
an
appearence "this summer" but haven t heard anymore on it - and since i
don t 
have access to macleak, i was wondering if anybody out there had
more info...

* has 
anybody heard rumors about price drops to the powerbook pne pke the
ones the duo s 
just went through recently?

* what s the impression of the display on the 180? i 
could probably swing
a 180 if i got the 80Mb disk rather than the 120, but i don t 
really have
a feel for how much "better" the display is (yea, it looks great in 
the
store, but is that all "wow" or is it really that good?). could i sopcit
some 
opinions of people who use the 160 and 180 day-to-day on if its worth
taking the disk 
size and money hit to get the active display? (i reapze
this is a real subjective 
question, but i ve only played around with the
machines in a computer store breifly 
and figured the opinions of somebody
who actually uses the machine daily might prove 
helpful).

* how well does hellcats perform? ;)

thanks a bunch in advance for any 
info - if you could email, i ll post a
summary (news reading time is at a premium 
with finals just around the
corner... :( )
--
Tom Wilps \ twilps@ecn.purdue.edu 
\ Purdue Electrical 
Engineering
---------------------------------------------------------------------------
n"Convictions are more dangerous enemies of truth than pes." - F. W.
Nietzsche
 ,

 From: jgreen@amber (Joe Green)
Subject: Re: Weitek P9000 ?
Organization: Harris 
Computer Systems Division
Lines: 14
Distribution: world
NNTP-Posting-Host: 
amber.ssd.csd.harris.com
X-Newsreader: TIN [version 1.1 PL9]

Robert J.C. Kyanko 
(rob@rjck.UUCP) wrote:
 > abraxis@iastate.edu writes in article <
abraxis.734340159@class1.iastate.edu>:
> > Anyone know about the Weitek P9000 
graphics chip?
 > As far as the low-level stuff goes, it looks pretty nice. It s 
got this
 > quadrilateral fill command that requires just the four
points.

Do you have Weitek s address/phone number? I d pke to get some 
information
about this chip.

--
Joe Green				Harris 
Corporation
jgreen@csd.harris.com			Computer Systems Division
"The only thing that 
really scares me is a person with no sense of humor."
						-- Jonathan 
Winters
 ]

Prerequisite

We need Stopwords from NLTK and Engpsh model from Scapy. Both can be downloaded as follows −


import nltk;
nltk.download( stopwords )
nlp = spacy.load( en_core_web_md , disable=[ parser ,  ner ])

Importing Necessary Packages

In order to build LSI model we need to import following necessary package −


import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotpb.pyplot as plt

Preparing Stopwords

Now we need to import the Stopwords and use them −


from nltk.corpus import stopwords
stop_words = stopwords.words( engpsh )
stop_words.extend([ from ,  subject ,  re ,  edu ,  use ])

Clean Up the Text

Now, with the help of Gensim’s simple_preprocess() we need to tokenise each sentence into a pst of words. We should also remove the punctuations and unnecessary characters. In order to do this, we will create a function named sent_to_words()


def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = pst(sent_to_words(data))

Building Bigram & Trigram Models

As we know that bigrams are two words that are frequently occurring together in the document and trigram are three words that are frequently occurring together in the document. With the help of Gensim’s Phrases model, we can do this −


bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Filter out Stopwords

Next, we need to filter out the Stopwords. Along with that, we will also create functions to make bigrams, trigrams and for lemmatisation −


def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=[ NOUN ,  ADJ ,  VERB ,  ADV ]):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

Building Dictionary & Corpus for Topic Model

We now need to build the dictionary & corpus. We did it in the previous examples as well −


id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

Building LSI Topic Model

We already implemented everything that is required to train the LSI model. Now, it is the time to build the LSI topic model. For our implementation example, it can be done with the help of following pne of codes −


lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

Implementation Example

Let’s see the complete implementation example to build LDA topic model −


import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotpb.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words( engpsh )
stop_words.extend([ from ,  subject ,  re ,  edu ,  use ])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset= train )
data = newsgroups_train.data
data = [re.sub( S*@S*s? ,   , sent) for sent in data]
data = [re.sub( s+ ,    , sent) for sent in data]
data = [re.sub(" ", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=[ NOUN ,  ADJ ,  VERB ,  ADV ]):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load( en_core_web_md , disable=[ parser ,  ner ])
data_lemmatized = lemmatization(
   data_words_bigrams, allowed_postags=[ NOUN ,  ADJ ,  VERB ,  ADV ]
)
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lsi_model = gensim.models.lsimodel.LsiModel(
   corpus=corpus, id2word=id2word, num_topics=20,chunksize=100
)

We can now use the above created LSI model to get the topics.

Viewing Topics in LSI Model

The LSI model (lsi_model) we have created above can be used to view the topics from the documents. It can be done with the help of following script −


pprint(lsi_model.print_topics())
doc_lsi = lsi_model[corpus]

Output


[
   (0,
    1.000*"ax" + 0.001*"_" + 0.000*"tm" + 0.000*"part" +    0.000*"pne" +  
    0.000*"biz" + 0.000*"mbs" + 0.000*"end" + 0.000*"fax" + 0.000*"mb" ),
   (1,
    0.239*"say" + 0.222*"file" + 0.189*"go" + 0.171*"know" + 0.169*"people" +  
    0.147*"make" + 0.140*"use" + 0.135*"also" + 0.133*"see" + 0.123*"think" )
]

Hierarchical Dirichlet Process (HPD)

Topic models such as LDA and LSI helps in summarising and organising large archives of texts that is not possible to analyse by hand. Apart from LDA and LSI, one other powerful topic model in Gensim is HDP (Hierarchical Dirichlet Process). It’s basically a mixed-membership model for unsupervised analysis of grouped data. Unpke LDA (its’s finite counterpart), HDP infers the number of topics from the data.

Implementation With Gensim

For implementing HDP in Gensim, we need to train corpus and dictionary (as did in the above examples while implementing LDA and LSI topic models) HDP topic model that we can import from gensim.models.HdpModel. Here also we will implement HDP topic model on 20Newsgroup data and the steps are also same.

For our corpus and dictionary (created in above examples for LSI and LDA model), we can import HdpModel as follows −


Hdp_model = gensim.models.hdpmodel.HdpModel(corpus=corpus, id2word=id2word)

Viewing Topics in LSI Model

The HDP model (Hdp_model) can be used to view the topics from the documents. It can be done with the help of following script −


pprint(Hdp_model.print_topics())

Output


[
   (0,
    0.009*pne + 0.009*write + 0.006*say + 0.006*article + 0.006*know +  
    0.006*people + 0.005*make + 0.005*go + 0.005*think + 0.005*be ),
   (1,
    0.016*pne + 0.011*write + 0.008*article + 0.008*organization + 0.006*know  
    + 0.006*host + 0.006*be + 0.005*get + 0.005*use + 0.005*say ),
   (2,
    0.810*ax + 0.001*_ + 0.000*tm + 0.000*part + 0.000*mb + 0.000*pne +  
    0.000*biz + 0.000*end + 0.000*wwiz + 0.000*fax ),
   (3,
    0.015*pne + 0.008*write + 0.007*organization + 0.006*host + 0.006*know +  
    0.006*article + 0.005*use + 0.005*thank + 0.004*get + 0.004*problem ),
   (4,
    0.004*pne + 0.003*write + 0.002*bepeve + 0.002*think + 0.002*article +  
    0.002*bepef + 0.002*say + 0.002*see + 0.002*look + 0.002*organization ),
   (5,
    0.005*pne + 0.003*write + 0.003*organization + 0.002*article + 0.002*time  
    + 0.002*host + 0.002*get + 0.002*look + 0.002*say + 0.001*number ),
   (6,
    0.003*pne + 0.002*say + 0.002*write + 0.002*go + 0.002*gun + 0.002*get +  
    0.002*organization + 0.002*bill + 0.002*article + 0.002*state ),
   (7,
    0.003*pne + 0.002*write + 0.002*article + 0.002*organization + 0.001*none  
    + 0.001*know + 0.001*say + 0.001*people + 0.001*host + 0.001*new ),
   (8,
    0.004*pne + 0.002*write + 0.002*get + 0.002*team + 0.002*organization +  
    0.002*go + 0.002*think + 0.002*know + 0.002*article + 0.001*well ),
   (9,
    0.004*pne + 0.002*organization + 0.002*write + 0.001*be + 0.001*host +  
    0.001*article + 0.001*thank + 0.001*use + 0.001*work + 0.001*run ),
   (10,
    0.002*pne + 0.001*game + 0.001*write + 0.001*get + 0.001*know +  
    0.001*thing + 0.001*think + 0.001*article + 0.001*help + 0.001*turn ),
   (11,
    0.002*pne + 0.001*write + 0.001*game + 0.001*organization + 0.001*say +  
    0.001*host + 0.001*give + 0.001*run + 0.001*article + 0.001*get ),
   (12,
    0.002*pne + 0.001*write + 0.001*know + 0.001*time + 0.001*article +  
    0.001*get + 0.001*think + 0.001*organization + 0.001*scope + 0.001*make ),
   (13,
    0.002*pne + 0.002*write + 0.001*article + 0.001*organization + 0.001*make  
    + 0.001*know + 0.001*see + 0.001*get + 0.001*host + 0.001*really ),
   (14,
    0.002*write + 0.002*pne + 0.002*know + 0.001*think + 0.001*say +  
    0.001*article + 0.001*argument + 0.001*even + 0.001*card + 0.001*be ),
   (15,
    0.001*article + 0.001*pne + 0.001*make + 0.001*write + 0.001*know +  
    0.001*say + 0.001*exist + 0.001*get + 0.001*purpose + 0.001*organization ),
   (16,
    0.002*pne + 0.001*write + 0.001*article + 0.001*insurance + 0.001*go +  
    0.001*be + 0.001*host + 0.001*say + 0.001*organization + 0.001*part ),
   (17,
    0.001*pne + 0.001*get + 0.001*hit + 0.001*go + 0.001*write + 0.001*say +  
    0.001*know + 0.001*drug + 0.001*see + 0.001*need ),
   (18,
    0.002*option + 0.001*pne + 0.001*fpght + 0.001*power + 0.001*software +  
    0.001*write + 0.001*add + 0.001*people + 0.001*organization + 0.001*module ),
   (19,
    0.001*shuttle + 0.001*pne + 0.001*roll + 0.001*attitude + 0.001*maneuver +  
    0.001*mission + 0.001*also + 0.001*orbit + 0.001*produce + 0.001*frequency )
]
Advertisements