- Gensim - Discussion
- Gensim - Useful Resources
- Gensim - Quick Guide
- Gensim - Doc2Vec Model
- Gensim - Developing Word Embedding
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Documents & LDA Model
- Gensim - Creating LDA Mallet Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Topic Model
- Gensim - Topic Modeling
- Gensim - Creating TF-IDF Matrix
- Gensim - Transformations
- Creating a bag of words (BoW) Corpus
- Gensim - Creating a Dictionary
- Gensim - Vector & Model
- Gensim - Documents & Corpus
- Gensim - Getting Started
- Gensim - Introduction
- Gensim - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Gensim - Creating LDA Mallet Model
This chapter will explain what is a Latent Dirichlet Allocation (LDA) Mallet Model and how to create the same in Gensim.
In the previous section we have implemented LDA model and get the topics from documents of 20Newsgroup dataset. That was Gensim’s inbuilt version of the LDA algorithm. There is a Mallet version of Gensim also, which provides better quapty of topics. Here, we are going to apply Mallet’s LDA on the previous example we have already implemented.
What is LDA Mallet Model?
Mallet, an open source toolkit, was written by Andrew McCullum. It is basically a Java based package which is used for NLP, document classification, clustering, topic modepng, and many other machine learning apppcations to text. It provides us the Mallet Topic Modepng toolkit which contains efficient, samppng-based implementations of LDA as well as Hierarchical LDA.
Mallet2.0 is the current release from MALLET, the java topic modepng toolkit. Before we start using it with Gensim for LDA, we must download the mallet-2.0.8.zip package on our system and unzip it. Once installed and unzipped, set the environment variable %MALLET_HOME% to the point to the MALLET directory either manually or by the code we will be providing, while implementing the LDA with Mallet next.
Gensim Wrapper
Python provides Gensim wrapper for Latent Dirichlet Allocation (LDA). The syntax of that wrapper is gensim.models.wrappers.LdaMallet. This module, collapsed gibbs samppng from MALLET, allows LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents as well.
Implementation Example
We will be using LDA Mallet on previously built LDA model and will check the difference in performance by calculating Coherence score.
Providing Path to Mallet File
Before applying Mallet LDA model on our corpus built in previous example, we must have to update the environment variables and provide the path the Mallet file as well. It can be done with the help of following code −
import os from gensim.models.wrappers import LdaMallet os.environ.update({ MALLET_HOME :r C:/mallet-2.0.8/ }) #You should update this path as per the path of Mallet directory on your system. mallet_path = r C:/mallet-2.0.8/bin/mallet #You should update this path as per the path of Mallet directory on your system.
Once we provided the path to Mallet file, we can now use it on the corpus. It can be done with the help of ldamallet.show_topics() function as follows −
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
Output
[ (4, [( gun , 0.024546225966016102), ( law , 0.02181426826996709), ( state , 0.017633545129043606), ( people , 0.017612848479831116), ( case , 0.011341763768445888), ( crime , 0.010596684396796159), ( weapon , 0.00985160502514643), ( person , 0.008671896020034356), ( firearm , 0.00838214293105946), ( popce , 0.008257963035784506)]), (9, [( make , 0.02147966482730431), ( people , 0.021377478029838543), ( work , 0.018557122419783363), ( money , 0.016676885346413244), ( year , 0.015982015123646026), ( job , 0.012221540976905783), ( pay , 0.010239117106069897), ( time , 0.008910688739014919), ( school , 0.0079092581238504), ( support , 0.007357449417535254)]), (14, [( power , 0.018428398507941996), ( pne , 0.013784244460364121), ( high , 0.01183271164249895), ( work , 0.011560979224821522), ( ground , 0.010770484918850819), ( current , 0.010745781971789235), ( wire , 0.008399002000938712), ( low , 0.008053160742076529), ( water , 0.006966231071366814), ( run , 0.006892122230182061)]), (0, [( people , 0.025218349201353372), ( kill , 0.01500904870564167), ( child , 0.013612400660948935), ( armenian , 0.010307655991816822), ( woman , 0.010287984892595798), ( start , 0.01003226060272248), ( day , 0.00967818081674404), ( happen , 0.009383114328428673), ( leave , 0.009383114328428673), ( fire , 0.009009363443229208)]), (1, [( file , 0.030686386604212003), ( program , 0.02227713642901929), ( window , 0.01945561169918489), ( set , 0.015914874783314277), ( pne , 0.013831003577619592), ( display , 0.013794120901412606), ( apppcation , 0.012576992586582082), ( entry , 0.009275993066056873), ( change , 0.00872275292295209), ( color , 0.008612104894331132)]), (12, [( pne , 0.07153810971508515), ( buy , 0.02975597944523662), ( organization , 0.026877236406682988), ( host , 0.025451316957679788), ( price , 0.025182275552207485), ( sell , 0.02461728860071565), ( mail , 0.02192687454599263), ( good , 0.018967419085797303), ( sale , 0.017998870026097017), ( send , 0.013694207538540181)]), (11, [( thing , 0.04901329901329901), ( good , 0.0376018876018876), ( make , 0.03393393393393394), ( time , 0.03326898326898327), ( bad , 0.02664092664092664), ( happen , 0.017696267696267698), ( hear , 0.015615615615615615), ( problem , 0.015465465465465466), ( back , 0.015143715143715144), ( lot , 0.01495066495066495)]), (18, [( space , 0.020626317374284855), ( launch , 0.00965716006366413), ( system , 0.008560244332602057), ( project , 0.008173097603991913), ( time , 0.008108573149223556), ( cost , 0.007764442723792318), ( year , 0.0076784101174345075), ( earth , 0.007484836753129436), ( base , 0.0067535595990880545), ( large , 0.006689035144319697)]), (5, [( government , 0.01918437232469453), ( people , 0.01461203206475212), ( state , 0.011207097828624796), ( country , 0.010214802708381975), ( israep , 0.010039691804809714), ( war , 0.009436532025838587), ( force , 0.00858043427504086), ( attack , 0.008424780138532182), ( land , 0.0076659662230523775), ( world , 0.0075103120865437)]), (2, [( car , 0.041091194044470564), ( bike , 0.015598981291017729), ( ride , 0.011019688510138114), ( drive , 0.010627877363110981), ( engine , 0.009403467528651191), ( speed , 0.008081104907434616), ( turn , 0.007738270153785875), ( back , 0.007738270153785875), ( front , 0.007468899990204721), ( big , 0.007370947203447938)]) ]
Evaluating Performance
Now we can also evaluate its performance by calculating the coherence score as follows −
ldamallet = gensim.models.wrappers.LdaMallet( mallet_path, corpus=corpus, num_topics=20, id2word=id2word ) pprint(ldamallet.show_topics(formatted=False))
Output
Coherence Score: 0.5842762900901401Advertisements