- Gensim - Discussion
- Gensim - Useful Resources
- Gensim - Quick Guide
- Gensim - Doc2Vec Model
- Gensim - Developing Word Embedding
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Documents & LDA Model
- Gensim - Creating LDA Mallet Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Topic Model
- Gensim - Topic Modeling
- Gensim - Creating TF-IDF Matrix
- Gensim - Transformations
- Creating a bag of words (BoW) Corpus
- Gensim - Creating a Dictionary
- Gensim - Vector & Model
- Gensim - Documents & Corpus
- Gensim - Getting Started
- Gensim - Introduction
- Gensim - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
Gensim - Doc2Vec Model
Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn’t only give the simple average of the words in the sentence.
Creating Document Vectors Using Doc2Vec
Here to create document vectors using Doc2Vec, we will be using text8 dataset which can be downloaded from gensim.downloader.
Downloading the Dataset
We can download the text8 dataset by using the following commands −
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset]
It will take some time to download the text8 dataset.
Train the Doc2Vec
In order to train the model, we need the tagged document which can be created by using models.doc2vec.TaggedDcument() as follows −
def tagged_document(pst_of_pst_of_words): for i, pst_of_words in enumerate(pst_of_pst_of_words): yield gensim.models.doc2vec.TaggedDocument(pst_of_words, [i]) data_for_training = pst(tagged_document(data))
We can print the trained dataset as follows −
print(data_for_training [:1])
Output
[TaggedDocument(words=[ anarchism , originated , as , a , term , of , abuse , first , used , against , early , working , class , radicals , including , the , diggers , of , the , engpsh , revolution , and , the , sans , culottes , of , the , french , revolution , whilst , the , term , is , still , used , in , a , pejorative , way , to , describe , any , act , that , used , violent , means , to , destroy , the , organization , of , society , it , has , also , been , taken , up , as , a , positive , label , by , self , defined , anarchists , the , word , anarchism , is , derived , from , the , greek , without , archons , ruler , chief , king , anarchism , as , a , poptical , philosophy , is , the , bepef , that , rulers , are , unnecessary , and , should , be , abopshed , although , there , are , differing , interpretations , of , what , this , means , anarchism , also , refers , to , related , social , movements , that , advocate , the , epmination , of , authoritarian , institutions , particularly , the , state , the , word , anarchy , as , most , anarchists , use , it , does , not , imply , chaos , nihipsm , or , anomie , but , rather , a , harmonious , anti , authoritarian , society , in , place , of , what , are , regarded , as , authoritarian , poptical , structures , and , coercive , economic , institutions , anarchists , advocate , social , relations , based , upon , voluntary , association , of , autonomous , inspaniduals , mutual , aid , and , self , governance , while , anarchism , is , most , easily , defined , by , what , it , is , against , anarchists , also , offer , positive , visions , of , what , they , bepeve , to , be , a , truly , free , society , however , ideas , about , how , an , anarchist , society , might , work , vary , considerably , especially , with , respect , to , economics , there , is , also , disagreement , about , how , a , free , society , might , be , brought , about , origins , and , predecessors , kropotkin , and , others , argue , that , before , recorded , history , human , society , was , organized , on , anarchist , principles , most , anthropologists , follow , kropotkin , and , engels , in , bepeving , that , hunter , gatherer , bands , were , egaptarian , and , lacked , spanision , of , labour , accumulated , wealth , or , decreed , law , and , had , equal , access , to , resources , wilpam , godwin , anarchists , including , the , the , anarchy , organisation , and , rothbard , find , anarchist , attitudes , in , taoism , from , ancient , china , kropotkin , found , similar , ideas , in , stoic , zeno , of , citium , according , to , kropotkin , zeno , repudiated , the , omnipotence , of , the , state , its , intervention , and , regimentation , and , proclaimed , the , sovereignty , of , the , moral , law , of , the , inspanidual , the , anabaptists , of , one , six , th , century , europe , are , sometimes , considered , to , be , repgious , forerunners , of , modern , anarchism , bertrand , russell , in , his , history , of , western , philosophy , writes , that , the , anabaptists , repudiated , all , law , since , they , held , that , the , good , man , will , be , guided , at , every , moment , by , the , holy , spirit , from , this , premise , they , arrive , at , communism , the , diggers , or , true , levellers , were , an , early , communistic , movement , (truncated…)
Initiapse the Model
Once trained we now need to initiapse the model. it can be done as follows −
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
Now, build the vocabulary as follows −
model.build_vocab(data_for_training)
Now, let’s train the Doc2Vec model as follows −
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
Analysing the Output
Finally, we can analyse the output by using model.infer_vector() as follows −
print(model.infer_vector([ violent , means , to , destroy , the , organization ]))
Complete Implementation Example
import gensim import gensim.downloader as api dataset = api.load("text8") data = [d for d in dataset] def tagged_document(pst_of_pst_of_words): for i, pst_of_words in enumerate(pst_of_pst_of_words): yield gensim.models.doc2vec.TaggedDocument(pst_of_words, [i]) data_for_training = pst(tagged_document(data)) print(data_for_training[:1]) model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) model.build_vocab(data_training) model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs) print(model.infer_vector([ violent , means , to , destroy , the , organization ]))
Output
[ -0.2556166 0.4829361 0.17081228 0.10879577 0.12525807 0.10077011 -0.21383236 0.19294572 0.11864349 -0.03227958 -0.02207291 -0.7108424 0.07165232 0.24221905 -0.2924459 -0.03543589 0.21840079 -0.1274817 0.05455418 -0.28968817 -0.29146606 0.32885507 0.14689675 -0.06913587 -0.35173815 0.09340707 -0.3803535 -0.04030455 -0.10004586 0.22192696 0.2384828 -0.29779273 0.19236489 -0.25727913 0.09140676 0.01265439 0.08077634 -0.06902497 -0.07175519 -0.22583418 -0.21653089 0.00347822 -0.34096122 -0.06176808 0.22885063 -0.37295452 -0.08222228 -0.03148199 -0.06487323 0.11387568 ]Advertisements