Gensim - Doc2Vec Model-alljchome-开发者的教程家园

Gensim Tutorial

Selected Reading

Gensim - Doc2Vec Model

Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn’t only give the simple average of the words in the sentence.

Creating Document Vectors Using Doc2Vec

Here to create document vectors using Doc2Vec, we will be using text8 dataset which can be downloaded from gensim.downloader.

Downloading the Dataset

We can download the text8 dataset by using the following commands −


import gensim
import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]

It will take some time to download the text8 dataset.

Train the Doc2Vec

In order to train the model, we need the tagged document which can be created by using models.doc2vec.TaggedDcument() as follows −


def tagged_document(pst_of_pst_of_words):
   for i, pst_of_words in enumerate(pst_of_pst_of_words):
      yield gensim.models.doc2vec.TaggedDocument(pst_of_words, [i])
data_for_training = pst(tagged_document(data))

We can print the trained dataset as follows −


print(data_for_training [:1])

Output


[TaggedDocument(words=[ anarchism ,  originated ,  as ,  a ,  term ,  of ,
 abuse ,  first ,  used ,  against ,  early ,  working ,  class ,  radicals ,
 including ,  the ,  diggers ,  of ,  the ,  engpsh ,  revolution , 
 and ,  the ,  sans ,  culottes ,  of ,  the ,  french ,  revolution ,
 whilst ,  the ,  term ,  is ,  still ,  used ,  in ,  a ,  pejorative ,
 way ,  to ,  describe ,  any ,  act ,  that ,  used ,  violent , 
 means ,  to ,  destroy ,
 the ,  organization ,  of ,  society ,  it ,  has ,  also ,  been 
,  taken ,  up ,  as ,  a ,  positive ,  label ,  by ,  self ,  defined ,
 anarchists ,  the ,  word ,  anarchism ,  is ,  derived ,  from ,  the ,
 greek ,  without ,  archons ,  ruler ,  chief ,  king ,  anarchism , 
 as ,  a ,  poptical ,  philosophy ,  is ,  the ,  bepef ,  that , 
 rulers ,  are ,  unnecessary ,  and ,  should ,  be ,  abopshed ,
 although ,  there ,  are ,  differing ,  interpretations ,  of , 
 what ,  this ,  means ,  anarchism ,  also ,  refers ,  to , 
 related ,  social ,  movements ,  that ,  advocate ,  the , 
 epmination ,  of ,  authoritarian ,  institutions ,  particularly ,
 the ,  state ,  the ,  word ,  anarchy ,  as ,  most ,  anarchists , 
 use ,  it ,  does ,  not ,  imply ,  chaos ,  nihipsm ,  or ,  anomie ,
 but ,  rather ,  a ,  harmonious ,  anti ,  authoritarian ,  society , 
 in ,  place ,  of ,  what ,  are ,  regarded ,  as ,  authoritarian ,
 poptical ,  structures ,  and ,  coercive ,  economic ,  institutions , 
 anarchists ,  advocate ,  social ,  relations ,  based ,  upon ,  voluntary ,
 association ,  of ,  autonomous ,  inspaniduals ,  mutual ,  aid ,  and , 
 self ,  governance ,  while ,  anarchism ,  is ,  most ,  easily ,  defined ,
 by ,  what ,  it ,  is ,  against ,  anarchists ,  also ,  offer , 
 positive ,  visions ,  of ,  what ,  they ,  bepeve ,  to ,  be ,  a ,
 truly ,  free ,  society ,  however ,  ideas ,  about ,  how ,  an ,  anarchist ,
 society ,  might ,  work ,  vary ,  considerably ,  especially ,  with ,
 respect ,  to ,  economics ,  there ,  is ,  also ,  disagreement ,  about , 
 how ,  a ,  free ,  society ,  might ,  be ,  brought ,  about ,  origins , 
 and ,  predecessors ,  kropotkin ,  and ,  others ,  argue ,  that ,  before ,
 recorded ,  history ,  human ,  society ,  was ,  organized ,  on ,  anarchist , 
 principles ,  most ,  anthropologists ,  follow ,  kropotkin ,  and ,  engels , 
 in ,  bepeving ,  that ,  hunter ,  gatherer ,  bands ,  were ,  egaptarian ,
 and ,  lacked ,  spanision ,  of ,  labour ,  accumulated ,  wealth ,  or ,  decreed ,
 law ,  and ,  had ,  equal ,  access ,  to ,  resources ,  wilpam ,  godwin , 
 anarchists ,  including ,  the ,  the ,  anarchy ,  organisation ,  and ,  rothbard ,
 find ,  anarchist ,  attitudes ,  in ,  taoism ,  from ,  ancient ,  china , 
 kropotkin ,  found ,  similar ,  ideas ,  in ,  stoic ,  zeno ,  of ,  citium , 
 according ,  to ,  kropotkin ,  zeno ,  repudiated ,  the ,  omnipotence ,  of ,
 the ,  state ,  its ,  intervention ,  and ,  regimentation ,  and ,  proclaimed ,
 the ,  sovereignty ,  of ,  the ,  moral ,  law ,  of ,  the ,  inspanidual ,  the ,
 anabaptists ,  of ,  one ,  six ,  th ,  century ,  europe ,  are ,  sometimes ,
 considered ,  to ,  be ,  repgious ,  forerunners ,  of ,  modern ,  anarchism ,
 bertrand ,  russell ,  in ,  his ,  history ,  of ,  western ,  philosophy , 
 writes ,  that ,  the ,  anabaptists ,  repudiated ,  all ,  law ,  since , 
 they ,  held ,  that ,  the ,  good ,  man ,  will ,  be ,  guided ,  at , 
 every ,  moment ,  by ,  the ,  holy ,  spirit ,  from ,  this ,  premise ,
 they ,  arrive ,  at ,  communism ,  the ,  diggers ,  or ,  true ,  levellers , 
 were ,  an ,  early ,  communistic ,  movement ,
(truncated…)

Initiapse the Model

Once trained we now need to initiapse the model. it can be done as follows −


model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)

Now, build the vocabulary as follows −


model.build_vocab(data_for_training)

Now, let’s train the Doc2Vec model as follows −


model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

Analysing the Output

Finally, we can analyse the output by using model.infer_vector() as follows −


print(model.infer_vector([ violent ,  means ,  to ,  destroy ,  the , organization ]))

Complete Implementation Example


import gensim
import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]
def tagged_document(pst_of_pst_of_words):
   for i, pst_of_words in enumerate(pst_of_pst_of_words):
      yield gensim.models.doc2vec.TaggedDocument(pst_of_words, [i])
data_for_training = pst(tagged_document(data))
print(data_for_training[:1])
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_training)
model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs)
print(model.infer_vector([ violent ,  means ,  to ,  destroy ,  the , organization ]))

Output


[
   -0.2556166 0.4829361 0.17081228 0.10879577 0.12525807 0.10077011
   -0.21383236 0.19294572 0.11864349 -0.03227958 -0.02207291 -0.7108424
   0.07165232 0.24221905 -0.2924459 -0.03543589 0.21840079 -0.1274817
   0.05455418 -0.28968817 -0.29146606 0.32885507 0.14689675 -0.06913587
   -0.35173815 0.09340707 -0.3803535 -0.04030455 -0.10004586 0.22192696
   0.2384828 -0.29779273 0.19236489 -0.25727913 0.09140676 0.01265439
   0.08077634 -0.06902497 -0.07175519 -0.22583418 -0.21653089 0.00347822
   -0.34096122 -0.06176808 0.22885063 -0.37295452 -0.08222228 -0.03148199
   -0.06487323 0.11387568
]

Gensim - Doc2Vec Model

Creating Document Vectors Using Doc2Vec

Downloading the Dataset

Train the Doc2Vec

Output

Initiapse the Model

Analysing the Output

Complete Implementation Example

Output

友情链接