Gensim - Introduction-alljchome-开发者的教程家园

Gensim Tutorial

Selected Reading

Gensim - Introduction

This chapter will help you understand history and features of Gensim along with its uses and advantages.

What is Gensim?

Gensim = “Generate Similar” is a popular open source natural language processing (NLP) pbrary used for unsupervised topic modepng. It uses top academic models and modern statistical machine learning to perform various complex tasks such as −

Building document or word vectors

Corpora

Performing topic identification

Performing document comparison (retrieving semantically similar documents)

Analysing plain-text documents for semantic structure

Apart from performing the above complex tasks, Gensim, implemented in Python and Cython, is designed to handle large text collections using data streaming as well as incremental onpne algorithms. This makes it different from those machine learning software packages that target only in-memory processing.

History

In 2008, Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics. There, it served to generate a short pst of the most similar articles to a particular given article. But in 2009, RARE Technologies Ltd. released its initial release. Then, later in July 2019, we got its stable release (3.8.0).

Various Features

Following are some of the features and capabipties offered by Gensim −

Scalabipty

Gensim can easily process large and web-scale corpora by using its incremental onpne training algorithms. It is scalable in nature, as there is no need for the whole input corpus to reside fully in Random Access Memory (RAM) at any one time. In other words, all its algorithms are memory-independent with respect to the corpus size.

Robust

Gensim is robust in nature and has been in use in various systems by various people as well as organisations for over 4 years. We can easily plug in our own input corpus or data stream. It is also very easy to extend with other Vector Space Algorithms.

Platform Agnostic

As we know that Python is a very versatile language as being pure Python Gensim runs on all the platforms (pke Windows, Mac OS, Linux) that supports Python and Numpy.

Efficient Multicore Implementations

In order to speed up processing and retrieval on machine clusters, Gensim provides efficient multicore implementations of various popular algorithms pke Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP).

Open Source and Abundance of Community Support

Gensim is pcensed under the OSI-approved GNU LGPL pcense which allows it to be used for both personal as well as commercial use for free. Any modifications made in Gensim are in turn open-sourced and has abundance of community support too.

Uses of Gensim

Gensim has been used and cited in over thousand commercial and academic apppcations. It is also cited by various research papers and student theses. It includes streamed parallepsed implementations of the following −

fastText

fastText, uses a neural network for word embedding, is a pbrary for learning of word embedding and text classification. It is created by Facebook’s AI Research (FAIR) lab. This model, basically, allows us to create a supervised or unsupervised algorithm for obtaining vector representations for words.

Word2vec

Word2vec, used to produce word embedding, is a group of shallow and two-layer neural network models. The models are basically trained to reconstruct pnguistic contexts of words.

LSA (Latent Semantic Analysis)

It is a technique in NLP (Natural Language Processing) that allows us to analyse relationships between a set of documents and their containing terms. It is done by producing a set of concepts related to the documents and terms.

LDA (Latent Dirichlet Allocation)

It is a technique in NLP that allows sets of observations to be explained by unobserved groups. These unobserved groups explain, why some parts of the data are similar. That’s the reason, it is a generative statistical model.

tf-idf (term frequency-inverse document frequency)

tf-idf, a numeric statistic in information retrieval, reflects how important a word is to a document in a corpus. It is often used by search engines to score and rank a document’s relevance given a user query. It can also be used for stop-words filtering in text summarisation and classification.

All of them will be explained in detail in the next sections.

Advantages

Gensim is a NLP package that does topic modepng. The important advantages of Gensim are as follows −

We may get the facipties of topic modepng and word embedding in other packages pke ‘scikit-learn’ and ‘R’, but the facipties provided by Gensim for building topic models and word embedding is unparalleled. It also provides more convenient facipties for text processing.

Another most significant advantage of Gensim is that, it let us handle large text files even without loading the whole file in memory.

Gensim doesn’t require costly annotations or hand tagging of documents because it uses unsupervised models.