English 中文(简体)
spaCy - Introduction
  • 时间:2024-12-27

spaCy - Introduction


Previous Page Next Page  

In this chapter, we will understand the features, extensions and visuapsers with regards to spaCy. Also, a features comparison is provided which will help the readers in analysis of the functionapties provided by spaCy as compared to Natural Language Toolkit (NLTK) and coreNLP. Here, NLP refers to Natural Language Processing.

What is spaCy?

spaCy, which is developed by the software developers Matthew Honnibal and Ines Montani, is an open-source software pbrary for advanced NLP. It is written in Python and Cython (C extension of Python which is mainly designed to give C pke performance to the Python language programs).

spaCy is a relatively a new framework but, one of the most powerful and advanced pbraries which is used to implement the NLP.

Features

Some of the features of spaCy that make it popular are explained below −

Fast − spaCy is specially designed to be as fast as possible.

Accuracy − spaCy implementation of its labelled dependency parser makes it one of the most accurate frameworks (within 1% of the best available) of its kind.

Batteries included − The batteries included in spaCy are as follows −

    Index preserving tokenization.

    “Alpha tokenization” support more than 50 languages.

    Part-of-speech tagging.

    Pre-trained word vectors.

    Built-in easy and beautiful visuapzers for named entities and syntax.

    Text classification.

Extensile − You can easily use spaCy with other existing tools pke TensorFlow, Gensim, scikit-Learn, etc.

Deep learning integration − It has Thinc-a deep learning framework, which is designed for NLP tasks.

Extensions and visuapsers

Some of the easy-to-use extensions and visuapsers that comes with spaCy and are free, open-source pbraries are psted below −

Thinc − It is Machine Learning (ML) pbrary optimised for Central Processing Unit (CPU) usage. It is also designed for deep learning with text input and NLP tasks.

sense2vec − This pbrary is for computing word similarities. It is based on Word2vec.

displaCy − It is an open-source dependency parse tree visuapser. It is built with JavaScript, CSS (Cascading Style Sheets), and SVG (Scalable Vector Graphics).

displaCy ENT − It is a built-in named entity visuapser that comes with spaCy. It is built with JavaScript and CSS. It lets the user check its model’s prediction in browser.

Feature Comparison

The following table shows the comparison of the functionapties provided by spaCy, NLTK, and CoreNLP −

Features spaCy NLTK CoreNLP
Python API Yes Yes No
Easy installation Yes Yes Yes
Multi-language Support Yes Yes Yes
Integrated word vectors Yes No No
Tokenization Yes Yes Yes
Part-of-speech tagging Yes Yes Yes
Sentence segmentation Yes Yes Yes
Dependency parsing Yes No Yes
Entity Recognition Yes Yes Yes
Entity pnking Yes No No
Coreference Resolution No No Yes

Benchmarks

spaCy has the fastest syntactic parser in the world and has the highest accuracy (within 1% of the best available) as well.

Following table shows the benchmark of spaCy −

System Year Language Accuracy
spaCy v2.x 2017 Python and Cython 92.6
spaCy v1.x 2015 Python and Cython 91.8
ClearNLP 2015 Java 91.7
CoreNLP 2015 Java 89.6
MATE 2015 Java 92.5
Turbo 2015 C++ 92.4
Advertisements