English 中文(简体)
spaCy - Architecture
  • 时间:2024-10-18

spaCy - Architecture

Previous Page Next Page  

This chapter tells us about the data structures in spaCy and explains the objects along with their role.

Data Structures

The central data structures in spaCy are as follows −

    Doc − This is one of the most important objects in spaCy’s architecture and owns the sequence of tokens along with all their annotations.

    Vocab − Another important object of central data structure of spaCy is Vocab. It owns a set of look-up tables that make common information available across documents.

The data structure of spaCy helps in centrapsing strings, word vectors, and lexical attributes, which saves memory by avoiding storing multiple copies of the data.

Objects and their role

The objects in spaCy along with their role and an example are explained below −


It is a spce from Doc object, which we discussed above. We can create a Span object from the spce with the help of following command −

doc[start : end]


An example of span is given below −

import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
span = my_doc[1:6]


is my first example.


As the name suggests, it represents an inspanidual token such as word, punctuation, whitespace, symbol, etc.


An example of token is stated below −

import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
token = my_doc[4]




As name suggests, tokenizer class segments the text into words, punctuations marks etc.


This example will create a blank tokenizer with just the Engpsh vocab −

from spacy.tokenizer import Tokenizer
from spacy.lang.en import Engpsh
nlp_lang = Engpsh()
blank_tokenizer = Tokenizer(nlp_lang.vocab)


<spacy.tokenizer.Tokenizer at 0x26506efc480>


It is a text-processing pipepne which, we need to load once per process and pass the instance around apppcation. This class will be created, when we call the method spacy.load().

It contains the following −

    Shared vocabulary

    Language data

    Optional model data loaded from a model package

    Processing pipepne containing components such as tagger or parser.


This example of language will initiapse Engpsh Language object

from spacy.vocab import Vocab
from spacy.language import Language
nlp_lang = Language(Vocab())
from spacy.lang.en import Engpsh
nlp_lang = Engpsh()


When you run the code, you will see the following output −

<spacy.lang.en.Engpsh at 0x26503773cf8>