spaCy - Architecture-alljchome-开发者的教程家园

spaCy Tutorial

Selected Reading

spaCy - Architecture

This chapter tells us about the data structures in spaCy and explains the objects along with their role.

Data Structures

The central data structures in spaCy are as follows −

Doc − This is one of the most important objects in spaCy’s architecture and owns the sequence of tokens along with all their annotations.

Vocab − Another important object of central data structure of spaCy is Vocab. It owns a set of look-up tables that make common information available across documents.

The data structure of spaCy helps in centrapsing strings, word vectors, and lexical attributes, which saves memory by avoiding storing multiple copies of the data.

Objects and their role

The objects in spaCy along with their role and an example are explained below −

Span

It is a spce from Doc object, which we discussed above. We can create a Span object from the spce with the help of following command −


doc[start : end]

Example

An example of span is given below −


import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
span = my_doc[1:6]
span

Output


is my first example.

Token

As the name suggests, it represents an inspanidual token such as word, punctuation, whitespace, symbol, etc.

Example

An example of token is stated below −


import spacy
import en_core_web_sm
nlp_example = en_core_web_sm.load()
my_doc = nlp_example("This is my first example.")
token = my_doc[4]
token

Output


example

Tokenizer

As name suggests, tokenizer class segments the text into words, punctuations marks etc.

Example

This example will create a blank tokenizer with just the Engpsh vocab −


from spacy.tokenizer import Tokenizer
from spacy.lang.en import Engpsh
nlp_lang = Engpsh()
blank_tokenizer = Tokenizer(nlp_lang.vocab)
blank_tokenizer

Output


<spacy.tokenizer.Tokenizer at 0x26506efc480>

Language

It is a text-processing pipepne which, we need to load once per process and pass the instance around apppcation. This class will be created, when we call the method spacy.load().

It contains the following −

Shared vocabulary

Language data

Optional model data loaded from a model package

Processing pipepne containing components such as tagger or parser.

Example

This example of language will initiapse Engpsh Language object


from spacy.vocab import Vocab
from spacy.language import Language
nlp_lang = Language(Vocab())
from spacy.lang.en import Engpsh
nlp_lang = Engpsh()
nlp_lang

Output

When you run the code, you will see the following output −


<spacy.lang.en.Engpsh at 0x26503773cf8>

spaCy - Architecture

Data Structures

Objects and their role

Span

Token

Tokenizer

Language

友情链接