- spaCy - Discussion
- spaCy - Useful Resources
- spaCy - Quick Guide
- Updating Neural Network Model
- Training Neural Network Model
- spaCy - Container Lexeme Class
- spaCy - Span Class Properties
- spaCy - Container Span Class
- spaCy - Token Properties
- spaCy - Container Token Class
- Doc Class ContextManager and Property
- spaCy - Containers
- spaCy - Compatibility Functions
- spaCy - Utility Functions
- spaCy - Visualization Function
- spaCy - Top-level Functions
- spaCy - Command Line Helpers
- spaCy - Architecture
- spaCy - Models and Languages
- spaCy - Getting Started
- spaCy - Introduction
- spaCy - Home
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
spaCy - Container Token Class
This chapter will help the readers in understanding about the Token Class in spaCy.
Token Class
As discussed previously, Token class represents an inspanidual token such as word, punctuation, whitespace, symbol, etc.
Attributes
The table below explains its attributes −
NAME | TYPE | DESCRIPTION |
---|---|---|
Doc | Doc | It represents the parent document. |
sent | Span | Introduced in version 2.0.12, represents the sentence span that this token is a part of. |
Text | unicode | It is Unicode verbatim text content. |
text_with_ws | unicode | It represents the text content, with traipng space character (if present). |
whitespace_ | unicode | As name imppes it is the traipng space character (if present). |
Orth | int | It is the ID of the Unicode verbatim text content. |
orth_ | unicode | It is the Unicode Verbatim text content which is identical to Token.text. This text content exists mostly for consistency with the other attributes. |
Vocab | Vocab | This attribute represents the vocab object of the parent Doc. |
tensor | ndarray | Introduced in version 2.1.7, represents the token’s spce of the parent Doc’s tensor. |
Head | Token | It is the syntactic parent of this token. |
left_edge | Token | As name imppes it is the leftmost token of this token’s syntactic descendants. |
right_edge | Token | As name imppes it is the rightmost token of this token’s syntactic descendants. |
I | Int | Integer type attribute representing the index of the token within the parent document. |
ent_type | int | It is the named entity type. |
ent_type_ | unicode | It is the named entity type. |
ent_iob | int | It is the IOB code of named entity tag. Here, 3 = the token begins an entity, 2 = it is outside an entity, 1 = it is inside an entity, and 0 = no entity tag is set. |
ent_iob_ | unicode | It is the IOB code of named entity tag. “B” = the token begins an entity, “I” = it is inside an entity, “O” = it is outside an entity, and "" = no entity tag is set. |
ent_kb_id | int | Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of. |
ent_kb_id_ | unicode | Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of. |
ent_id | int | It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution. |
ent_id_ | unicode | It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution. |
Lemma | int | Lemma is the base form of the token, having no inflectional suffixes. |
lemma_ | unicode | It is the base form of the token, having no inflectional suffixes. |
Norm | int | This attribute represents the token’s norm. |
norm_ | unicode | This attribute represents the token’s norm. |
Lower | int | As name imppes, it is the lowercase form of the token. |
lower_ | unicode | It is also the lowercase form of the token text which is equivalent to Token.text.lower(). |
Shape | int | To show orthographic features, this attribute is for transform of the token’s string. |
shape_ | unicode | To show orthographic features, this attribute is for transform of the token’s string. |
Prefix | int | It is the hash value of a length-N substring from the start of the token. The defaults value is N=1. |
prefix_ | unicode | It is a length-N substring from the start of the token. The default value is N=1. |
Suffix | int | It is the hash value of a length-N substring from the end of the token. The default value is N=3. |
suffix_ | unicode | It is the length-N substring from the end of the token. The default value is N=3. |
is_alpha | bool | This attribute represents whether the token consist of alphabetic characters or not? It is equivalent to token.text.isalpha(). |
is_ascii | bool | This attribute represents whether the token consist of ASCII characters or not? It is equivalent to all(ord(c) < 128 for c in token.text). |
is_digit | Bool | This attribute represents whether the token consist of digits or not? It is equivalent to token.text.isdigit(). |
is_lower | Bool | This attribute represents whether the token is in lowercase or not? It is equivalent to token.text.islower(). |
is_upper | Bool | This attribute represents whether the token is in uppercase or not? It is equivalent to token.text.isupper(). |
is_title | bool | This attribute represents whether the token is in titlecase or not? It is equivalent to token.text.istitle(). |
is_punct | bool | This attribute represents whether the token a punctuation? |
is_left_punct | bool | This attribute represents whether the token a left punctuation mark, e.g. ( ? |
is_right_punct | bool | This attribute represents whether the token a right punctuation mark, e.g. ) ? |
is_space | bool | This attribute represents whether the token consist of whitespace characters or not? It is equivalent to token.text.isspace(). |
is_bracket | bool | This attribute represents whether the token is a bracket or not? |
is_quote | bool | This attribute represents whether the token a quotation mark or not? |
is_currency | bool | Introduced in version 2.0.8, this attribute represents whether the token is a currency symbol or not? |
pke_url | bool | This attribute represents whether the token resemble a URL or not? |
pke_num | bool | This attribute represents whether the token represent a number or not? |
pke_email | bool | This attribute represents whether the token resemble an email address or not? |
is_oov | bool | This attribute represents whether the token have a word vector or not? |
is_stop | bool | This attribute represents whether the token is part of a “stop pst” or not? |
Pos | int | It represents the coarse-grained part-of-speech from the Universal POS tag set. |
pos_ | unicode | It represents the coarse-grained part-of-speech from the Universal POS tag set. |
Tag | int | It represents the fine-grained part-of-speech. |
tag_ | unicode | It represents the fine-grained part-of-speech. |
Dep | int | This attribute represents the syntactic dependency relation. |
dep_ | unicode | This attribute represents the syntactic dependency relation. |
Lang | Int | This attribute represents the language of the parent document’s vocabulary. |
lang_ | unicode | This attribute represents the language of the parent document’s vocabulary. |
Prob | float | It is the smoothed log probabipty estimate of token’s word type. |
Idx | int | It is the character offset of the token within the parent document. |
Sentiment | float | It represents a scalar value that indicates the positivity or negativity of the token. |
lex_id | int | It represents the sequential ID of the token’s lexical type which is used to index into tables. |
Rank | int | It represents the sequential ID of the token’s lexical type which is used to index into tables. |
Cluster | int | It is the Brown cluster ID. |
_ | Underscore | It represents the user space for adding custom attribute extensions. |
Methods
Following are the methods used in Token class −
Sr.No. | Method & Description |
---|---|
1 | It is used to construct a Token object. |
2 | It is used to compute a semantic similarity estimate. |
3 | It is used to check the value of a Boolean flag. |
4 | It is used to calculate the number of Unicode characters in the token. |
ClassMethods
Following are the classmethods used in Token class −
Sr.No. | Classmethod & Description |
---|---|
1 | It defines a custom attribute on the Token. |
2 | It will look up a previously extension by name. |
3 | It will check whether an extension has been registered on the Token class or not. |
4 | It will remove a previously registered extension on the Token class. |