spaCy - Container Token Class-alljchome-开发者的教程家园

spaCy Tutorial

Selected Reading

spaCy - Container Token Class

This chapter will help the readers in understanding about the Token Class in spaCy.

Token Class

As discussed previously, Token class represents an inspanidual token such as word, punctuation, whitespace, symbol, etc.

Attributes

The table below explains its attributes −

NAME	TYPE	DESCRIPTION
Doc	Doc	It represents the parent document.
sent	Span	Introduced in version 2.0.12, represents the sentence span that this token is a part of.
Text	unicode	It is Unicode verbatim text content.
text_with_ws	unicode	It represents the text content, with traipng space character (if present).
whitespace_	unicode	As name imppes it is the traipng space character (if present).
Orth	int	It is the ID of the Unicode verbatim text content.
orth_	unicode	It is the Unicode Verbatim text content which is identical to Token.text. This text content exists mostly for consistency with the other attributes.
Vocab	Vocab	This attribute represents the vocab object of the parent Doc.
tensor	ndarray	Introduced in version 2.1.7, represents the token’s spce of the parent Doc’s tensor.
Head	Token	It is the syntactic parent of this token.
left_edge	Token	As name imppes it is the leftmost token of this token’s syntactic descendants.
right_edge	Token	As name imppes it is the rightmost token of this token’s syntactic descendants.
I	Int	Integer type attribute representing the index of the token within the parent document.
ent_type	int	It is the named entity type.
ent_type_	unicode	It is the named entity type.
ent_iob	int	It is the IOB code of named entity tag. Here, 3 = the token begins an entity, 2 = it is outside an entity, 1 = it is inside an entity, and 0 = no entity tag is set.
ent_iob_	unicode	It is the IOB code of named entity tag. “B” = the token begins an entity, “I” = it is inside an entity, “O” = it is outside an entity, and "" = no entity tag is set.
ent_kb_id	int	Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of.
ent_kb_id_	unicode	Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of.
ent_id	int	It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution.
ent_id_	unicode	It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution.
Lemma	int	Lemma is the base form of the token, having no inflectional suffixes.
lemma_	unicode	It is the base form of the token, having no inflectional suffixes.
Norm	int	This attribute represents the token’s norm.
norm_	unicode	This attribute represents the token’s norm.
Lower	int	As name imppes, it is the lowercase form of the token.
lower_	unicode	It is also the lowercase form of the token text which is equivalent to Token.text.lower().
Shape	int	To show orthographic features, this attribute is for transform of the token’s string.
shape_	unicode	To show orthographic features, this attribute is for transform of the token’s string.
Prefix	int	It is the hash value of a length-N substring from the start of the token. The defaults value is N=1.
prefix_	unicode	It is a length-N substring from the start of the token. The default value is N=1.
Suffix	int	It is the hash value of a length-N substring from the end of the token. The default value is N=3.
suffix_	unicode	It is the length-N substring from the end of the token. The default value is N=3.
is_alpha	bool	This attribute represents whether the token consist of alphabetic characters or not? It is equivalent to token.text.isalpha().
is_ascii	bool	This attribute represents whether the token consist of ASCII characters or not? It is equivalent to all(ord(c) < 128 for c in token.text).
is_digit	Bool	This attribute represents whether the token consist of digits or not? It is equivalent to token.text.isdigit().
is_lower	Bool	This attribute represents whether the token is in lowercase or not? It is equivalent to token.text.islower().
is_upper	Bool	This attribute represents whether the token is in uppercase or not? It is equivalent to token.text.isupper().
is_title	bool	This attribute represents whether the token is in titlecase or not? It is equivalent to token.text.istitle().
is_punct	bool	This attribute represents whether the token a punctuation?
is_left_punct	bool	This attribute represents whether the token a left punctuation mark, e.g. ( ?
is_right_punct	bool	This attribute represents whether the token a right punctuation mark, e.g. ) ?
is_space	bool	This attribute represents whether the token consist of whitespace characters or not? It is equivalent to token.text.isspace().
is_bracket	bool	This attribute represents whether the token is a bracket or not?
is_quote	bool	This attribute represents whether the token a quotation mark or not?
is_currency	bool	Introduced in version 2.0.8, this attribute represents whether the token is a currency symbol or not?
pke_url	bool	This attribute represents whether the token resemble a URL or not?
pke_num	bool	This attribute represents whether the token represent a number or not?
pke_email	bool	This attribute represents whether the token resemble an email address or not?
is_oov	bool	This attribute represents whether the token have a word vector or not?
is_stop	bool	This attribute represents whether the token is part of a “stop pst” or not?
Pos	int	It represents the coarse-grained part-of-speech from the Universal POS tag set.
pos_	unicode	It represents the coarse-grained part-of-speech from the Universal POS tag set.
Tag	int	It represents the fine-grained part-of-speech.
tag_	unicode	It represents the fine-grained part-of-speech.
Dep	int	This attribute represents the syntactic dependency relation.
dep_	unicode	This attribute represents the syntactic dependency relation.
Lang	Int	This attribute represents the language of the parent document’s vocabulary.
lang_	unicode	This attribute represents the language of the parent document’s vocabulary.
Prob	float	It is the smoothed log probabipty estimate of token’s word type.
Idx	int	It is the character offset of the token within the parent document.
Sentiment	float	It represents a scalar value that indicates the positivity or negativity of the token.
lex_id	int	It represents the sequential ID of the token’s lexical type which is used to index into tables.
Rank	int	It represents the sequential ID of the token’s lexical type which is used to index into tables.
Cluster	int	It is the Brown cluster ID.
_	Underscore	It represents the user space for adding custom attribute extensions.

Methods

Following are the methods used in Token class −

Sr.No.	Method & Description
1	Token._ _init_ _ It is used to construct a Token object.
2	Token.similarity It is used to compute a semantic similarity estimate.
3	Token.check_flag It is used to check the value of a Boolean flag.
4	Token._ _len_ _ It is used to calculate the number of Unicode characters in the token.

ClassMethods

Following are the classmethods used in Token class −

Sr.No.	Classmethod & Description
1	Token.set_extension It defines a custom attribute on the Token.
2	Token.get_extension It will look up a previously extension by name.
3	Token.has_extension It will check whether an extension has been registered on the Token class or not.
4	Token.remove_extension It will remove a previously registered extension on the Token class.

spaCy - Container Token Class

Token Class

Attributes

Methods

ClassMethods

友情链接