vocab¶
-
class
Vocab
(counter=None, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Bases:
object
Vocab is for mapping between text tokens and ids. :param counter: A Counter intance describes
the tokens and their frequencies. Its keys will be indexed accroding to the order of frequency sorting to construct mapping relationship. If None,
token_to_idx
must be provided as the mapping relationship. Default: None.- Parameters
max_size (int, optional) – Max size of vocab, not including special tokens. Default: None.
min_freq (int) – Ignore tokens whose frequencies are less than
min_freq
. Default: 1.token_to_idx (dict, optional) – A dict specifies the mapping relationship between tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.
unk_token (str) – special token for unknow token ‘<unk>’. If no need, it also could be None. Default: None.
pad_token (str) – special token for padding token ‘<pad>’. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token ‘<bos>’. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token ‘<eos>’. If no need, it also could be None. Default: None.
**kwargs (dict) – Keyword arguments ending with
_token
. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.
-
to_tokens
(indices)[source]¶ Map the input indices to token list. :param indices: input indices for mapping. :type indices: list|tuple|int
- Returns
obtained token(s).
- Return type
list|str
-
to_indices
(tokens)[source]¶ Map the input tokens into indices :param tokens: input tokens for mapping. :type tokens: list|tuple, optional
- Returns
obationed indice list.
- Return type
list|int
-
property
idx_to_token
¶ Return index-token dict
-
property
token_to_idx
¶ Return token-index dict
-
to_json
(path=None)[source]¶ Summarize some information of vocab as JSON string. If path is gaven, the JSON string will be saved into files. :param path: the path to save JSON string. If None, the
JSON will not be saved. Default: None.
- Returns
JSON string.
- Return type
str
-
classmethod
from_json
(json_str)[source]¶ Load vocab from JSON string or JSON file. :param json_str: JSON string or file path of JSON string. :type json_str: str
- Returns
vocab generated from information contained in JSON string.
- Return type
-
classmethod
from_dict
(token_to_idx, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Generate vocab from a dict. :param token_to_idx: A dict describes the mapping relationship between
tokens to indices.
- Parameters
unk_token (str) – special token for unknow token. If no need, it also could be None. Default: None.
pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.
**kwargs (dict) – Keyword arguments ending with
_token
. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.
- Returns
vocab generated from the given dict and special tokens.
- Return type
-
static
build_vocab
(iterator, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Building vocab accoring to given iterator and other information. Iterate over the
iterator
to construct aCounter
and as__init__
:param iterator: Iterator of tokens. Each tokens should be list of token if wordlevel vocab is needed. :type iterator: collections.Iterable :param max_size: Max size of vocab, not including special tokens. Default: None. :type max_size: int, optional :param min_freq: Ignore tokens whose frequencies are less thanmin_freq
. Default: 1. :type min_freq: int :param token_to_idx: A dict specifies the mapping relationshipbetween tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.
- Parameters
unk_token (str) – special token for unknow token ‘<unk>’. If no need, it also could be None. Default: None.
pad_token (str) – special token for padding token ‘<pad>’. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token ‘<bos>’. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token ‘<eos>’. If no need, it also could be None. Default: None.
**kwargs (dict) – Keyword arguments ending with
_token
. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.
- Returns
Generated vocab from given iterator and other informations.
- Return type
-
static
load_vocabulary
(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Instantiate an instance of
Vocab
from a file reserving all tokens by usingVocab.from_dict
. The file contains a token per line, and the line number would be the index of corresponding token. :param filepath: path of file to construct vocabulary. :type filepath: str :param unk_token: special token for unknown token. If no need, it alsocould be None. Default: None.
- Parameters
pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.
**kwargs (dict) – keyword arguments for
Vocab.from_dict
.
- Returns
An instance of
Vocab
.- Return type