vocab

class Vocab(counter=None, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]

Bases: object

Vocab is for mapping between text tokens and ids. :param counter: A Counter intance describes

the tokens and their frequencies. Its keys will be indexed accroding to the order of frequency sorting to construct mapping relationship. If None, token_to_idx must be provided as the mapping relationship. Default: None.

Parameters
  • max_size (int, optional) – Max size of vocab, not including special tokens. Default: None.

  • min_freq (int) – Ignore tokens whose frequencies are less than min_freq. Default: 1.

  • token_to_idx (dict, optional) – A dict specifies the mapping relationship between tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.

  • unk_token (str) – special token for unknow token ‘<unk>’. If no need, it also could be None. Default: None.

  • pad_token (str) – special token for padding token ‘<pad>’. If no need, it also could be None. Default: None.

  • bos_token (str) – special token for bos token ‘<bos>’. If no need, it also could be None. Default: None.

  • eos_token (str) – special token for eos token ‘<eos>’. If no need, it also could be None. Default: None.

  • **kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

to_tokens(indices)[source]

Map the input indices to token list. :param indices: input indices for mapping. :type indices: list|tuple|int

Returns

obtained token(s).

Return type

list|str

to_indices(tokens)[source]

Map the input tokens into indices :param tokens: input tokens for mapping. :type tokens: list|tuple, optional

Returns

obationed indice list.

Return type

list|int

property idx_to_token

Return index-token dict

property token_to_idx

Return token-index dict

to_json(path=None)[source]

Summarize some information of vocab as JSON string. If path is gaven, the JSON string will be saved into files. :param path: the path to save JSON string. If None, the

JSON will not be saved. Default: None.

Returns

JSON string.

Return type

str

classmethod from_json(json_str)[source]

Load vocab from JSON string or JSON file. :param json_str: JSON string or file path of JSON string. :type json_str: str

Returns

vocab generated from information contained in JSON string.

Return type

Vocab

classmethod from_dict(token_to_idx, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]

Generate vocab from a dict. :param token_to_idx: A dict describes the mapping relationship between

tokens to indices.

Parameters
  • unk_token (str) – special token for unknow token. If no need, it also could be None. Default: None.

  • pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.

  • bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.

  • eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.

  • **kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

Returns

vocab generated from the given dict and special tokens.

Return type

Vocab

static build_vocab(iterator, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]

Building vocab accoring to given iterator and other information. Iterate over the iterator to construct a Counter and as __init__ :param iterator: Iterator of tokens. Each tokens should be list of token if wordlevel vocab is needed. :type iterator: collections.Iterable :param max_size: Max size of vocab, not including special tokens. Default: None. :type max_size: int, optional :param min_freq: Ignore tokens whose frequencies are less than min_freq. Default: 1. :type min_freq: int :param token_to_idx: A dict specifies the mapping relationship

between tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.

Parameters
  • unk_token (str) – special token for unknow token ‘<unk>’. If no need, it also could be None. Default: None.

  • pad_token (str) – special token for padding token ‘<pad>’. If no need, it also could be None. Default: None.

  • bos_token (str) – special token for bos token ‘<bos>’. If no need, it also could be None. Default: None.

  • eos_token (str) – special token for eos token ‘<eos>’. If no need, it also could be None. Default: None.

  • **kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

Returns

Generated vocab from given iterator and other informations.

Return type

Vocab

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token per line, and the line number would be the index of corresponding token. :param filepath: path of file to construct vocabulary. :type filepath: str :param unk_token: special token for unknown token. If no need, it also

could be None. Default: None.

Parameters
  • pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.

  • bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.

  • eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.

  • **kwargs (dict) – keyword arguments for Vocab.from_dict.

Returns

An instance of Vocab.

Return type

Vocab