tokenizer

class GPT2Tokenizer(vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None, do_lower_case=True)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

set_special_tokens(special_tokens)[source]

Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the special_tokens list.

tokenize(text)[source]

Tokenize a string.

convert_tokens_to_ids(tokens)[source]

Converts a sequence of tokens into ids using the vocab.

convert_ids_to_tokens(ids, skip_special_tokens=False)[source]

Converts a single index or a sequence of indices (integers) in a token or a sequence of tokens (str) by using the vocabulary.

Parameters

skip_special_tokens – Don’t decode special tokens (self.all_special_tokens). Default: False

encode(text, fn=None)[source]

Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_seq_len is specified.

Parameters
  • text (str, List[str] or List[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method)

  • text_pair (str, List[str] or List[int], optional, defaults to None) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method)

  • max_seq_len (int, optional, defaults to :int:`512`) – If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionary

  • pad_to_max_seq_len (bool, optional, defaults to False) – If set to True, the returned sequences will be padded according to the model’s padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model’s max length.

  • truncation_strategy (str, optional, defaults to longest_first) –

    String selected in the following options:

    • ’longest_first’ (default) Iteratively reduce the inputs sequence until the input is under max_seq_len starting from the longest one at each token (when there is a pair of input sequences)

    • ’only_first’: Only truncate the first sequence

    • ’only_second’: Only truncate the second sequence

    • ’do_not_truncate’: Does not truncate (raise an error if the input sequence is longer than max_seq_len)

  • return_position_ids (bool, optional, defaults to False) – Set to True to return tokens position ids (default True).

  • return_token_type_ids (bool, optional, defaults to True) – Whether to return token type IDs.

  • return_attention_mask (bool, optional, defaults to False) – Whether to return the attention mask.

  • return_length (int, defaults to False) – If set the resulting dictionary will include the length of each encoded inputs

  • return_overflowing_tokens (bool, optional, defaults to False) – Set to True to return overflowing token information (default False).

  • return_special_tokens_mask (bool, optional, defaults to False) – Set to True to return special tokens mask information (default False).

Returns

A Dictionary of shape:

{
    input_ids: list[int],
    position_ids: list[int] if return_position_ids is True
    token_type_ids: list[int] if return_token_type_ids is True (default)
    attention_mask: list[int] if return_attention_mask is True
    seq_len: int if return_length is True
    overflowing_tokens: list[int] if a ``max_seq_len`` is specified and return_overflowing_tokens is True
    num_truncated_tokens: int if a ``max_seq_len`` is specified and return_overflowing_tokens is True
    special_tokens_mask: list[int] if return_special_tokens_mask is True
}

With the fields:

  • input_ids: list of token ids to be fed to a model

  • position_ids: list of token position ids to be fed to a model

  • token_type_ids: list of token type ids to be fed to a model

  • attention_mask: list of indices specifying which tokens should be attended to by the model

  • length: the input_ids length

  • overflowing_tokens: list of overflowing tokens if a max length is specified.

  • num_truncated_tokens: number of overflowing tokens a max_seq_len is specified

  • special_tokens_mask: list of [0, 1], with 0 specifying special added tokens and 1 specifying sequence tokens.

save_resources(save_directory)[source]

Save tokenizer related resources to files under save_directory. :param save_directory: Directory to save files into. :type save_directory: str

class GPT2ChineseTokenizer(vocab_file, model_file, do_lower_case=True, max_len=512, bod_id='<bod>', eod_id='<eod>', max_length=None)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a GPT2 Chinese tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords.

tokenize(text)[source]

Tokenize a string.

encode(text)[source]

Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a max_seq_len is specified.

Parameters
  • text (str, List[str] or List[int]) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method)

  • text_pair (str, List[str] or List[int], optional, defaults to None) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method)

  • max_seq_len (int, optional, defaults to :int:`512`) – If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionary

  • pad_to_max_seq_len (bool, optional, defaults to False) – If set to True, the returned sequences will be padded according to the model’s padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model’s max length.

  • truncation_strategy (str, optional, defaults to longest_first) –

    String selected in the following options:

    • ’longest_first’ (default) Iteratively reduce the inputs sequence until the input is under max_seq_len starting from the longest one at each token (when there is a pair of input sequences)

    • ’only_first’: Only truncate the first sequence

    • ’only_second’: Only truncate the second sequence

    • ’do_not_truncate’: Does not truncate (raise an error if the input sequence is longer than max_seq_len)

  • return_position_ids (bool, optional, defaults to False) – Set to True to return tokens position ids (default True).

  • return_token_type_ids (bool, optional, defaults to True) – Whether to return token type IDs.

  • return_attention_mask (bool, optional, defaults to False) – Whether to return the attention mask.

  • return_length (int, defaults to False) – If set the resulting dictionary will include the length of each encoded inputs

  • return_overflowing_tokens (bool, optional, defaults to False) – Set to True to return overflowing token information (default False).

  • return_special_tokens_mask (bool, optional, defaults to False) – Set to True to return special tokens mask information (default False).

Returns

A Dictionary of shape:

{
    input_ids: list[int],
    position_ids: list[int] if return_position_ids is True
    token_type_ids: list[int] if return_token_type_ids is True (default)
    attention_mask: list[int] if return_attention_mask is True
    seq_len: int if return_length is True
    overflowing_tokens: list[int] if a ``max_seq_len`` is specified and return_overflowing_tokens is True
    num_truncated_tokens: int if a ``max_seq_len`` is specified and return_overflowing_tokens is True
    special_tokens_mask: list[int] if return_special_tokens_mask is True
}

With the fields:

  • input_ids: list of token ids to be fed to a model

  • position_ids: list of token position ids to be fed to a model

  • token_type_ids: list of token type ids to be fed to a model

  • attention_mask: list of indices specifying which tokens should be attended to by the model

  • length: the input_ids length

  • overflowing_tokens: list of overflowing tokens if a max length is specified.

  • num_truncated_tokens: number of overflowing tokens a max_seq_len is specified

  • special_tokens_mask: list of [0, 1], with 0 specifying special added tokens and 1 specifying sequence tokens.

convert_tokens_to_ids(text)[source]

Converts a sequence of tokens into ids using the vocab. The tokenizer should has the vocab attribute. Args:

tokens (list(str)): List of tokens.

Returns

Converted id list.

Return type

list

convert_ids_to_tokens(tokens)[source]

Converts a single index or a sequence of indices (integers) in a token or a sequence of tokens (str) by using the vocabulary.

Parameters

skip_special_tokens – Don’t decode special tokens (self.all_special_tokens). Default: False

save_resources(save_directory)[source]

Save tokenizer related resources to files under save_directory. :param save_directory: Directory to save files into. :type save_directory: str