tokenizer¶

class UnifiedTransformerTokenizer(vocab_file, sentencepiece_model_file, do_lower_case=False, unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]', chitchat_token='[CHAT]', knowledge_token='[KNOW]', recommend_token='[RECO]', special_tokens_file='')[source]¶

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

property vocab_size¶: return the size of vocabulary. :returns: the size of vocabulary. :rtype: int

preprocess_text(inputs, remove_space=True, lower=False)[source]¶: preprocess data by removing extra space and normalize data.

clean_text(text)[source]¶: Performs invalid character removal and whitespace cleanup on text.

encode_pieces(spm_model, text, return_unicode=True, sample=False)[source]¶: turn sentences into word pieces.

tokenize(text)[source]¶

End-to-end tokenization for BERT models. :param text: The text to be tokenized. :type text: str

Returns: A list of string representing converted tokens.
Return type: list

merge_subword(tokens)[source]¶: Merge subword.

convert_tokens_to_string(tokens, keep_space=True)[source]¶

Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing __ to concat subwords, also remove __ when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list

Returns: Converted string from tokens.
Return type: str

convert_ids_to_string(ids, keep_space=True)[source]¶: Convert ids to string.

num_special_tokens_to_add(pair=False)[source]¶

Returns the number of added tokens when encoding a sequence with special tokens. .. note:

This encodes inputs and checks the number of added tokens, and is
therefore not efficient. Do not put this inside your training loop.

Parameters: pair (bool, optional) – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False. Default False.
Returns: Number of tokens added to sequences

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]¶

Build model inputs from a sequence or a pair of sequence by concatenating and adding special tokens. An UnifiedTransformer sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Parameters

token_ids_0 (list) – List of IDs to which the special tokens will be added.
token_ids_1 (list, optional) – Optional second list of IDs for sequence pairs. Default None.

Returns

List of input_ids with the appropriate special tokens.

Return type

list

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]¶

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens. An UnifiedTransformer offset_mapping has the following format:

- single sequence: ``(0,0) X (0,0)``
- pair of sequences: `(0,0) A (0,0) B (0,0)``

Parameters

offset_mapping_ids_0 (list) – List of char offsets to which the special tokens will be added.
offset_mapping_ids_1 (list, optional) – Optional second list of char offsets for offset mapping pairs. Dafault None

Returns

List of char offsets with the appropriate offsets of special: tokens.

Return type

list

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]¶

Create the token_type_ids from the two sequences passed for the model.

An UnifiedTransformer sequence token_type_ids has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion (0s).

Parameters

token_ids_0 (list) – List of IDs.
token_ids_1 (list, optional) – Optional second list of IDs for sequence pairs. Default None

Returns

List of token_type_id according to the given sequence(s).

Return type

list

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method. :param token_ids_0: List of IDs. :type token_ids_0: list :param token_ids_1: Optional second list of IDs for sequence

pairs. Default None.

Parameters

already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Default False.

Returns

A list of integers in the range [0, 1]. 1 for a special token,: 0 for a sequence token.

Return type

list

save_resources(save_directory)[source]¶: Save tokenizer related resources to files under save_directory. :param save_directory: Directory to save files into. :type save_directory: str

static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶

Instantiate an instance of Vocab from a file reserving all tokens by using Vocab.from_dict. The file contains a token and index of the token per line, separated by ‘ ‘. :param filepath: path of file to construct vocabulary. :type filepath: str :param unk_token: special token for unknown token. If no need, it also

could be None. Default: None.

Parameters

pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.
**kwargs (dict) – keyword arguments for Vocab.from_dict.

Returns

An instance of Vocab.

Return type

Vocab

dialogue_encode(history, response=None, knowledge=None, task_type='chitchat', max_seq_len=512, max_response_len=128, max_knowledge_len=128, return_position_ids=True, return_token_type_ids=True, return_attention_mask=True, return_length=False, add_start_token_as_response=False, pad_to_max_seq_len=False, return_tensors=False)[source]¶

Main method to encode the single-turn or multi-turn dialogue conversation. It will return a dictionary containing the encoded sequence and other relative informations which meets the input format requirements of the UnifiedTransformer model. See detail at https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue

Parameters

history (str|list|tuple) – The history of dialogue conversation. It is an utterance or list of utterances to be encoded. Each utterance is a string.
response (str, optional) – The response of dialogue conversation. It should be set when training the model. It should not be set when running inference. Default None.
knowledge (str, optional) – The knowledge information of dialogue conversation. It should be set if the task_type is “knowledge” or “recommend”. Default None.
task_type (str, optional) – The type of dialogue conversation. It is one of “chitchat”, “knowledge” and “recommend”. They represent the chitchat dialogue, knowledge grounded dialogue and conversational recommendation respectively. Default “chitchat”.
max_seq_len (int, optional) – The maximum encoded sequence length. Default 512.
max_response_len (int, optional) – The maximum encoded sequence length of the input response. Default 128.
max_knowledge_len (int, optional) – The maximum encoded sequence length of the input knowledge. Default 128.
return_position_ids (bool, optional) – Whether to return the position_ids. Default True.
return_token_type_ids (bool, optional) – Whether to return the token_type_ids. Default True.
return_attention_mask (bool, optional) – Whether to return the attention_mask. Default True.
return_length (bool, optional) – Whether to return the length of the encoded sequence. Default False.
add_start_token_as_response (bool, optional) – Whether to add the special token [CLS] at the end of sequence as the begining of the response when running inference to force the model to start generating response sequence. Default False.
pad_to_max_seq_len (bool, optional) – Whether to pad the returned sequences to the max_seq_len. Note that, in this method, returned sequences will be padded on the left. Default False.
return_tensors (bool, optional) – Whether to convert the returned sequences to Tensor. Default False.