tokenizer¶
-
class
UnifiedTransformerTokenizer
(vocab_file, sentencepiece_model_file, do_lower_case=False, unk_token='[UNK]', pad_token='[PAD]', cls_token='[CLS]', sep_token='[SEP]', mask_token='[MASK]', chitchat_token='[CHAT]', knowledge_token='[KNOW]', recommend_token='[RECO]', special_tokens_file='')[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
property
vocab_size
¶ return the size of vocabulary. :returns: the size of vocabulary. :rtype: int
-
preprocess_text
(inputs, remove_space=True, lower=False)[source]¶ preprocess data by removing extra space and normalize data.
-
encode_pieces
(spm_model, text, return_unicode=True, sample=False)[source]¶ turn sentences into word pieces.
-
tokenize
(text)[source]¶ End-to-end tokenization for BERT models. :param text: The text to be tokenized. :type text: str
- Returns
A list of string representing converted tokens.
- Return type
list
-
convert_tokens_to_string
(tokens, keep_space=True)[source]¶ Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing
__
to concat subwords, also remove__
when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list- Returns
Converted string from tokens.
- Return type
str
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens. .. note:
This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (bool, optional) – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False. Default False.
- Returns
Number of tokens added to sequences
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence by concatenating and adding special tokens. An UnifiedTransformer sequence has the following format:
- single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]``
- Parameters
token_ids_0 (list) – List of IDs to which the special tokens will be added.
token_ids_1 (list, optional) – Optional second list of IDs for sequence pairs. Default None.
- Returns
List of input_ids with the appropriate special tokens.
- Return type
list
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Build offset map from a pair of offset map by concatenating and adding offsets of special tokens. An UnifiedTransformer offset_mapping has the following format:
- single sequence: ``(0,0) X (0,0)`` - pair of sequences: `(0,0) A (0,0) B (0,0)``
- Parameters
offset_mapping_ids_0 (list) – List of char offsets to which the special tokens will be added.
offset_mapping_ids_1 (list, optional) – Optional second list of char offsets for offset mapping pairs. Dafault None
- Returns
- List of char offsets with the appropriate offsets of special
tokens.
- Return type
list
-
create_token_type_ids_from_sequences
(token_ids_0, token_ids_1=None)[source]¶ Create the token_type_ids from the two sequences passed for the model.
An UnifiedTransformer sequence token_type_ids has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
is None, this method only returns the first portion (0s).- Parameters
token_ids_0 (list) – List of IDs.
token_ids_1 (list, optional) – Optional second list of IDs for sequence pairs. Default None
- Returns
List of token_type_id according to the given sequence(s).
- Return type
list
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model
method. :param token_ids_0: List of IDs. :type token_ids_0: list :param token_ids_1: Optional second list of IDs for sequencepairs. Default None.
- Parameters
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Default False.
- Returns
- A list of integers in the range [0, 1]. 1 for a special token,
0 for a sequence token.
- Return type
list
-
save_resources
(save_directory)[source]¶ Save tokenizer related resources to files under
save_directory
. :param save_directory: Directory to save files into. :type save_directory: str
-
static
load_vocabulary
(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]¶ Instantiate an instance of
Vocab
from a file reserving all tokens by usingVocab.from_dict
. The file contains a token and index of the token per line, separated by ‘ ‘. :param filepath: path of file to construct vocabulary. :type filepath: str :param unk_token: special token for unknown token. If no need, it alsocould be None. Default: None.
- Parameters
pad_token (str) – special token for padding token. If no need, it also could be None. Default: None.
bos_token (str) – special token for bos token. If no need, it also could be None. Default: None.
eos_token (str) – special token for eos token. If no need, it also could be None. Default: None.
**kwargs (dict) – keyword arguments for
Vocab.from_dict
.
- Returns
An instance of
Vocab
.- Return type
-
dialogue_encode
(history, response=None, knowledge=None, task_type='chitchat', max_seq_len=512, max_response_len=128, max_knowledge_len=128, return_position_ids=True, return_token_type_ids=True, return_attention_mask=True, return_length=False, add_start_token_as_response=False, pad_to_max_seq_len=False, return_tensors=False)[source]¶ Main method to encode the single-turn or multi-turn dialogue conversation. It will return a dictionary containing the encoded sequence and other relative informations which meets the input format requirements of the UnifiedTransformer model. See detail at https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue
- Parameters
history (str|list|tuple) – The history of dialogue conversation. It is an utterance or list of utterances to be encoded. Each utterance is a string.
response (str, optional) – The response of dialogue conversation. It should be set when training the model. It should not be set when running inference. Default None.
knowledge (str, optional) – The knowledge information of dialogue conversation. It should be set if the
task_type
is “knowledge” or “recommend”. Default None.task_type (str, optional) – The type of dialogue conversation. It is one of “chitchat”, “knowledge” and “recommend”. They represent the chitchat dialogue, knowledge grounded dialogue and conversational recommendation respectively. Default “chitchat”.
max_seq_len (int, optional) – The maximum encoded sequence length. Default 512.
max_response_len (int, optional) – The maximum encoded sequence length of the input
response
. Default 128.max_knowledge_len (int, optional) – The maximum encoded sequence length of the input
knowledge
. Default 128.return_position_ids (bool, optional) – Whether to return the position_ids. Default True.
return_token_type_ids (bool, optional) – Whether to return the token_type_ids. Default True.
return_attention_mask (bool, optional) – Whether to return the attention_mask. Default True.
return_length (bool, optional) – Whether to return the length of the encoded sequence. Default False.
add_start_token_as_response (bool, optional) – Whether to add the special token [CLS] at the end of sequence as the begining of the response when running inference to force the model to start generating response sequence. Default False.
pad_to_max_seq_len (bool, optional) – Whether to pad the returned sequences to the
max_seq_len
. Note that, in this method, returned sequences will be padded on the left. Default False.return_tensors (bool, optional) – Whether to convert the returned sequences to Tensor. Default False.
-
property