tokenizer¶

class ErnieCtmTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token_template='[CLS{}]', summary_num=1, mask_token='[MASK]', **kwargs)[source]¶

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Construct a ERNIE-CTM tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param vocab_file: File containing the vocabulary. :type vocab_file: str :param do_lower_case: Whether or not to lowercase the input when tokenizing. :type do_lower_case: bool, optional, defaults to True :param do_basic_tokenize: Whether or not to do basic tokenization before WordPiece. :type do_basic_tokenize: bool, optional, defaults to True :param unk_token: The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this

token instead.

Parameters

sep_token (str, optional, defaults to "[SEP]") – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
pad_token (str, optional, defaults to "[PAD]") – The token used for padding, for example when batching sequences of different lengths.
cls_token_template (str, optional defauts to "[CLS{}]") – The template of summary token for multiple summary placeholders.
summary_num (int, optional, defaults to 1) – Summary placeholder used in ernie-ctm model. For catching a sentence global feature from multiple aware.
mask_token (str, optional, defaults to "[MASK]") – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
strip_accents – (bool, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT).

convert_tokens_to_string(tokens)[source]¶: Converts a sequence of tokens (string) in a single string.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]¶

Build model inputs from a sequence or a pair of sequences for sequence classification tasks by concatenating and add special tokens. A ERNIE-CTM sequence has the following format:

single sequence: [CLS0][CLS1]… X [SEP]
pair of sequences: [CLS0][CLS1]… X [SEP] X [SEP]

Parameters

{typing.List[int]} -- List of IDs to which the special tokens will be added. (token_ids_0) –

Keyword Arguments

{typing.Optional[typing.List[int]]} -- Optional second list of IDs for sequence pairs. (token_ids_1) –
(default – {None})

Returns

typing.List[int] – List of input IDs with the appropriate special tokens.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters

{typing.List[int]} -- List of IDs. (token_ids_0) –

Keyword Arguments

{typing.Optional[typing.List[int]]} -- (token_ids_1) – Optional seconde list of IDs for sequence pairs. (default: {None})
{bool} -- (already_has_special_tokens) – Whether or not the token list is already formatted with special tokens for the model. (default: {False})

Returns

1 for a special token, 0 for a sequence token.

Return type

typing.List[int] – A list of integers in the range [0, 1]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]¶

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s). :param token_ids_0: List of IDs. :type token_ids_0: List[int] :param token_ids_1: Optional second list of IDs for sequence pairs. :type token_ids_1: List[int], optional

Returns: List of token type IDs according to the given sequence(s).
Return type: List[int]

tokenize(text, **kwargs)[source]¶: Basic Tokenization of a piece of text, to tokenize Chinese Character, we should transform string to token list straightly.