tokenizer

class ErnieCtmTokenizer(vocab_file, do_lower_case=True, do_basic_tokenize=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token_template='[CLS{}]', summary_num=1, mask_token='[MASK]', **kwargs)[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Construct a ERNIE-CTM tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param vocab_file: File containing the vocabulary. :type vocab_file: str :param do_lower_case: Whether or not to lowercase the input when tokenizing. :type do_lower_case: bool, optional, defaults to True :param do_basic_tokenize: Whether or not to do basic tokenization before WordPiece. :type do_basic_tokenize: bool, optional, defaults to True :param unk_token: The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this

token instead.

Parameters
  • sep_token (str, optional, defaults to "[SEP]") – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

  • pad_token (str, optional, defaults to "[PAD]") – The token used for padding, for example when batching sequences of different lengths.

  • cls_token_template (str, optional defauts to "[CLS{}]") – The template of summary token for multiple summary placeholders.

  • summary_num (int, optional, defaults to 1) – Summary placeholder used in ernie-ctm model. For catching a sentence global feature from multiple aware.

  • mask_token (str, optional, defaults to "[MASK]") – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.

  • strip_accents – (bool, optional): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT).

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (string) in a single string.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]

Build model inputs from a sequence or a pair of sequences for sequence classification tasks by concatenating and add special tokens. A ERNIE-CTM sequence has the following format:

  • single sequence: [CLS0][CLS1]… X [SEP]

  • pair of sequences: [CLS0][CLS1]… X [SEP] X [SEP]

Parameters

{typing.List[int]} -- List of IDs to which the special tokens will be added. (token_ids_0) –

Keyword Arguments
  • {typing.Optional[typing.List[int]]} -- Optional second list of IDs for sequence pairs. (token_ids_1) –

  • (default – {None})

Returns

typing.List[int] – List of input IDs with the appropriate special tokens.

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Parameters

{typing.List[int]} -- List of IDs. (token_ids_0) –

Keyword Arguments
  • {typing.Optional[typing.List[int]]} -- (token_ids_1) – Optional seconde list of IDs for sequence pairs. (default: {None})

  • {bool} -- (already_has_special_tokens) – Whether or not the token list is already formatted with special tokens for the model. (default: {False})

Returns

1 for a special token, 0 for a sequence token.

Return type

typing.List[int] – A list of integers in the range [0, 1]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s). :param token_ids_0: List of IDs. :type token_ids_0: List[int] :param token_ids_1: Optional second list of IDs for sequence pairs. :type token_ids_1: List[int], optional

Returns

List of token type IDs according to the given sequence(s).

Return type

List[int]

tokenize(text, **kwargs)[source]

Basic Tokenization of a piece of text, to tokenize Chinese Character, we should transform string to token list straightly.