tokenizer¶
-
class
ErnieCtmTokenizer
(vocab_file, do_lower_case=True, do_basic_tokenize=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token_template='[CLS{}]', summary_num=1, mask_token='[MASK]', **kwargs)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Construct a ERNIE-CTM tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param vocab_file: File containing the vocabulary. :type vocab_file:
str
:param do_lower_case: Whether or not to lowercase the input when tokenizing. :type do_lower_case:bool
,optional
, defaults toTrue
:param do_basic_tokenize: Whether or not to do basic tokenization before WordPiece. :type do_basic_tokenize:bool
,optional
, defaults toTrue
:param unk_token: The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be thistoken instead.
- Parameters
sep_token (
str
,optional
, defaults to"[SEP]"
) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.pad_token (
str
,optional
, defaults to"[PAD]"
) – The token used for padding, for example when batching sequences of different lengths.cls_token_template (
str
,optional
defauts to"[CLS{}]"
) – The template of summary token for multiple summary placeholders.summary_num (
int
,optional
, defaults to 1) – Summary placeholder used in ernie-ctm model. For catching a sentence global feature from multiple aware.mask_token (
str
,optional
, defaults to"[MASK]"
) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.strip_accents – (
bool
,optional
): Whether or not to strip all accents. If this option is not specified, then it will be determined by the value forlowercase
(as in the original BERT).
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (string) in a single string.
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequences for sequence classification tasks by concatenating and add special tokens. A ERNIE-CTM sequence has the following format:
single sequence: [CLS0][CLS1]… X [SEP]
pair of sequences: [CLS0][CLS1]… X [SEP] X [SEP]
- Parameters
{typing.List[int]} -- List of IDs to which the special tokens will be added. (token_ids_0) –
- Keyword Arguments
{typing.Optional[typing.List[int]]} -- Optional second list of IDs for sequence pairs. (token_ids_1) –
(default – {None})
- Returns
typing.List[int] – List of input IDs with the appropriate special tokens.
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
prepare_for_model
method.- Parameters
{typing.List[int]} -- List of IDs. (token_ids_0) –
- Keyword Arguments
{typing.Optional[typing.List[int]]} -- (token_ids_1) – Optional seconde list of IDs for sequence pairs. (default: {None})
{bool} -- (already_has_special_tokens) – Whether or not the token list is already formatted with special tokens for the model. (default: {False})
- Returns
1 for a special token, 0 for a sequence token.
- Return type
typing.List[int] – A list of integers in the range [0, 1]
-
create_token_type_ids_from_sequences
(token_ids_0, token_ids_1=None)[source]¶ Create a mask from the two sequences passed to be used in a sequence-pair classification task. A BERT sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s). :param token_ids_0: List of IDs. :type token_ids_0:List[int]
:param token_ids_1: Optional second list of IDs for sequence pairs. :type token_ids_1:List[int]
,optional
- Returns
List of token type IDs according to the given sequence(s).
- Return type
List[int]