tokenizer¶
-
class
BigBirdTokenizer
(sentencepiece_model_file, do_lower_case=True, encoding='utf8', unk_token='<unk>', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a BigBird tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param sentencepiece_model_file: file path of the vocabulary :type sentencepiece_model_file: str :param do_lower_case: Whether the text strips accents and convert to
lower case. If you use the BigBird pretrained model, lower is set to False when using the cased model, otherwise it is set to True. Default: True.
- Parameters
unk_token (str) – The special token for unkown words. Default: “[UNK]”.
sep_token (str) – The special token for separator token . Default: “[SEP]”.
pad_token (str) – The special token for padding. Default: “[PAD]”.
cls_token (str) – The special token for cls. Default: “[CLS]”.
mask_token (str) – The special token for mask. Default: “[MASK]”.
Examples:
-
property
vocab_size
¶ return the size of vocabulary. :returns: the size of vocabulary. :rtype: int
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing
##
to concat subwords, also remove##
when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list- Returns
Converted string from tokens.
- Return type
str
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False.
- Returns
Number of tokens added to sequences
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
A BERT sequence has the following format:
- single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]``
- Parameters
token_ids_0 (
List[int]
) – List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
,optional
) – Optional second list of IDs for sequence pairs.
- Returns
List of input_id with the appropriate special tokens.
- Return type
List[int]