tokenizer¶

class BigBirdTokenizer(sentencepiece_model_file, do_lower_case=True, encoding='utf8', unk_token='<unk>', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]¶

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a BigBird tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param sentencepiece_model_file: file path of the vocabulary :type sentencepiece_model_file: str :param do_lower_case: Whether the text strips accents and convert to

lower case. If you use the BigBird pretrained model, lower is set to False when using the cased model, otherwise it is set to True. Default: True.

Parameters

unk_token (str) – The special token for unkown words. Default: “[UNK]”.
sep_token (str) – The special token for separator token . Default: “[SEP]”.
pad_token (str) – The special token for padding. Default: “[PAD]”.
cls_token (str) – The special token for cls. Default: “[CLS]”.
mask_token (str) – The special token for mask. Default: “[MASK]”.

Examples:

property vocab_size¶: return the size of vocabulary. :returns: the size of vocabulary. :rtype: int

convert_tokens_to_string(tokens)[source]¶

Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing ## to concat subwords, also remove ## when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list

Returns: Converted string from tokens.
Return type: str

encode(text, max_seq_len=None, max_pred_len=None, masked_lm_prob=0.15)[source]¶

num_special_tokens_to_add(pair=False)[source]¶

Returns the number of added tokens when encoding a sequence with special tokens.

Note

This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

Parameters: pair – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False.
Returns: Number of tokens added to sequences

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]¶

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

A BERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Parameters

token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns

List of input_id with the appropriate special tokens.

Return type

List[int]