tokenizer

class BasicTokenizer(do_lower_case=True)[source]

Bases: object

Runs basic tokenization (punctuation splitting, lower casing, etc.). :param do_lower_case: Whether the text strips accents and convert to

lower case. If you use the BERT Pretrained model, lower is set to Flase when using the cased model, otherwise it is set to True. Default: True.

tokenize(text)[source]

Tokenizes a piece of text using basic tokenizer. :param text: A piece of text. :type text: str

Returns

A list of tokens.

Return type

list(str)

class BertTokenizer(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs a BERT tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param vocab_file: file path of the vocabulary :type vocab_file: str :param do_lower_case: Whether the text strips accents and convert to

lower case. If you use the BERT pretrained model, lower is set to Flase when using the cased model, otherwise it is set to True. Default: True.

Parameters
  • unk_token (str) – The special token for unkown words. Default: “[UNK]”.

  • sep_token (str) – The special token for separator token . Default: “[SEP]”.

  • pad_token (str) – The special token for padding. Default: “[PAD]”.

  • cls_token (str) – The special token for cls. Default: “[CLS]”.

  • mask_token (str) – The special token for mask. Default: “[MASK]”.

Examples

property vocab_size

return the size of vocabulary. :returns: the size of vocabulary. :rtype: int

tokenize(text)[source]

End-to-end tokenization for BERT models. :param text: The text to be tokenized. :type text: str

Returns

A list of string representing converted tokens.

Return type

list

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing ## to concat subwords, also remove ## when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list

Returns

Converted string from tokens.

Return type

str

num_special_tokens_to_add(pair=False)[source]

Returns the number of added tokens when encoding a sequence with special tokens.

Note

This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

Parameters

pair – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False.

Returns

Number of tokens added to sequences

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.

A BERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
Parameters
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns

List of input_id with the appropriate special tokens.

Return type

List[int]

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]

Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.

A BERT offset_mapping has the following format:

- single sequence: ``(0,0) X (0,0)``
- pair of sequences: `(0,0) A (0,0) B (0,0)``
Parameters
  • offset_mapping_ids_0 (List[tuple]) – List of char offsets to which the special tokens will be added.

  • offset_mapping_ids_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs.

Returns

List of char offsets with the appropriate offsets of special tokens.

Return type

List[tuple]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]

Create a mask from the two sequences passed to be used in a sequence-pair classification task.

A BERT sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

Parameters
  • token_ids_0 (List[int]) – List of IDs.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns

List of token_type_id according to the given sequence(s).

Return type

List[int]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer encode methods.

Parameters
  • token_ids_0 (List[int]) – List of ids of the first sequence.

  • token_ids_1 (List[int], optinal) – List of ids of the second sequence.

  • already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.

Returns

The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type

results (List[int])

class WordpieceTokenizer(vocab, unk_token, max_input_chars_per_word=100)[source]

Bases: object

Runs WordPiece tokenization. :param vocab: Vocab of the word piece tokenizer. :type vocab: Vocab|dict :param unk_token: A specific token to replace all unkown tokens. :type unk_token: str :param max_input_chars_per_word: If a word’s length is more than

max_input_chars_per_word, it will be dealt as unknown word. Default: 100.

tokenize(text)[source]

Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. :param text: A single token or whitespace separated tokens. This should have

already been passed through BasicTokenizer.

Returns

A list of wordpiece tokens.

Return type

list (str)

Example

input = “unaffable” output = [“un”, “##aff”, “##able”]