tokenizer¶
-
class
BasicTokenizer
(do_lower_case=True)[source]¶ Bases:
object
Runs basic tokenization (punctuation splitting, lower casing, etc.). :param do_lower_case: Whether the text strips accents and convert to
lower case. If you use the BERT Pretrained model, lower is set to Flase when using the cased model, otherwise it is set to True. Default: True.
-
class
BertTokenizer
(vocab_file, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]')[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a BERT tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords. :param vocab_file: file path of the vocabulary :type vocab_file: str :param do_lower_case: Whether the text strips accents and convert to
lower case. If you use the BERT pretrained model, lower is set to Flase when using the cased model, otherwise it is set to True. Default: True.
- Parameters
unk_token (str) – The special token for unkown words. Default: “[UNK]”.
sep_token (str) – The special token for separator token . Default: “[SEP]”.
pad_token (str) – The special token for padding. Default: “[PAD]”.
cls_token (str) – The special token for cls. Default: “[CLS]”.
mask_token (str) – The special token for mask. Default: “[MASK]”.
Examples
-
property
vocab_size
¶ return the size of vocabulary. :returns: the size of vocabulary. :rtype: int
-
tokenize
(text)[source]¶ End-to-end tokenization for BERT models. :param text: The text to be tokenized. :type text: str
- Returns
A list of string representing converted tokens.
- Return type
list
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (list of string) in a single string. Since the usage of WordPiece introducing
##
to concat subwords, also remove##
when converting. :param tokens: A list of string representing tokens to be converted. :type tokens: list- Returns
Converted string from tokens.
- Return type
str
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair – Returns the number of added tokens in the case of a sequence pair if set to True, returns the number of added tokens in the case of a single sequence if set to False.
- Returns
Number of tokens added to sequences
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens.
A BERT sequence has the following format:
- single sequence: ``[CLS] X [SEP]`` - pair of sequences: ``[CLS] A [SEP] B [SEP]``
- Parameters
token_ids_0 (
List[int]
) – List of IDs to which the special tokens will be added.token_ids_1 (
List[int]
,optional
) – Optional second list of IDs for sequence pairs.
- Returns
List of input_id with the appropriate special tokens.
- Return type
List[int]
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Build offset map from a pair of offset map by concatenating and adding offsets of special tokens.
A BERT offset_mapping has the following format:
- single sequence: ``(0,0) X (0,0)`` - pair of sequences: `(0,0) A (0,0) B (0,0)``
- Parameters
offset_mapping_ids_0 (
List[tuple]
) – List of char offsets to which the special tokens will be added.offset_mapping_ids_1 (
List[tuple]
,optional
) – Optional second list of char offsets for offset mapping pairs.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-
create_token_type_ids_from_sequences
(token_ids_0, token_ids_1=None)[source]¶ Create a mask from the two sequences passed to be used in a sequence-pair classification task.
A BERT sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence |
If
token_ids_1
isNone
, this method only returns the first portion of the mask (0s).- Parameters
token_ids_0 (
List[int]
) – List of IDs.token_ids_1 (
List[int]
,optional
) – Optional second list of IDs for sequence pairs.
- Returns
List of token_type_id according to the given sequence(s).
- Return type
List[int]
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer
encode
methods.- Parameters
token_ids_0 (List[int]) – List of ids of the first sequence.
token_ids_1 (List[int], optinal) – List of ids of the second sequence.
already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to None.
- Returns
The list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- Return type
results (List[int])
-
class
WordpieceTokenizer
(vocab, unk_token, max_input_chars_per_word=100)[source]¶ Bases:
object
Runs WordPiece tokenization. :param vocab: Vocab of the word piece tokenizer. :type vocab: Vocab|dict :param unk_token: A specific token to replace all unkown tokens. :type unk_token: str :param max_input_chars_per_word: If a word’s length is more than
max_input_chars_per_word, it will be dealt as unknown word. Default: 100.
-
tokenize
(text)[source]¶ Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. :param text: A single token or whitespace separated tokens. This should have
already been passed through
BasicTokenizer
.- Returns
A list of wordpiece tokens.
- Return type
list (str)
Example
input = “unaffable” output = [“un”, “##aff”, “##able”]
-