tokenizer¶
Tokenization class for XLNet model.
-
class
XLNetTokenizer(vocab_file, do_lower_case=False, remove_space=True, keep_accents=False, bos_token='<s>', eos_token='</s>', unk_token='<unk>', sep_token='<sep>', pad_token='<pad>', cls_token='<cls>', mask_token='<mask>', additional_special_tokens=['<eop>', '<eod>'])[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizerConstructs an XLNet tokenizer. Based on SentencePiece.
- Parameters
vocab_file (
str) –SentencePiecefile (ends with .spm) that contains the vocabulary necessary to instantiate a tokenizer.do_lower_case (
bool, optional) – Whether to lowercase the input when tokenizing. Defaults toFalseand we do not lowercase the input.remove_space (
bool, optional) – Whether to strip the text when tokenizing. Defaults toTrueand we remove excess spaces before and after the string.keep_accents (
bool, optional) – Whether to keep accents when tokenizing. Defaults toFalseand we don’t keep accents.bos_token (
str, optional) – The beginning of sequence token that was used during pretraining. Defaults to"<s>".eos_token (
str, optional) – The end of sequence token. Defaults to"</s>".unk_token (
str, optional) – The unknown token. A token that is not in the vocabulary is set to be unk_token inorder to be converted to an ID. Defaults to"<unk>".sep_token (
str, optional) – The separator token. Defaults to"<sep>".pad_token (
str, optional) – The token used for padding. Defaults to"<pad>".cls_token (
str, optional) – The classifier token which is used when doing sequence classification. It is the last token of the sequence when built with special tokens. Defaults to"<cls>".mask_token (
str, optional) – The token used for masking values. In the masked language modeling task, this is the token used and which the model will try to predict. Defaults to"<mask>".additional_special_tokens (
List[str], optional) – Additional special tokens used by the tokenizer. Defaults to["<eop>", "<eod>"].
-
sp_model¶ The
SentencePieceprocessor that is used for every conversion (string, tokens and IDs).- Type
SentencePieceProcessor
-
tokenize(text)[source]¶ End-to-end tokenization for XLNet models.
- Parameters
text (
str) – The text to be tokenized.- Returns
A list of string representing converted tokens.
- Return type
List(str)
-
convert_tokens_to_ids(tokens)[source]¶ Converts a token (or a sequence of tokens) to a single integer id (or a sequence of ids), using the vocabulary.
- Parameters
tokens (
strorList[str]) – One or several token(s) to convert to token id(s).- Returns
The token id or list of token ids or tuple of token ids.
- Return type
intorList[int]ortuple(int)
-
convert_ids_to_tokens(ids, skip_special_tokens=False)[source]¶ Converts a single index or a sequence of indices to a token or a sequence of tokens, using the vocabulary and added tokens.
- Parameters
ids (
intorList[int]) – The token id (or token ids) to be converted to token(s).skip_special_tokens (
bool, optional) – Whether or not to remove special tokens in the decoding. Defaults toFalseand we do not remove special tokens.
- Returns
The decoded token(s).
- Return type
strorList[str]
-
convert_tokens_to_string(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
num_special_tokens_to_add(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (
bool, optional) – Whether the sequence is a sequence pair or a single sequence. Defaults toFalseand the input is a single sequence.- Returns
Number of tokens added to sequences.
- Return type
int
-
build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]¶ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format:
single sequence:
X <sep> <cls>pair of sequences:
A <sep> B <sep> <cls>
- Parameters
token_ids_0 (
List[int]) – List of IDs for the first sequence.token_ids_1 (
List[int], optional) – Optional second list of IDs for sequence pairs. Defaults toNone.
- Returns
List of input IDs with the appropriate special tokens.
- Return type
List[int]
-
build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]¶ Builds offset map from a pair of offset map by concatenating and adding offsets of special tokens.
An XLNet offset_mapping has the following format:
single sequence:
X (0,0) (0,0)pair of sequences:
A (0,0) B (0,0) (0,0)
- Parameters
offset_mapping_0 (
List[tuple]) – List of char offsets to which the special tokens will be added.offset_mapping_1 (
List[tuple], optional) – Optional second list of char offsets for offset mapping pairs. Defaults toNone.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-
get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Creates a special tokens mask from the input sequences. This method is called when adding special tokens using the tokenizer
encodemethod.- Parameters
token_ids_0 (
List[int]) – List of IDs for the first sequence.token_ids_1 (
List[int], optional) – Optional second list of IDs for sequence pairs. Defaults toNone.already_has_special_tokens (
bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults toFalse.
- Returns
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- Return type
List[int]
-
create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]¶ Creates a mask from the input sequences. An XLNet sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 | first sequence | second sequence |
0 stands for the segment id of first segment tokens,
1 stands for the segment id of second segment tokens,
2 stands for the segment id of cls_token.
- Parameters
token_ids_0 (
List[int]) – List of IDs for the first sequence.token_ids_1 (
List[int], optional) – Optional second list of IDs for the sequence pair. Defaults toNone.
- Returns
List of token type IDs according to the given sequence(s).
- Return type
List[int]