tokenizer¶
Tokenization class for XLNet model.
-
class
XLNetTokenizer
(vocab_file, do_lower_case=False, remove_space=True, keep_accents=False, bos_token='<s>', eos_token='</s>', unk_token='<unk>', sep_token='<sep>', pad_token='<pad>', cls_token='<cls>', mask_token='<mask>', additional_special_tokens=['<eop>', '<eod>'])[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs an XLNet tokenizer. Based on SentencePiece.
- Parameters
vocab_file (
str
) –SentencePiece
file (ends with .spm) that contains the vocabulary necessary to instantiate a tokenizer.do_lower_case (
bool
, optional) – Whether to lowercase the input when tokenizing. Defaults toFalse
and we do not lowercase the input.remove_space (
bool
, optional) – Whether to strip the text when tokenizing. Defaults toTrue
and we remove excess spaces before and after the string.keep_accents (
bool
, optional) – Whether to keep accents when tokenizing. Defaults toFalse
and we don’t keep accents.bos_token (
str
, optional) – The beginning of sequence token that was used during pretraining. Defaults to"<s>"
.eos_token (
str
, optional) – The end of sequence token. Defaults to"</s>"
.unk_token (
str
, optional) – The unknown token. A token that is not in the vocabulary is set to be unk_token inorder to be converted to an ID. Defaults to"<unk>"
.sep_token (
str
, optional) – The separator token. Defaults to"<sep>"
.pad_token (
str
, optional) – The token used for padding. Defaults to"<pad>"
.cls_token (
str
, optional) – The classifier token which is used when doing sequence classification. It is the last token of the sequence when built with special tokens. Defaults to"<cls>"
.mask_token (
str
, optional) – The token used for masking values. In the masked language modeling task, this is the token used and which the model will try to predict. Defaults to"<mask>"
.additional_special_tokens (
List[str]
, optional) – Additional special tokens used by the tokenizer. Defaults to["<eop>", "<eod>"]
.
-
sp_model
¶ The
SentencePiece
processor that is used for every conversion (string, tokens and IDs).- Type
SentencePieceProcessor
-
tokenize
(text)[source]¶ End-to-end tokenization for XLNet models.
- Parameters
text (
str
) – The text to be tokenized.- Returns
A list of string representing converted tokens.
- Return type
List(str)
-
convert_tokens_to_ids
(tokens)[source]¶ Converts a token (or a sequence of tokens) to a single integer id (or a sequence of ids), using the vocabulary.
- Parameters
tokens (
str
orList[str]
) – One or several token(s) to convert to token id(s).- Returns
The token id or list of token ids or tuple of token ids.
- Return type
int
orList[int]
ortuple(int)
-
convert_ids_to_tokens
(ids, skip_special_tokens=False)[source]¶ Converts a single index or a sequence of indices to a token or a sequence of tokens, using the vocabulary and added tokens.
- Parameters
ids (
int
orList[int]
) – The token id (or token ids) to be converted to token(s).skip_special_tokens (
bool
, optional) – Whether or not to remove special tokens in the decoding. Defaults toFalse
and we do not remove special tokens.
- Returns
The decoded token(s).
- Return type
str
orList[str]
-
convert_tokens_to_string
(tokens)[source]¶ Converts a sequence of tokens (strings for sub-words) in a single string.
-
num_special_tokens_to_add
(pair=False)[source]¶ Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (
bool
, optional) – Whether the sequence is a sequence pair or a single sequence. Defaults toFalse
and the input is a single sequence.- Returns
Number of tokens added to sequences.
- Return type
int
-
build_inputs_with_special_tokens
(token_ids_0, token_ids_1=None)[source]¶ Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format:
single sequence:
X <sep> <cls>
pair of sequences:
A <sep> B <sep> <cls>
- Parameters
token_ids_0 (
List[int]
) – List of IDs for the first sequence.token_ids_1 (
List[int]
, optional) – Optional second list of IDs for sequence pairs. Defaults toNone
.
- Returns
List of input IDs with the appropriate special tokens.
- Return type
List[int]
-
build_offset_mapping_with_special_tokens
(offset_mapping_0, offset_mapping_1=None)[source]¶ Builds offset map from a pair of offset map by concatenating and adding offsets of special tokens.
An XLNet offset_mapping has the following format:
single sequence:
X (0,0) (0,0)
pair of sequences:
A (0,0) B (0,0) (0,0)
- Parameters
offset_mapping_0 (
List[tuple]
) – List of char offsets to which the special tokens will be added.offset_mapping_1 (
List[tuple]
, optional) – Optional second list of char offsets for offset mapping pairs. Defaults toNone
.
- Returns
List of char offsets with the appropriate offsets of special tokens.
- Return type
List[tuple]
-
get_special_tokens_mask
(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]¶ Creates a special tokens mask from the input sequences. This method is called when adding special tokens using the tokenizer
encode
method.- Parameters
token_ids_0 (
List[int]
) – List of IDs for the first sequence.token_ids_1 (
List[int]
, optional) – Optional second list of IDs for sequence pairs. Defaults toNone
.already_has_special_tokens (
bool
, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults toFalse
.
- Returns
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- Return type
List[int]
-
create_token_type_ids_from_sequences
(token_ids_0, token_ids_1=None)[source]¶ Creates a mask from the input sequences. An XLNet sequence pair mask has the following format:
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 | first sequence | second sequence |
0 stands for the segment id of first segment tokens,
1 stands for the segment id of second segment tokens,
2 stands for the segment id of cls_token.
- Parameters
token_ids_0 (
List[int]
) – List of IDs for the first sequence.token_ids_1 (
List[int]
, optional) – Optional second list of IDs for the sequence pair. Defaults toNone
.
- Returns
List of token type IDs according to the given sequence(s).
- Return type
List[int]