tokenizer

Tokenization class for XLNet model.

class XLNetTokenizer(vocab_file, do_lower_case=False, remove_space=True, keep_accents=False, bos_token='<s>', eos_token='</s>', unk_token='<unk>', sep_token='<sep>', pad_token='<pad>', cls_token='<cls>', mask_token='<mask>', additional_special_tokens=['<eop>', '<eod>'])[source]

Bases: paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer

Constructs an XLNet tokenizer. Based on SentencePiece.

Parameters
  • vocab_file (str) – SentencePiece file (ends with .spm) that contains the vocabulary necessary to instantiate a tokenizer.

  • do_lower_case (bool, optional) – Whether to lowercase the input when tokenizing. Defaults to False and we do not lowercase the input.

  • remove_space (bool, optional) – Whether to strip the text when tokenizing. Defaults to True and we remove excess spaces before and after the string.

  • keep_accents (bool, optional) – Whether to keep accents when tokenizing. Defaults to False and we don’t keep accents.

  • bos_token (str, optional) – The beginning of sequence token that was used during pretraining. Defaults to "<s>".

  • eos_token (str, optional) – The end of sequence token. Defaults to "</s>".

  • unk_token (str, optional) – The unknown token. A token that is not in the vocabulary is set to be unk_token inorder to be converted to an ID. Defaults to "<unk>".

  • sep_token (str, optional) – The separator token. Defaults to "<sep>".

  • pad_token (str, optional) – The token used for padding. Defaults to "<pad>".

  • cls_token (str, optional) – The classifier token which is used when doing sequence classification. It is the last token of the sequence when built with special tokens. Defaults to "<cls>".

  • mask_token (str, optional) – The token used for masking values. In the masked language modeling task, this is the token used and which the model will try to predict. Defaults to "<mask>".

  • additional_special_tokens (List[str], optional) – Additional special tokens used by the tokenizer. Defaults to ["<eop>", "<eod>"].

sp_model

The SentencePiece processor that is used for every conversion (string, tokens and IDs).

Type

SentencePieceProcessor

tokenize(text)[source]

End-to-end tokenization for XLNet models.

Parameters

text (str) – The text to be tokenized.

Returns

A list of string representing converted tokens.

Return type

List(str)

convert_tokens_to_ids(tokens)[source]

Converts a token (or a sequence of tokens) to a single integer id (or a sequence of ids), using the vocabulary.

Parameters

tokens (str or List[str]) – One or several token(s) to convert to token id(s).

Returns

The token id or list of token ids or tuple of token ids.

Return type

int or List[int] or tuple(int)

convert_ids_to_tokens(ids, skip_special_tokens=False)[source]

Converts a single index or a sequence of indices to a token or a sequence of tokens, using the vocabulary and added tokens.

Parameters
  • ids (int or List[int]) – The token id (or token ids) to be converted to token(s).

  • skip_special_tokens (bool, optional) – Whether or not to remove special tokens in the decoding. Defaults to False and we do not remove special tokens.

Returns

The decoded token(s).

Return type

str or List[str]

convert_tokens_to_string(tokens)[source]

Converts a sequence of tokens (strings for sub-words) in a single string.

num_special_tokens_to_add(pair=False)[source]

Returns the number of added tokens when encoding a sequence with special tokens.

Note

This encodes inputs and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

Parameters

pair (bool, optional) – Whether the sequence is a sequence pair or a single sequence. Defaults to False and the input is a single sequence.

Returns

Number of tokens added to sequences.

Return type

int

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]

Builds model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An XLNet sequence has the following format:

  • single sequence: X <sep> <cls>

  • pair of sequences: A <sep> B <sep> <cls>

Parameters
  • token_ids_0 (List[int]) – List of IDs for the first sequence.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs. Defaults to None.

Returns

List of input IDs with the appropriate special tokens.

Return type

List[int]

build_offset_mapping_with_special_tokens(offset_mapping_0, offset_mapping_1=None)[source]

Builds offset map from a pair of offset map by concatenating and adding offsets of special tokens.

An XLNet offset_mapping has the following format:

  • single sequence: X (0,0) (0,0)

  • pair of sequences: A (0,0) B (0,0) (0,0)

Parameters
  • offset_mapping_0 (List[tuple]) – List of char offsets to which the special tokens will be added.

  • offset_mapping_1 (List[tuple], optional) – Optional second list of char offsets for offset mapping pairs. Defaults to None.

Returns

List of char offsets with the appropriate offsets of special tokens.

Return type

List[tuple]

get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)[source]

Creates a special tokens mask from the input sequences. This method is called when adding special tokens using the tokenizer encode method.

Parameters
  • token_ids_0 (List[int]) – List of IDs for the first sequence.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs. Defaults to None.

  • already_has_special_tokens (bool, optional) – Whether or not the token list is already formatted with special tokens for the model. Defaults to False.

Returns

A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Return type

List[int]

create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)[source]

Creates a mask from the input sequences. An XLNet sequence pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2
| first sequence    | second sequence |
  • 0 stands for the segment id of first segment tokens,

  • 1 stands for the segment id of second segment tokens,

  • 2 stands for the segment id of cls_token.

Parameters
  • token_ids_0 (List[int]) – List of IDs for the first sequence.

  • token_ids_1 (List[int], optional) – Optional second list of IDs for the sequence pair. Defaults to None.

Returns

List of token type IDs according to the given sequence(s).

Return type

List[int]

save_resources(save_directory)[source]

Saves tokenizer related resources to files under save_directory.

Parameters

save_directory (str) – Directory to save files into.