tokenizer¶
-
class
GPT2Tokenizer
(vocab_file, merges_file, errors='replace', special_tokens=None, max_len=None, do_lower_case=True)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
-
set_special_tokens
(special_tokens)[source]¶ Add a list of additional tokens to the encoder. The additional tokens are indexed starting from the last index of the current vocabulary in the order of the
special_tokens
list.
-
convert_ids_to_tokens
(ids, skip_special_tokens=False)[source]¶ Converts a single index or a sequence of indices (integers) in a token or a sequence of tokens (str) by using the vocabulary.
- Parameters
skip_special_tokens – Don’t decode special tokens (self.all_special_tokens). Default: False
-
encode
(text, fn=None)[source]¶ Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a
max_seq_len
is specified.- Parameters
text (
str
,List[str]
orList[int]
) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using thetokenize
method) or a list of integers (tokenized string ids using theconvert_tokens_to_ids
method)text_pair (
str
,List[str]
orList[int]
,optional
, defaults toNone
) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using thetokenize
method) or a list of integers (tokenized string ids using theconvert_tokens_to_ids
method)max_seq_len (
int
,optional
, defaults to :int:`512`) – If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionarypad_to_max_seq_len (
bool
,optional
, defaults toFalse
) – If set to True, the returned sequences will be padded according to the model’s padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model’s max length.truncation_strategy (
str
,optional
, defaults tolongest_first
) –String selected in the following options:
’longest_first’ (default) Iteratively reduce the inputs sequence until the input is under max_seq_len starting from the longest one at each token (when there is a pair of input sequences)
’only_first’: Only truncate the first sequence
’only_second’: Only truncate the second sequence
’do_not_truncate’: Does not truncate (raise an error if the input sequence is longer than max_seq_len)
return_position_ids (
bool
,optional
, defaults toFalse
) – Set to True to return tokens position ids (default True).return_token_type_ids (
bool
,optional
, defaults toTrue
) – Whether to return token type IDs.return_attention_mask (
bool
,optional
, defaults toFalse
) – Whether to return the attention mask.return_length (
int
, defaults toFalse
) – If set the resulting dictionary will include the length of each encoded inputsreturn_overflowing_tokens (
bool
,optional
, defaults toFalse
) – Set to True to return overflowing token information (default False).return_special_tokens_mask (
bool
,optional
, defaults toFalse
) – Set to True to return special tokens mask information (default False).
- Returns
A Dictionary of shape:
{ input_ids: list[int], position_ids: list[int] if return_position_ids is True token_type_ids: list[int] if return_token_type_ids is True (default) attention_mask: list[int] if return_attention_mask is True seq_len: int if return_length is True overflowing_tokens: list[int] if a ``max_seq_len`` is specified and return_overflowing_tokens is True num_truncated_tokens: int if a ``max_seq_len`` is specified and return_overflowing_tokens is True special_tokens_mask: list[int] if return_special_tokens_mask is True }
With the fields:
input_ids
: list of token ids to be fed to a modelposition_ids
: list of token position ids to be fed to a modeltoken_type_ids
: list of token type ids to be fed to a modelattention_mask
: list of indices specifying which tokens should be attended to by the modellength
: the input_ids lengthoverflowing_tokens
: list of overflowing tokens if a max length is specified.num_truncated_tokens
: number of overflowing tokens amax_seq_len
is specifiedspecial_tokens_mask
: list of [0, 1], with 0 specifying special added tokens and 1 specifying sequence tokens.
-
-
class
GPT2ChineseTokenizer
(vocab_file, model_file, do_lower_case=True, max_len=512, bod_id='<bod>', eod_id='<eod>', max_length=None)[source]¶ Bases:
paddlenlp.transformers.tokenizer_utils.PretrainedTokenizer
Constructs a GPT2 Chinese tokenizer. It uses a basic tokenizer to do punctuation splitting, lower casing and so on, and follows a WordPiece tokenizer to tokenize as subwords.
-
encode
(text)[source]¶ Returns a dictionary containing the encoded sequence or sequence pair and additional information: the mask for sequence classification and the overflowing elements if a
max_seq_len
is specified.- Parameters
text (
str
,List[str]
orList[int]
) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using thetokenize
method) or a list of integers (tokenized string ids using theconvert_tokens_to_ids
method)text_pair (
str
,List[str]
orList[int]
,optional
, defaults toNone
) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using thetokenize
method) or a list of integers (tokenized string ids using theconvert_tokens_to_ids
method)max_seq_len (
int
,optional
, defaults to :int:`512`) – If set to a number, will limit the total sequence returned so that it has a maximum length. If there are overflowing tokens, those will be added to the returned dictionarypad_to_max_seq_len (
bool
,optional
, defaults toFalse
) – If set to True, the returned sequences will be padded according to the model’s padding side and padding index, up to their max length. If no max length is specified, the padding is done up to the model’s max length.truncation_strategy (
str
,optional
, defaults tolongest_first
) –String selected in the following options:
’longest_first’ (default) Iteratively reduce the inputs sequence until the input is under max_seq_len starting from the longest one at each token (when there is a pair of input sequences)
’only_first’: Only truncate the first sequence
’only_second’: Only truncate the second sequence
’do_not_truncate’: Does not truncate (raise an error if the input sequence is longer than max_seq_len)
return_position_ids (
bool
,optional
, defaults toFalse
) – Set to True to return tokens position ids (default True).return_token_type_ids (
bool
,optional
, defaults toTrue
) – Whether to return token type IDs.return_attention_mask (
bool
,optional
, defaults toFalse
) – Whether to return the attention mask.return_length (
int
, defaults toFalse
) – If set the resulting dictionary will include the length of each encoded inputsreturn_overflowing_tokens (
bool
,optional
, defaults toFalse
) – Set to True to return overflowing token information (default False).return_special_tokens_mask (
bool
,optional
, defaults toFalse
) – Set to True to return special tokens mask information (default False).
- Returns
A Dictionary of shape:
{ input_ids: list[int], position_ids: list[int] if return_position_ids is True token_type_ids: list[int] if return_token_type_ids is True (default) attention_mask: list[int] if return_attention_mask is True seq_len: int if return_length is True overflowing_tokens: list[int] if a ``max_seq_len`` is specified and return_overflowing_tokens is True num_truncated_tokens: int if a ``max_seq_len`` is specified and return_overflowing_tokens is True special_tokens_mask: list[int] if return_special_tokens_mask is True }
With the fields:
input_ids
: list of token ids to be fed to a modelposition_ids
: list of token position ids to be fed to a modeltoken_type_ids
: list of token type ids to be fed to a modelattention_mask
: list of indices specifying which tokens should be attended to by the modellength
: the input_ids lengthoverflowing_tokens
: list of overflowing tokens if a max length is specified.num_truncated_tokens
: number of overflowing tokens amax_seq_len
is specifiedspecial_tokens_mask
: list of [0, 1], with 0 specifying special added tokens and 1 specifying sequence tokens.
-
convert_tokens_to_ids
(text)[source]¶ Converts a sequence of tokens into ids using the vocab. The tokenizer should has the
vocab
attribute. Args:tokens (list(str)): List of tokens.
- Returns
Converted id list.
- Return type
list
-