distinct¶
-
class
Distinct
(n_size=2, trans_func=None, name='distinct')[source]¶ Bases:
paddle.metric.metrics.Metric
Distinct is an algorithm for evaluating the textual diversity of the generated text by calculating the number of distinct n-grams. The larger the value of n-grams, the higher the diversity of the text. See detail at https://arxiv.org/abs/1510.03055
Distinct
could be used aspaddle.metric.Metric
class, or an ordinary class. WhenDistinct
is used aspaddle.metric.Metric
class. A function is needed that transforms the network output to string list. It should be noted that theDistinct
here is different from theDistinct
calculated in prediction, and it is only for observation during training and evaluation.- Parameters
trans_func (callable, optional) –
trans_func
transforms the network output to string list. Default None. WhenDistinct
is used aspaddle.metric.Metric
class,trans_func
must be provided. Please note that the input oftrans_func
is numpy array.n_size (int, optional) – Number of gram for
Distinct
metric. Default: 2.name (str, optional) – Name of
paddle.metric.Metric
instance. Default: “distinct”.
Examples
1. Using as a general evaluation object. .. code-block:: python
from paddlenlp.metrics import Distinct distinct = Distinct() cand = [“The”,”cat”,”The”,”cat”,”on”,”the”,”mat”] distinct.add_inst(cand) print(distinct.score()) # 0.8333333333333334
Using as an instance of
paddle.metric.Metric
.
import numpy as np from functools import partial import paddle from paddlenlp.transformers import BertTokenizer from paddlenlp.metrics import Distinct def trans_func(logits, tokenizer): '''Transform the network output `logits` to string list.''' # [batch_size, seq_len] token_ids = np.argmax(logits, axis=-1).tolist() cand_list = [] for ids in token_ids: tokens = tokenizer.convert_ids_to_tokens(ids) strings = tokenizer.convert_tokens_to_string(tokens) cand_list.append(strings.split()) return cand_list paddle.seed(2021) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer)) batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size logits = paddle.rand([batch_size, seq_len, vocab_size]) distinct.update(logits.numpy()) print(distinct.accumulate()) # 1.0
-
update
(output, *args)[source]¶ Update the metrics states. This method firstly will use
trans_func
to process theoutput
to get the tokenized candidate sentence list. Then calladd_inst
to process the candidate list one by one.