distinct¶
-
class
Distinct(n_size=2, trans_func=None, name='distinct')[source]¶ Bases:
paddle.metric.metrics.MetricDistinct is an algorithm for evaluating the textual diversity of the generated text by calculating the number of distinct n-grams. The larger the value of n-grams, the higher the diversity of the text. See detail at https://arxiv.org/abs/1510.03055
Distinctcould be used aspaddle.metric.Metricclass, or an ordinary class. WhenDistinctis used aspaddle.metric.Metricclass. A function is needed that transforms the network output to string list. It should be noted that theDistincthere is different from theDistinctcalculated in prediction, and it is only for observation during training and evaluation.- Parameters
trans_func (callable, optional) –
trans_functransforms the network output to string list. Default None. WhenDistinctis used aspaddle.metric.Metricclass,trans_funcmust be provided. Please note that the input oftrans_funcis numpy array.n_size (int, optional) – Number of gram for
Distinctmetric. Default: 2.name (str, optional) – Name of
paddle.metric.Metricinstance. Default: “distinct”.
Examples
1. Using as a general evaluation object. .. code-block:: python
from paddlenlp.metrics import Distinct distinct = Distinct() cand = [“The”,”cat”,”The”,”cat”,”on”,”the”,”mat”] distinct.add_inst(cand) print(distinct.score()) # 0.8333333333333334
Using as an instance of
paddle.metric.Metric.
import numpy as np from functools import partial import paddle from paddlenlp.transformers import BertTokenizer from paddlenlp.metrics import Distinct def trans_func(logits, tokenizer): '''Transform the network output `logits` to string list.''' # [batch_size, seq_len] token_ids = np.argmax(logits, axis=-1).tolist() cand_list = [] for ids in token_ids: tokens = tokenizer.convert_ids_to_tokens(ids) strings = tokenizer.convert_tokens_to_string(tokens) cand_list.append(strings.split()) return cand_list paddle.seed(2021) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer)) batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size logits = paddle.rand([batch_size, seq_len, vocab_size]) distinct.update(logits.numpy()) print(distinct.accumulate()) # 1.0
-
update(output, *args)[source]¶ Update the metrics states. This method firstly will use
trans_functo process theoutputto get the tokenized candidate sentence list. Then calladd_instto process the candidate list one by one.