distinct

class Distinct(n_size=2, trans_func=None, name='distinct')[source]

Bases: paddle.metric.metrics.Metric

Distinct is an algorithm for evaluating the textual diversity of the generated text by calculating the number of distinct n-grams. The larger the value of n-grams, the higher the diversity of the text. See detail at https://arxiv.org/abs/1510.03055

Distinct could be used as paddle.metric.Metric class, or an ordinary class. When Distinct is used as paddle.metric.Metric class. A function is needed that transforms the network output to string list. It should be noted that the Distinct here is different from the Distinct calculated in prediction, and it is only for observation during training and evaluation.

Parameters
  • trans_func (callable, optional) – trans_func transforms the network output to string list. Default None. When Distinct is used as paddle.metric.Metric class, trans_func must be provided. Please note that the input of trans_func is numpy array.

  • n_size (int, optional) – Number of gram for Distinct metric. Default: 2.

  • name (str, optional) – Name of paddle.metric.Metric instance. Default: “distinct”.

Examples

1. Using as a general evaluation object. .. code-block:: python

from paddlenlp.metrics import Distinct distinct = Distinct() cand = [“The”,”cat”,”The”,”cat”,”on”,”the”,”mat”] distinct.add_inst(cand) print(distinct.score()) # 0.8333333333333334

  1. Using as an instance of paddle.metric.Metric.

import numpy as np
from functools import partial
import paddle
from paddlenlp.transformers import BertTokenizer
from paddlenlp.metrics import Distinct

def trans_func(logits, tokenizer):
    '''Transform the network output `logits` to string list.'''
    # [batch_size, seq_len]
    token_ids = np.argmax(logits, axis=-1).tolist()
    cand_list = []
    for ids in token_ids:
        tokens = tokenizer.convert_ids_to_tokens(ids)
        strings = tokenizer.convert_tokens_to_string(tokens)
        cand_list.append(strings.split())
    return cand_list

paddle.seed(2021)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
distinct = Distinct(trans_func=partial(trans_func, tokenizer=tokenizer))
batch_size, seq_len, vocab_size = 4, 16, tokenizer.vocab_size
logits = paddle.rand([batch_size, seq_len, vocab_size])
distinct.update(logits.numpy())
print(distinct.accumulate()) # 1.0
update(output, *args)[source]

Update the metrics states. This method firstly will use trans_func to process the output to get the tokenized candidate sentence list. Then call add_inst to process the candidate list one by one.

add_inst(cand)[source]

Update the states based on the candidate.

Parameters

cand (list) – Tokenized candidate sentence generated by model.

reset()[source]

Reset states and result

accumulate()[source]

Calculate the final distinct metric.

name()[source]

Returns metric name