dataset

class MapDataset(data, **kwargs)[source]

Bases: paddle.fluid.dataloader.dataset.Dataset

Wraps a dataset-like object as a instance of Dataset, and equips it with map and other utility methods. All non-magic methods of the raw object also accessible. :param data: A dataset-like object. It can be a list or a

subclass of Dataset.

filter(fn)[source]

Filters samples by the filter function and uses the filtered data to update this dataset. :param fn: A filter function that takes a sample as input and

returns a boolean. Samples that return False are discarded.

shard(num_shards=None, index=None)[source]

Use samples whose indices mod index equals 0 to update this dataset. :param num_shards: A integer representing the number of

data shards. If None, num_shards would be number of trainers. Default: None

Parameters

index (int, optional) – A integer representing the index of the current shard. If None, index` would be the current trainer rank id. Default: None.

map(fn, lazy=True, batched=False)[source]

Performs specific function on the dataset to transform and update every sample. :param fn: Transformations to be performed. It receives single

sample as argument if batched is False. Else it receives all examples.

Parameters
  • lazy (bool, optional) – If True, transformations would be delayed and performed on demand. Otherwise, transforms all samples at once. Note that if fn is stochastic, lazy should be True or you will get the same result on all epochs. Defalt: False.

  • batched (bool, optional) – If True, transformations would take all examples as input and return a collection of transformed examples. Note that if set True, lazy option would be ignored.

class DatasetBuilder(lazy=None, name=None, **config)[source]

Bases: object

A base class for all DatasetBuilder. It provides a read() function to turn a data file into a MapDataset or IterDataset.

_get_data() function and _read() function should be implemented to download data file and read data file into a Iterable of the examples.

read(filename, split='train')[source]

Returns an dataset containing all the examples that can be read from the file path. If self.lazy is False, this eagerly reads all instances from self._read() and returns an MapDataset. If self.lazy is True, this returns an IterDataset, which internally relies on the generator created from self._read() to lazily produce examples. In this case your implementation of _read() must also be lazy (that is, not load all examples into memory at once).

get_labels()[source]

Return list of class labels of the dataset if specified.

get_vocab()[source]

Return vocab file path of the dataset if specified.

class IterDataset(data, **kwargs)[source]

Bases: paddle.fluid.dataloader.dataset.IterableDataset

Wraps a dataset-like object as a instance of Dataset, and equips it with map and other utility methods. All non-magic methods of the raw object also accessible. :param data: A dataset-like object. It can be a Iterable or a

subclass of Dataset.

filter(fn)[source]

Filters samples by the filter function and uses the filtered data to update this dataset. :param fn: A filter function that takes a sample as input and

returns a boolean. Samples that return False are discarded.

shard(num_shards=None, index=None)[source]

Use samples whose indices mod index equals 0 to update this dataset. :param num_shards: A integer representing the number of

data shards. If None, num_shards would be number of trainers. Default: None

Parameters

index (int, optional) – A integer representing the index of the current shard. If None, index` would be the current trainer rank id. Default: None.

map(fn)[source]

Performs specific function on the dataset to transform and update every sample. :param fn: Transformations to be performed. It receives single

sample as argument.