abstract_dataloader.ext.sample ¶

Dataset sampling, including a low discrepancy subset sampler.

Dataset sampling is implemented using a SampledDataset, which transparently wraps an existing Dataset.

abstract_dataloader.ext.sample.SampledDataset ¶

Bases: Dataset[TSample], Generic[TSample]

Dataset wrapper which only exposes a subset of values.

The sampling mode can be one of:

random: Uniform random sampling, with np.random.default_rng and the supplied seed; if seed is a float, it is converted into an integer by multiplying by len(dataset) and rounding.
ld: Low discrepancy sampling; see sample_ld.
uniform: Uniformly spaced sampling, with linspace(0, n, samples).
Callable: A callable which takes the total number of samples, and returns an array of indices to sample from the dataset.

Info

This SampledDataset is fully ADL-compliant, and acts as a passthrough to an ADL-compliant Dataset: if the input dataset is a Dataset[Sample], then the wrapped dataset is also a Dataset[Sample].

Type Parameters

Sample: dataset sample type.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset[TSample]`	underlying dataset.	required
`samples`	`int \| float`	target number of samples.	required
`seed`	`int \| float`	sampler seed.	`0`
`mode`	`Literal['ld', 'uniform', 'random'] \| Callable[[int], Integer[ndarray, N]]`	sampling mode.	`'ld'`

Source code in src/abstract_dataloader/ext/sample.py

class SampledDataset(spec.Dataset[TSample], Generic[TSample]):
    """Dataset wrapper which only exposes a subset of values.

    The sampling `mode` can be one of:

    - `random`: Uniform random sampling, with `np.random.default_rng` and the
      supplied seed; if `seed` is a `float`, it is converted into an integer
      by multiplying by `len(dataset)` and rounding.
    - `ld`: Low discrepancy sampling; see [`sample_ld`][^.].
    - `uniform`: Uniformly spaced sampling, with `linspace(0, n, samples)`.
    - `Callable`: A callable which takes the total number of samples, and
        returns an array of indices to sample from the dataset.

    !!! info

        This `SampledDataset` is fully ADL-compliant, and acts as a passthrough
        to an ADL-compliant [`Dataset`][abstract_dataloader.spec.]: if the
        input dataset is a `Dataset[Sample]`, then the wrapped dataset is also
        a `Dataset[Sample]`.

    Type Parameters:
        `Sample`: dataset sample type.

    Args:
        dataset: underlying dataset.
        samples: target number of samples.
        seed: sampler seed.
        mode: sampling mode.
    """

    def __init__(
        self, dataset: spec.Dataset[TSample], samples: int | float,
        seed: int | float = 0,
        mode: Literal["ld", "uniform", "random"]
            | Callable[[int], Integer[np.ndarray, "N"]] = "ld"
    ) -> None:
        self.dataset = dataset

        if isinstance(samples, float):
            samples = int(samples * len(dataset))

        if mode == "ld":
            self.subset = sample_ld(len(dataset), samples=samples, seed=seed)
        elif mode == "random":
            if isinstance(seed, float):
                seed = int(seed * len(dataset))
            self.subset = np.random.default_rng(seed).choice(
                len(dataset), size=samples, replace=True)
        elif mode == "uniform":
            self.subset = np.linspace(
                0, len(dataset) - 1, samples, dtype=np.int64)
        else:  # Callable
            self.subset = mode(len(dataset)).astype(np.int64)

    def __getitem__(self, index: int | np.integer) -> TSample:
        """Fetch item from this dataset by global index."""
        return self.dataset[self.subset[index]]

    def __len__(self) -> int:
        """Total number of samples in this dataset."""
        return self.subset.shape[0]

getitem ¶

__getitem__(index: int | integer) -> TSample

Fetch item from this dataset by global index.

Source code in src/abstract_dataloader/ext/sample.py

def __getitem__(self, index: int | np.integer) -> TSample:
    """Fetch item from this dataset by global index."""
    return self.dataset[self.subset[index]]

len ¶

__len__() -> int

Total number of samples in this dataset.

Source code in src/abstract_dataloader/ext/sample.py

def __len__(self) -> int:
    """Total number of samples in this dataset."""
    return self.subset.shape[0]