abstract_dataloader.ext.sample
¶
Dataset sampling, including a low discrepancy subset sampler.
Dataset sampling is implemented using a SampledDataset
,
which transparently wraps an existing Dataset
.
abstract_dataloader.ext.sample.SampledDataset
¶
Bases: Dataset[TSample]
, Generic[TSample]
Dataset wrapper which only exposes a subset of values.
The sampling mode
can be one of:
random
: Uniform random sampling, withnp.random.default_rng
and the supplied seed; ifseed
is afloat
, it is converted into an integer by multiplying bylen(dataset)
and rounding.ld
: Low discrepancy sampling; seesample_ld
.uniform
: Uniformly spaced sampling, withlinspace(0, n, samples)
.Callable
: A callable which takes the total number of samples, and returns an array of indices to sample from the dataset.
Info
This SampledDataset
is fully ADL-compliant, and acts as a passthrough
to an ADL-compliant Dataset
: if the
input dataset is a Dataset[Sample]
, then the wrapped dataset is also
a Dataset[Sample]
.
Type Parameters
Sample
: dataset sample type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
Dataset[TSample]
|
underlying dataset. |
required |
samples
|
int | float
|
target number of samples. |
required |
seed
|
int | float
|
sampler seed. |
0
|
mode
|
Literal['ld', 'uniform', 'random'] | Callable[[int], Integer[ndarray, N]]
|
sampling mode. |
'ld'
|
Source code in src/abstract_dataloader/ext/sample.py
__getitem__
¶
abstract_dataloader.ext.sample.sample_ld
¶
sample_ld(
total: int,
samples: float | int,
seed: float | int = 0,
alpha: float | int = 2 / sqrt(5) + 1,
) -> Int64[ndarray, samples]
Compute deterministic low-discrepancy subset mask.
Uses a simple alpha * n % 1
formulation, described
here,
with a modification to work with integer samples:
- For a given
total
, find the integer closest tototal * alpha
which is co-prime with the total. Use this as the step size. - Then,
1...total * alpha (mod total)
is guaranteed to visit each index up tototal
exactly once.
Note
The default alpha = 1 / phi
where phi
is the golden ratio
(1 + sqrt(5)) / 2
has strong low-discrepancy sampling properties
;
due to the quantized nature of this function, the discrepancy may be
larger when total
is small.
Tip
Each of the parameters (samples
, seed
, alpha
) can be
specified as a float [0, 1]
, and a proportion of the total
will
be used instead. For example, if seed = 0.7
and total=100
, then
seed = 70
will be used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
total
|
int
|
total number of samples to sample from, i.e. maximum index. |
required |
samples
|
float | int
|
number of samples to generate. Should be less than |
required |
seed
|
float | int
|
initial offset for the sampling sequence. Can leave this at |
0
|
alpha
|
float | int
|
step size in the sequence; the default value is the inverse
golden ratio |
2 / sqrt(5) + 1
|
Returns:
Type | Description |
---|---|
Int64[ndarray, samples]
|
Array, in mixed order, of |