Abstract Dataloader Specifications
¶
Abstract Dataloader Specifications.
The implementations here provide "duck type" protocol definitions of key data loading primitives. In order to implement the specification, users simply need to "fill in" the methods described here for the types which they wish to implement.
Type Parameters
ADL specification protocol types are defined as
generics,
which are parameterized by other types. These type parameters are
documented by a Type Parameters
section where applicable.
Composition Rules
ADL protocols which can be composed together are documented by a
Composition Rules
section.
abstract_dataloader.spec.Metadata
¶
Bases: Protocol
Sensor metadata.
All sensor metadata is expected to be held in memory during training, so great effort should be taken to minimize its memory usage. Any additional information which is not strictly necessary for book-keeping, or which takes more than negligible space, should be loaded as data instead.
Note
This can be a @dataclass
, typing.NamedTuple
,
or a fully custom type - it just has to expose a timestamps
attribute.
Attributes:
Name | Type | Description |
---|---|---|
timestamps |
Float[ndarray, N]
|
measurement timestamps, in seconds. Nominally in epoch
time; must be consistent within each trace (but not necessarily
across traces). Suggested type: |
Source code in src/abstract_dataloader/spec.py
abstract_dataloader.spec.Sensor
¶
Bases: Protocol
, Generic[TSample, TMetadata]
A sensor, consisting of a synchronous time-series of measurements.
This protocol is parameterized by generic TSample
and TMetadata
types,
which can encode the expected data type of this sensor. For example:
class Point2D(TypedDict):
x: float
y: float
def point_transform(point_sensor: Sensor[Point2D, Any]) -> T:
...
This encodes an argument, point_sensor
, which expected to be a sensor
that reads data with type Point2D
, but does not specify a metadata type.
Type Parameters
TSample
: sample data type which thisSensor
returns. As a convention, we suggest returning "batched" data by default, i.e. with a leading singleton axis.TMetadata
: metadata type associated with this sensor; must implementMetadata
.
Attributes:
Name | Type | Description |
---|---|---|
metadata |
TMetadata
|
sensor metadata, including timestamp information. |
Source code in src/abstract_dataloader/spec.py
__getitem__
¶
abstract_dataloader.spec.Synchronization
¶
Bases: Protocol
Synchronization protocol for asynchronous time-series.
Defines a rule for creating matching sensor index tuples which correspond to some kind of global index.
Source code in src/abstract_dataloader/spec.py
__call__
¶
Apply synchronization protocol.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
timestamps
|
dict[str, Float[ndarray, _N]]
|
sensor timestamps. Each key denotes a different sensor name, and the value denotes the timestamps for that sensor. |
required |
Returns:
Type | Description |
---|---|
dict[str, Integer[ndarray, M]]
|
A dictionary, where keys correspond to each sensor, and values
correspond to the indices which map global indices to sensor
indices, i.e. |
Source code in src/abstract_dataloader/spec.py
abstract_dataloader.spec.Trace
¶
Bases: Protocol
, Generic[TSample]
A trace, consisting of multiple simultaneously-recording sensors.
This protocol is parameterized by a generic Sample
type, which can encode
the expected data type of this trace.
Type Parameters
Sample
: sample data type which thisTrace
returns. As a convention, we suggest returning "batched" data by default, i.e. with a leading singleton axis.
Source code in src/abstract_dataloader/spec.py
__getitem__
¶
Get item from global index (or fetch a sensor by name).
Info
For the user's convenience, traces can be indexed by a str
sensor
name, returning that Sensor
. While we are generally wary
of requiring "quality of life" features, we include this since a
simple isinstance(index, str)
check suffices to implement this
feature.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | integer | str
|
sample index, or sensor name. |
required |
Returns:
Type | Description |
---|---|
TSample | Sensor
|
Loaded sample if |
TSample | Sensor
|
|
Source code in src/abstract_dataloader/spec.py
abstract_dataloader.spec.Dataset
¶
Bases: Protocol
, Generic[TSample]
A dataset, consisting of multiple traces concatenated together.
Due to the type signatures, a Trace
is actually a subtype of
Dataset
. This means that a dataset which implements a
collection of traces can also take a collection of datasets!
Type Parameters
TSample
: sample data type which thisDataset
returns. As a convention, we suggest returning "batched" data by default, i.e. with a leading singleton axis.
Source code in src/abstract_dataloader/spec.py
abstract_dataloader.spec.Transform
¶
Bases: Protocol
, Generic[TRaw, TTransformed]
Sample or batch-wise transform.
Note
This protocol is a suggestively-named equivalent to
Callable[[TRaw], TTransformed]
or Callable[[Any], Any]
.
Composition Rules
Transform
can be freely composed, as long as each transform'sTTransformed
matches the next transform'sTRaw
; this composition is implemented byabstract.Transform
.- Composed
Transform
s result in anotherTransform
:
Type Parameters
TRaw
: Input data type.TTransformed
: Output data type.
Source code in src/abstract_dataloader/spec.py
__call__
¶
Apply transform to a single sample.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
TRaw
|
A single |
required |
Returns:
Type | Description |
---|---|
TTransformed
|
A single |
abstract_dataloader.spec.Collate
¶
Bases: Protocol
, Generic[TTransformed, TCollated]
Data collation.
Note
This protocol is a equivalent to
Callable[[Sequence[TTransformed]], TCollated]
. Collate
can also
be viewed as a special case of Transform
, where the input type
TRaw
must be a Sequence[...]
.
Composition Rules
Collate
can only be composed in parallel, and can never be sequentially composed.
Type Parameters
TTransformed
: Input data type.TCollated
: Output data type.
Source code in src/abstract_dataloader/spec.py
abstract_dataloader.spec.Pipeline
¶
Bases: Protocol
, Generic[TRaw, TTransformed, TCollated, TProcessed]
Dataloader transform pipeline.
This protocol is parameterized by four type variables which encode the
different data formats at each stage in the pipeline. This forms a
Raw -> Transformed -> Collated -> Processed
pipeline with three
transforms:
sample
: a sample to sample transform; can be sequentially assembled from one or moreTransform
s.collate
: a list-of-samples to batch transform. Can use exactly oneCollate
.batch
: a batch to batch transform; can be sequentially assembled from one or moreTransform
s.
Composition Rules
- A full
Pipeline
can be sequentially pre-composed and/or post-composed with one or moreTransform
s; this is implemented bygeneric.ComposedPipeline
. Pipeline
s can always be composed in parallel; this is implemented bygeneric.ParallelPipelines
, with a pytorchnn.Module
-compatible version intorch.ParallelPipelines
.
Type Parameters
TRaw
: Input data format.TTransformed
: Data after the firsttransform
step.TCollated
: Data after the secondcollate
step.TProcessed
: Output data format.
Source code in src/abstract_dataloader/spec.py
300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 |
|
batch
¶
Transform data batch.
- Operates on a batch of data, nominally on the GPU-side of a dataloader.
- This method is both sequentially and parallel composable.
Implementation as torch.nn.Module
If these Transforms
require GPU state, it may be helpful to
implement it as a torch.nn.Module
. In this case, batch
should redirect to __call__
, which in turn redirects to
nn.Module.forward
in order to handle any registered
pytorch hooks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
TCollated
|
A |
required |
Returns:
Type | Description |
---|---|
TProcessed
|
The |
Source code in src/abstract_dataloader/spec.py
collate
¶
collate(data: Sequence[TTransformed]) -> TCollated
Collate a list of data samples into a GPU-ready batch.
- Operates on the CPU-side of the dataloader, and is responsible for aggregating individual samples into a batch (but not transferring to the GPU).
- Analogous to the
collate_fn
of a pytorch dataloader. - This method is not sequentially composable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Sequence[TTransformed]
|
A sequence of |
required |
Returns:
Type | Description |
---|---|
TCollated
|
A |
Source code in src/abstract_dataloader/spec.py
sample
¶
Transform single samples.
- Operates on single samples, nominally on the CPU-side of a dataloader.
- This method is both sequentially and parallel composable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
TRaw
|
A single |
required |
Returns:
Type | Description |
---|---|
TTransformed
|
A single |