Using an Abstract Dataloader¶
A fully implemented Abstract Dataloader (ADL) compliant system should consist of a collection of modular components implementing sensor data reading, time synchronization, trace & dataset handling, followed by data preprocessing.
In this tutorial, we cover how to use these components if you are given implementations for some or all of them; we split this into data loading and processing, which are connected only by a trivial interface — the output of the Dataset
should be the input of the Pipeline
.
Dataset¶
At a minimum, any ADL-compliant dataloader should include one or more Sensor
implementations. These may be accompanied by custom Synchronization
, Trace
, and Dataset
implementations.
Sensor
¶
Sensors must implementing the Sensor
specification:
- Sensors have a
metadata: spec.Metadata
attribute, which contains atimestamps: Float64[np.ndarray, "N"]
attribute. - Each sensor has a
__getitem__
which can be used to read data by index, and a__len__
.
Warning
The ADL does not prescribe a standardized API for initializing a Sensor
since this is highly implementation-specific.
Trace
¶
Once you have initialized a collection of sensors which all correspond to simultaneously-recorded sensor data, assemble them into a Trace
:
Trace
implementations supplied by an ADL-compliant dataloader may have their own initialization methods.- A trace has a
__getitem__
which reads data from each sensor corresponding to some global index assigned by aSynchronization
policy, and a__len__
.
In the case that a data loading library does not provide a custom Trace
implementation, abstract_dataloader.abstract.Trace
can be used instead as a simple, no-frills baseline implementation.
Info
A Trace
implementation should take (and abstract.Trace
does take) a Synchronization
policy as an argument. While a particular ADL-compliant dataloader implementation may provide custom Synchronization
policies, a few generic implementations are included with abstract_dataloader.generic
:
Class | Description |
---|---|
Empty |
a no-op for intializing a trace without any synchronization (i.e., just as a container of sensors). |
Nearest |
find the nearest measurement for each sensor relative to the reference sensor's measurements. |
Next |
find the next measurement for each sensor relative to the reference sensor's measurements. |
Dataset
¶
Finally, once a collection of Trace
objects are initialized, combine them into a Dataset
.
- As with
Trace
,Dataset
implementations supplied by an ADL-compliant dataloader may have their own initialization methods. - Similarly, datasets have a
__getitem__
which reads data from each sensor and a__len__
.
In the case that a data loading library does not provide a custom Dataset
implementation, abstract_dataloader.abstract.Dataset
can also be used instead.
Tip
If your use case does not require combining multiple traces into a dataset, you can directly use a Trace
as a Dataset
: the Trace
protocol subclasses the Dataset
protocol, in that all required interfaces are a subset of the Dataset
interfaces.
Pipeline¶
Dataloaders may or may not come with data a preprocessing Pipeline
which you are expected to use. Since data loaders (Dataset
) and processing pipelines (Pipeline
) are modular and freely composable assuming they share the same data types, it's also possible that a pipeline is distributed separately from the data loader(s) which it is compatible with.
Use Out-of-the Box¶
If a library comes with a complete, ready-to-use Pipeline
, then all that remains is to apply the pipeline:
def apply_pipeline(indices, dataset, pipeline):
raw = [dataset[i] for i in indices]
transformed = [pipeline.batch(x) for x in raw]
collated = pipeline.collate(transformed)
processed = pipeline.batch(collated)
return processed
Tip
In practice, the pipeline should be integrated with the data loading and training pipeline to ensure that it is properly pipelined and parallelized!
Assuming you are using pytorch, the following building blocks may be helpful:
- Use
TransformedDataset
, which provides a pytorch map-style dataset with a pipeline's.transform
applied. - Pass
Pipeline.collate
as thecollate_fn
for a pytorchDataLoader
. - If you are using pytorch lightning,
ext.lightning.ADLDataModule
can also handle.transform
,.collate
, and a number of other common data marshalling steps for you.
Assemble from Components¶
If you don't have a complete Pipeline
implementation but instead have separate components, you can use abstract.Pipeline
to assemble a transform
: spec.Transform
, collate
: spec.Collate
, and batch
: spec.Transform
into a single pipeline.
Tip
Both transform
and pipeline
are optional and will fall back to the identity function if not provided. Only collate
is required.
Info
For pytorch users, a reference collate function is provided as abstract_dataloader.torch.Collate
, which can handle common pytorch tensor operations and data structures.