Using the Abstract Dataloader¶
Recommendations
- Define your components using
abstract
base classes, and specify input components usingspec
interfaces. - Set up a type system, and provide your types to the ADL
spec
andgeneric
components as type parameters. - Use a static type checker and a runtime type checker.
Using ADL Components¶
Code using ADL-compliant components (which may itself be another ADL-compliant component) should use the exports provided in the spec
as type annotations. When combined with static type checkers such as
mypy or pyright and runtime (dynamic)
type checkers such as beartype and typeguard, this allows code to be verified for some degree of type safety and spec compliance.
from dataclasses import dataclass
from abstract_dataloader.spec import Sensor
@dataclass
class Foo:
x: float
def example(adl_sensor: Sensor[Foo, Any], idx: int) -> float:
x1 = adl_sensor[idx].x
# pyright: x: float
x2 = adl_sensor[x1]
# pyright: argument of type "float" cannot be assigned to parameter
# "index" of type "int | Integer[Any]"...
return x2
Spec Verification via Type Checking
The abstract_dataloader
module does not include any runtime type checking; downstream modules should enable runtime type checking if desired, for example with install hooks like
beartype_this_package()
or jaxtyping's install_import_hook(...)
.
Implementing ADL Components¶
There are three ways to implement ADL-compliant components, in order of preference:
- Use the abstract base classes provided in
abstract_dataloader.abstract
. - Explicitly implement the protocols described in
abstract_dataloader.spec
. - Implement the protocols using the specifications as guidance with neither
abstract_dataloader.abstract
norabstract_dataloader.spec
classes as base classes.
Using the Abstract Base Classes¶
The abstract_dataloader.abstract
submodule provides Abstract Base Class implementations of applicable specifications, with required methods annotated, and methods which can be written in terms of others "polyfilled" by default. This is the fastest way to get started:
class Lidar(abstract.Sensor[LidarData, LidarMetadata]):
def __init__(
self, metadata: LidarMetadata, path: str, name: str = "sensor"
) -> None:
self.path = path
super().__init__(metadata=metadata, name=name)
def __getitem__(self, batch: int | np.integer) -> LidarData:
blob_path = os.path.join(self.path, self.name, f"{batch:06}.npz")
with open(blob_path) as f:
return dict(np.load(f))
Prefer spec
to abstract
as type bounds
While this example does not take any ADL components as inputs, ADL-compliant implementations which take components as inputs should use spec
types as annotations instead of abstract
base classes as annotations. This preserves modularity; if an abstract
base class is used as an input type annotation, this prevents the use of other compatible implementations which do not use the specified base class.
Explicitly Implementing an ADL Component¶
Implementations do not necessarily need to use the abstract base classes provided, and can instead use the provided specifications as base classes. This explicitly implements the defined specifications, which allows static type checkers to provide some degree of verification that you have implemented the specification.
Example
The provided generic.Next
explicitly implements spec.Synchronization
:
import numpy as np
from jaxtyping import Float64, UInt32
from abstract_dataloader import spec
class Next(spec.Synchronization):
def __init__(self, reference: str) -> None:
self.reference = reference
def __call__(
self, timestamps: dict[str, Float64[np.ndarray, "_N"]]
) -> dict[str, UInt32[np.ndarray, "M"]]:
ref_time_all = timestamps[self.reference]
start_time = max(t[0] for t in timestamps.values())
end_time = min(t[-1] for t in timestamps.values())
start_idx = np.searchsorted(ref_time_all, start_time)
end_idx = np.searchsorted(ref_time_all, end_time)
ref_time = ref_time_all[start_idx:end_idx]
return {
k: np.searchsorted(v, ref_time).astype(np.uint32)
for k, v in timestamps.items()}
Implicitly Implementing an ADL Component¶
Since the abstract dataloader is based entirely on structural subtyping - "static duck typing", implementing an ADL-compliant component does not actually require the abstract dataloader!
- Simply implementing the protocols described by the ADL, using the same types (or with more general inputs and more specific outputs), automatically allows for ADL compatibility with downstream components which use only the ADL spec and correctly apply ADL spec type-checking.
- Since the types only describe behavior, parts or all of the
spec
can be copied into other modules; these copies then become totally interchangeable with the originals!
Extending Protocols¶
When extending the abstract dataloader specifications with additional domain-specific features, keep in mind that your implementation must be usable wherever an ADL-compliant dataloader is used. This is known as the Liskov Substitution Principle (and is the error you will get from static type checkers if you break that compatibility). The key implications are as follows:
Preconditions cannot be strengthened in the subtype.
- Implementations may not impose stricter requirements on the input types; for example, where
__getitem__
is expected to allowindex: int | np.integer
, implementations may not require more specific types, e.g.np.int32
. - This also implies that additional arguments must be optional, and have defaults, so that they are not required to be set when called.
Postconditions cannot be weakened in the subtype.
- Return types cannot be more general than in the specifications. In general, this should never be an issue, since the return types provided in the abstract dataloader specifications are extremely general.
- However, the generic types provided should be followed; for example, when
Sensor.__getitem__
returns a certainSample
,Sensor.stream
should also return that same type.
Example
The abstract.Sensor
provides a stream
implementation with an additional batch
argument that modifies the return type, while still remaining fully compliant with the Sensor
specification.
@overload
def stream(self, batch: None = None) -> Iterator[TSample]: ...
@overload
def stream(self, batch: int) -> Iterator[list[TSample]]: ...
def stream(
self, batch: int | None = None
) -> Iterator[TSample | list[TSample]]:
if batch is None:
for i in range(len(self)):
yield self[i]
else:
for i in range(len(self) // batch):
yield [self[j] for j in range(i * batch, (i + 1) * batch)]
This is accomplished using an @overload
with a default that falls back to the specification:
- If
.stream()
is used only as described by the spec,batch=None
matches the first overload, which returns anIterator[TSample]
, just as in the spec. - If
.stream()
is passed abatch=...
argument, the second overload now matches, instead returning anIterator[list[TSample]]
.