Skip to content

Data Transforms

TL;DR

A Transform is any data transformation. Collate is a special transform which aggregates a Sequence of data into a batch. Combining a CPU-side Transform, a Collate function, and a GPU-side Transform yields a data Pipeline.

Transforms

Since data processing steps can vary wildly between domains, the lowest common denominator to describe a data transformation is simply a Callable[[TRaw], TTransformed]: a callable which takes an input data type, and converts it to some other data type. We provide this as the suggestively-named Transform protocol:

class Transform(Protocol, Generic[TRaw, TTransformed]):
    """Sample or batch-wise transform."""

    def __call__(self, data: TRaw) -> TTransformed:
        """Apply transform to a single sample."""
        ...

Composition Rules

Transforms can be sequentially composed, as long as the output type of the first is a subtype of the input of the second, e.g.

Transform[T2, T3] (.) Transform[T1, T2] = Transform[T1, T3].
This simple rule is implemented in abstract.Transform.

Batching and Collation

While clearly universal, the "all data processing is composed callables" is too vague, and not really helpful for organizing data transforms. To better categorize data transforms, we turn to analyzing batched operation.

From a code portability standpoint, we can break down all transforms based on whether they support batched operation in the inputs and/or outputs. This implies that there are four possible types of transforms: single-sample to single-sample, single-samples to batch, batch to batch, and batch to single-sample

Question

Are there any use cases for batch to single-sample transforms? I'm not aware of any, though perhaps there are some edge cases out there.

Within the first three (which are commonly used) the single-sample to batch transform stands out. We define this as a narrower type compared to a generic transform, which we refer to as Collate, which is analogous to the collate_fn of a pytorch dataloader:

class Collate(Protocol, Generic[TTransformed, TCollated]):
    """Data collation."""

    def __call__(self, data: Sequence[TTransformed]) -> TCollated:
        """Collate a set of samples."""
        ...

Composition Rules

Collate cannot be composed, since samples can only be aggregated into a batch once. However, like all transforms, they can be composed "in parallel" by routing different parts of a data structure to different Collate implementations.

Pipelines

A typical data processing pipeline consists of a CPU-side transform, a batching function, and a GPU-side transform. We formalize this using a Pipeline, which collects these three components together into a generically typed container:

Pipeline
       .sample       .collate      .batch
┌──────┐      ┌──────┐    . ┌─────┐      ┌─────┐
│Sample├─────►│Sample├──┐ . │     │      │     │
└──────┘      └──────┘  │ . │     │      │     │
┌──────┐      ┌──────┐  │ . │     │      │     │
│Sample├─────►│Sample├──┤ . │     │      │     │
└──────┘      └──────┘  ├──►│Batch├─────►│Batch│
  ...           ...     │ . │     │      │     │
                        │ . │     │      │     │
┌──────┐      ┌──────┐  │ . │     │      │     │
│Sample├─────►│Sample├──┘ . │     │      │     │
└──────┘      └──────┘    . └─────┘      └─────┘
                     CPU◄───►GPU
  • Pipeline.sample: apply some transform to a single sample, returning another single sample. This represents most common dataloader operations, e.g. data augmentations, point cloud processing, and nominally occurs on the CPU.
  • Pipeline.collate: combine multiple samples into a "batch" which facilitates vectorized processing.
  • Pipeline.batch: this step operates solely on batched data; there is no sharp line where GPU preprocessing ends, and a GPU-based model begins. This captures expensive, GPU-accelerated preprocessing steps.

Note

Since the only distinction between sample-to-sample and batch-to-batch transforms is the definition of a sample and a batch, which are inherently domain and implementation specific, we use the same generic Transform type for both Pipeline.sample and Pipeline.batch, unlike Pipeline.collate, which accepts the distinct Collate type.

Implementations may also be .sample/.batch-generic, for example using overloaded operators only, operating on pytorch cpu and cuda tensors, or having a numpy/pytorch switch. As such, we leave distinguishing .sample and .batch transforms up to the user.

Composition

Since they contain a Collate step, Pipelines may not be sequentially composed. Again, like all transforms, they can be composed "in parallel" by routing different parts of a data structure to different Collate implementations.

A Pipeline may also be composed with additional Transforms before and after its .sample and .batch.