🗃️ Data Model
Overview
In this section your will learn what are the main data abstractions upon which Pipewine is built, how to interact with them and extend them according to your needs.
Pipewine data model is composed of three main abstractions:
- Dataset - A Sequence of
Sample
instances, where "sequence" means an ordered collection that supports indexing, slicing and iteration. - Sample - A Mapping of strings to
Item
instances, where "mapping" means a set of key-value pairs that supports indexing and iteration. - Item - An object that has access to the underlying data unit. E.g. images, text, structured metadata, numpy arrays, and whatever serializable object you may want to include in your dataset.
Plus, some lower level components that are detailed later on. You can disregard them for now:
- Parser - Defines how an item should encode/decode the associated data.
- Reader - Defines how an item should access data stored elsewhere.
Dataset
Dataset
is the highest-level container and manages the following information:
- How many samples it contains
- In which order
It provides methods to access individual samples or slices of datasets, as in Python slice.
Note
A Dataset
is an immutable Python Sequence, supporting all its methods.
All Dataset
objects are Generics, meaning that they can be hinted with information about the type of samples they contain. This is especially useful if you are using a static type checker.
Example
Example usage of a Dataset
object:
# Given a Dataset of MySample's
dataset: Dataset[MySample]
# Supports len
number_of_samples = len(dataset)
# Supports indexing
sample_0 = dataset[0] # The type checker infers the type: MySample
sample_51 = dataset[51] # The type checker infers the type: MySample
# Suppors slicing
sub_dataset = dataset[10:20] # The type checker infers the type: Dataset[MySample]
# Supports iteration
for sample in dataset:
...
By default Pipewine provides two implementations of the Dataset
interface:
ListDataset
LazyDataset
ListDataset
A ListDataset
is basically a wrapper around a Python list
, such that, whenever indexed, the result is immediately available.
To achieve this, it has two fundamental requirements:
- All samples must be known at creation time.
- All samples must be always loaded into memory.
Due to these limitations, it's rarely used in the built-in operations, since the lazy alternative LazyDataset
combined with caching provides a better trade-off, but it may be handy to have when:
- The number of samples is small.
- Samples are lightweight (i.e. no images, 3d data, huge tensors etc...)
Example
Example of how to construct a ListDataset
:
Time complexity (N = number of samples):
- Creation - O(N) (including the construction of the list)
- Length - O(1)
- Indexing - O(1)
- Slicing - O(N)
LazyDataset
The smarter alternative is LazyDataset
, a type of Dataset
that defers the computation of the samples as late as possible. That's right, when using a LazyDataset
samples are created when it is indexed, using a user-defined function that is passed at creation time.
This has some implications:
- Samples are not required to be known at creation time, meaning that you can create a
LazyDataset
in virually zero time. - Samples are not required to be kept loaded into memory the whole time, meaning that the memory required by
LazyDataset
is constant. - Constant-time slicing.
- The computatonal cost shifts to the indexing part, which now carries the burden of creating and returning samples.
Example
Let's see an example of how to create and use a LazyDataset
:
# Define a function that creates samples from an integer index.
def get_sample_fn(idx: int) -> Sample:
print(f"Called with index: {idx}")
sample = ... # Omitted
return sample
# Create a LazyDataset of length 10
dataset = LazyDataset(10, get_sample_fn)
# Do some indexing
sample_0 = dataset[0] # Prints 'Called with index: 0'
sample_1 = dataset[1] # Prints 'Called with index: 1'
sample_2 = dataset[2] # Prints 'Called with index: 2'
# Indexing the same sample multiple times calls the function multiple times
sample_1 = dataset[1] # Prints 'Called with index: 1'
Warning
What if my function is very expensive to compute? Is LazyDataset
going to call it every time the dataset is indexed?
Yes, but that can be avoided by using Caches, which are not managed by the LazyDataset
class.
Time complexity (N = number of samples):
- Creation - O(1)
- Length - O(1)
- Indexing - Depends on
get_sample_fn
andindex_fn
. - Slicing - O(1)
Sample
Sample
is a mapping-like container of Item
objects. If dataset were tables (as in a SQL database), samples would be individual rows. Contrary to samples in a dataset, items in a sample do not have any ordering relationship and instead of being indexed with an integer, they are indexed by key.
Note
A Sample
is an immutable Python Mapping, supporting all its methods.
Example
Let's see an example on how to use a Sample
object as a python mapping:
# Given a Sample (let's not worry about its creation)
sample: Sample
# Get the number of items inside the sample
number_of_items = len(sample)
# Retrieve an item named "image".
# This does not return the actual image, but merely an Item that has access to it.
# This will be explained in detail later.
item_image = sample["image"]
# Retrieve an item named "metadata"
item_metadata = sample["metadata"]
# Iterate on all keys
for key in sample.keys():
...
# Iterate on all items
for item in sample.values():
...
# Iterate on all key-item pairs
for key, item in sample.items():
...
In addition to all Mapping
methods, Sample
provides a set of utility methods to create modified copies (samples are immutable) where new items are added, removed or have their content replaced by new values.
Example
Example showing how to manipulate Sample
objects using utility methods:
# Given a Sample (let's not worry about its creation)
sample: Sample
# Add/Replace the item named "image" with another item
new_sample = sample.with_item("image", new_image_item)
# Add/Replace multiple items at once
new_sample = sample.with_items(image=new_image_item, metadata=new_metadata_item)
# Replace the contents of the item named "image" with new data
new_sample = sample.with_value("image", np.array([[[...]]]))
# Replace the contents of multiple items at once
new_sample = sample.with_values(image=np.array([[[...]]]), metadata={"foo": 42})
# Remove one or more items
new_sample = sample.without("image")
new_sample = sample.without("image", "metadata")
# Remove everything but one or more items
new_sample = sample.with_only("image")
new_sample = sample.with_only("image", "metadata")
# Rename items
new_sample = sample.remap({"image": "THE_IMAGE", "metadata": "THE_METADATA"})
In contrast with Datasets
, pipewine does not offer a lazy version of samples, meaning that the all items are always kept memorized. Usually, you want to keep the number of items per sample bound to a constant number.
Pipewine provides two main Sample
implementations that differ in the way they handle typing information.
TypelessSample
The most basic type of Sample
is TypelessSample
, akin to the old Pipelime Sample
.
This class is basically a wrapper around a dictionary of items of unknown type.
When using TypelessSample
it's your responsibility to know what is the type of each item, meaning that if you access an item you then have to cast it to the expected type.
With the old Pipelime, this quickly became a problem and lead to the creation of Entity
and Action
classes, that provide a type-safe alternative, but unfortunately integrate poorly with the rest of the library, failing to completely remove the need for casts or type: ignore
directives.
Example
The type-checker fails to infer the type of the retrieved item:
sample = TypelessSample(**dictionary_of_items)
# When accessing the "image" item, the type checker cannot possibly know that the
# item named "image" is an item that accesses an image represented by a numpy array.
image_item = sample["image"]
# Thus the need for casting (type-unsafe)
image_data = cast(np.ndarray, image_item())
Despite this limitation, TypelessSample
allow you to use Pipewine in a quick-and-dirty way that allows for faster experimentation without worrying about type-safety.
Example
At any moment, you can convert any Sample
into a TypelessSample
, dropping all typing information by calling the typeless
method:
TypedSample
TypedSample
is the type-safe alternative for samples. It allows you to construct samples that retain information on the type of each item contained within them, making your static type-checker happy.
TypedSample
on its own does not do anything, to use it you always need to define a class that defines the names and the type of the items. This process is very similar to the definition of a Python dataclass, with minimal boilerplate.
What you get in return:
- No need for
t.cast
ortype: ignore
directives that make your code cluttered and error-prone. - The type-checker will complain when something is wrong with the way you use your
TypedSample
, effectively preventing many potential bugs. - Intellisense automatically suggests field and method names for auto-completion.
- Want to rename an item? Any modern IDE is able to quickly rename all occurrences of a
TypedSample
field without breaking anything.
Example
Example creation and usage of a custom TypedSample
:
class MySample(TypedSample):
image_left: Item[np.ndarray]
image_right: Item[np.ndarray]
category: Item[str]
my_sample = MySample(
image_left=image_left_item,
image_right=image_right_item,
category=category_item,
)
image_left_item = my_sample.image_left # Type-checker infers type Item[np.ndarray]
image_left_item = my_sample["image_left"] # Equivalent type-unsafe
image_right_item = my_sample.image_right # Type-checker infers type Item[np.ndarray]
image_right_item = my_sample["image_right"] # Equivalent type-unsafe
category_item = my_sample.category # Type-checker infers type Item[str]
category_item = my_sample["category"] # Equivalent type-unsafe
Warning
Beware of naming conflicts when using TypedSample
. You should avoid item names conflicting with the methods of the Sample
class.
Item
Item
objects represent a single serializable unit of data. They are not the data itself, instead, they only have access to the underlying data.
Items do not implement any specific Python abstract type, since they are at the lowest level of the hierarchy and do not need to manage any collection of objects.
All items can be provided with typing information about the type of the data they have access to. This enables the type-checker to automatically infer the type of the data when accessed.
All Item
objects have a Parser
inside of them, an object that is responsible to encode/decode the data when reading or writing. These Parser
objects are detailed later on.
Furthermore, items can be flagged as "shared", enabling Pipelime to perform some optimizations when reading/writing them, but essentially leaving their behavior unchanged.
Example
Example usage of an Item
:
# Given an item that accesses a string
item: Item[str]
# Get the actual data by calling the item ()
actual_data = item()
# Create a copy of the item with data replaced by something else
new_item = item.with_value("new_string")
# Get the parser of the item
parser = item.parser
# Create a copy of the item with another parser
new_item = item.with_parser(new_parser)
# Get the sharedness of the the item
is_shared = item.is_shared
# Set the item as shared/unshared
new_item = item.with_sharedness(True)
Pipewine provides three Item
variants, that differ in the way data is accessed or stored.
MemoryItem
MemoryItem
instances are items that directly contain data they are associated with. Accessing data is immediate as it is always loaded in memory and ready to be returned.
Tip
Use MemoryItem
to contain "new" data that is the result of a computation. E.g. the output of a complex DL model.
Example
To create a MemoryItem
, you just need to pass the data as-is and the Parser
object:
StoredItem
StoredItem
instances are items that point to external data stored elsewhere. Upon calling the item, the data is read from the storage, parsed and returned.
StoredItem
objects use both Parser
and Reader
objects to retrieve the data. A Reader
is an object that exposes a read
method that returns data as bytes.
Currently Pipewine provides a Reader
for locally available files called LocalFileReader
, that essentially all it does is open(path, "rb").read()
.
Tip
Use StoredItem
to contain data that is yet to be loaded. E.g. when creating a dataset that reads from a DB, do not perform all the loading upfront, use StoredItem
to lazily load the data only when requested.
Example
To create a StoredItem
, you need to
Warning
Contrary to old Pipelime items, StoredItem
do not offer any kind of automatic caching mechanism: if you retrieve the data multiple times, you will perform a full read each time.
To counteract this, you need to use Pipewine cache operations.
CachedItem
CachedItem
objects are items that offer a caching mechanism to avoid calling expensive read operations multiple times when the underlying data is left unchanged.
To create a CachedItem
, you just need to pass an Item
of your choice to the CachedItem
constructor.
Example
Example usage of a CachedItem
:
# Suppose we have an item that reads a high resolution BMP image from an old HDD.
reader = LocalFileReader(Path("/extremely/large/file.bmp"))
item = StoredItem(reader, BmpParser())
# Reading data takes ages, and does not get faster if done multiple times.
data1 = item() # Slow
data2 = item() # Slow
data3 = item() # Slow
# With CachedItem, we can memoize the data after the first access, making subsequent
# accesses immediate
cached_item = CachedItem(item)
data1 = cached_item() # Slow
data2 = cached_item() # Fast
data3 = cached_item() # Fast
Parser
Pipewine Parser
objects are responsible for implementing the serialization/deserialization functions for data:
parse
transforms bytes into python objects of your choice.dump
transforms python objects into bytes.
Built-in Parsers
Pipewine has some built-in parsers for commonly used data encodings:
-
PickleParser
: de/serializes data using Pickle, a binary protocol that can be used to de/serialize most Python objects. Key pros/cons:- ✅
pickle
can efficiently serialize pretty much any python object. - ❌
pickle
is not secure: you can end up executing malicious code when reading data. - ❌
pickle
only works with Python, preventing interoperability with other systems. - ❌ There are no guarantees that
pickle
data written today can be correctly read by future python interpreters.
- ✅
-
JSONParser
andYAMLParser
de/serializes data using JSON or YAML, two popular human-readable data serialization languages that support tree-like structures of data that strongly resemble Python builtin types.- ✅ Both JSON and YAML are interoperable with many existing systems.
- ✅ Both JSON and YAML are standard formats that guarantee backward compatibility.
- ⚠️ JSON and YAML only support a limited set of types such as
int
,float
,str
,bool
,dict
,list
. - ✅
JSONParser
andYAMLParser
interoperate with pydanticBaseModel
objects, automatically calling pydantic parsing, validation and dumping when reding/writing. - ❌ Both JSON and YAML trade efficiency off for human readability. You may want to use different formats when dealing with large data that you don't care to manually read.
-
NumpyNpyParser
de/serializes numpy arrays into binary files.- ✅ Great with dealing with numpy arrays of arbitrary shape and type
- ❌ Only works with Python and Numpy.
- ❌ Does not apply any compression to data, resulting in very large files.
-
TiffParser
de/serializes numpy arrays into TIFF files.- ✅ Great with dealing with numpy arrays of arbitrary shape and type.
- ✅ Produces files that can be read outside of Python.
- ✅ Applies zlib lossless compression to reduce the file size.
-
BmpParser
de/serializes numpy arrays into BMP files.- ⚠️ Only supports grayscale, RGB and RGBA uint8 images.
- ❌ Does not apply any compression to data, resulting in very large files.
- ✅ Fast de/serialization.
- ✅ Lossless.
-
PngParser
de/serializes numpy arrays into PNG files.- ⚠️ Only supports grayscale, RGB and RGBA uint8 images.
- ✅ Produces smaller files due to image compression.
- ❌ Slow de/serialization.
- ✅ Lossless.
-
JpegParser
de/serializes numpy arrays into JPEG files.- ⚠️ Only supports grayscale and RGB uint8 images.
- ✅ Produces very small files due to image compression.
- ✅ Fast de/serialization.
- ❌ Lossy.
Custom Parsers
With Pipewine you are not limited to use the built-in Parsers, you can implement your own and use it seamlessly as if it were provided by the library.
Example
Let's create a TrimeshParser
that is able to handle 3D meshes using the popular library Trimesh
class TrimeshParser(Parser[tm.Trimesh]):
def parse(self, data: bytes) -> tm.Trimesh:
# Create a binary buffer with the binary data
buffer = io.BytesIO(data)
# Read the buffer and let trimesh load the 3D mesh object
return tm.load(buffer, file_type="obj")
def dump(self, data: tm.Trimesh) -> bytes:
# Create an empty buffer
buffer = io.BytesIO()
# Export the mesh to the buffer
data.export(buffer, file_type="obj")
# Return the contents of the buffer
return buffer.read()
@classmethod
def extensions(cls) -> Iterable[str]:
# This tells pipewine that it can automatically use this parses whenever a
# file with .obj extension is found and needs to be parsed.
return ["obj"]
Immutability
All data model types are immutable. Their inner state is hidden in private fields and methods and should never be modified in-place. Instead, they provide public methods that return copies with altered values, leaving the original object intact.
With immutability, a design decision inherited by the old Pipelime, we can be certain that every object is in the correct state everytime, since it cannot possibly change, and this prevents many issues when the same function is run multiple times, possibly in non-deterministic order.
Example
Let's say you have a sample containing an item named image
with an RGB image. You want to resize the image reducing the resolution to 50% of the original size.
To change the image in a sample, you need to create a new sample in which the image
item contains the resized image.
def half_res(image: np.ndarray) -> np.ndarray:
# Some code that downscales an image by 50%
...
# Read the image (more details later)
image = sample["image"]()
# Downscale the image
half_image = half_res(image)
# Create a new sample with the new (downscaled) image
new_sample = sample.with_value("image", half_image)
At the end of the snippet above, the sample
variable will still contain the original full-size image. Instead, new_sample
will contain the new resized image.
There are only two exceptions to this immutability rule:
- Caches: They need to change their state to save time when the result of a computation is already known. Since all other data is immutable, caches never need to be invalidated.
- Inner data: While all pipewine data objects are immutable, this may not be true for the data contained within them. If your item contains mutable objects, you are able to modify them implace. But never do that!
Python, unlike other languages, has no mechanism to enforce read-only access to an object, the only way to do so would be to perform a deep-copy whenever an object is accessed, but that would be a complete disaster performance-wise.
So, when dealing with mutable data structures inside your items, make sure you either:
- Access the data without applying changes.
- Create a deep copy of the data before applying in-place changes.
Danger
Never do this!
# Read the image from a sample
image = sample["image"]() # <-- We need to call the item () to retrieve its content
# Apply some in-place changes to image
image += 1
image *= 0.9
image += 1
# Create a new sample with the new image
new_sample = sample.with_value("image", image)
The modified image will be present both in the old and new sample, violating the immutability rule.
Success
Do this instead!
# Read the image from a sample
image = sample["image"]() # <-- We need to call the item () to retrieve its content
# Create a copy of the image with modified data
image = image + 1
# Since image is now a copy of the original data, you can now apply all
# the in-place changes you like now.
image *= 0.9 # Perfectly safe
image += 1 # Perfectly safe
# Create a new sample with the new image
new_sample = sample.with_value("image", image)
The modified image will be present only in the new sample.