In the previous tutorials, we introduced the process of executing workloads on the DeepSquare grid using simple fire-and-forget examples. However, data persistence becomes vital when dealing with tasks such as machine learning training, where you need to input training data and extract model checkpoints.

This section will guide you through the process of integrating a dataset into your workflow and retrieving the resulting data.

Data Persistence in DeepSquare

Let's dive into a practical example involving basic machine learning. We'll integrate a dataset into our workflow and extract checkpoints from our process.

Utilizing S3 and HTTP for Data Transport

DeepSquare does not inherently store data; the control and management of data is completely under your purview. However, DeepSquare requires access to your data sources to fetch and push data.

DeepSquare provides support for both S3 and HTTP data sources. In this guide, we'll utilize HTTP to fetch a publicly available dataset and S3 to store our output data.

We'll work with the CIFAR-10 dataset, which is available for download at this address. CIFAR-10 is a collection of 60,000 32x32 color images categorized into 10 classes, with 6,000 images per class. The dataset is divided into five training batches (50,000 images) and one test batch (10,000 images).

The ML model

In this tutorial, we'll train a Deep Layer Aggregation Model. The source code for this model is located here, and the container image is available at ghcr.io/deepsquare-io/cifar-10-example:latest.

Source code

model.py (PyTorch Model)
import torch
import torch.nn.functional as F
from torch import nn


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(
            in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(
            planes, planes, kernel_size=3, stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion * planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(
                    in_planes,
                    self.expansion * planes,
                    kernel_size=1,
                    stride=stride,
                    bias=False,
                ),
                nn.BatchNorm2d(self.expansion * planes),
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out


class Root(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=1):
        super(Root, self).__init__()
        self.conv = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size,
            stride=1,
            padding=(kernel_size - 1) // 2,
            bias=False,
        )
        self.bn = nn.BatchNorm2d(out_channels)

    def forward(self, xs: list[torch.Tensor]):
        x = torch.cat(xs, 1)
        out = F.relu(self.bn(self.conv(x)))
        return out


class Tree(nn.Module):
    def __init__(self, block, in_channels, out_channels, level=1, stride=1):
        super(Tree, self).__init__()
        self.root = Root(2 * out_channels, out_channels)
        if level == 1:
            self.left_tree = block(in_channels, out_channels, stride=stride)
            self.right_tree = block(out_channels, out_channels, stride=1)
        else:
            self.left_tree = Tree(
                block, in_channels, out_channels, level=level - 1, stride=stride
            )
            self.right_tree = Tree(
                block, out_channels, out_channels, level=level - 1, stride=1
            )

    def forward(self, x):
        out1 = self.left_tree(x)
        out2 = self.right_tree(out1)
        out = self.root([out1, out2])
        return out


class SimpleDLA(nn.Module):
    def __init__(self, block=BasicBlock, num_classes=10):
        super(SimpleDLA, self).__init__()
        self.base = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU(True),
        )

        self.layer1 = nn.Sequential(
            nn.Conv2d(16, 16, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(16),
            nn.ReLU(True),
        )

        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU(True),
        )

        self.layer3 = Tree(block, 32, 64, level=1, stride=1)
        self.layer4 = Tree(block, 64, 128, level=2, stride=2)
        self.layer5 = Tree(block, 128, 256, level=2, stride=2)
        self.layer6 = Tree(block, 256, 512, level=1, stride=2)
        self.linear = nn.Linear(512, num_classes)

    def forward(self, x):
        out = self.base(x)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = self.layer4(out)
        out = self.layer5(out)
        out = self.layer6(out)
        out = F.avg_pool2d(out, 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

Visualization

model.pt

Setup an S3

Amazon S3, or Simple Storage Service, is a preferred method for data storage, particularly for control files. It's an object storage solution that stores data in an unstructured format, thereby allowing scalability by spreading data across multiple devices.

If you're inclined to self-host your own S3 server, consider using Garage for small to medium-scale operations, or MinIO for medium to large-scale applications.

Alternatively, if setting up your own storage infrastructure sounds daunting, there are convenient options such as Exoscale Simple Object Storage Simple Object Storage (approx. $0.020 per GB/month) or Google Cloud Storage (approx. $0.023 per GB/month).

Upon successful deployment of your S3 server and creation of your bucket, you should retrieve the API access keys for your S3 storage. These keys will enable the interaction with your storage space.

Key	Description	Example
Region	The region of the S3 object storage.	us‑east‑2
Endpoint URL	The URL to the S3 API, which should start with https.	https://s3.us‑east‑2.amazonaws.com
Bucket URL	The S3 bucket URL used to fetch data.	s3://my-bucket
Path	The path relative to the bucket root.	/my-directory
Access Key ID	The access key ID of the API access.	Varies with the host: starts with AKIA (amazon) or EXO (exoscale)
Secret Access Key	The password of the API access.	***

Writing the workflow file

Workflow
resources:
  tasks: 1
  gpus: 2
  cpusPerTask: 8
  memPerCpu: 2048

enableLogging: true

input:
  http:
    url: https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

output:
  s3:
    region: at-vie-1
    bucketUrl: s3://cifar10
    path: '/'
    accessKeyId: EXO***
    secretAccessKey: '***'
    endpointUrl: https://sos-at-vie-1.exo.io

## Enable periodinc sync with output.
continuousOutputSync: true

steps:
  - name: train
    resources:
      gpusPerTask: 2
    run:
      command: /.venv/bin/python3 main.py --checkpoint_out=$DEEPSQUARE_OUTPUT/ckpt.pth --dataset=$DEEPSQUARE_INPUT/
      container:
        image: deepsquare-io/cifar-10-example:latest
        registry: ghcr.io
      workDir: '/app'

The input option will automatically unpack the archive.

The main.py script will produce a ckpt.pth checkpoint file every time. Using continuousOutputSync, the checkpoint will be uploaded each time the file is updated. You won't lose progress and can resume it later using --checkpoint_in=$DEEPSQUARE_INPUT/ckpt.pth, but you need to set up an S3 as input. One solution would be to use an init container:

Workflow with resume checkpoint
resources:
  tasks: 1
  gpus: 2
  cpusPerTask: 8
  memPerCpu: 2048

enableLogging: true

input:
  s3:
    region: at-vie-1
    bucketUrl: s3://cifar10
    path: '/'
    accessKeyId: EXO284cde16bdbe4195b8fc4763
    secretAccessKey: KYReUpY-8ipfAvO5wlYpd7Uq-IkadN9ac535H-C1mbI
    endpointUrl: https://sos-at-vie-1.exo.io

output:
  s3:
    region: at-vie-1
    bucketUrl: s3://cifar10
    path: '/'
    accessKeyId: EXO284cde16bdbe4195b8fc4763
    secretAccessKey: KYReUpY-8ipfAvO5wlYpd7Uq-IkadN9ac535H-C1mbI
    endpointUrl: https://sos-at-vie-1.exo.io
## Enable periodinc sync with output.
continuousOutputSync: true

steps:
  - name: download dataset
    run:
      command: curl -fsSL https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz -o
        $STORAGE_PATH/cifar-10-python.tar.gz; tar -C $STORAGE_PATH -xvzf $STORAGE_PATH/cifar-10-python.tar.gz;
        ls -lah $STORAGE_PATH
      container:
        image: curlimages/curl:latest
        registry: registry-1.docker.io
  - name: train
    resources:
      gpusPerTask: 2
    run:
      command: '/.venv/bin/python3 main.py --checkpoint_in=$DEEPSQUARE_INPUT/ckpt.pth
        --checkpoint_out=$DEEPSQUARE_OUTPUT/ckpt.pth --dataset=$STORAGE_PATH/'
      container:
        image: deepsquare-io/cifar-10-example:latest
        registry: ghcr.io
      workDir: '/app'

(Since we use PyTorch DataLoader, we could also have downloaded the dataset directly from the code).

Next steps

Great job on mastering storage management with DeepSquare! You've successfully imported and exported data in a job.

Next, we will dive into the process of containerizing an application. This segment will cover:

The concept of application containers.
Building and packaging applications with Docker, Podman, or Buildah.
Local testing of your containerized application.
Exporting your application to a registry.

Data Persistence in DeepSquare

Utilizing S3 and HTTP for Data Transport​

The ML model​

Setup an S3​

Writing the workflow file​

Next steps​

Utilizing S3 and HTTP for Data Transport

The ML model

Setup an S3

Writing the workflow file

Next steps