8. Distributed block storage - 《[英文] Kubernetes clusters for the hobbyist.》

Distributed block storage

Data in containers is ephemeral, as soon as a pod gets stopped, crashes, or is for some reason rescheduled to run on another node, all its data is gone. While this is fine for static applications such as the Kubernetes Dashboard (which obtains its data from persistent sources running outside of the container), persisting data becomes a non-optional requirement as soon as we deploy databases on our cluster.

Kubernetes supports various types of volumes that can be attached to pods. Only a few of these match our requirements. There’s the hostPath type which simply maps a path on the host to a path inside the container, but this won’t work because we don’t know on which node a pod will be scheduled.

Persistent volumes

There’s a concept within Kubernetes of how to separate storage management from cluster management. To provide a layer of abstraction around storage there are two types of resources deeply integrated into Kubernetes, PersistentVolume and PersistentVolumeClaim. When running on well integrated platforms such as GCE, AWS or Azure, it’s really easy to attach a persistent volume to a pod by creating a persistent volume claim. Unfortunately, we don’t have access to such solutions.

Our cluster consists of multiple nodes and we need the ability to attach persistent volumes to any pod running on any node. There are a couple of projects and companies emerging around the idea of providing hyper-converged storage solutions. Some of their services are running as pods within Kubernetes itself, which is certainly the perfect way of managing storage on a small cluster such as ours.

Choosing a solution

Currently there are a couple of interesting solutions matching our criteria, but they all have their downsides:

Rook.io is an open source project based on Ceph. It looks promising, but is still pretty much in alpha state and lacking some serious documentation.
gluster-kubernetes is an open source project built around GlusterFS and Heketi. Setup seems tedious at this point, requiring some kind of schema to be provided in JSON format.
Portworx is a commercial project that offers a free variant of their proprietary software, providing great documentation and tooling.

Even though we would definitely prefer using open source software, Portworx offers the best solution currently available. Setup is simple, deployment and operation is transparent. It launches just a single pod per instance where others create a whole bunch of pods and sidecars. Things might change in the future, but for now we’re going to settle on Portworx.

Deploying Portworx

As we run only a three node cluster, we’re going to deploy PX on all three of them using a DaemonSet with master toleration. The official documentation states that PX should be deployed manually on each host using docker run. The main reason behind this statement is probably PX’s need for mounting volumes in shared mode (e.g. -v /host/path:/container/path:shared). Officially, the Kubernetes pod specification doesn’t support this flag, but we can work around this by appending :shared to the mount path in the pod spec.

Before deploying the Portworx DaemonSet we need to provide a raw, unformatted block device that will be used for storage on each host. These can either be attached volumes or local loopback devices. On Scaleway, the volume on which the operating system is installed is called /dev/vda. Attaching another volume will be available as /dev/vdb. On DigitalOcean things work a little differently. Attached volumes are referenced with something like /dev/disk/by-id/scsi-0DO_Volume_<VOLUME_NAME>.

Make sure to edit the daemonset manifest listed below and replace the value of the PX_STORAGE_DEVICE env variable with a block device available in your environment. The resulting manifests turn out pretty lean for such a seemingly complex service:

It’s worth mentioning that the storage class manifest contains a few important parameters:

#  storage/storageclass.yml
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: portworx
provisioner: kubernetes.io/portworx-volume
parameters:
  repl: "2" #  replication factor
  snap_interval: "0" #  turn off automatic snapshots
  io_priority: "high"

Further parameters are listed in the Portworx storage class documentation.

In order to operate on the storage cluster use the pxctl control tool (documentation) via one of the Portworx containers. Here are some examples:

#  show status summary
kubectl exec -it portworx-storage-wp797 -- /opt/pwx/bin/pxctl status
#  list volumes in the cluster
kubectl exec -it portworx-storage-wp797 -- /opt/pwx/bin/pxctl volume list
#  show cluster wide alerts
kubectl exec -it portworx-storage-wp797 -- /opt/pwx/bin/pxctl cluster alerts

Consuming storage

The storage class we created can be consumed with a persistent volume claim:

#  minio/pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: minio-persistent-storage
spec:
  storageClassName: portworx
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

This will create a volume called minio-persistent-storage with 5GB storage capacity. Please note that there’s currently a bug in PX related to ReadWriteMany volume claims (bug report). Volumes can only be claimed in ReadWriteOnce access mode for the time being.

In this example we’re deploying Minio, an Amazon S3 compatible object storage server, to create and mount a persistent volume:

minio/deployment.yml
minio/ingress.yml
minio/secret.yml (MINIO_ACCESS_KEY: admin / MINIO_SECRET_KEY: admin.minio.secret.key)
minio/service.yml
minio/pvc.yml

The volume related configuration is buried in the deployment manifest:

#  from minio/deployment.yml
containers:
- name: minio
  volumeMounts:
  - name: data
    mountPath: /data
#  ...
volumes:
- name: data
  persistentVolumeClaim:
    claimName: minio-persistent-storage

The minio-persistent-storage volume will live as long as the persistent volume claim is not deleted (e.g. kubectl delete -f minio/pvc.yml). The Minio pod itself can be deleted, updated or rescheduled without data loss.