Linstor: networked storage without the complexity

With step-by-step guide to deployment under Ubuntu 18.04

When running your own infrastructure, persistent storage has always been a problem for virtual machines and stateful containers.

On the one hand, you have local storage — which means your VMs and containers cannot move without copying their entire disk image.

On the other hand, you have network-attached storage (SANs and NASes), which are expensive and a critical point of failure. And on the third hand, you have distributed storage systems like Ceph and Longhorn, which can be complex to deploy and expand, and very hard to diagnose when they don’t perform as expected.

In this article I’d like to introduce another option: Linstor. I’ll give a brief overview, ending up with the exact steps to create a simple demo storage cluster under Ubuntu 18.04.

Linstor overview

Linstor comes from Linbit, the company that gave us DRBD. DRBD is the “Distributed Replicated Block Device”, and provides RAID-1 style mirroring across a network. It has been around for many years, and is well battle-hardened. Out of the box, you can configure DRBD to replicate a drive or partition on one server to another server, and you can access that device via a kernel device node, /dev/drbdXXXX. Configuration is done dynamically via a CLI tool, or persistently in a config file.

Linstor in essence gives you a management layer on top of DRBD. When you ask Linstor to create a storage volume, it creates an LVM volume (or ZFS zvol) on two or more nodes, and configures DRBD replication between them. The volume is then accessible from any of the nodes — even ones which don’t contain a local replica, because they can still access it over the network.

As a result, each volume is accessible at the same /dev/drbdXXXX path across your entire cluster. It’s like having a SAN, but without the cost or single-point-of-failure. Virtual machines can live-migrate, and containers can attach persistent volumes. Plugins are provided for Proxmox VE, Kubernetes and others.

Crucially, Linstor is just a management layer: it does not sit in the storage I/O data path. So even if Linstor crashes or is restarted, all your ongoing disk I/O continues unaffected.

Benefits of Linstor

  • It configures DRBD for you, across multiple nodes, running on top of either LVM or zpools.
  • Nodes can access storage over the network using the DRBD protocol. You can have a “diskless” client node, or your client can be one of the replicas.
  • It uses DRBD 9, which allows up to 32 replicas of each volume. You can add and remove replicas to a volume dynamically, with no downtime.
  • Compared to Ceph, it’s much easier to debug. You can use simple tools like drbdsetup status and lvs to see what’s going on.
  • If the controller goes down, it doesn’t affect any running DRBD instances.
  • It’s free and open-source.

Downsides of Linstor

  • The satellite component of Linstor, which runs on every storage node, is written in Java. So is the controller component: the default database is H2, although others are supported. Much as I dislike Java, by treating it as a black box I think I can live with this. The CLI is written in Python, and the Proxmox plugin in Perl, so you don’t need Java anywhere else.
  • Since DRBD version 9 is required, you may need to install (or build) an updated kernel module. Linbit provide a number of suitable pre-built modules, but using them introduces some extra work with UEFI secure boot that I’ll explain in detail below. This may become less of an issue in future¹.
  • If you want access to official, supported binaries or docker images then you need to pay. However there is a free (unsupported) PPA for Ubuntu, and you can easily build your own docker images from the Dockerfile.
  • A single volume cannot be larger than the storage capacity offered by a single node.
  • Linstor only provides block storage. If you want object storage then this is not what you’re looking for. You could run Minio to give an object front-end, just as you could run an NFS or Samba server to make a NAS, but it won’t scale once you get into the multi-terabyte arena.

Alternatives to Linstor

The closest I’ve come to this in the past is Ganeti, a free virtualization framework originating from Google. Its unique feature was that it could create DRBD-replicated volumes, giving a degree of VM resilience and mobility without a SAN. However, it was limited in that each volume could only replicate to one other node, so you could only live-migrate a VM to that particular node. It also has an obscure command line interface.

Google have now handed over the project to the community, which is enthusiastic but very small.

Linstor proof-of-concept

Here is how I set up a proof-of-concept deployment using four Ubuntu 18.04 virtual machines as storage nodes. Note: for this test environment, I decided to make node1 be both the controller node and a storage node: Linbit call this a “combined” node.

I used lxd to run the VMs. lxd started as a containerisation service built into Ubuntu, but as of lxd 4.0 has gained the ability to launch fully-fledged VMs. It also has a guest agent which allows the host to execute commands in the guest using “lxc exec …”. This makes it very amenable to scripting.

If your host is Ubuntu 18.04, then you’ll need to install lxd 4.0+ as a snap.

If you want to replicate these steps using a different virtualization environment, you can — but you’ll need to replace lxc exec <node> … with ssh’ing into the VM, or typing commands directly on the relevant VM console.

for n in node{1..4}; do \
lxc launch images:ubuntu/18.04/cloud $n --vm
done
lxc list # wait until all four have picked up an IP address

lxd’s defaults for VMs are 1GB RAM and 10GB disk, which are fine for this purpose.

On my host, I have configured lxd’s default network so that machines pick up a DHCP address from 192.0.2.0/24

From the host, I can do this inside all 4 VMs with a line of bash script (note the long line which has wrapped below):

for n in node{1..4}; do \
lxc exec $n -- bash -c 'add-apt-repository -y ppa:linbit/linbit-drbd9-stack; apt-get update; apt-get install -y --no-install-recommends drbd-dkms drbd-utils lvm2 linstor-satellite linstor-client'
done

This is the point where UEFI secure boot nonsense kicks into play, because the DRBD9 kernel module from the Linbit repository hasn’t been signed by Ubuntu.

For each VM, I got the following message displayed:

Your system has UEFI Secure Boot enabled.

UEFI Secure Boot requires additional configuration to work with third-party drivers.

The system will assist you in configuring UEFI Secure Boot. To permit the use of third-party drivers, a new Machine-Owner Key (MOK) has been generated. This key now needs to be enrolled in your system’s firmware.

To ensure that this change is being made by you as an authorized user, and not by an attacker, you must choose a password now and then confirm the change after reboot using the same password, in both the “Enroll MOK” and “Change Secure Boot state” menus that will be presented to you when this system reboots.

If you proceed but do not confirm the password upon reboot, Ubuntu will still be able to boot on your system but any hardware that requires third-party drivers to work correctly may not be usable.

For the secure boot password you are forced to pick a string between 8 and 16 characters. I used “abcd1234”. You’ll need this in the next step.

This is the painful part. One by one, restart each of the VMs and attach to its console so you can see the messages it generates as it boots:

lxc stop node1
lxc start --console node1

It will stop with a menu option like this:

Perform MOK management

- Continue boot
- Enroll MOK
- Enroll key from disk
- Enroll hash from disk

Select “Enroll MOK”, “Continue”, “Yes”. You’ll then be prompted for the secure boot password you selected before (e.g. “abcd1234”). Finally, select “Reboot”.

Once this has been done, Linbit’s DRBD9 module should become loadable. You can check it like this:

lxc exec node1 modprobe drbd

If no error is displayed, then everything is good. Repeat for all four nodes.

Install the controller on node1 only (the flag --no-install-recommends prevents unnecessary dependencies from being installed):

lxc exec node1 -- apt-get install linstor-controller \
--no-install-recommends

Each storage node needs an LVM volume group to allocate space from, and it’s easiest just to add another virtual disk for this, separate from the boot disk.

Under lxd, this is how to create and attach a new disk for the first node:

lxc storage volume create default node1b --type block
lxc stop node1
lxc config device add node1 node1b disk pool=default source=node1b
lxc start node1

Repeat for node2, node3 and node4 (adjusting “node1” and “node1b” as appropriate). The new disk appears as /dev/sdb in each VM.

On each node, we need to add this new disk into an LVM volume group which I’ve called “vg_ssd”. This is what Linstor will allocate space from.

for n in node{1..4}; do lxc exec $n vgcreate vg_ssd /dev/sdb; done

Make a note of the IP addresses for each node from lxc list, and substitute the correct ones below.

The cluster configuration is performed on the controller node only. We will start the controller, and join the storage nodes. Remember that node1 is both controller and storage, a “combined” node.

lxc shell node1
# now we have a shell in the node1 VM; the following
# commands are run on node1
systemctl enable --now linstor-controller
linstor node list # shows an empty list
linstor node create node1 192.0.2.124 --node-type combined
linstor node create node2 192.0.2.120
linstor node create node3 192.0.2.58
linstor node create node4 192.0.2.60
linstor node list

The node listing should look like this:

Still on the controller node, create a storage pool called “pool_ssd” which contains the “vg_ssd” volume group from each node:

linstor storage-pool create lvm node1 pool_ssd vg_ssd
linstor storage-pool create lvm node2 pool_ssd vg_ssd
linstor storage-pool create lvm node3 pool_ssd vg_ssd
linstor storage-pool create lvm node4 pool_ssd vg_ssd
linstor storage-pool list

Now create a resource group and a volume group:

linstor resource-group create my_ssd_group \
--storage-pool pool_ssd --place-count 2
linstor volume-group create my_ssd_group

Finally, create a new 2GiB volume carved out of this volume group:

linstor resource-group spawn-resources my_ssd_group my_ssd_res 2G

Check the results of this operation:

You can see that the volume “my_ssd_res” has been created as a pair of LVM volumes, replicated between node1 and node2; you can also see the progress of DRBD synchronization (the initial mirroring).

Here are some other status checks you can do:

(If you are familiar with drbd8, then note that in drbd9 /proc/drbd no longer contains status)

Look carefully, and you can see that node3 has been included as a “diskless” peer. This is a clever bit of DRBD configuration by Linstor: by having three nodes participate (even though one doesn’t store data), then if one node fails there will still be a quorum. It avoids the “split brain” which would occur if there were only two nodes, and one failed.

This volume is now available as a block device from any three of the participating nodes, even node3 which doesn’t have a local replica:

root@node3:~# ls -l /dev/drbd*
brw-rw---- 1 root disk 147, 1000 Nov 4 12:26 /dev/drbd1000

/dev/drbd:
total 0
drwxr-xr-x 3 root root 60 Nov 4 12:26 by-res

For example: I can create a filesystem from node3:

root@node3:~# mkfs.ext4 /dev/drbd1000
...
root@node3:~# mount /dev/drbd1000 /mnt
root@node3:~# echo "world" >/mnt/hello
root@node3:~# umount /dev/drbd1000

And then I can mount it from another node:

root@node1:~# mount /dev/drbd1000 /mnt
root@node1:~# cat /mnt/hello
world
root@node1:~# umount /dev/drbd1000

There’s no need to switch primary/secondary roles, it “just works”.

BEWARE: just like a SAN, it’s very important that you don’t have the same filesystem mounted on two nodes simultaneously, or you’ll end up with severe data corruption. Therefore it’s critical that you unmount the filesystem on one node before mounting it on another.

This volume is available on node1, node2 and node3. What if we want to access it from node4? This is done by adding node4 as another “diskless” participant:

And now the device can be accessed from node4 as well:

root@node4:~# mount /dev/drbd1000 /mnt
root@node4:~# cat /mnt/hello
world
root@node4:~# umount /mnt
root@node4:~#

Aside: you can see all the details of the inter-node TCP connections using drbdsetup show. On the diskless nodes, you’ll just see two connections to the nodes where data is present.

More detail is also available from linstor volume list:

(Note the VolNr. A single resource can contain multiple volumes — the first one is volume 0).

It’s incredibly easy to toggle a node into being another replica. For example, to promote node3 into a third replica of this volume:

node3 is now in progress of replicating the data. Then to remove the replica from node1:

That’s all of Ganeti’s instance disk move, migrate and failover actions in a single command!

There is also a very nice utility called drbdtop which displays a drbd status summary. It’s not apt-packaged, but is a single binary download that you can copy into /usr/local/bin.

It shows a table of all resources, with check marks or crosses indicating the current replication status and health.

Linstor can do over-provisioning using thin LVM and thin ZFS volumes, which also enable snapshots. I have not tested these features.

Conclusion

Linstor is an extremely interesting entrant to the storage arena. It provides the robustness of DRBD replication and networked storage access, without the cost of a SAN or the complexity of a full-blown distributed storage system. (Linbit have also published their own comparison with Ceph).

These examples show how to perform a basic test installation, but for full details, see the manual. In production you may also want to consider having a standby controller, or even a HA controller.

Finally: although I used lxd to run the VMs for this test, lxd itself can’t yet use Linstor for container and VM storage. If someone could write that, it would be awesome :-)

¹Whilst Ubuntu 20.04 has drbd 9.x utilities, I don’t think the corresponding module has been integrated into the kernel yet.