Longhorn – A Kubernetes-Native Filesystem

vegard.blog.engen.priv.no

58 points by jandeboevrie 7 days ago

dpedu 3 days ago

Kubernetes CSI drivers are surprisingly easy to write. You basically just have to implement a number of gRPC procedures that manipulate your system's storage as the Kubernetes control plane calls them. I wrote one that uses file-level syncing between hosts using Syncthing to "fake" network volumes.

https://kubernetes-csi.github.io/docs/developing.html

There are 4 gRPCs listed in the overview, that literally is all you need.

cmeacham98 3 days ago

I tried longhorn on my homelab cluster. I'll admit it's possible that I did something wrong, but I managed to somehow get it into a state where it seemed my volumes got permanently corrupted. At the very least I couldn't figure out how to get my volumes working again.

When restoring from backup I went with Rook (which is a wrapper on ceph) instead and it's been much more stable, even able to recover (albeit with some manual intervention needed) from a total node hardware failure.

nerdjon 3 days ago

It is interesting seeing this article come up since just yesterday I setup longhorn in my homelab cluster needing better performance for some tasks than NFS was providing so I setup a raid on my r630 and tried it out.
So far things are running well but I can't shake this fear that I am in for a rude awakening and I loose everything. I backups but the recovery will be painful if I have to do it.
I will have to take a look at rook since I am not quite committed enough yet (only moved over 2 things) to switch.
- master_crab 3 days ago
  
  If the information is truly important push it off to a database or NAS. I use rook at home but really only for long lived app data (config files, etc). Anything truly important (media, files, etc) is served from an NFS attached to the cluster.
- cortesoft 3 days ago
  
  I have a small 4 node home cluster, and longhorn works great... on smaller volumes.
  I have a 15TB volume for video storage, and it can't complete any replica rebuilds. It always fails at some point and then tries to restart.
  
  nerdjon 2 days ago
  
  That is good to know then, I am really just using this for smaller volumes. My media is sitting at about the same size yours is and instead of using PVC's I just have it mounting a straight NFS share specifically for that to avoid any issues there.
  I think I am likely keeping most of my storage just setup with a storage class that uses my NFS as storage. But longhorn will be used for the things that need to be faster like the databases. I moved jellyfin over to Longhorn and it went from being borderline unusable while metadata was grabbed to actually working well.
  I can't imagine my biggest volume being more than 100gb, and even that is likely a major over estimation on my part.

devn0ll 3 days ago

As an Enterprise user of Rancher, we had long discussions with Suse about Longhorn. And we are not using it.

You need a separate storage lan, a seriously beafy one at to use Longhorn. But even 25Gbit was not enough to keep volumes from being corrupted.

When rebuilds take too long, longhorn fails, crashes, hangs, etc, etc.

We will never make the mistake of using Longhorn again.

coopreme 3 days ago

Go with Ceph… a little more of a learning curve but overall better.

remram 3 days ago

Be aware of its security flaws -- https://github.com/longhorn/longhorn/issues/1983

Allowing anyone to delete all your data is not great. When I found this I gave up on Longhorn and installed Ceph.

dilyevsky 3 days ago

Anyone knows what's the story with NVMEoF/SPDK support these days? A couple years ago Mayastor/OpenEBS was running laps around Longhorn on every performance metrics big time, not sure if anything changed there...

studmuffin650 3 days ago

Where I work, we primarily use Ceph for the a K8s Native Filesystem. Though we still use OpenEBS for block store and are actively watching OpenEBS mayastor

__turbobrew__ 3 days ago

I looked into mayastor and the NVME-of stuff is interesting, but it is so so so far behind ceph when it comes to stability and features. One ceph has the next generation crimson OSD with seastore I believe it should close a lot of the performance gaps with ceph.
- dilyevsky 3 days ago
  
  > One ceph has the next generation crimson OSD with seastore I believe it should close a lot of the performance gaps with ceph.
  only been in development for what like 5 years at this point? =) i have no horse in this race but seems to me openebs will close the gap sooner.
  
  __turbobrew__ 2 days ago
  
  soon™

scubbo 3 days ago

(Copied from[0] when this was posted to lobste.rs) Longhorn was nothing but trouble for me. Issues with mount paths, uneven allocation of volumes, orphaned undeletable data taking up space. It’s entirely possible that this was a skill issue, but still - never touching it again. Democratic-csi[1] has been a breath of fresh air by comparison.

[0] https://lobste.rs/s/vmardk/longhorn_kubernetes_native_filesy... [1] https://github.com/democratic-csi/democratic-csi

positisop 3 days ago

Longhorn is a poorly implemented distributed storage layer. You are better off with Ceph.

willbeddow 3 days ago

have not used longhorn, but we are currently in the process of migrating off of ceph after an extremely painful relationship with it. Ceph has fundamental design flaws (like the way it handles subtree pinning) that, IMO, make more modern distributed filesystems very useful. SeaweedFS is also cool, and for high performance use cases, weka is expensive but good.
- q3k 3 days ago
  
  That sounds more like a CephFS issue than a Ceph issue.
  (a lot of us distrust distributed 'POSIX-like' filesystems for good reasons)
- __turbobrew__ 3 days ago
  
  Are there any distributed POSIX filesystems which don’t suck? I think part of the issue is that POSIX compliant filesystem just doesn’t scale, and you are just seeing that?
  
  scheme271 3 days ago
  
  I think Lustre works fairly well. At the very least, it's used in a lot of HPC centers to handle large filesystems that get hammered by lots of nodes concurrently. It's open source so nominally free although getting a support contract from specialized consulting firm might be pricey.
  
  latchkey 3 days ago
  
  https://www.reddit.com/r/AMD_Stock/comments/1nd078i/scaleup_...
  You're going to have to open the image and then go to the third image. I thought it was interesting that OCI pegs Lustre at 8Gb/s and their high performance FS at much higher than that... 20-80.
  
  scheme271 2 days ago
  
  That's 8Gb/s per TB of storage. The bandwidth is going to scale up as you add OSTs and OSSs. The OCI FS maxes at 80Gb/s per mount target.
  
  huntaub 3 days ago
  
  Basically, we are building this at Archil (https://archil.com). The reason these things are generally super expensive is that it’s incredibly hard to build.
  
  willbeddow 3 days ago
  
  weka seems to Just Work from our tests so far, even under pretty extreme load with hundreds of mounts on different machines, lots of small files, etc... Unfortunately it's ungodly expensive.
yupyupyups 3 days ago

I've heard Ceph is expensive to run. But maybe that's not true?
- keeperofdakeys 3 days ago
  
  Ceph overheads aren't that large for a small cluster, but they grow as you add more hosts, drives, and more storage. Probably the main gotcha is that you're (ideally) writing your data three times on different machines, which is going to lead to a large overhead compared with local storage.
  Most resource requirements for Ceph assume you're going for a decently sized cluster, not something homelab sized.
- jauntywundrkind 3 days ago
  
  I'm only just wading in, after years of intent. I don't feel like Ceph is particularly demanding. It does want a decent amount of ram. 1GB each for monitor, manager, and metadata, up to 16GB total for larger clusters, according to docs. But then each disk's OSD defaults to 4gb, which can add up fast!! And some users can use more. 10Gbe is recommended and more is better here but that seems not unique to ceph: syncing storage will want bandwidth. https://docs.ceph.com/en/octopus/start/hardware-recommendati...
  
  westurner 3 days ago
  
  This from 2023 says: https://www.redhat.com/en/blog/ceph-cluster-single-machine :
  > All you need is a machine, virtual or physical, with two CPU cores, 4GB RAM, and at least two or three disks (plus one disk for the operating system).
  
  xyzzy123 3 days ago
  
  For me it was the ram for the OSDs, 1GB per 1TB but ideally more for SSDs...
- master_crab 3 days ago
  
  It’s going to do a good job saturating your lan maintaining quorum on the data.

johntash 2 days ago

For homelab uses, I've been enjoying Linstor/Piraeus a lot more than longhorn lately. Less issues overall so far and simpler.

yamapikarya 3 days ago

i am using nfs and i think its pretty simple and just works

philsnow 2 days ago

It's simple enough, and I moved from Longhorn to NFS for my homelab as well, but I bristle at needing to have the same unix UIDs everywhere that wants to mount or serve an NFS volume. It seems like a huge layering violation.
I "just" want to expose storage over the network (I don't really care about the protocol, NFS would be fine) with a pre-shared secret or something like that.
edit: NFS really goes poorly when containers want to chown things, now I need to have a 'postgres' UID that's the same everywhere?
- yamapikarya 2 days ago
  
  not really sure about permission things, but basically it just dump all your data inside the server and many applications are accessing it. i think it's really depends on your application

d3Xt3r 6 days ago

Longhorn was the codename for Windows Vista... so not a great choice of a name (IMO).

onionisafruit 3 days ago

Longhorn is a fine name, and it doesn't matter if somebody else used it 20+ years ago
- selfhoster11 2 days ago
  
  That is false.
  Sincerely, a lover of Gemini (the protocol, and the AI) and Gopher (the protocol, and not the language).
- weinzierl 3 days ago
  
  By that logic Titanic would be a fine name too.
  
  ofrzeta 3 days ago
  
  https://www.titanic-magazin.de
  
  NewJazz 3 days ago
  
  Hmm, maybe just shorten to Titan?
  
  esafak 3 days ago
  
  Just don't use it to name a database.
  
  bigstrat2003 3 days ago
  
  I mean, I think it would be. Superstition about naming is silly.
- fineallaround 3 days ago
  
  [flagged]
  
  privatelypublic 3 days ago
  
  Even complaining about Vista raises eyebrows. It had two huge issues: overactive UAC, and Microsoft handing "Vista Certified" to basically anybody who asked. (Frequently to machines that would barely run XP pre-SP1.)
  Most of the complaints can be reduced to one of those.
  Yes- I hand wave away a lot of other things: because they were required for a huge step towards a decently secure and stable OS.
  
  samplatt 3 days ago
  
  >a huge step towards a decently secure and stable OS
  It absolutely was an important (and required) step towards a more secure and stable OS. What it was not, though, was a secure and stable OS.
  Windows ME was the same. A required step on the path towards something better, and ALSO something that had the "Windows XX-ready" badge slapped on anything that asked. But no one is lining up to try Vista again apart from technical challenges.
  
  privatelypublic 3 days ago
  
  ME is... not comparable? There's no security boundaries ME could implement- it was still DOS and fat32.
  The list of changes Vista made were never going to go off without a hitch. When you put new boundaries in place in the kernel, and a driver violates them because it was recompiled not updated to handle a separation and handle errors from it: there's no choice but to Kernel Panic.
  Compatibility Shims were introduced for userland changes.
  Despite the hate, DWM handled the most frequent crashes: graphics.
  Microsoft is STILL working on pulling graphics code out of the kernel and into userland.
Delphiza 2 days ago

I agree. You have to be a certain age to remember that a big part of Microsoft "Longhorn" was WinFS (Windows File System), which was intended to completely rework storage into a relational file system (or object-oriented depending on your view). "Longhorn" was supposed to do away with NTFS and failed miserably at that objective. I believe that WinFS delayed things considerably and eventually didn't ship with Vista.
Microsoft Longhorn's failure to be the next big thing was largely due to the bad implementation of a storage subsystem. The result was Windows Vista, which was derided as a bad OS (at least until Windows 8). Due to that history, I would not name any file system 'Longhorn'. It may not be the same as naming a cruise ship 'Titanic', but you wouldn't name it 'Iceberg' either.
gdbsjjdn 3 days ago

I did this was going to be about the Vista and how some of the FS stuff that got cut was prescient. "This old thing that didn't work was ahead of its' time" is a whole genre of post (ex. Itanium)
antod 3 days ago

Could've been worse eg Cairo or Blackcomb.
tracker1 3 days ago

I remembered the Windows Vista reference as soon as I saw the name. That said, I don't think it's a big deal.
pjmlp 3 days ago

Indeed, does it uses .NET in its implementation, or are they already rewriting it into COM?

samlevy0515 3 days ago

[dead]