Postmortem: Longhorn Snapshot Deadlock

Introduction

This is my first public postmortem. I've read plenty of them from companies like Cloudflare and AWS, and I always found them genuinely useful — not just as technical references, but as honest accounts of how things go wrong in real systems run by real people.

So when we had a bad day with our storage layer, I figured I'd write one too. Maybe it saves someone else from spending hours debugging the same thing. And honestly, writing it down helps me make sure I actually understand what happened — and won't forget it.

Long story short, our self-managed MinIO object storage became completely unavailable when its underlying Longhorn volume entered an infinite attach/detach loop. The root cause was a Longhorn snapshot chain that had grown to 252 entries — two over the engine's hard limit of 250 — preventing the storage engine from starting.

This post is a detailed account of what happened, why it was difficult to fix, and the steps we've taken to prevent it from happening again.

Background: How Longhorn Volumes and Snapshots Work

To understand what went wrong, it helps to understand how Longhorn manages storage.

Longhorn is a distributed block storage system for Kubernetes. Each volume has three components:

Volume (CRD) — the orchestration and metadata layer
Engine — a process that runs on the node where the volume is attached, handling all read/write operations
Replicas — three copies of the actual data, one per node

Snapshots in Longhorn are stored as sparse .img files on each replica node. They form a linked chain through parent pointers in .img.meta files:

volume-head.img        ← current live data (newest)
    ↑ parent
volume-snap-N.img      ← most recent snapshot
    ↑ parent
...
volume-snap-1.img      ← base snapshot (oldest, no parent)

Every time a recurring job runs, it creates a new snapshot by freezing the current state and creating a new volume-head. The old head becomes a snapshot in the chain.

When a volume is attached, the Longhorn engine walks the entire snapshot chain before it can start. This chain walk is where the hard limit matters.

Root Cause

Our cluster had a recurring backup job configured with task: backup-force-create, running daily at 1AM with retain: 7.

Looking at the Longhorn source code, the difference between backup and backup-force-create is subtle and easy to misunderstand:

RecurringJobTypeBackup            = RecurringJobType("backup")              // periodically create snapshots then do backups
RecurringJobTypeBackupForceCreate = RecurringJobType("backup-force-create") // periodically create snapshots then do backups even if old snapshots cleanup failed

Both task types create snapshots and perform backups. Neither of them handles snapshot cleanup on their own. The only difference is that backup-force-create will proceed even if a previous snapshot cleanup attempt had failed, while backup would skip the run in that case.

Snapshot cleanup is an entirely separate concern, handled by the snapshot-cleanup task type. This was the actual gap in our setup — we had a snapshot-cleanup recurring job, but it was assigned only to the all group. Our MinIO volume was labeled with a custom group. The cleanup job silently never ran on it.

Over 111 days, snapshots accumulated:

Daily backup job → 1 new snapshot per day
Cleanup job      → never ran on the volume (wrong group assignment)
111 days later   → 252 snapshots

Longhorn's engine has a hard limit of 250 snapshots, defined in the source code of longhorn-engine:

const (
    MaximumTotalSnapshotCount = 250
)

This limit is enforced during engine startup in openLiveChain():

func (r *Replica) openLiveChain() error {
    chain, err := r.Chain()
    if err != nil {
        return err
    }
    if len(chain) > types.MaximumTotalSnapshotCount {
        return fmt.Errorf("live chain is too long: %v", len(chain))
    }
    // ...
}

The Chain() function walks from volume-head backwards through parent pointers, counting each node. With 252 snapshots, the chain length exceeded 250 and the engine aborted immediately on every attach attempt.

The exact error in the logs:

failed to open replica <replica-ip>:10400 from remote:
rpc error: code = Unknown desc = live chain is too long: 252

This caused the volume to enter an infinite loop: attach attempted → engine fails → volume detaches → repeat.

Why Was This Difficult to Fix?

The straightforward fix — delete some snapshots — turned out to be a problem due to several compounding factors.

The snapshot deletion deadlock

Deleting snapshot CRDs (snapshots.longhorn.io) via kubectl does not delete the underlying snapshot data on disk. The Longhorn controller immediately recreates the CRD objects from the actual .img files it detects on the replica nodes. This means CRD deletion is effectively a no-op.

The correct way to purge snapshots is through Longhorn's purge API, which instructs the engine to delete the .img files from disk and then clean up the CRDs. But this requires the volume to be attached and the engine to be running. The engine won't start because there are too many snapshots:

Engine won't start → can't purge snapshots via API
Can't purge snapshots → engine won't start

Finalizers blocked kubectl force deletion

Snapshot CRDs have a longhorn.io finalizer. When we attempted to remove it, the Longhorn controller re-added it faster than we could remove it — it's a reconciliation loop running continuously. Since longhorn-manager runs as a DaemonSet, it couldn't be scaled down to pause reconciliation.

The backup restore path was blocked

As an alternative recovery path, we attempted to restore the volume from a recent backup stored on our on-prem MinIO backup server. All three restore replicas failed with:

AWS Error: RequestTimeTooSkewed The difference between the request time
and the server's time is too large.

The backup server's clock was approximately 7 minutes and 35 seconds ahead of our cluster nodes. MinIO (which implements the S3 API) rejected requests outside its time skew tolerance. Unfortunately the backup server was restricted and required to ask support which of course would take times.

How We Fixed It

With both the purge path and the restore path blocked, the only remaining option was to directly modify the snapshot chain on disk.

Identifying safe snapshots to delete

Longhorn snapshot .img files are sparse — they only store the delta (changed blocks) from the previous snapshot. A snapshot with zero actual disk usage contains no unique data and can be removed without data loss.

We scanned the replica directory for size-0 snapshots:

for f in /data/longhorn/replicas/<volume>-<replica-id>/volume-snap-*.img; do
  size=$(du -s "$f" | awk '{print $1}')
  if [ "$size" = "0" ]; then
    echo "$f"
  fi
done

We found multiple size-0 snapshots and identified three consecutive ones in the chain:

...→ snap-A → snap-B (0 bytes) → snap-C (0 bytes) → snap-D (0 bytes) → snap-E → ...

"Consecutive" is the key requirement here. We can safely remove a sequence of snapshots from the chain as long as we relink the chain by updating the parent pointer of the first surviving node to skip over the deleted ones.

Verifying consistency across all three replica nodes

Before touching any files, we confirmed the same three snapshots existed on all three replica nodes with identical parent pointers and zero actual disk usage. Replicas must remain consistent — deleting snapshots from one node but not others would cause Longhorn to detect an inconsistency during the next attach.

Performing the surgery

On each replica node:

Step 1: Backup the affected metadata files

mkdir -p /tmp/longhorn-backup
cp /data/longhorn/replicas/<volume>-<replica-id>/volume-snap-<snap-E>.img.meta /tmp/longhorn-backup/

Step 2: Update the parent pointer to skip the deleted snapshots

python3 -c "
import json
f = '/data/longhorn/replicas/<volume>-<replica-id>/volume-snap-<snap-E>.img.meta'
with open(f) as fp:
    data = json.load(fp)
data['Parent'] = 'volume-snap-<snap-A>.img'
with open(f, 'w') as fp:
    json.dump(data, fp, separators=(',', ':'))
"

Step 3: Delete the three size-0 snapshot files

rm volume-snap-<snap-B>.img
rm volume-snap-<snap-B>.img.meta
rm volume-snap-<snap-C>.img
rm volume-snap-<snap-C>.img.meta
rm volume-snap-<snap-D>.img
rm volume-snap-<snap-D>.img.meta

After completing the operation on the first node, Longhorn automatically purged additional orphaned snapshots, reducing the count from 252 to 229. The chain was valid again. The engine started successfully, the volume attached, and MinIO came back online.

The third replica, which was partially through the cleanup process, was missing from the volume and had to be rebuilt. Longhorn automatically initiated replica rebuilding and restored full redundancy within a few hours.

Why Wasn't This Caught Earlier?

No alerting on snapshot count

We had no monitoring configured to alert when snapshot counts approached the 250 limit. The TooManySnapshots condition appeared in the volume status as early as January 16 — over a month before the outage — but no one was watching for it.

Misunderstood behavior of `backup-force-create`

The backup-force-create task type is often assumed to handle snapshot cleanup as part of its run. It doesn't — and neither does backup. Both task types only create snapshots and perform backups. The difference between them is purely about failure tolerance: backup skips the run if a previous cleanup failed, while backup-force-create proceeds regardless.

Snapshot cleanup is handled entirely by the separate snapshot-cleanup task type. This separation was not well understood when the recurring jobs were originally configured.

Group assignment mismatch went unnoticed

The snapshot-cleanup job was assigned to the all group. The MinIO volume was in a different custom group. This mismatch meant the cleanup job silently never ran on the volume it needed to protect. There was no indication of this in normal operations.

To understand why this is so easy to miss, it helps to look at how Longhorn actually resolves which volumes a recurring job should run on. The mechanism is entirely label-based. Volumes are tagged with labels in this format:

recurring-job-group.longhorn.io/<group-name>=enabled

When a recurring job runs, the volume selection logic lives in app/recurring_job.go in the recurringJob() function. It first queries for volumes directly assigned to the job by name, then iterates over each group in spec.groups:

volumes, err := getVolumesBySelector(types.LonghornLabelRecurringJob, jobName, namespace, lhClient)
filteredVolumes := []string{}
filterVolumesForJob(allowDetached, volumes, &filteredVolumes)

for _, jobGroup := range jobGroups {
    volumes, err := getVolumesBySelector(types.LonghornLabelRecurringJobGroup, jobGroup, namespace, lhClient)
    filterVolumesForJob(allowDetached, volumes, &filteredVolumes)
}

logger.Infof("Found %v volumes with recurring job %v", len(filteredVolumes), jobName)

If a volume doesn't carry the label the job is querying for, getVolumesBySelector returns zero results and the job moves on silently — no warning, no error, no indication that it skipped anything. The only trace is the Found 0 volumes log line, which is printed at info level but easy to overlook in normal operations.

In our case, the volume had this label:

recurring-job-group.longhorn.io/data=enabled

But the cleanup job only queried for:

recurring-job-group.longhorn.io/all=enabled

The query returned zero volumes. The job reported success. This happened every week for 111 days without anyone noticing.

No verified restore path

We had backups, but had never tested restoring from them in this environment. The clock skew issue on the backup server was unknown until we needed to use it. A restore drill would have caught this.

Remediation

Immediate fixes applied

Fix the snapshot-cleanup job group assignment

The snapshot-cleanup recurring job now covers both the all group and our custom volume group:

kubectl patch recurringjob <cleanup-job-name> -n longhorn-system \
  --type merge -p '{"spec":{"groups":["all","<your-group>"],"cron":"0 2 * * *"}}'

We also changed the schedule from weekly to daily. Weekly cleanup with a daily backup job creating snapshots is insufficient margin.