☕ It Started Like Any Normal Day
Coffee. Tickets. The usual background noise of enterprise support work at Wipro for Dell EMC. You know the drill — log in, check queue, sip tea (or coffee if you're still pretending to be productive).
Then my colleague picked up a case from one of India's top banks.
Not failed. Not corrupted. Not "partially unavailable."
Just… gone.
I remember the energy shift in the room. That specific kind of silence when someone says something that shouldn't be possible.
Because here's the thing about Dell EMC NetWorker — it doesn't randomly wake up one morning, look at 16TB of banking data, and think "you know what, today I delete." That's just not how it works. But try explaining that to a bank that's already halfway through drafting escalation emails.
🏦 The Setup (aka "This Should Never Fail")
Before we dive into the chaos, let me explain what was supposed to happen. The bank had a carefully designed backup workflow for one of their critical applications — one that couldn't be taken offline (no cold backups allowed). Whenever the application team confirmed low usage, backup admins would trigger this OnDemand workflow.
Application
Workflow
Basically, the kind of setup that auditors love and engineers quietly fear touching. And this workflow had been running perfectly for months. Which is exactly why this incident made zero sense.
🚨 The Weird Part Nobody Could Explain
During what should have been a routine backup trigger, the admin noticed something deeply wrong:
The admin panicked and stopped the job immediately. Smart move.
Let me explain why "relabelled" is such a big deal, because this is where the tape library logic matters:
Tapes don't get relabelled automatically or manually unless the last saveset on that tape has expired. This is default, non-negotiable behavior.
If there are any free tapes available for a workflow/pool, NetWorker will use those instead of relabelling an old tape.
If a tape IS being relabelled → the system believes the data on it has already expired. But that data was supposed to have 7 years of retention.
So either NetWorker had a serious bug (nightmare scenario for everyone), or something had caused the saveset to expire prematurely. My colleague did the initial analysis and found… nothing. No trace. No clear reason. Clean logs saying backup completed successfully.
The case was already escalating. Customer was ready to go nuclear. And the data situation looked like this:
😤 I Took Over the Case
Not because I had answers. I absolutely did not. But because nobody else did either, and someone had to own it.
I set two goals, in priority order:
Here's where most engineers would look at the incomplete saveset and give up. An incomplete backup in standard recovery scenarios is like a key with half its teeth missing — it technically exists, but it won't open anything useful.
Most tools won't touch it. Most procedures don't cover it.
But then I remembered something I'd mostly ignored for years:
💡 UASM — The Tool I Underestimated
UASM (Universal Access Storage Module) is a low-level NetWorker utility that gives you raw, direct access to backup data on media — bypassing the normal metadata and saveset completeness requirements.
Think of it as the emergency crowbar of NetWorker. When the front door is locked, the key is broken, and the windows are shut — UASM finds a way in through the wall.
The plan I came up with was unusual but logical:
We executed. Tape retrieved, loaded, scanned. UASM command fired from the storage node CLI. And then… we waited.
No dramatic "problem solved" moment. Just steady, slow progress. Chunk by chunk. File by file.
That should have been the end of it. Close the case, write the RCA as "unknown cause," and move on. But I've never been comfortable with "unknown."
🔍 Going Back in Time (No, Really)
I reached out to the L3 SME team. Their suggestion:
My first reaction was what? Go to the past? Are you writing a script for Doctor Strange?
But they weren't being philosophical. They were talking about Bootstrap Backups.
A Bootstrap is a special save set that NetWorker creates every day, containing:
- 📂 Media Database — knows about every saveset, tape, and retention policy
- ⚙️ Resource Files — all configuration data
- 🔐 Auth Service Data — user and permission info
It's essentially a daily snapshot of NetWorker's brain. And since it runs every day, it gives you a historical timeline of how the system thought things looked at different points in time.
The idea was brilliant once it clicked: if a saveset was deleted from the current media database, we can't see it now. But if we restore bootstrap backups from before the deletion, we can see it in the historical database. Then we replay day by day until it disappears — and that pinpoints exactly when and why it expired.
We pulled a full month of bootstrap backups, restored them in an isolated folder, compressed them, and brought them to a lab environment. Carefully. Methodically. Day by day.
And we saw it. Clear as day. The savesets were valid — then suddenly expired — days before the tape incident.
Something changed the retention.
🎭 The Audit Log Confession
We got on a session with the customer and pulled the audit logs.
And there it was.
Their intention? Extend the retention period.
What they actually selected? "Expire Immediately."
One dropdown. One confident click. 16TB scheduled for deletion.
No bug. No corruption. No mystery. Just a UI option that looks innocuous until it isn't.
The same thing had happened to multiple savesets, not just this one. So we spent the next few weeks loading every affected tape, running scanner against all the savesets, and manually resetting retention back to the correct compliance period.
🧠 What This Case Taught Me
Backup systems don't fail dramatically. They fail quietly — when a small mistake slips through and nobody notices until it's too late. No alarm, no error email, no red screen. Just a checkbox that meant something different than what you thought.
It followed instructions. A human made the decision. The system did exactly what it was told.
And ever since that case, whenever someone says "we didn't change anything" — I stop believing them immediately. Not because people lie, but because sometimes they genuinely don't remember that one dropdown three weeks ago that felt routine.
Always check the audit logs. Always.
This whole process took around 45 days to complete. One of the best experiences in my career at Dell — equal parts terrifying, exhausting, and deeply satisfying.
If you've had a similar incident, or if this raised questions about your own backup setup, feel free to drop a comment below. I try to respond to everything.
💬 Discussion
Have a question, a similar story, or spotted something I missed? Drop it below.
The bootstrap time-travel trick is genuinely underrated. Pulling historical media databases to trace when a policy change happened. Most engineers don't even know you can do this.