Back to Portfolio
Dell EMC NetWorker Backup & Recovery RCA War Story

💀 The Day 16TB Vanished — And Everyone Looked at Me

A banking client. A missing backup. A tape that wasn't supposed to move. And 45 days that changed how I think about "bulletproof" systems.

Shibu A
Shibu A · 12 min read · May 2025 · views
Part 1

☕ It Started Like Any Normal Day

Coffee. Tickets. The usual background noise of enterprise support work at Wipro for Dell EMC. You know the drill — log in, check queue, sip tea (or coffee if you're still pretending to be productive).

Then my colleague picked up a case from one of India's top banks.

🚨
"16TB of backup data is missing."
Not failed. Not corrupted. Not "partially unavailable."
Just… gone.

I remember the energy shift in the room. That specific kind of silence when someone says something that shouldn't be possible.

Because here's the thing about Dell EMC NetWorker — it doesn't randomly wake up one morning, look at 16TB of banking data, and think "you know what, today I delete." That's just not how it works. But try explaining that to a bank that's already halfway through drafting escalation emails.

🏦 The Setup (aka "This Should Never Fail")

Before we dive into the chaos, let me explain what was supposed to happen. The bank had a carefully designed backup workflow for one of their critical applications — one that couldn't be taken offline (no cold backups allowed). Whenever the application team confirmed low usage, backup admins would trigger this OnDemand workflow.

📐 Backup Workflow Architecture
🏛️
Critical Banking
Application
Hot backup only
OnDemand
Workflow
💿
Data Domain
2 months retention
Action 1: Backup
📼
LTO-8 Tape Library
7 years retention
Action 2: Clone
⚠️ Bank compliance: Tapes can NEVER be reused. Once full → ejected → cold storage.

Basically, the kind of setup that auditors love and engineers quietly fear touching. And this workflow had been running perfectly for months. Which is exactly why this incident made zero sense.

Part 2

🚨 The Weird Part Nobody Could Explain

During what should have been a routine backup trigger, the admin noticed something deeply wrong:

📼
A tape belonging to this workflow was being relabelled — and new clone data was being written on top of it.

The admin panicked and stopped the job immediately. Smart move.

Let me explain why "relabelled" is such a big deal, because this is where the tape library logic matters:

🧠 How Tape Relabelling Works in NetWorker
Rule 1

Tapes don't get relabelled automatically or manually unless the last saveset on that tape has expired. This is default, non-negotiable behavior.

Rule 2

If there are any free tapes available for a workflow/pool, NetWorker will use those instead of relabelling an old tape.

Therefore

If a tape IS being relabelled → the system believes the data on it has already expired. But that data was supposed to have 7 years of retention.

So either NetWorker had a serious bug (nightmare scenario for everyone), or something had caused the saveset to expire prematurely. My colleague did the initial analysis and found… nothing. No trace. No clear reason. Clean logs saying backup completed successfully.

The case was already escalating. Customer was ready to go nuclear. And the data situation looked like this:

📊 Damage Assessment
💿
Data Domain Copy
GONE — expired long ago
📼
Tape 1 (~15TB)
SAFE — ejected to cold storage
📼
Tape 2 (~1TB)
OVERWRITTEN — relabelled
The backup spanned 2 tapes. Tape 2 held the "tail" of the saveset — the last ~1TB. Without it, the entire backup is technically incomplete.
Part 3

😤 I Took Over the Case

Not because I had answers. I absolutely did not. But because nobody else did either, and someone had to own it.

I set two goals, in priority order:

🥇
Primary: Recover as much data as possible
🥈
Secondary: Find the root cause — because "we don't know" is not an acceptable answer for a bank

Here's where most engineers would look at the incomplete saveset and give up. An incomplete backup in standard recovery scenarios is like a key with half its teeth missing — it technically exists, but it won't open anything useful.

Most tools won't touch it. Most procedures don't cover it.

But then I remembered something I'd mostly ignored for years:

Part 4

💡 UASM — The Tool I Underestimated

🤯
Until this case, I thought UASM was just a niche utility for controlling backup/clone actions. I was very, very wrong.
🔧 What is UASM?

UASM (Universal Access Storage Module) is a low-level NetWorker utility that gives you raw, direct access to backup data on media — bypassing the normal metadata and saveset completeness requirements.

Think of it as the emergency crowbar of NetWorker. When the front door is locked, the key is broken, and the windows are shut — UASM finds a way in through the wall.

The plan I came up with was unusual but logical:

🛠️ Recovery Plan — Operation: Get The Data Back
1
Retrieve Tape 1 from cold storage — the one with ~15TB
2
Load without mounting in NetWorker, run scanner to rebuild saveset index
3
Run UASM from the storage node CLI to extract the data directly, ignoring the missing tail
4
Dump recovered data to the storage node's local disk (needed ~15TB free space)
5
Re-backup to tape using a temp workflow, so data is properly protected again
🍀
Rare moment of luck: The customer's storage node had about 20TB of free space available. For once, the universe was cooperating.

We executed. Tape retrieved, loaded, scanned. UASM command fired from the storage node CLI. And then… we waited.

No dramatic "problem solved" moment. Just steady, slow progress. Chunk by chunk. File by file.

Result: ~15TB successfully recovered. Not the full 16TB — the 1TB tail on Tape 2 was gone forever — but no longer a disaster. The critical data was back.

That should have been the end of it. Close the case, write the RCA as "unknown cause," and move on. But I've never been comfortable with "unknown."

Part 5

🔍 Going Back in Time (No, Really)

I reached out to the L3 SME team. Their suggestion:

"Go to the past. Find when the data actually expired."

My first reaction was what? Go to the past? Are you writing a script for Doctor Strange?

But they weren't being philosophical. They were talking about Bootstrap Backups.

🧠 What is a NetWorker Bootstrap?

A Bootstrap is a special save set that NetWorker creates every day, containing:

  • 📂 Media Database — knows about every saveset, tape, and retention policy
  • ⚙️ Resource Files — all configuration data
  • 🔐 Auth Service Data — user and permission info

It's essentially a daily snapshot of NetWorker's brain. And since it runs every day, it gives you a historical timeline of how the system thought things looked at different points in time.

The idea was brilliant once it clicked: if a saveset was deleted from the current media database, we can't see it now. But if we restore bootstrap backups from before the deletion, we can see it in the historical database. Then we replay day by day until it disappears — and that pinpoints exactly when and why it expired.

🔬 The "Time Machine" Investigation
Day of Backup
Backup + Clone completed. 16TB written. Retention: 7 years. ✅
Days Later
Bootstrap backups confirm: savesets present, retention intact. All normal.
⚠️ The Day Everything Changed
Savesets suddenly marked as expired. Way too early. Something triggered this.
Next Day
Expiration job runs. Data purged from Data Domain. Tapes become eligible for reuse.
Next Backup Trigger
Tape gets relabelled. Clone overwrites 1TB tail. Admin notices. Chaos begins. 💀

We pulled a full month of bootstrap backups, restored them in an isolated folder, compressed them, and brought them to a lab environment. Carefully. Methodically. Day by day.

And we saw it. Clear as day. The savesets were valid — then suddenly expired — days before the tape incident.

Something changed the retention.

Part 6

🎭 The Audit Log Confession

We got on a session with the customer and pulled the audit logs.

And there it was.

😱
The customer had manually modified the retention on those savesets.

Their intention? Extend the retention period.
What they actually selected? "Expire Immediately."

One dropdown. One confident click. 16TB scheduled for deletion.

No bug. No corruption. No mystery. Just a UI option that looks innocuous until it isn't.

🔗 The Complete Chain of Events
👆
Admin selects "Expire Immediately" instead of extending retention
⏱️
NetWorker marks savesets as expired in media database
🗑️
Next day: expiration job runs, data deleted from Data Domain
📼
Tapes become eligible for reuse (all savesets expired)
💀
Next clone operation relabels tape, overwrites data
NetWorker executed instructions perfectly. The instruction was wrong.

The same thing had happened to multiple savesets, not just this one. So we spent the next few weeks loading every affected tape, running scanner against all the savesets, and manually resetting retention back to the correct compliance period.

45
Days to fully resolve
15TB
Data recovered
1
Wrong dropdown choice
0
Bugs in NetWorker
Takeaway

🧠 What This Case Taught Me

Backup systems don't fail dramatically. They fail quietly — when a small mistake slips through and nobody notices until it's too late. No alarm, no error email, no red screen. Just a checkbox that meant something different than what you thought.

💬
NetWorker didn't delete 16TB of data.
It followed instructions. A human made the decision. The system did exactly what it was told.

And ever since that case, whenever someone says "we didn't change anything" — I stop believing them immediately. Not because people lie, but because sometimes they genuinely don't remember that one dropdown three weeks ago that felt routine.

Always check the audit logs. Always.

This whole process took around 45 days to complete. One of the best experiences in my career at Dell — equal parts terrifying, exhausting, and deeply satisfying.

If you've had a similar incident, or if this raised questions about your own backup setup, feel free to drop a comment below. I try to respond to everything.

Shibu A
Shibu A
Support Engineer & Infrastructure Admin with 5+ years handling enterprise backup systems, storage, and Linux environments. Currently based in Doha, Qatar.

💬 Discussion

Have a question, a similar story, or spotted something I missed? Drop it below.

👤
Shibu A
Shibu A Author

The bootstrap time-travel trick is genuinely underrated. Pulling historical media databases to trace when a policy change happened. Most engineers don't even know you can do this.