SCSI Hell. My worst day in IT

Posted by New-Assumption-3106@reddit | talesfromtechsupport | View on Reddit | 81 comments

This was possibly 15 years ago. My biggest client, an accountancy practice, had as their main server a Compag ProLiant with 6 non-hot-pluggable SCSI drive bays. Four of the bays were occupied with a RAID5 array. They wanted more disk space and we decided to put two more big drives in and create another mirrored volume.

Easy. Right?

Production time downtime was a complete no-no so I got in there super-early, like 06:00, and shut the server down gracefully. I popped the two new drives in their caddies into the box and powered it up. SCSI drives take a while to start and you have to wait for each drive to spin up in sequence and get verified. All six spin up, then the RAID controller anounces "No Logical Drives"

What The Actual Fuck?

I powered it off and removed the new drives. Power on. Same message.

Power off. Reseat the four drives. Power up. Nope.

The array is gone. Called a mate who worked in a fully Compaq data centre and he and his colleagues simply could not believe it, but there it was.

So that's 25 fee-earning accountants unable to process any billable hours until the server is back. I presented the facts to the owner, who was thankfully understanding, took the box away, reinstalled the OS then started the restore from backup. The restore took hours and was the most nerve-wracking experience of my life but boy was I relieved when it restarted and booted up to the domain admin login.

I put the new drives back in and they worked. No idea to this day what went wrong. I can only assume a firmware bug.

Full report to the client & they claimed lost production time on their insurance, so a happy ending.

[-]

HurryAcceptable9242@reddit

I'm having a PTSD moment reading this story ... Post Traumatic SCSI Disorder.

Seriously, that brought back some bad memories of some very stressful times.

[-]

New-Assumption-3106@reddit (OP)

I wasn't expanding the array. I was adding a new mirrored pair as a new volume

[-]

HurryAcceptable9242@reddit

Roger that. I was thinking "add to" in terms of messing with a machine that had what was apparently a stable array. I learned through hard experiences that it would be best to not touch a stable machine. And living in dread of having to actually do a recovery like that. My hat's off to you sir.

[-]

MrDolomite@reddit

There is a reason that SCSI is a four letter word. 😡

[-]

Fraerie@reddit

I swear SCSI was always black magic.

[-]

grauenwolf@reddit

Until at least February 1982, ANSI developed the specification as "SASI" and "Shugart Associates System Interface". However, the committee documenting the standard would not allow it to be named after a company. Almost a full day was devoted to agreeing to name the standard "Small Computer System Interface", which Boucher intended to be pronounced "sexy", but ENDL's Dal Allan pronounced the new acronym as "scuzzy" and that stuck.

Seems like it was intentional from the start.

[-]

KnottaBiggins@reddit

the committee documenting the standard would not allow it to be named after a company.

Probably why they came up with something for RS-232 to replace the words "Radio Shack" - that erstwhile serial interface was originally designed for the TRS-80.

[-]

EruditeLegume@reddit

TIL - Thanks! :)

[-]

grauenwolf@reddit

That's really cool.

[-]

dustojnikhummer@reddit

FUCK SCSI?

[-]

zvekl@reddit

Lun always made me giggle. Lun sounds like sorry for lun-pa, a Taiwanese slang for ball sack and my god I hated and loved my quantum 105mb SCSI drive

[-]

KnottaBiggins@reddit

Nah, too scuzzy.

[-]

johnklos@reddit

This isn't SCSI hell... SCSI works well. It's RAID controller hell.

[-]

GeePee29@reddit

Old story. Mid 90s. I had to change all the drives in a SCSI array for larger ones. Saturday job. Me and another guy.
So check the backups are good. We power down and change out all the drives.

Power up and all looks good. We start the raid build process. All looking good.

We know it is going to take a while, so a leisurely walk down to the local shops to buy some lunch for later on. Leisurely walk back. We've been gone about 45 mins. Check the RAID initialization process. 3% complete!!! What!!!! 3%
We stick around and wait till it ticks over to 4% and then calculate it is not going to finish for about 18 hours.

So we go home and then have to come back on the Sunday to finish the job. Which we did and it took ages. I logged about 18 hours over that weekend.

Two days later the array crashed and took everything with it. Firmware. Kinda my fault. I should have checked this in advance and flashed it. The other guy was no help at all despite being more experienced than me.

And then, of course, the finance manager whinges about the overtime cost.

Not my only SCSI horror story but definitely the worst.

[-]

SuDragon2k3@reddit

It appears you forgot the appropriate sacrifices and weren't using the correct runes, written in blood, on the inside of the case.

[-]

phazedout1971@reddit

I've never worked in a server room I didn't leave blood in, those mounting rails are sharp

[-]

williamconley@reddit

we have found (my GF and I) over 20 years that her blood sacrifice will work. If she is the one who works on the down server/workstation/system/whatever and she slips and cuts herself, we won't open that machine again for a decade. Then she'll see the bloodstain and remember. Especially since in most of those cases I tried and failed to get the system online, then had to talk to the client on the phone (so she jumped it while I was on the phone and I didn't DARE to try to insert myself back into the equation).

[-]

b1ackfa1c0n@reddit

My dad worked on the credit card imprinter and server for Bank of America in the mid 80's. He swore that it never worked right unless a tech cut themselves on a sharp piece of metal once a month or so.

Side note, there was some serious security around that. You had to walk through a double door man trap and the guards openly carried shotguns in the building.

[-]

phazedout1971@reddit

We used to refer to those servers as the "Compaq unerliant" had 3 out of 4 with random hardware failures, including a failed motherboard from box

[-]

djtodd242@reddit

We had our server equipment turned off for the roll over on Y2K. Shut it all down on Dec 31, and went back on Jan 2 to turn it all back on.

Had a couple of drives that wouldn't spin up anymore. I actually had to use the "fix" we had for drives developing Seagate disease in the late 80s to early 90s. One person holds the drive, and as the power is turned on you jerked the drive horizontally once and the centripetal force was enough to get an old drive motor a small push start.

I got the data off those drives before I did anything else.

[-]

SeanBZA@reddit

Ever freeze a Seagate to get that last clone off it, before it warmed up and stopped responding. took it in for RMA, and they deliberately kept the office at around 12C, with a receptionist wearing a jersey on a 30C plus day. First run of Seatools it passed, so I asked them to do it again, and closed the clamshell around the drive. It hit 35C and slowed to a crawl, at 40C it stopped reading sectors, then stopped responding to the bus, and I got a new drive. A whole 2G of storage, so i cloned the original 700m partition back on, restarted Netware, and kepot the new drive as bootable spare to restore onto. then later on as admin used all that extra space and made a second volume that I could use.

Backup is easy if you can copy the needed directories over. so you only have a 10min logout window at lunchtime, and can do the moving to tape at leisure.

[-]

RamblingReflections@reddit

Wow, memory unlocked. I thought my first IT boss was pulling the piss the first time he told me to take a drive, put it in a snap lock bag, and throw it in the freezer for a few hours, and then immediately try and get it to spin up to get the data off it before it warmed up.

Once he explained it as the cold minutely shrinking the motor and potentially reducing the size of whatever was causing the friction that was stopping the disks spinning up, long enough for data recovery, anyway, I was a bit less sceptical. I was worried about moisture damage but he assured me it was dead anyway and this was a last ditch effort.

In that era it worked about half the times I tried it. Only once attempted it for a server, and it actually worked long enough to save us a whole weekend of restoring from backup tapes, thank god. Geeze, this would have been 20 odd years ago!

[-]

RedFive1976@reddit

I think I've gotten the freezer trick to work a couple of times, somewhat less than 50% for me. I once resurrected a RAID5 array for data recovery by replacing the controller board on the failed drive. Found an identical SCSI drive on eBay, bought it and swapped the circuit board over. Worked well enough to get all the data off.

[-]

New-Assumption-3106@reddit (OP)

I've done this multiple times with around a 50% success rate. Freeze the drive for a few hours then if it spins grab the data

[-]

djtodd242@reddit

I have heard of this, but I've never experienced it "in the field".

[-]

lucky_ducker@reddit

Late 2000s my company relied on a single HP server with two SCSI drives RAID 1. That server ate SCSI drives for breakfast, no joke, we had a drive fail about every 18 months. It was so regular that I would put reminders on my calendar to be ready to come in early and swap in a new drive.

When, at one juncture we had trouble tracking down an identical spare, when we finally did find a source we bought four of them. Yes, over time we used them all up.

When in 2018 we retired that server and virtualized (I know, late to the party, this was a non-profit). We bought a ProLiant server with 16 hot-swappable drive bays and set up RAID 10. We also bought two hot spares. When I retired in 2024 the two spares were still in their boxes.

[-]

wild_dog@reddit

I'm gonna nitpick here, if the spares were still in the boxes, they are cold spares. Hot spares are already installed and powered but not in use, so that upon failure/degregation, it can automatically and instantly start rebuilding with the hot spare.

[-]

Gadgetman_1@reddit

A non-hotpluggable drive bay in a 'no downtime' environment?

Someone must have been a bit too fond of counting paperclips...

My organisation didn't have any 24/7 requirements on our administrative network back then, but yeah, we had hot-plug drives all the way. The only exception ever was the HP ML110 servers we got years later, for temp office work. hot-plug just meant that our job was so much easier. No shutdown and restarts, no messing with SCSI IDs or anything.

Can't remember how it was back then, if the RAID config was stored in the controller or what, but I suspect the controller. And I bet there was a small battery on it... that probably died a couple of years earlier...

More recent HP/Compaq RAID controllers doesn't hold this, it's all stored on the drives themselves.

Some still have batteries, but that's only for the write-back cache.

Never had an issue with Firmware when adding disks, but I always upgraded them to match the rest of the same model. And I only used drives from the OEM. Sure, the disk may had been a Toshiba, or WD or whatever, but if it was going into a RAID, we only bought OEMs with their custom FirmWare.

[-]

Dansiman@reddit

No production downtime.

[-]

Strazdas1@reddit

Probably set up ages ago before company expanded into having no downtime. When its you and one acocuntatn friend it may technically be no downtime but in practice its plenty of times you can take it down. When its 25+ people on same system its a bit different.

[-]

SeanBZA@reddit

Non profit, so likely the server was a donation as well, well used.

[-]

12stringPlayer@reddit

I used to keep a small rubber chicken in my toolbox to wave over a particular server that would be very unhappy any time you had to do anything with the SCSI chain.

When we had to move that server to a new data center one day, my PFY couldn't find the terminator, though he insisted he'd packed it. He scrambled (and succeeded) in finding one nearby, and once we started using the new terminator, we never had another problem with that system again.

Things were weird in the olden days.

[-]

New-Assumption-3106@reddit (OP)

Things were weird in the olden days

They sure were. Trying to jumper 4 IDE drive to work together.....

[-]

RamblingReflections@reddit

Wow, setting jumpers of hard drives and motherboards. That’s taking me back. I was talking to a younger tech the other day who was having an issue getting a new build to recognise 2 drives, and was explaining how we used to have to use jumpers to set master and slave drives, and they had to be on the correct position on the IDE cable too, or things wouldn’t boot. It was one of those basic things that you naturally checked first when someone would say, “I just replaced the secondary drive on my PC and now it won’t boot”. 9/10 times the replacement drive was set to master (or had no jumper at all because it was assumed it would be on a single IDE cable - standard for off the shelf), and the most time consuming part of the job was locating the little tweezers I used to move or place the jumper.

I tried explaining motherboard jumpers to adjust the IRQ as well and his eyes kinda glazed over like mine used to when my grandpa started talking about the good old days. I felt old (grandpa old, and I’m not even the right gender for that!) and shut up, and told him that I was almost certain his issue wasn’t related to jumpers, but to check the cabling and left him to it.

[-]

New-Assumption-3106@reddit (OP)

Tweezers!

I had some delicate needle-nose pliers just for that task

Fuck, I'm old

[-]

Dansiman@reddit

I always just used my fingernails. ¯\_(ツ)_/¯

[-]

dazcon5@reddit

Then running QEMM a couple dozen times to get all the drivers to load and run properly.

[-]

TMQMO@reddit

Even better if the sound card and modem were on the same card. (Thank you, Packard Bell.)

[-]

SeanBZA@reddit

Motherboard with integrated serial and parallel, and they only offered 2 IO port locations, and 2 IRQ, and you needed to have either none or both, with a serial mouse. then try to put in a HGC card with built in parallel port as well.

[-]

Jonathan_the_Nerd@reddit

Oh, and manually setting interrupts to get a soundcard, a network card & an internal modem going all at once, if you even had enough slots

Suppressed memory unlocked. You owe me for another therapy session.

[-]

SeanBZA@reddit

Going to bet you had a passive terminator as the old one, and the newer one was active. Have had that before, took the passive one, with the nearly 1A power draw, and replaced with active, and termination current went down, plus the bus error rate dropped to zero. all that SCSI chain did was drive a HP scanjet, a Zip drive and an Arcus scanner. Both the HP and Arcus loved to do lamp fail warnings, but the Arcus replacing the lamp was cheap, even buying them direct from Arcus was $10 each, unlike HP where you spent $200 on a service, plus 2 weeks in transit, for them to replace the entire scan platform, as the lamp is glued into it. Arcus 10 minutes of work, plus clean the underside glass, and clip in the standard Phillips lamp, which you could get over the counter for $2, but the Arcus versions did, despite being made by Phillips, last so much longer, probably because they were all non Alto types so had the full 50mg of mercury dosed in them.

[-]

desertwanderrr@reddit

I feared anything to do with early SCSI! I once had to an all-nighter restoring from 8 inch floppy disks for a client - that's how old I am. It was a SCSI failure that prompted the work.

[-]

Caithus63@reddit

Only thing worse was IBM Microchannel SCSI

[-]

djfdhigkgfIaruflg@reddit

I've heard legends about those

[-]

Caithus63@reddit

Honest_Day_3244@reddit

I'd hear that story

[-]

dazed63@reddit

I can confirm that.

[-]

4me2knowit@reddit

As someone who once moved a database on 150 8inch floppies I share your pain

[-]

NightGod@reddit

My first job we did our backups on 8-inch floppies and our redundancy was one person would take a huge laundry bag of floppies in cartridges home from last week and the current week would stay in the server room next to our System/36.

The real fun was the time a hard drive crashed in the S/36. IBM replaced the drive and, after we did the restore, we learned that our backups did NOT include user accounts, so our manager got stuck inputting them all manually. Fortunately, it was a pretty small company, \~200ish users, so it wasn't too horrible for her.

They were definitely added to the backup process after that, tho

[-]

deaxes@reddit

How was the termination setup? The little I know of SCSI comes from growing up in a Mac household. One thing that was drilled into me was that you needed to terminate the last drive and only the last drive.

[-]

pawwoll@reddit

Who calls himself New-Assumption-3106????
BOT

[-]

New-Assumption-3106@reddit (OP)

Nope, just a randomly assigned username done by Reddit

[-]

pawwoll@reddit

hmm oki
but u still sus

[-]

ThunderDwn@reddit

You forgot to sacrifice te goats.

Or didn't sacrifice sufficient goats.

SCSI requires blood sacrifice to work. Goats are preferred, but you could have cut off your own finger and used that - although I always found that to be a little excessive, myself, although I've known guys who swear it works

[-]

GolfballDM@reddit

During a time of frustration with a project I was working on many years ago, I remember filing a PO with the project head for a pair of sacrificial goats and a sacrificial virgin.

[-]

jobblejosh@reddit

Or you sacrificed too many goats.

Whatever it was, it wasn't the right number. Which will be different to every previous amount, and there shall be no way of predicting the right amount.

A good technician knows the amount of goats to sacrifice based solely on the vibes of the job.

[-]

RamblingReflections@reddit

The colour of their coats is important too. To cover your bases thoroughly, the more colours the goats coat has, the better the chances were that your offering would be deemed correct. None of those plain, single coloured coats for SCSI. Not if you value your data!!!

[-]

freddyboomboom67@reddit

I didn't see you mention your sacrifice.

From SCSI Is Not Magic:

"SCSI is *NOT* magic. There are *fundamental technical reasons* why you have to sacrifice a young goat to your SCSI chain every now and then."

[-]

Geminii27@reddit

I was so glad I never had to work on SCSI gear at the time. I heard so many horror stories.

[-]

SeanBZA@reddit

Still have a SCSI card, though i think it will be in the next ewaste lot.

[-]

georgiomoorlord@reddit

At least you weren't sacked for it. Boss seemed to understand it was an old box anyway

[-]

XkF21WNJ@reddit

Sacked? They did nothing wrong! They restored the server from backup in a few hours, that's exemplary.

[-]

Ahayzo@reddit

Sacked? They did nothing wrong

Since when does that protect us lol

[-]

Strazdas1@reddit

Yeah, that never stopped manglement.

[-]

Jofarin@reddit

Production time downtime was a complete no-no so I got in there super-early, like 06:00

Maybe there's an obvious answer, but why wasn't this scheduled right after production ended?

[-]

New-Assumption-3106@reddit (OP)

I was younger & less wise

[-]

Neue_Ziel@reddit

I was in the Navy, minding a network of confidential information, and the servers all had Raid 5 and tape backups.

It was a PM to swap tapes every other day. Everything was cool until the calls came in that the tag out software wasn’t allowing people to login or the documents couldn’t be accessed.

Turns out, the electricians tagged out the cooing system for the server room and all the SCSI drives shat the bed on overtemp.

It was an oven in that room, and I tracked down the electricians to remove the tag under orders from my division officer.

I pulled all the drives we had as spares, and all the ones that supply had in stock and began to restore from tape backups for the rest of the day. 6 servers, 8 drives apiece.

That sucked.

Tagout book had warnings about not taking out the cooling system added to it to prevent this issue.

[-]

pocketpc_@reddit

every LOTO procedure I've ever used or written has "check with the owner of the system" as step 1 for a reason lol

[-]

dragzo0o0@reddit

Had exactly that issue - probably around the same time or a little earlier. Was the array controller card. Luckily for us, I’d done the work on a Friday night and HP were able to get a replacement for us over the weekend. Was a bit of downtime Monday morning but only about 30 minutes for the site.

[-]

coyote_den@reddit

System Can’t See It

[-]

saracor@reddit

Did you bring the chicken? I mean, SCSI demanded sacrifices back in the day. It was magic and no one could understand it.
I remember dealing with many of those Compaq systems and their SCSI setups. Each system seemed to have it's own way of doing things.
Don't get me started on the DEC Alphas either. I am so glad we're well beyond all this now.

[-]

FuglyLookingGuy@reddit

That's why whenever an upgrade needed to be made, I always scheduled it to start at 6pm Friday.

That gives you the whole weekend to regret your life choices.

[-]

MartyFarrell@reddit

I preferred Saturday morning after a full backup :-)

[-]

New-Assumption-3106@reddit (OP)

Yep. This incident helped me learn that lesson

[-]

nspitzer@reddit

Bet you didn't run the Compaq firmware upgrade rompac first to ensure the scsi card firmware was up to date as well as ensuring the scsi drivers were up to date. Its been a few years but compaq were notorious for having issues with out of date firmware on thE scsi cards causing issues.

[-]

New-Assumption-3106@reddit (OP)

I bet you're right!

[-]

greenonetwo@reddit

So you couldn't add the array but don't initialize it? That is what I did on some HP gear around then. Got the array back with all the data.

[-]

Immortal_Tuttle@reddit

My first thought. Did exactly that in 2006 after someone decided to shut down the main machine without checking that the hot spare was actually spare part donor at that stage. He then proceed to remove the drives. Luckily he was putting them one on another, so it was a matter of LOFI (last out - first in) putting them back together.

[-]

tmofee@reddit

For years my father had an old Xeon server that was running server 2000. One time he turned the machine off to move it to another room and the power supply thought “I need a bit of a rest now”. I managed to migrate our domain to 365 email (and thank Christ! Some of the spam we used to cop) and used it for quickbooks until the hard discs finally died. Dad was afraid of any power issues after that.

[-]

asp174@reddit

That Raid 5 with 4 drives was probably slow as hell. Controllers didn't have that much cache, and with array sizes other than 3, 5, 9, 17, etc had a hefty read-before-write penalty.

[-]

peterdeg@reddit

Damn scsi. Had a whole lot of ibm server 500s I was upgrading.
Step one was to clone scsi disk 0 to a new disk set as scsi 6.
0 was the primary disk.
One damn machine had a different version of firmware. With it, 6 was the primary.
So, at 2am, I cloned the new blank disk onto the existing one. That was a long night.