Don't trust the brochure. Or the manual. Or anything really.

Posted by SuperTechnoDunce@reddit | talesfromtechsupport | View on Reddit | 53 comments

(Or: I discover why the BOFH hates engineers)

I was reminded recently of an elusive problem I'd tracked down in some of our nicer gear, as I started setting it up today for a new event.

In commercial AV, we have two important signal sources beyond just video itself: sync and timecode.

Sync is fairly self-explanatory; it is a signal dating back to the days of the Marconi-EMI television, which sets the refresh rate of your device. If you send the same sync signal to everything, it all refreshes at the same time - cameras, displays, switchers, et cetera - and you eliminate artifacts that you'd normally see when filming displays, as well as other nasty bits like screen tearing and rolling when switching sources during a live event.

Timecode, conversely, is a clock signal embedded in the recording itself storing an exact time, divided by hours/minutes/seconds/frames (well, fields if we're being pedantic, but that's besides the point). It is used by editors in post-production to line up all the various audio and video sources - a modern substitute for the classic slate clap (which is still used as a backup by most large productions).

When sync or timecode go missing (or have any kind of problem, really) people pull their hair out. Usually not me - I'm too busy setting my pants on fire and running around trying to fix the issue. What follows is a tale of one such issue...

The control room we use for our primary productions is a pretty nice system - some of the gear is temperamental on startup, admittedly, but once it's up and running it's set. One of the pieces of that control room is a set of external recording boxes - these are our primary record source, with a backup recorder in case of failure.

Except... On day two of a major event, we discovered a small problem. The timecode didn't match between the units. Which, of course, meant that the video for each camera had to be lined up manually before editing. And because it was a recorded live event, we didn't have the option to do a slate clap before each recording.

Now - if the offset between the two boxes had been consistent, we simply could have measured it, and then informed the editors of the offset. Suddenly our issue would become a minor nuisance instead of a major problem requiring hours of extra work to manually align footage. But the offset was anything but consistent; sometimes it was three frames, sometimes five, sometimes ten.

I and the other techs working on the event were stumped. We'd confirmed both units were getting timecode. Signal paths were properly terminated or left unterminated as required. Oscilloscope readings of the sync and timecode signals looked good. But what about the units themselves? In a moment of desperation I took a high-shutter-speed picture of the two displays, each showing their timecode. And the readouts didn't match... what the hell?

Restarting the units fixed the problem. Lovely. That fix lasted about 24 hours... and the recordings were once again out of sync, and our chief editor would still have been pulling his hair out if he had any in the first place.

W. T. F.

I took another long look at our signal path for timecode the next day. The unit-to-unit latency made zero sense. The timecode passed from the generator directly to the first unit, and then was looped through to the seco... Oh wait.

Fuck.

Anyone who knows anything about hardware design knows that a loop-through is a physical piece of copper, and that the device providing the loop-through simply copies the signal with some sort of high-impedance repeater (I.E. an op-amp). Everybody knows that, especially engineers who design this sort of gear... right?

Apparently engineers who design this sort of gear do not understand basic electronics principles or the concept of redundancy. The "loop-through" turned out to be a software repeater, which added random amounts of delay. Not only that, but thanks to being a software repeater it doesn't function if the unit dies - meaning that if that unit craters, anything downstream of it loses timecode as well.

Aaaaaaaaaaallllll because some idiot engineer didn't understand why op-amps were invented in the first place, or the basics of RS232 or any other bus-based signal for that matter or... You get the picture.

The problem was summarily fixed after a short period of finagling with our rack's cable salad, rewiring the second recorder box directly to our timecode generator instead of the first unit's not-a-loop-through output.

If I were a less forgiving man, I'd be booking a meeting with that engineer in my archives room, and rewiring the halon hold switch...

[-]

NumerableElk@reddit

The title is like my life's mantra.

[-]

concordchris@reddit

Oh gods… it’s been many, many (so many…) years since I’ve seen a reference to “BOFH”! I used to follow that religiously… he was so spot-on about so many things! Thanks for reviving the memories!!!

[-]

jamoche_2@reddit

This sounds like a infinitely more benign variant of the Therac-25 (radiation therapy machine) bug. Prior versions had hardware means of blocking the condition that would cause over-radiation. Someone in the design process thought software blocks would work just fine.

They didn't. People died.

[-]

Top_Box_8952@reddit

Software blocks never work. You need physical lockouts and mechanical fail-safe systems.

Because some bean cruncher will override the software.

[-]

JeffTheNth@reddit

wasn't that due to the technicians doing things faster than the engineers testing and causing the radiation to be unchecked or something ... When testing, it was fine because the people testing weren't doing it every day. When in use by the actual technicians, zappo...?

[-]

SuperTechnoDunce@reddit (OP)

It was actually a rollover error - when an 8-bit counter (incremented every time the machine was used) rolled over to zero, the machine wouldn't attempt its safety checks at all.

[-]

db48x@reddit

There were several related problems with the Therac-25.

The first was that if you filled out all of the fields it would set a flag to indicate that you had entered all of the data, then start rotating the machine to bring the equipment into the correct alignment for the treatment you had entered, which took about 8 seconds. If, during those 8 seconds, you used the arrow keys to go back to a previous field and change it then the new data would not be used to reconfigure the machine because the flag was already set. But the new data would be used to display the treatment plan to the operator for confirmation. And the operator had no direct view of the machine because they were in a separate room behind radiation shielding. So there was no way to notice that the machine was in the wrong configuration for the treatment displayed on the monitor.

The other issue is a bit more difficult to understand; the descriptions of it are always rather brief and I think that details have been lost in the retelling. There was a variable that was used both as a frame counter and as an error code. When setting up a treatment there is a step where the operator must manually adjust the aim of the beam and the position of various attenuating plates to match alignment marks on the patient’s body (just drawn on with a marker). The machine shines a simple light through the beam path so that it illuminates the patient in exactly the spot where the radiation beam will be, so this is easy for a human and hard for a machine. Once the beam is positioned the operator can go back to the other room and press a button to continue the treatment.

Anyway, while this is happening the display is being constantly updated with information read from sensors and the frame counter is being incremented. But when the button is pressed that same variable is read as an error code. If it is zero then there is no error, but any other value indicates some error condition. (This same convention is super common in code written in C today, like the Linux kernel.) Most of the time the frame counter, which was indeed an 8-bit value, was positive. This was read as an error, and the way to fix the error was to automatically reinitialize certain variables and then move the machine into the correct configuration. The counter would be reset to zero as part of this process, and everything would work fine.

But if the counter was zero, then all of that was skipped. After all, there was no error to correct for, so everything must already be in place. Instead of firing the x-rays at low power and through a series of filters and beam spreaders the x-ray beam would strike the patient directly, at full power.

[-]

JeffTheNth@reddit

that's not very many in the grand scheme... especially for something used so much.
Modern machines and only using an 8-bit counter? The programmer should be shot.... From a cannon.... Into the sun...
(And if that was a physical counter, shame on them!)

[-]

jamoche_2@reddit

That too. It had two modes, high radiation with a plate, lower without the plate. If someone very skilled at it set it high by mistake and went back to fix it quickly, it didn't actually get fixed, but the plate didn't get put into position. The hardware check would've detected that.

There were a lot of things going wrong.

[-]

henke37@reddit

That was a race condition. This is a failure to maintain realtime performance.

[-]

jamoche_2@reddit

The race condition was only one of the problems. The other was thinking software was sufficient when it needed to be using hardware.

[-]

SuperTechnoDunce@reddit (OP)

You make a good point. The Therac-25 is a great example of OBOEs as well, but for some reason it never clicked that it ultimately should have just been a hardware failsafe.

[-]

acidblue811@reddit

This is why apprenticeships should be a thing for all engineering fields. Not an electrical or electronics engineer but I've lost count of fresh grads making the same mistake over amd only realizing why it wouldn't work when you point it out

[-]

techforallseasons@reddit

Did you label ( OR REMOVE ) the BNC to inform future users of this design fault?

I would consider making the "loop-thru" unserviceable and adding a sharpie-label so the mistake doesn't cause pain to my future self.

[-]

marshogas@reddit

Young engineer without an older mentor. 'Hey, I found a way to do simpler than that old stupid way'. And no one tells them why you do it the old way.

[-]

SuperTechnoDunce@reddit (OP)

How DARE you try and guilt trip me out of being angry1!!!!1!11!

In all seriousness, you make a good point. Looking back, my technical education consisted of a two-year "crash course" in everything from networking to sysadmin to embedded systems to hardware design and repair.

Things largely worked out fine for me, but I had years of experience in the field as a hobbyist before that point. A number of classmates could produce functional results, but the spaghetti under the hood always left a lot to be desired - there was no time for the instructors to teach the concept of first-principles design, or the idea of planning before sitting down and writing code or designing a PCB.

Thinking about your average fresh graduate from that program getting their hands on a design project like this... Well, it suddenly makes a lot more sense.

[-]

semboflorin@reddit

Here's a little bit to drive the nail in just a little further. Often times the one to blame isn't even the engineer who did the thing. Even if it was their own idea and implementation.

Company managers and directors have to keep a lean staff these days in order to maximize their company profit. Not only are they told to do so, they are encouraged with bonuses and other perks. So they get rid of that old expensive engineer that has been making them money and bring in a fresh recruit desperate for a job in this market for a lower pay. They may get some training from the old engineer on their way out but that's unlikely as that old engineer is pretty disgruntled at this point and can see the writing on the wall.

[-]

marshogas@reddit

You should still be mad. I would be talking to them about this as an major defect that renders their products as disastrous to use. If they are able to provide an update, you will expect them to replace yours. If they cannot fix it, then it is your responsibility to inform the industry of this major shortcoming before it costs your industry more issues.

Understanding the failure of the individual doesn't excuse the failure of the company. They should have had supervisors or processes to prevent this. And because they didn't, how can you be sure that they won't do this with all their products or introduce new faults into their other models.

[-]

APiousCultist@reddit

Young engineer without an older mentor.

Ah yes, a ronin-gineer.

[-]

Rathmun@reddit

A wonderful example of why Chesterton's Fence isn't actually a fallacy. People like to call it an Appeal to Tradition when it's actually an admonition against Appeal to Neologism.

Some traditions are stupid, but assuming every tradition is stupid, is stupid.

[-]

grendus@reddit

Anyone who says Chesterton's Fence is a fallacy should not be listened to.

Chesterton's Fence simply says that you should know why the fence was built before you tear it down.

[-]

Rathmun@reddit

Exactly. Or at least that you should know if it still serves a purpose and what that purpose is. Sometimes the original purpose is lost to history, but a new purpose may still be valid.

[-]

JeffTheNth@reddit

...why reinvent the wheel, and then find out years later why they had a hole in the middle...

[-]

insomniacwhirlwind@reddit

“Young engineer without an older mentor”

Why did I just get a cold tremor slowly up my spine in a corkscrew fashion? October is the time for horror stories like this sentence….

[-]

thekr0w3@reddit

I’m a younger A/V technician, and I’m glad this post made full sense to me! A couple of years ago I would’ve been thrown for a loop with all these terms, but I was able to follow along and even thought of the same steps in roughly the same order.

I wonder if I can do the same thing for our system, setting their timecodes to sync correctly to cut down on editing time trying to align clips

[-]

SuperTechnoDunce@reddit (OP)

There's no reason you can't, assuming you've got a reasonably decent distribution network. Timecode can be sent as audio through a standard balanced XLR cable, or as video over 75 ohm cable. Add a few throwdown SDI-to-fiber converters and you can get it pretty much anywhere.

One thing to keep in mind is that (as far as I'm aware) there's no easy way to keep timecode synced with wireless cameras. Your best bet there is to freerun them - keep a spare timecode connection handy, and keep the cameras plugged in until just before the shoot. Then unplug them and hope their internal clocks are accurate enough that they only drift a frame or two.

It's entirely possible they make wireless timecode transmitters - I just don't have anything that up-to-date. Most of my tech still exclusively uses blackburst instead of trilevel sync, for example.

[-]

abz_eng@reddit

Have you looked into gPTP

IEEE 802.1AS is an adaptation of PTP, called gPTP, for use with Audio Video BridgingIEEE 802.1AS is an adaptation of PTP, called gPTP, for use with Audio Video Bridging

Jeff Geerling goes into it - which is where I learnt about it

The timing industry has many solutions for 'grandmaster' clocks, which take in highly accurate time from GPS, GNSS, or other atomic-clock-backed time sources, and distribute it to local networks with extreme precision—down to the nanosecond range—using PTP.

[-]

SuperTechnoDunce@reddit (OP)

I have. However, it's not a standard within commercial A/V as far as AV over coax or fibre is typically concerned.

it's commonly used to ensure accurate clocking in low-latency AVoIP networks like Dante and SMTPE2110, but not to distribute timecode - it doesn't matter what time the timecode shows as long as it's consistent across all recorders.

[-]

Mr_ToDo@reddit

a modern substitute for the classic slate clap

That's what that's for? That... makes so much sense

[-]

pygmymetal@reddit

Ah, BOFH, that’s a name I’ve not heard in a long long time…

[-]

GreenEggPage@reddit

You can still read him on The Register.

[-]

NaoPb@reddit

Interesting read. May I ask why one would prefer "daisy-chaining" it from device to device instead of wiring up every device to a central point? Mind you, I know nothing about this stuff.

[-]

SuperTechnoDunce@reddit (OP)

Looping signals through devices can simplify your wiring (and therefore reduce cable salad and/or the need for cable management) pretty drastically. They can also save you from having to drop in a dedicated distribution amp, which preserves rack space.

Another good example of this would be getting signals between rooms or venues. You only have so many connectors on your patch box, so you may only be able to bring one copy of a signal through. If it's a location inside the venue, you may not even have the option of dropping in a DA if you don't have one without fans (too much noise mid-performance).

[-]

NaoPb@reddit

Ah, that makes sense. Thanks for taking the time to explain this to me.

[-]

MrArges@reddit

This is why I hate the idea of new requirements being tickets to change a product. Products should have a document that lists all the known requirements, learn something over the life that is a new required feature, add it to the doc. Make the ticket to reflect that that product does not do what the doc now requires.

Passthrough must have latency less than x and be resilient to failure of rest of device.

[-]

talexbatreddit@reddit

Good for you for drilling down to the reason for the problem.

And whatever dough-brained engineer thought, "Oh, I can do that in software!" should be given a bucket and told to go collect a bucket of steam. Oof.

[-]

J_Landers@reddit

Don't worry. They went to work in Boeing administration.

[-]

tgrantt@reddit

Oof

[-]

vaildin@reddit

Engineers should spend their entire afterlife using, repairing, maintaining and otherwise dealing with whatever it was they engineered.

Then its up to them whether they go to heaven or hell.

[-]

WhenSummerIsGone@reddit

As a software engineer, my first reaction to your story is to blame management or the business, not engineering. Too many times I have said "if we do it like this it will cause problem X" and they told me to do it anyway, and yes it caused problem X.

I imagine someone thought it would be cheaper to do this in software instead of hardware and make that much more profit on each unit.

[-]

Norkas-Aradel@reddit

Sure hope its not Hyperdeck you're referring to...

[-]

SuperTechnoDunce@reddit (OP)

Hah, nope. BlackMagic hardware seems to be really hit-or-miss these days - if you need something in a hurry, buy two. If not, their customer support will take care of you, but their RMA process is not lightning fast.

[-]

Stryker_One@reddit

Wait a minute, why bother with even op-amp, they are not zero delay devices? Why noy just a wire?

[-]

SuperTechnoDunce@reddit (OP)

I threw together a basic schematic to demonstrate.

If you don't use an op-amp or other high-impedance device to pull signal off the bus, you're reducing your signal's propagation distance and defeating the point of a bus - you have no guarantee that it will work when some of the bus devices are powered down.

Yes, they add some latency, but that's on the order of nanoseconds. It's inconsequential when you're timing things to 1/30th or 1/60th of a second.

[-]

asp174@reddit

Because you don't want a faulty device downstream pull the entire bus down.

[-]

JaschaE@reddit

From a long time as tinkerer and bludgeoner of problems I am currently moving into professional tech support (sys-admin) and am trying to gather best practices from more experienced people, like you. As such: Do you find it a necessary/ helpful first step to set your pants on fire? And is this practice covered by some clothing allowance on the job?

[-]

commentsrnice2@reddit

I believe the fire to be a natural byproduct of rapid acceleration when traveling between points of failure during the diagnostic stage

[-]

JaschaE@reddit

I used to be a medic. There are no "emergencies" in tech that warrant running, unless you need to leave the blast radius.

[-]

commentsrnice2@reddit

It could also have been that cheap warming plate for your coffee mug that someone stupidly placed a disposable coffee cup on 🤷🏻‍♂️

[-]

JaschaE@reddit

That is a technical malfunction that turned into a medical emergency. As a general rule, medics here don't run either, as that increases the likelihood of wrecking yourself and all your gear before reaching the patient.

[-]

SeanBZA@reddit

Which is why I have a 16 output video distribution amplifier, which actually had 4 separate 16 channel outputs, but you can parallel them to get 32 outputs off each input. funny enough they work down to DC, so my use was for 2 16 section blocks to be used for stereo audio, and then the other section, bridged, was used to give 5 outputs per section of video. Power amplifiers from Opamp Labs, each in an octal socket, and, despite them supposedly not working past 8MHz they gave good video, with a gain of 2 for the termination to work. did have to fix one, replacing the output transistors and bias diodes, after it failed, what a pain to depot the hard epoxy, but I did have a bottle of epoxy solvent that did the job, after a week of soaking.

[-]

Harry_Smutter@reddit

As soon as you mentioned it went from the first to the second unit I knew there was a delay there, haha. I went, "why not just tag each unit off the generator?" And then you did. Some timings simply should not be software defined.

[-]

PKZsarcasticMirror@reddit

LOVE the BOFH reference!!! (I've been reading his stuff since 'The Striped Irregular Bucket' when he was 'down under' Many, Many years ago...) Still reading the new stuff on "The Register"