HNPWA with Next.js

For a brief period, the Windows kernel tried to deal with gamma rays

ingve | 318 points

Invalidating the caches is kind of a cringe inducing approach on this (actual) problem. Especially in HPC radiation related single event upsets have become a real problem. If you do the math, all the silicon area devoted to memory (DRAM, caches, registers) adds up, and what you've got is essentially particle detector.

Compared to the effective volume of a purpose designed one (ATLAS, CMS, Super Kaminokade, etc.) rather small, but a particle detector nevertheless.

A couple of months / years ago, there was an article (also linked here on HN, IIRC) that did a few back of the envelope calculations regarding expected event rates. IIRC it was something on the order of 1 event per day per 10^12 transistors. (EDIT: not the one I thought of but blows the same horn: http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf )

Also radiation hardened software has been researched (and still is). Essentially the idea is to not only have redundant, error correcting memory, but also redundant, error correcting computation. NASA has some publications on that. e.g. https://ti.arc.nasa.gov/m/pub-archive/1075h/1075%20(Mehlitz)...

datenwolf | 5 years ago

If you don't believe in bit flips, try this!

http://dinaburg.org/bitsquatting.html

I did that for a bit on cloudfront.net and got dozens of them in a short amount of time.

spullara | 5 years ago

Embedded people deal with this all the time. One class of solutions involves a checker task running continuously, which verifies the integrity of the data structures, kind of like a poor man's ECC. Really important code generally does everything three times, so there's a tie breaker in case there's a temporary fault in code or memory. I've seen this done with macros in ways that result in pretty wild code, like running a computation three times, storing each of those results three times, and then comparing the resulting nine outputs three times. That was in a diving related application, so it's not crazy to do all that work over and over since it had to be right.

Complex embedded systems like your cellphone's baseband processor usually just give up at some point and suicide a task or even the whole OS if they detect a problem. For a while I had a Qualcomm debugger attached to the internal cell modem I had in a netbook I was working on, and the baseband crashed all the time due to hardware faults. I thought I had a bad chip for a while until I realized it never happened when I left it in an underground parking lot.

trelliscoded | 5 years ago

A bunch of years ago Cisco had an issue with some RAM in a new switch model, I think it was in the 65xx. They where crashing randomly but only in certain places in the world. Cisco spent tons of money on this. No idea. They brought in a physics professor. The devices with the most issues were located in countries up near the artic circle. Cosmic Rays caused a bit flip in this particular set of RAM due to something in its design. Sorry for the light details, it's been years.

I also worked at a switch manufacturer. We had some ASICs from one of the big companies. Had crashes that we could not explain at all. We knew it was not us. Proved that bits where flipping in the switch ASIC. Turn out they had forgot to spec low alpha solder. Alpha partics will not go through your skin, but when it is layered right on to the chip....oops.

myrandomcomment | 5 years ago

Way back in the late 90s IBM had a problem with alpha-source-contaminated plastics in their SRAM chips. Those chips were used as caches in Sun SPARC processor modules. IBM told some customers, but not Sun. This caused random bitflips in the processor cache, leading to assorted failures and crashes in what was supposed to be reliable UNIX servers.

blattimwind | 5 years ago

So ... here's what I'm thinking, as a complete layman with respect to how radiation affects memory devices. RAM is DRAM, i.e. dynamic RAM. It has to get automatically refreshed relatively frequently.

So, maybe (again, me being a layman) what happens is that usually gamma rays hit a DRAM cell, but haven't imparted enough energy to cause a flip. A millisecond later the cell gets refreshed erasing what little influence the gamma ray had. No harm done. A flip would only occur if enough particles hit the cell within the refresh time frame. That's of course possible, but more rare.

Contrast this with processor cache. On-die cache is most likely SRAM, Static RAM. It doesn't get refreshed. So the slight voltage errors caused by gamma rays can slowly build up over time.

Perhaps this normally isn't an issue, because even though the cache is SRAM and doesn't get refreshed automatically, it'll get "refreshed" by virtue of being cache. i.e. as long as the processor is busy the cache is constantly getting re-written with new cache lines.

But that won't hold true when the processor is asleep. The cache will be sitting idle, making itself susceptible to accumulated charges. Thus the likelihood of a gamma flip is greatly increased.

All of that crude logic aside there's one caveat:

> he workaround was removed once the problem was fixed in microcode or in a later processor stepping.

So ... either everything I said is a load of bollocks and actually this was a processor bug that some CPU engineer mistook as gamma flips, or maybe my theory is correct and they changed the CPU to occasionally wake up and "refresh" its cache automatically.

The mystery remains...

fpgaminer | 5 years ago

I found this code as an intern at Microsoft, while the manufacturer is hidden in the post, I'll give you a clue - the company starts with "I" and ends with "ntel"

xpaulbettsx | 5 years ago

If you have a large enough fleet, and log your ECC errors, you have actually built a not-very-sensitive and very expensive scientific instrument- a cosmic ray detector. Physics is awesome.

dekhn | 5 years ago

To answer the question in the OP: yes, the processor cache might be more susceptible than RAM, if the RAM is ECC.

I've heard many stories about bit-flips causing serious problems at higher-elevation sites. Apparently a major installation at either NCAR or UCAR was delayed by a month fighting such problems. While I haven't actually confirmed any of these stories first hand, I've heard enough to believe that a little paranoia is justified.

notacoward | 5 years ago

Bit flips are real. I used to see them on my (admittedly low end) webserver. Eg. There were occasional errors like "myOfunction not found". A quick glance on a ASCII table shows that the original function name "my_function" is indeed one bitflip away (0x4F vs 0x5F)

rubenbe | 5 years ago

When I was fresh out of college, I worked as a contractor for a prominent agricultural equipment manufacturer. I was responsible for building out the touch-screen interface for the radio (a Qt app). I was told by an engineer who worked for the equipment manufacturer that my application wasn't good enough because needed to be able to operate correctly in the face of arbitrary bit flips "from lightning strikes"--I kindly asked her to show me the requirements which was sufficient to get her to relent, but that was still the wackiest request I've ever received.

weberc2 | 5 years ago

I once saw a postmortem where a server process mysteriously tried to delete whole data (fortunately no actual data was lost). After much confusion, the conclusion was that a cosmic ray flipped a single bit in a register, making it point to 8 bytes past the correct address in C++ virtual function table. As a result, instead of calling UpdateRow(), the process executed DeleteTable().

Of course cosmic rays don't exactly leave a trace, so we will never know.

yongjik | 5 years ago

I've heard stories from the supercomputing folks about trying to put their machine rooms underneath parking structures, to get the added protection of layers and layers of concrete overhead.

epaulson | 5 years ago

Could it be that the manufacturer was aware of a bug and chosed to circumvent it by using gamma rays as a pretext ?

herogreen | 5 years ago

Microsoft allow commented-out code in their kernel?

kurtisc | 5 years ago

A pedantic point (good thing I'm on HN), but I wonder if they didn't actually mean muons and not gamma rays?

nategri | 5 years ago

Relevant: https://en.wikipedia.org/wiki/Timothy_C._May

May is most noted for having identified the cause of the "alpha particle problem", which was affecting the reliability of integrated circuits as device features reached a critical size where a single alpha particle could change the state of a stored value

nobrains | 5 years ago

For people concerned, discover the wonderful thing called 8T SRAM

baybal2 | 5 years ago

Maybe the cache should be made using Silicon on Sapphire for radiation hardening? But I am not sure if multi-process silicon fabrication is viable.

discoball | 5 years ago

It did not

effnorwood | 5 years ago