Sometimes it really IS not our fault
So we got an emergency support call during the Passover holiday, and as you can imagine, it was a strange one. Our investigation of the error basically boiled down (cutting down a lot of effort in between): “This can’t be happening.”
I hate this kind of answer, because it usually means that we are missing something. Usually that can be a strange error code, some race condition or just something strange about the environment.
While we were working the problem, the customer came back with, “Oh, we found the issue. A memory unit went rogue, and the firmware wasn’t able to catch it.” When they updated the firmware, it apparently caught it immediately.
So I guess we can close this support incident.
Comments
rouge/rogue - though a memory unit going red is equally nasty :-)
Paul, Thanks, fixed. That is my memory going bad :-)
You think that's a bad one? After lots of chasing, eventually blaming RAM but a few memory tests show nothing wrong, you get one that shows something wrong. You swap it out. Problem goes away. Until another day.
Bad RAM /socket/. Intermittently.
Okay so if this was a customer I'd have had the entire hardware swapped as soon as we knew it was hardware. But this was a personal workstation and dual CPU Athlon motherboards were too expensive for me to just buy another to check something!
Here is a nice reading, regarding byzantine failures. Really opened up my mind regarding Murphy's pranks regarding hardware. My favorite is slide #23 - sunlight causing the issue. https://c3.nasa.gov/dashlink/static/media/other/ObservedFailures1.html
Maayan, Good reading, indeed.
Comment preview