Microsoft server crash nearly causes 800-plane pile-up
The radio system shutdown, which lasted more than three hours, left 800 planes in the air without contact to air traffic control, and led to at least five cases where planes came too close to one another, according to comments by the Federal Aviation Administration reported in the LA Times and The New York Times. Air traffic controllers were reduced to using personal mobile phones to pass on warnings to controllers at other facilities, and watched close calls without being able to alert pilots, according to the LA Times report.
[ . . . ]The [Windows] servers are timed to shut down after 49.7 days of use in order to prevent a data overload, a union official told the LA Times. To avoid this automatic shutdown, technicians are required to restart the system manually every 30 days. An improperly trained employee failed to reset the system, leading it to shut down without warning, the official said.
The Windows systems were an “upgrade” to the original UNIX servers used at deployment, but evidently the testing or design didn’t include a 50+ day duty cycle[1].
So the moral of our story is: if you want something done, do it yourself. Is there a purer definition of Open Source?
I doubt the prior system was based on an open source foundation (Solaris or AIX, perhaps) but I would bet there was much great transparency into the workings of the base system with it than there is with Windows.
So what would have happened if there had been an accident attributable to this system failure? Who would be found negligent or culpable? In all honesty, I have to say MSFT wouldn’t be the one to take the fall here: the vendor/designer of the system that relied on Windows, despite it’s “warranty” (read: disclaimer) and with any awareness of it’s reliability in the field, is the guilty party. When I read phrases like “off-the-shelf” in the context of a system migration, I assume costs are a driving factor: in the context of safety, how much was LAX going to save? If it was a requirements-driven move, what were those requirements?
fn1. My FreeBSD-based server is at 58 days of uptime: though I would not equate the valuable work it does with air traffic control management, it doesn’t seem to have issues with uptimes in excess of 7 weeks.