o run for years between reboots. Unfortunately, few of those computers are PCs.
If mainframes, high-end servers, and embedded control systems can chug along for years without crashing, freezing, faulting, or otherwise refusing to function, then why can't PCs? Surprisingly, the answer has only partly to do with technology. The biggest reason why PCs are the most crash-prone computers ever built is that reliability has never been a high priority -- either for the industry or for users. Like a patient seeking treatment from a therapist, PCs must
want
to change.
"When a 2000-user
mainframe crashes
, you don't just reboot it and go on working," says Stephen Rochford, an experienced consultant in Colorado Springs, Colorado, who develops custom financial applications. "The customer demands to know why the system went down and wants the problem fixed. Most customers with PCs don't have that much clou
t."
Fortunately, there are signs that everyone is paying slightly more attention to the problem. Users are getting fed up with time-consuming crashes -- not to mention the complicated fixes that consume even more time -- but that's only one factor. For the PC industry, the prime motives seem to be self-defense and future aspirations.
With regard to self-defense: Vendors are struggling to control technical-support costs, while alternatives such as network computers (NCs) are making IT professionals more aware of the hidden expenses of PCs. With regard to future aspirations: The PC industry covets the prestige and lush profit margins of high-end servers and mainframes. But processing power alone does not a mainframe make. When the chips are down, high availability must be more than just a promise.
That's why the PC industry is working on solutions that should make crashes a little less frequent. We're starting to see OSes that upgrade themselves, applications that repair themselves, sensors that det
ect impending hardware failures, development tools that help programmers write cleaner code, and renewed interest in the time-tested technologies found in mainframes and mission-critical embedded systems. As a bonus, some of those improvements will make PCs easier to manage, too.
But don't celebrate yet -- it's hardly a revolution. Change is coming slowly, and PCs will remain the least reliable computers for years to come.
Why PCs Crash
Before examining the technical reasons why PCs crash, it's useful to analyze the psychology of PCs -- by far the biggest reason for their misbehavior. The fact is, PCs were born to be bad.
"The fundamental concept of the personal computer was to make trade-offs that guaranteed PCs would crash more often," declares Brian Croll, director of Solaris product marketing at Sun Microsystems. "The first PCs cut corners in ways that horrified computer scientists at the time, but the idea was to make a computer that was more affordable and more compact. Engineerin
g is all about making trade-offs."
It's not that PC pioneers weren't interested in reliability. It's just that they were more interested in chopping computers down to size so that everybody could own one. They scrounged the cheapest possible parts to build the hardware, and they took dangerous shortcuts when writing the software.
For instance, to wring the most performance out of slow CPUs and a few kilobytes of RAM, early PCs ran the application program, the OS, and the device drivers in a common address space in main memory. A nasty bug in any of those components would usually bring down the whole system. But OS developers didn't have much choice, because early CPUs had no concept of protected memory or a kernel mode to insulate the OS from programs running in user mode. All the software ran in a shared, unprotected address space, where anything could clobber anything else, bringing the system down.
Ironically, though, the first PCs were fairly reliable, thanks to their utter simplicity. In the
1970s and early 1980s, system crashes generally weren't as common as they are today. (This is difficult to document, but almost everyone swears it's true.) The real trouble started when PCs grew more complex.
Consider the phenomenal growth in code size of a modern OS for PCs: Windows NT. The original version in 1992 contained 4 million lines of source code -- considered quite a lot at the time. NT 4.0, released in 1996, expanded to 16.5 million lines. NT 5.0, due this year, will balloon to an estimated 27 million to 30 million lines. That's about a 700 percent growth in only six years.
"People who build reliable systems don't radically change the system very often," says Sun's Croll. (Solaris is holding fairly steady at 7 million to 8 million lines of code.) "PCs tend to have boatloads of fresh, virgin, untested code. The sheer number of lines of code makes bugs more likely. The code you never write has no bugs."
Engineers who work with mainframes and critical embedded systems agree. "Having 15 m
illion lines of code isn't as bad as having 15 million lines of new code," notes Wayman Thomas, director of mainframe solutions for Candle, which makes performance monitors and other software for large-scale servers and mainframes. (See the sidebars "Why Mainframes Rarely Crash" and "Embedded Reliability: Bet Your Life".)
However, Russ Madlener, Microsoft's desktop OS product manager, says that code expansion is manageable if developers expand their testing, too. He says the NT product group now has two testers for every programmer. "I wouldn't necessarily say that bugs grow at the same rate as code," he adds.
It's true that NT is more crash-resistant than Windows 95, a smaller OS that's been around a lot longer. And both crash less often than the Mac OS, which is older still. In this case, new technology compensates for NT's youth and girth. NT has more robust memory protection and rests on a modern kernel, while Windows 95 has more limited memory protection and totters on the remnants of MS-DOS and
Windows 3.1. The Mac OS has virtually no memory protection and allows applications to multitask cooperatively in a shared address space -- a legacy of its origins in the early 1980s.
Still, it will be interesting to see how stable NT remains as it grows fatter. And grow fatter it will, because nearly everybody wants more features. Software vendors want more features because they need reasons to sell new products and upgrades. Chip makers and system vendors need reasons to sell bigger, faster computers. Computer magazines need new things to write about. Users seem to have an insatiable demand for more bells and whistles, whether they use them or not.
"The whole PC industry has come to resemble a beta-testing park," moans Pavle Bojkavski, a law student at the University of Amsterdam who's frustrated by the endless cycle of crashes, bug fixes, upgrades, and more crashes. "How about developing stable computers using older technology? Or am I missing a massive rise in the number of masochists globally who
just love being punished?"
Although there are dozens of technical reasons why PCs crash, it all comes down to two basic traits: the growth spurt of complexity, which has no end in sight, and the low emphasis on reliability. Attempts to sell simplified computers (such as NCs) or scaled-down applications (such as Microsoft Write) typically meet with resistance in the marketplace. For many users, it seems the stakes aren't high enough yet.
"If you're using [Microsoft] Word and the system crashes, you lose a little work, but you don't lose a lot of money, and no one dies," explains Sun's Croll. "It's a worthwhile trade-off."
Causes Behind Crashes
You can sort the technical reasons for crashes into two broad categories: hardware problems and software problems.
Genuine hardware problems are much less common, but you can't ignore the possibility. One downside to the recent sharp drop in system prices (see "Disposable PCs," February) is that manufacturers are cutting corners more closely th
an ever before. Inexpensive PCs aren't necessarily shoddy PCs, but sometimes they are. (See the sidebar "It's a Hardware Problem!".)
Another cause of mysterious crashes, outright sabotage, is beyond the scope of this article. The dangers of viruses, worms, and Trojan horse programs are well documented, and it's really a security issue. And, of course, nefarious behavior isn't limited to software. In a study of 10,000 help-desk calls, analysts at Workgroup Technologies discovered that 10 calls in one month at one company came from users whose SIMMs had been stolen. A former CIO at a publishing company told BYTE that his employees frequently upgraded their systems by pilfering SIMMs from other employees' machines. (Robin Hood strikes again.)
Generally, though, when a computer crashes, it's the software that's failed. If it's an application, you stand to lose your unsaved work in that program, but a good OS should protect the memory partitions that other programs occupy. Sometimes, however, the crashed p
rogram triggers a cascade of software failures that brings down the entire system.
Then the only recourse is to reboot, sacrificing unsaved work in all open applications. And because neither the OS nor the applications get a chance to clean up after themselves -- by closing open files, deleting temporary files, flushing I/O channels, and so forth -- an abrupt reboot can leave debris on the hard disk or even scramble the disk. This leads to more instability, more crashes, and lost data.
So why do
programs crash
? Chiefly, there are two reasons: A condition arises that the program's designer didn't anticipate, so the program doesn't handle the condition; or the program anticipates the condition but then fails to handle it in an adequate manner.
In a perfect world, every program would handle every possible condition, or at least it would defer to another program that can handle it, such as the OS. But in the real world, programmers don't anticipate everything. Sometimes they deli
berately ignore conditions that are less likely to happen -- perhaps in trade for smaller code, faster code, or meeting a deadline. In those cases, the OS is the court of last resort, the arbiter of disturbances that other programs can't resolve. "At the OS level, you've got to anticipate the unanticipated, as silly as that sounds," says Guru Rao, chief engineer for IBM's System/390 mainframes.
To deal with these dangers, programmers must wrap all critical operations in code that traps an error within a special subroutine. The subroutine tries to determine what caused the error and what should be done about it. Sometimes the program can quietly recover without the user's knowing that anything happened. In other cases, the program must display an error message asking the user what to do. If the error-handling code fails, or is missing altogether, the program crashes.
Autopsy of a Crash
Crash
is a vague term used to describe a number of misfortunes. Typically, a program that crashes
is surprised by an exception, caught in an infinite loop, confused by a race condition, starved for resources, or corrupted by a memory violation.
Exceptions are run-time errors or interrupts that force a CPU to suspend normal program execution. (Java is a special case: The Java virtual machine [VM] checks for some run-time errors in software and can throw an exception without involving the hardware CPU.) For example, if a program tries to open a nonexistent data file, the CPU returns an exception that means "File not found." If the program's error-trapping code is poor or absent, the program gets confused.
That's when a good OS should intervene. It probably can't correct the problem behind the scenes, but it can at least display an error message: "File not found: Are you sure you inserted the right disk?" However, if the OS's error-handling code is deficient, more dominoes fall, and eventually the whole system crashes.
Sometimes a program gets stuck in an infinite loop. Due to an unexpected condi
tion, the program repeatedly executes the same block of code over and over again. (Imagine a person so stupid that he or she follows literally the instructions on a shampoo bottle: "Lather. Rinse. Repeat.") To the user, a program stuck in an infinite loop appears to freeze or lock up. Actually, the program is running furiously.
Again, a good OS will intervene by allowing the user to safely stop the process. But the process schedulers in some OSes have trouble coping with this problem. In Windows 3.1 and the Mac OS, the schedulers work cooperatively, which means they depend on processes to cooperate with each other by not hogging all the CPU time. Windows 95 and NT, OS/2, Unix, Linux, and most other modern OSes allow a process to preempt another process.
Race conditions are similar to infinite loops, except they're usually caused by something external to the program. Maybe the program is talking to an external device that isn't responding as quickly as the program expects -- or the program isn't respon
sive to the device. Either way, there's a failure to communicate. The software on each end is supposed to have time-out code to handle this condition, but sometimes the code isn't there or doesn't work properly.
Resource starvation is another way to crash. Usually, the scarce resource is memory. A program asks the OS for some free memory; if the OS can't find enough memory at that moment, it denies the request.
Again, the program should anticipate this condition instead of going off and sulking, but sometimes it doesn't. If the program can't function without the expected resources, it may stop dead in its tracks without explaining why. To the user, the program appears to be frozen.
Even worse, the program may assume it got the memory it asked for. This typically leads to a memory violation. When a program tries to use memory it doesn't legitimately own, it either corrupts a piece of its own memory or attempts to access memory outside its partition.
What happens next largely depends on the stre
ngth of the OS's memory protection. A vigilant OS won't let a program misuse memory. When the program tries to access an illegal memory address, the CPU throws an exception. The OS catches the exception, notifies the user with an error message ("This program has attempted an illegal operation: invalid page fault"), and attempts to recover. If it can't, it either shuts down the program or lets the user put the program out of its misery.
Not every OS is so protective. When the OS doesn't block an illegal memory access, the errant program overwrites memory that it's using for something else, or it steals memory from another program. The resulting memory corruption usually sparks another round of exceptions that eventually leads to a crash.
Corruption also occurs when a program miscalculates how much memory it already has. For instance, a program might try to store some data in the nonexistent 101st element of a 100-element array. When the program overruns the array bounds, it overwrites another data stru
cture. The next time the program reads the corrupted data structure, the CPU throws an exception. Wham! Another crash.
Altered States
Modern PCs suffer from a whole other class of problems related to their
state
-- the sum total of all the information that defines the machine's status or condition. State information includes all the software installed on the hard disk, the configuration files, the control panel settings, the configurable data in the BIOS, and the user's preferences settings. It's everything that makes one system different from another system that has identical hardware.
Before PCs had hard drives, they were essentially stateless. They stored everything on floppy disks and tapes. Users and administrators never had to install, uninstall, or manage any software on the system. Because the state information was independent of the machine, it was almost impervious to any disaster that befell the machine. If a meteor destroyed your PC, you could replace it with another PC
and get back to work immediately. There was nothing to reinstall or reconstruct. (Today, NCs attempt to recreate this pure statelessness by storing everything on a server.)
By contrast, modern PCs hoard an immense amount of state information that's constantly changing. Even when you're staring blankly at the screen, a brief flurry of disk activity might signal that your OS is modifying its registry settings in the background. Problems arise when a change of state knocks the system off balance. Usually this happens after the installation of some new software -- a new version of the OS, a new application, an updated device driver, or just about anything. Suddenly the system doesn't work like it used to. You are the victim of a software conflict that's often incredibly difficult to fix because you're not sure what changed or how to change it back.
Two of the biggest
culprits are DLLs
on Windows PCs and extensions on Macs. DLLs are code libraries that different programs can share. E
xtensions are programs that hook into the Mac OS during boot-up to modify the system's behavior or augment the capabilities of an application. Both types of components inflict ridiculous amounts of aggravation.
One common problem occurs when a software installer dumbly replaces a newer version of a component with an older version. The newly installed application works fine, but an existing application might start crashing. Users aren't sure whom to blame. Result: a series of frustrating tech-support calls.
Shouldn't the installer merely check a component's date stamp before replacing it? Alas, it's not always that simple. Sometimes the date stamp isn't definitive, or maybe it has changed. Windows allows an installer to query a DLL to discover its actual version number, which is safer. But even if every installer were this careful, version management is only one problem. "Some companies tend to change functions in a common DLL without telling everyone right away, and those changes can cause problems fo
r existing programs," says Dave Galligher, product-development manager at Cougar Mountain Software, an accounting software vendor.
Programs expect their DLLs to contain functions that have a particular name, a particular list of calling parameters, and particular return values. But Windows has no standard mechanism for querying a DLL to confirm this information. A program that relies on a DLL function to return a 32-bit integer value could easily crash if a different version of the DLL returns a 64-bit-long integer instead.
The problem of managing a system's state has spawned a whole subindustry of utility programs and management tools: CleanSweep, Conflict Catcher, Extensions Manager, First Aid Deluxe, Norton Utilities, Oil Change, RealHelp, TuneUp, Uninstaller, and dozens more. OS vendors are rapidly adding new management features to their system software, too. It's all because today's PCs require more care and feeding than a barrel full of Tamagotchi Giga Pets.
It's also a classic example of ac
celerating complexity. Components such as DLLs were invented to reduce complexity; programs wouldn't grow so fast if they shared common code. But installers began splattering so many DLLs all over the hard disk that they created a new problem. That, in turn, spurs the industry to produce new management tools, utilities, and OS features -- still more complexity. It starkly demonstrates how difficult it will be to transform PCs into truly reliable systems.
"The highest management cost in an IT environment comes from managing PCs," says Steve Mann, vice president of product strategy for Computer Associates. "They're not very manageable, and they're not very standardized in terms of configurations."
The chore of managing PCs is directly related to reliability. In a survey of 1800 IT professionals at the Computer Associates world user conference in 1997, 70 percent of the respondents agreed that mainframes are more reliable than PC-based client/server systems. "It's only recently that administrators have b
egun demanding the same levels of manageability and reliability that they're used to with mainframes and large servers," says Mann.
Searching for Solutions
Any solution must start with the way developers write, test, and debug their source code. Beyond that, installers must do a better job of loading finished programs onto systems. Finally, the OSes and applications must work together to make PCs easier to manage.
At the risk of igniting a flame war, it's only logical to place a large portion of the blame where it belongs: on C and C++. "Writing in C or C++ is like running a chain saw with all the safety guards removed," says Bob Gray, senior director of consulting services for Virtual Solutions, a developer of custom industrial applications. "It's powerful, but it's easy to cut off your fingers."
Few, if any, languages make it so easy to write bad code. Of course, anyone can write bad code in any language, but C and C++ are famously unforgiving. The computer industry standardized on C/
C++ for commercial software development over a decade ago, creating a mountain of buggy software that will haunt us for decades to come.
Diehards protest that the sparsity of C/C++ is what makes it so fast. But PC hardware is getting so fast anyway that it's time to refocus instead on reliability. In the years ahead, as old-but-indispensable C/C++ programs continue to crash, the excuse that C/C++ conserves every CPU cycle will seem quaint -- as quaint as coding the year in two digits instead of four, thus conserving 2 bytes of storage.
What's the alternative? Take your pick. All fourth-generation-language (4GL) tools are safer, including Delphi, PowerBuilder, TopSpeed, Smalltalk, and Visual Basic. Perhaps the best example of a modern language is Java. It contains numerous safeguards that stop many bugs before they happen (see the sidebar "Better Tools for Better Code").
Rushing development cycles to match "Internet years" is another source of trouble. "If you look at the industry today, we see six
- or nine-month development cycles instead of 18-month cycles," says Gary Ulaner, group product manager for Quarterdeck's RealHelp. "There are also more programmers doing software development, and not all of them have the same level of discipline for quality assurance. The requirements of time to market and revenue often cause products to be shipped before they're ready."
One dubious solution is public beta testing. Time was, you had to be someone special to be a beta tester. Now anybody who has a computer, a modem, and a reckless disregard for system stability can test beta software. The novelty of being an insider who runs prerelease products (even if a million other people are doing the same thing) has made public betas a huge hit. But public betas are also responsible for spreading buggy code, leaving a wake of system crashes and trashed hard drives.
"Some people might not realize what beta means," says Virtual Solutions' Gray. "It's not just a trick way to get an early copy of a new product."
True, public betas expose fresh code to mass testing. But how many casual beta testers report unique bugs -- or any at all? How many of them bother to remove the buggy software (including all its hidden components) from their system after the final product ships? How many realize what they're doing to their systems?
Microsoft's Madlener defends the practice of public betas but acknowledges that developers and users should be more careful. "Of late, we've been reviewing the disclaimer messages that come with these beta products," he says. "They call for some responsibility on the part of the beta testers, too, so they don't install the beta on a system that's mission-critical."
The next step is software installation -- and installers need to get smarter. OS/2 Warp 4 has an integrated Feature Installer that makes sure the right files get saved in the right places without stepping on other components. It's not just for installing OS software, either; third-party developers can use it for applications. U
nix package installers, who have been around a lot longer, do the same thing. There are also some good third-party installers, such as InstallShield for Windows and MindVision's Installer VISE for the Mac.
Madlener says Windows NT 5.0 will have a new Application Installer Service, which sounds a lot like OS/2's Feature Installer. It means that developers will no longer have to write their own setup code. Instead, NT 5.0 will execute a script that tells where each file goes. NT will arbitrate any DLL conflicts and keep a log of all new files and registry changes. According to Madlener, this will make it easier to cleanly uninstall the software or reinstall individual components.
Madlener says he doesn't know yet if other versions of Windows will get the installer, but he says Windows 98 will have a management tool called the System File Checker. This is a diagnostic program that checks system components and can reinstall missing or broken pieces. It also keeps a log that's a snapshot of the system's st
ate, making it easier to reverse changes.
Automated Maintenance
An interesting but potentially hazardous solution to system maintenance is automatic updating. Few users or administrators have time to scour the Internet for the latest upgrades and patches. That has opened the door for utilities such as CyberMedia's Oil Change and Quarterdeck's TuneUp and RealHelp. They compare your system configuration to a database on the Web. Then they help you download and install any relevant updates. It's such a good idea that Microsoft is thinking about adding similar features to Windows.
But there's a danger: Every change of your system's state, no matter now minor, can potentially break some existing software. An older program might crash with a newer DLL or device driver, forcing you to upgrade that program as well. Sometimes this triggers a cascade of failures and fixes before the system returns to a stable state. Sometimes you reach a dead end in which no update for a broken program is available.
And inevitably, the upgrades consume more memory, disk space, and CPU resources, accelerating the day when your PC becomes obsolete.
The phenomenon of new software breaking old software is well known to software engineers. Alan Wood, senior engineer at Tandem Computer, says fixes to Tandem's NonStop Kernel typically break something else in the OS about 5 percent to 10 percent of the time. Tandem catches those problems with thorough regression testing. But it's hard to perform that kind of formal testing on PCs: Every PC is slightly different.
Utilities such as Oil Change and TuneUp recognize this hazard. They log every alteration and save replaced components in a compressed archive, so you can undo an installation. But there's still a chance you'll wade deep into a series of changes and won't be able to roll back the system.
Applications can take some responsibility for system management, too. When a user launches Office 98 for the Mac, it performs a self-diagnostic. If it can't find any of its sh
ared libraries -- perhaps the user mistakenly disabled a library with the Extensions Manager -- Office 98 installs a fresh copy from a compressed archive on the hard disk. It all happens invisibly, so the user won't even notice. Microsoft says future versions of Office for Windows will also be self-repairing.
The Essence of PCs
Of course, every new feature, management tool, OS upgrade, and utility program adds still more code and complexity to a system. Some experts think PCs won't stop crashing until everyone accepts the futility of "feature shock." In other words, the shortest path to stability is simplicity: simpler hardware, simpler software, simpler user interfaces. But this demands a whole new way of thinking, says Michael L. Dertouzos, director of the MIT Laboratory for Computer Science: "It's more difficult, a little bit like birth control."
He says the change, if it ever comes, could begin as a grass-roots rebellion. Someone will use the Web to distribute a leaner, meaner OS that c
ircumvents the entrenched platforms. It'll be more stable, easier to use, and better understood.
It sounds a lot like what's happening today with Linux, or the early days of Mosaic. But Linux flunks the simplicity test, and Mosaic begat Navigator, which begat Communicator. Simple software doesn't stay simple for long.
At the other extreme is the NC concept: a stateless, simplified client designed for a wired world. But NCs sacrifice the crucial essence of a PC -- unlimited local control. Mainframes and critical embedded systems achieve their high reliability by sacrificing local control, too. For better or for worse, many users and IT professionals would rather crash than switch.
That's why the ultimate solution is a long way off. Realistically, developers will continue to write bigger programs that ship before they're ready. OSes will continue to grow more complicated. Users will continue to vote with their dollars for feature-laden software. Established platforms and applications will continue t
o overshadow radical alternatives. And PCs will continue to crash.
Where to Find
Candle
Santa Monica, CA
Phone: 310-829-5800
Internet:
http://www.candle.com/