software reality

Software reality
Friday January 4, 2002 09:05

Software bugs are lots easier to fix if they are in ram instead of [e]eprom or flash. And if software updates can be downloaded over internet.

The CY3671 EZ-USB-FX development board not makes possible updating peripheral microcontroller code from internet. Thursday January 3, 2002 08:43

Break Points

Jack G. Ganssle

As Good As It Gets

HOW good does firmware have to be? How good can it be? Is our search for perfection, or near-perfection, an exercise in futility?

Complex systems are a recent phenomenon. Many of us remember the early transistor radios, which sported no more than a half-dozen active devices. Vacuum tube televisions, common into the ‘70s, used 15 to 20 tubes, more or less equivalent to about the same number of transistors. In the 40s, the ENIAC computer required 18,000 tubes, so many that technicians wheeled shopping carts of spares through the room, constantly replacing those that burned out. Though that sounds like a lot of active elements, even the 25-year- old Z80 chip used a quarter of that many transistors, in a die smaller than just one of the hundreds of thousands of resistors in the ENIAC.

Now, the Pentium 4 has 45 million transistors. A big memory chip might require a third of a billion. Intel predicts that, later this decade, their processors will have a billion transistors. I’d guess that even the simplest embedded system, such as an electronic greeting card, requires thousands of active elements.

Software has grown even faster. In 1975, 10,000 lines of assembly code was considered huge. Given the development tools of the day—paper tape, cassettes for mass storage, and crude teletypes for consoles—working on projects of this size was difficult. Today 10,000 lines of C is a small program. A cell phone might contain a million lines of C or C++. This is astonishing, considering the device’s small form factor and power requirements.

Another measure of software size is memory usage. The 256-byte (that’s not a typo) EPROMs of 1975 meant even a measly 4KB program used 16 devices. Clearly, even small embedded systems were quite pricey. Today? 128KB of flash is nothing, even for a tiny application. The switch from 8- to 16-bit processors, and then from 16- to 32-bitters, is driven more by address space requirements than raw horsepower.

So our systems are growing rapidly in both size and complexity. They’re also growing, I contend, in failure modes. Are we smart enough to build these huge applications correctly?

It’s hard to make even a simple application perfect; big ones will most likely never be faultless. As the software grows, its components inevitably become more interdependent. A change in one area impacts other sections, often profoundly. Sometimes this is due to poor design; often, it’s a necessary effect of system growth.

The hardware, too, is certainly a long way from perfect. Even mature processors usually come with an errata sheet, one that can rival the datasheet in size. The infamous Pentium divide bug was just one of many bugs. Even today, the Pentium 3’s errata sheet (renamed “specification update”) details 83 issues. Motorola documents nearly a hundred problems in the MPC555.

What is the current state of the reliability of embedded systems? No one knows. It’s an area devoid of research. Yet a lot of raw data is available, some of which suggests we’re not doing well.

The Mars Pathfinder mission succeeded beyond anyone’s dreams, despite a significant error. A priority inversion problem—noticed on Earth but attributed to a glitch and ignored—caused numerous hangs. A remote debug capability saved the mission. This is an instructive failure because it shows the importance of adding external hardware and/or software to deal with unanticipated software errors.

The August 15, 2001 issue of the Journal of the American Medical Association contained a study called “Recalls and Safety Alerts Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators” by William H. Meisel and others. (Since these devices are implanted subcutaneously I can’t imagine how a recall works.) Surely designers of these devices are on the cutting edge of building the very best software. Yet between 1990 and 2000, firmware errors accounted for about 40% of the 523,000 devices recalled.

In the 10 years studied, we learned a lot about building better code. Tools have improved and the amount of real software engineering that takes place is much greater. Or so I thought. It turns out that the annual number of recalls increased between 1995 and 2000.

In defense of the pacemaker developers, they no doubt confront very complex problems. Interestingly, heart rhythms can be mathematically chaotic. A slight change in stimulus can cause the heartbeat to burst into quite unexpected randomness. And surely there’s a wide distribution of heart behavior in different patients.

Perhaps a new QA strategy is needed for these sorts of life-critical devices. What if the software engineer were someone with heart disease who had to use the latest widget before release to the general public?

A pilot friend tells me the 747 operator’s manual is a massive tome that describes everything one needs to know about the aircraft and its systems. He says that fully half of the book documents avionics (read: software) errors and workarounds.

The space shuttle’s software is a glass half-empty/half-full story. It’s probably the best code ever written, with an average error rate of about one per 400,000 lines. The cost? $1,000 per line. So, it is possible to write great code, hut despite paying vast sums, perfection is still elusive. Like the 747, the stuff works “well enough,” which is perhaps all we can ever expect. Is this as good as it gets?

The human factor

We don’t build systems that live in isolation. They’re part of a complex web of systems, not the least of which is the human operator or user. When tools were simple, there weren’t so many failure modes. That’s not true anymore. Do you remember the U.S.S. Vincennes? She’s a U.S. Navy battle cruiser, equipped with the sophisticated Aegis radar system. In July of 1988 the cruiser shot down an Iranian airliner over the Persian Gulf, killing all 290 people on board. Apparently the system knew that the target wasn’t an incoming enemy warplane, but that fact was displayed on terminals that weren’t easy to see. So here’s a failure where the system worked as designed, but the human element created a terrible failure. Was the software perfect since it met the requirements?

Unfortunately, airliners have become common targets for warplanes. This past October, a Ukrainian missile apparently shot down a Sibir Tu-154 commercial jet, killing all 78 passengers and crew. While I write, the cause is unknown, or unpublished, but local officials claim the missile had been targeted on a nearby drone. It missed, flying 150 miles before hitting the jet. Software error? Human error?

The war in Afghanistan shows the perils of mixing men and machines. At least one smart bomb missed its target and landed on civilians. U.S. military sources say incorrect target data was entered. Maybe that means someone keyed in the wrong GPS coordinates. It’s easy to blame an individual for mis-typing, but doesn’t it make more sense to look at the system as a whole, including bomb and operator? Bombs connote serious safety-critical issues. Perhaps a better design would accept targeting parameters in a string that includes a checksum, rather like credit card numbers. A mis-keyed entry would be immediately detected by the machine.

It’s well known that airplanes are so automated that on occasion both pilots have slipped off into sleep as the craft flies itself. Actually, that doesn’t really bother me much, since the autopilot beeps when at the destination, presumably waking the crew. But, before leaving, the fliers enter the destination in latitude/longitude format into the computers. What if they make a mistake (as has happened)? Current practice requires pilot and co-pilot to check each other’s entries, which will certainly reduce the chance of failure. Why not use checksummed data instead and let the machine validate the data?

Another U.S. vessel, the Yorktown, is part of the Navy’s “Smart Ship” initiative. Automating significant portions of the ship’s engineering (propulsion) reduces crew needs by 10% and saves some $2.8 million per year on this one ship. Yet the computers create new vulnerabilities. Reports suggest that an operator once entered an incorrect parameter that resulted in a divide-by-zero error. The entire network of Windows NT machines crashed. The Navy claims the ship was dead in the water for about three hours; other sources (www.gcn.com/archives/gcn /1998/july13/cov2.htm) claim it was towed into port for two days of system maintenance. Users are now trained to check their parameters more carefully. I can’t help wonder what happens in the heat of battle, when these young sailors maybe terrified, with smoke and fire perhaps raging. How careful will the checks be?

Some readers may also shudder at the thought of Windows NT controlling a safety-critical system. I admire the Navy’s resolve to use a commercial product, but wonder if Windows, which is the target of many hackers’ wrath, might not itself create other vulnerabilities. Will the next war be won by the nation with the best hackers?

People behave in unpredictable ways, leading to failures in even the best system designs. As our devices grow more complex, their human engineering becomes ever more important. Yet all too often this is neglected in our pursuit of technical solutions.

Solutions?

I’m a passionate believer in the value of firmware standards, code inspections, and a number of other activities characteristic of disciplined development. It’s my experience that an ad hoc or a non-existent process generally leads to crummy products. Smaller systems can succeed from the dedication of a couple of overworked experts, but as things scale up in size heroics become less and less effective.

Yet it seems an awful lot of us don’t know about basic software engineering rules. When talking to groups I usually ask how many participants have (and use) rules about the maximum size of a function. A basic rule of software engineering is to limit routines to a page or less. Only rarely does anyone raise their hand. Most admit to huge blocks of code, sometimes thousands of lines. Often, this is a result of changes and revisions, of the code evolving over the course of time. Yet it’s a practice that inevitably leads to problems.

Methodologies haven’t solved the problem. Most are too big and too complex. I have hope for UML, which seems to offer a way to build products that integrates hardware and software, and that is an intrinsic part of development from design to implementation. But UML will fail if management won’t pay for extensive training, or resist the urge to toss the approach when panic reigns.

The FDA, FAA, and other agencies are becoming aware of the perils of poor software, and have guidelines that can improve development. Britain’s Motor Industry Software Reliability Association (MISRA) has guidelines for the safer use of C. They feel that we need to avoid certain constructs and use others in controlled ways to eliminate potential error sources. I agree.

I doubt, though, that any methodology or set of practices can, in the real world of schedule pressures and capricious management, lead to perfect products. The numbers tell the story. The very best use of code inspections, for example, will detect about 70% of the mistakes before testing begins. (However, inspections will find those errors very cheaply.) That suggests that testing must pick up the other 30%. Yet studies show that often testing checks only about 50% of the software!

Sure, we can (and must) design better tests. We can, and should, use code coverage tools to ensure every execution path runs. These all lead to much better products, but not to perfection. Because all of the code is known to have run doesn’t mean that complex interactions between inputs won’t lead to bizarre outputs. As the number of decision paths increases, the difficulty of creating comprehensive tests skyrockets.

Perhaps the nature of engineering is that perfection itself is not really a goal. Products are as good as they have to be. Competition is a form of evolution that often leads to better quality. In the ‘70s Japanese automakers, who had practically no U.S. market share, started shipping cars that were reliable and cheap. They stunned Detroit, which was used to making a shoddy product that dealers improved and customers tolerated. Now the playing field has leveled, but at an unprecedented level of reliability.

Perfection may elude us, but we must find better ways to build our products. Wise developers spend their entire careers engaged in the search. esp

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at [email protected].

Embedded System Programming January 2002

APD Out Millions For False Alarms

Br JEFF JONES
Journal Staff Writer

Do the math and decide for yourself, says Albuquerque Police Department Lt. Larry Sonntag.

Should the APD continue to spend millions each year — and use thousands of hours of officers’ time -— to check out every burglar alarm call when almost all of them turn out to be false? A study Sonntag recently completed as part of a college police-management course found that more than 99 percent of the 55,247 burglar alarm calls the APD handled in 2000 were false alarms. The calls cost the APD the equivalent of $3.2 million, the study adds, and it offers two possible solutions.

The first, Sonntag says, is to beef up enforcement of the city’s alarm law and make sure frequent false-alarm culprits are fined.

The second solution — and one Sonntag says is becoming a trend among other law agencies — is for police simply not to go to burglar alarms unless there’s more evidence beyond a jangling alarm.

Sonntag said earlier this week that the study, which APD is examining, is “talking about a change in the level of service we provide. It’s going to be up to the public — it’s a matter of priority.”

Nick Bakas, Albuquerque’s chief public safety officer, said the thought of police not going to every alarm call “goes against my grain.” But he said the APD doesn’t have the staffing to continue handling the onslaught of false alarms.

Sonntag decided to tackle the false-alarm problem for a research paper in a course he took last fall, and the study helped him earn a leadership award.

According to Sonntag, the study showed that 13 percent of the total police calls the APD handled in 2000 were burglar alarms. It estimates the alarms, which require two officers per call, burned up more than 54,000 hours of police time.

“It works out to 27 officers doing nothing but taking alarm calls full-time,” Sonntag explained. That’s essentially the equivalent of one entire shift at an APD substation.

Sonntag said the alarm ordinance doesn’t require police to go to unverified burglar alarms, but the APD’s internal policy does mandate it.

Verified alarms are those in which someone — say a neighbor, homeowner or alarm company representative — has spotted signs of a potential break-in.

Unverified burglar alarms are lower-priority calls, meaning officers go to them only after handling more pressing matters such as violent or in-progress crimes. Sonntag said it can sometimes take “15,30 minutes or more” before an officer pulls up to a burglar- alarm call.

Sonntag said police in Las Vegas, Nev., and Salt Lake City no longer go to unverified burglar alarms. He said that according to authorities in Salt Lake City, residents are paying an extra $3 to $5 a month for their alarm service, and the services are sending security guards out to check on unverified alarms. In Albuquerque, “we’re a gift to their industry,” Sonntag said. Frank Keane, president of the New Mexico Burglar and Fire Alarm Association, said the responsibility for checking out a potentially dangerous alarm call should fall “squarely on the police department” — not on security guards with far less training. He warned that some less-reputable companies wouldn’t pick up the tab for security service, meaning no one would respond to some alarms.

He said the best way to solve the problem is by monitoring and fining alarm owners with repeat false alarms.

Keane said a 1998 revamp of the alarm law which requires alarm owners to annually renew their permits for $10 — was supposed to provide money for a computer system that would tell police who the troublemakers were. But the APD still must manually check its records to try to catch the troublemakers, and Keane said the permit money is going into the city’s general fund.

“You want to know where the money is? Filling potholes,” Keane said. “It’s a vicious circle, and it won’t change until there’s some hard, fast documentation.”

Albuquerque’s alarm ordinance allows the city to fine people $50 if they accrue five false alarms in a year. However, the APD acknowledges that many alarms aren’t registered and that many people with numerous false alarms aren’t being fined.

Marilyn Pascali, supervisor of the three-member APD Alarm Ordinance Unit, said she has roughly 20,000 alarms in her database. However, she said there could be 50,000 to 70,000 alarms in the city.

“When I see the money we’re not generating,- the little hairs on the back of my neck stand up,” Pascali said.

She said the ordinance could be improved. For starters, she said, she can only fine people who have registered their alarms. The $50 fine also can only be levied once a year, meaning owners can have an unlimited number of false alarms and still pay only $50.

Sonntag said police in one Maryland County set up a graduated fine system that boosted the penalties for false alarms into the thousands of dollars. He said as a result, false-alarm calls there dropped by 40 percent.

Albuquerque Journal Wednesday January 2, 2002

The concept of visible documentation means that you have complete traceability to any level.

The first time csd heard of the concept was in the Boeing 767 project.

The 767 in last reported count contained about 124 microcontrollers or microcomputers connected on an airinc 429 serial network.

The 767, like most airplanes, is built by subcontractors.

Subcontractors were supplied a software standards document.

When the documents were received they were checked by Boeing personnel and placed on a shelf at Paine field in Everett, WA.

Normally no one would ever look at the documents. Unless there is an accident. 1 2 3

The document visibility concept is,

Sure you can use a compiler for your project. But we want the source code for the compiler just in case an accident involves a compiler error.

Visible documentation gives you detailed documentation to any level you choose to descend. Wednesday November 15, 2000 09:56

Break Points
Jack G. Ganssle

Crash And Burn

Analyzing past failures is one of the best ways to prevent them from happening again. Here, Jack distills some lessons from highprofile disasters.

We're not terribly good at learning from our successes; smug with the satisfaction of a job well done, most of us proceed immediately to the next task at hand. It’s a shame that we can’t look at a successful project and then dig deeply into what happened, how, and when, to suck the educational content of the project dry.

Ah, but failures are indeed a different story. High-profile disasters inevitably produce an investigation, calls for Congress to “do something,” and, in the best of circumstances, a change in the way things are built so the accident is not repeated.

Isn’t it astonishing that airplane travel is so reliable? That we can zip around the sky at 600 knots, seven miles up, in an ineffably complex device created by flawed people? Perhaps aviation’s impressive safety record is a by-product of the way the industry manages failures. Every crash is investigated; each yields new training requirements, new design mods, or other system changes to eliminate or reduce the probability of such a disaster striking again.

Though crashes are rare, they do occur, so airliners carry expensive flight data recorders whose sole purpose is to produce post-accident clues to the safety board. What a shame that we firmware folks don’t have a similar attitude. Mostly we’re astonished when our systems break or a bug surfaces.

I hope that in the future we learn to write code proactively, expecting bugs and problems but finding or trapping them early, and leaving a trail of clues as to what went wrong.

I believe we should examine disasters, our own and others, because so many embedded systems crash in similar ways. I collect embedded disaster stories not from morbid fascination but because I think they offer universal lessons. Here are a few that are instructive.

NEAR

On December 20, 1998, the Near Earth Asteroid Rendezvous spacecraft, after three years enroute to 433 Eros, executed a main engine burn intended to place the vehicle in orbit about the asteroid. The planned 15-minute burn aborted almost immediately; firmware put the spacecraft into a safe mode, as planned in case of such a contingency. But then NEAR unexpectedly went silent. Twenty-seven hours later communications resumed, but ground controllers found that two-thirds of the mission’s fuel had been dumped.

Controllers spent a few days analyzing data to understand what happened, then initiated a series of burns that will ultimately lead to NEAR’s successful rendezvous with the asteroid. But two thirds of the spacecraft’s fuel had been dumped, using all of the mission’s reserves. Enough fuel was left—barely—to complete the original goals of the mission. But reduced fuel means things happen more slowly, so NEAR’s rendezvous with 433 Eros would happen 13 months later than planned.

Like so many system failures, a series of events, each not terribly critical, led to the fuel dump.

Immediately after the engine fired up for the planned 15 minute burn, accelerometers detected a lateral acceleration that exceeded a limit programmed into the firmware. This momentary under-one-second transient was not out of bounds for the mechanical configuration of the spacecraft. But the propulsion unit is cantilevered from the base of the spacecraft, creating a bending response that, according to the report, “was not appreciated.” l Quoting further, “In retrospect, the correct thing for the G&C software to have done would have been to ignore (blank out) the accelerometer readings during the brief transient period.” In other words, though the transient wasn’t anticipated, the software was too aggressive in qualifying accelerometer inputs.

With the software figuring lateral movement exceeded a pre-programmed limit, it shut the motor down and put the spacecraft into a safe mode. The firmware used thrusters to rotate NEAR to an earth-safe attitude. Code then ran a script designed to change over from thrusters to reaction wheels (heavy spinning wheels that absorb or impart spin to the spacecraft) for attitude control. According to the report, “Due to insufficient review and testing of the clean-up script, the commands needed to make a graceful transition to attitude control using reaction wheels were missing.” Wow!

Excessive spacecraft momentum meant that the reaction wheels just weren’t up to the task of putting NEAR into the earth-safe mode. The firmware did try, for the programmed 300 seconds, but then gave up and started warming up thrusters, which offer much more kick than the momentum wheels. Now the only chance to save the spacecraft was to go to the lowest level save mode, “sun-safe,” where it spun slowly around an axis pointing towards the sun. This would keep the batteries charged till ground intervention could help out.

Seven minutes later an error in a data structure (that is, a parameter stored in the firmware) led to the system thinking a momentum wheel that was running at its maximum speed was stopped. A series of race conditions, exacerbated by low batteries, led to some 7,900 seconds of thruster firing over the course of many hours. Eventually NEAR did stabilize in sun-safe mode, though now missing 29kg of critical propellant.

So NEAR’s troubles stem ultimately from a transient due to an odd vibration mode—something the firmware design team could not have anticipated. This rather small transient revealed flaws in the firmware that, in large part, led to a near-catastrophe (pun intended).

The review board inspected some, but not all, of the system’s 80,000 lines of code (C, Ada, and assembly). They uncovered nine software bugs and eight data structure errors. Bugs included poorly designed exception handlers and critical variables that could be erroneously overwritten.

Hindsight is certainly a powerful microscope, especially when zooming in on a specific problem that causes a mishap. But I can’t help but wonder why the post-failure review board’s firmware review was so much more effective than those—if there were any—performed during original design. The report’s recommendation 1c insists that from now on all command scripts must be tested, especially those critical to spacecraft safety— including abort cases. Well, duh!

You’d think configuration management would be a no-brainer for a mission costing many megabucks. Turns out the flight software was version 1.11, but two different version 1.lls existed. The one that was not flying had the proper command script to handle the thruster to reaction wheel changeover. Astonished? I sure was. From the report: “Flight code was stored on a network server in an uncontrolled environment.” Version control is not rocket science!

Clementine

NEAR is by no means the only space probe to suffer from software issues. (Recent failed Mars missions come immediately to mind.) In 1994, another asteroid-rendezvous spacecraft experienced a somewhat similar failure. Clementine, which very successfully mapped much of the moon from lunar orbit, was supposed to autonomously rendezvous with near-Earth asteroid 1620 Geographos. A software error caused a series of events that depleted the supply of hydrazine propellant, leaving the spacecraft spinning and unable to complete its mission.

A sequencing error triggered an opening of valves for four of the vehicle’s 12 attitude control thrusters, using up all of the propellant. No fuel, no go.

Unfortunately, I’ve been unable to obtain more detailed information about the nature of the software error. However, there’s an enticing—and as yet unavailable—reference in the appendix of the NEAR report to a memo called “How Clementine Really Failed and What NEAR Can Learn.” Is it possible that NEAR’s software failure had been anticipated four years earlier?

NASA published a report on the mission that mentions the failure but does not delve into root causes. 2 Clementine was a technology demonstrator operated by the Ballistic Missile Defense Organization; NASA was a partner, not the main force behind the mission. The Clementine report, though short on firmware details, does delve into the human price of a schedule that’s too tight. A few quotes follow; there’s not much one can add!

“The tight time schedule forced swift decisions and lowered costs, but also took a human toll. The stringent budget and the firm limitations on reserves guaranteed that the mission would be relatively inexpensive, but surely reduced the mission’s capability, may have made it less cost-effective, and perhaps ultimately led to the loss of the spacecraft before the completion of the asteroid flyby component of the mission.”

“The mission operations phase of the Clementine project appears to have been as much a triumph of human dedication and motivation as that of deliberate organization. The inadequate schedule ... ensured that the spacecraft was launched without all of the software having been written and tested.”

"Further, the spacecraft performance was marred by numerous computer crashes. It is no surprise that the team was exhausted by the end of the lunar mapping phase.”

Ariane 5

In May 1998 I described the 1996 failure of Ariane 5 (“Disaster;” p. 113), the large launch vehicle that tumbled and destroyed itself 40 seconds after blast-off. Since then more information has come to my attention.

Shortly after launch, the Inertial Reference System (SRI, the apparently scrambled acronym a result of translation from French to English) detected an overflow when converting a 64-bit floating-point number to 1 6-bit signed integer. An exception handler noted the problem and shut the SRI down. Due to the incredible expense of these missions (this maiden flight itself had two commercial spacecraft aboard, each valued at about $100 million) a back-up SRI stood ready to take over in case of the primary’s failure. SRI number two did indeed assume navigation responsibility, but it ran identical code, encountered the same error, and shut down as well.

Why did the overflow occur? This code had been ported from the much smaller Ariane 4. According to the report “. . .it is important to note that it was jointly agreed [between project partners at several contractual levels] not to include the Ariane 5 trajectory data in the SRI requirements and specification.” 3 Clearly, a decision that doomed the project to failure. Here’s a case where the firmware was, in fact, perfect—if perfection is measured by how well the code meets the spec. Again, “The supplier of the SRI was only following the specification given it.”

As with NEAR, Ariane’s crash resulted from a series of coupled events rather than any single problem. The exception was largely a result of poor specification. But designers did realize that some variables might go out of range; in fact they specifically wrote code to monitor four of the seven critical variables. Why were three left exposed? An assumption was made that physical limits made it impossible for these three to overflow (an assumption that proved expensively faulty). Further, a target of 80% processor loading meant that checking all calculations would be prohibitively expensive.

But the exception itself didn’t cause Ariane’s crash. When both SRIs failed, they did so gracefully, and even returned diagnostic data to the vehicle’s main computer that indicated the flight data was invalid. But the main computer ignored the diagnostic bit, assumed the data was valid, and used this incorrect information to guide the vehicle.

As a result of trying to use bad data, the computer commanded the engine nozzles to hard-over deflection, resulting in the tumbling and destruction of the rocket.

To complicate the picture further, the floating-point operation that overflowed was not even a calculation required for normal flight operations. It was left-over code, a relic of the firmware’s Ariane 4 heritage, code that had meaning only before lift-off.

The review board also noted that, though testing of the SRI is hard, it’s quite possible and (gasp!) maybe even a good idea: “Had such a test been performed by the supplier or as part of the acceptance test the failure mechanism would have been exposed.”

To summarize: poorly tested code that should not have been running caused a floating point conversion error because the spec didn’t call for an understanding of real flight dynamics. In an effort to keep processor loading low, the variables involved weren’t monitored, though others were. Two redundant SRIs running the same code performed identically and shut down. The main computer ignored the SRI “bad data” bit and tried to fly using corrupt information.

Another interesting tidbit from the report: "...the view had been taken that software should be considered correct until it is shown to be at fault.” This is the rationale behind using identical code on redundant SRIs. It does beg the question of why sufficient testing to isolate those potential software faults was not performed.

Conclusion

My embedded disaster collection grows daily. I expect, as embedded systems become ever more pervasive, that no end is in sight to the firmware crisis we’ll all experience.

Several common threads run through many of these stories. The first is that of error handling. Look at Ariane: when the software failed, it properly set a diagnostic bit that meant “ignore this data.” Yet the main CPU blithely carried on, ignoring the error bit instead.

Inadequate testing, too, appears repeatedly as a theme in disasters. The NEAR team had simulators and prototypes, but these test platforms worked poorly. Their fidelity was suspect, leaving the engineers to wonder, when problems surfaced, if the element at fault was the simulator or the code. Ariane, too, had poor simulators and thus only partially tested software. On Clementine it appears that some code was not tested at all.

Interprocessor communications is a constant source of trouble. Though I’m a great believer in using multiple CPUs to reduce software complexity and workload, problems result when too much comm is required. NEAR’s computers ran into race conditions. Ariane’s error bit was disregarded.

Those who don’t learn from the past are sentenced to repeat it. esp

Those who cannot remember the past are condemned to repeat it.
George Santyana, 1905

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at jack@ganssle. com.

References 1 The NEAR Rendezvous Burn Anomaly of December 1998.” Available at http://near.jhuapl.edu/anom/index.html
2. “Lessons Learned from the Clementine Mission,” NASA/CR report 97-207442.
3. “Ariane 5, Flight 501 Failure,” Report by the Inquiry Board. Available at rk.gsfc.nasa.gov:80/richcontent/Reports /Failure_reports/Ariane50l.htrn.

http://perso.respublica.fr/f4rtp/docgene.html

Embedded System Programming November 2000 http://www.embedded.com/

Success in high tech projects in enhanced by reading and taking the advice of seasoned experts Friday October 27, 2000 07:53

BREAK POINTS

Jack G. Ganssle

Momisms

Momism (mom iz’ em) n. 1. A brief statement of a principle passed maternally. 2. A tersely worded statement of an observation of truth: APHORISM.

The image of mom gently guiding her young ones down paths of righteousness, teaching them the basic elements of being civilized, helping with school work, is powerful indeed. Yet we still leave home ill-equipped for real life; college itself does but a poor job in preparing us for careers and adulthood. Perhaps mom should have taught us some more lessons. Here are a few thoughts.

Interrupts

Keep ISRs short. Debugging interrupt service routines is tough and, in some cases, almost impossible. Too often those expensive tools work poorly or not at all inside an ISR. Breakpoints fail because they operate at human speeds, while the interrupts come much faster. Single stepping, the old standby of many developers, just won’t work where interrupts arrive at any sort of reasonable rate. Single step in a section of code where interrupts are reenabled, and you’ll likely debug different instantiations of the ISR with each step. An emulator with trace will capture the service routine’s execution, but even the largest trace buffers fill quickly from loops and recursion.

A very wise friend taught me the fundamental rule of debugging ISRs: don’t. Keep the routine so short, so simple, that you can debug by inspection. A good rule of thumb is to limit ISRs to a dozen or so lines. Worst case, keep them shorter than a page. 11 the ISR really must do a lot of work, why not spawn a task that handles the complexity?

Avoid NMI. Non-maskable interrupt, also known as Trap, level 7, or any of a number of monikers, can’t be shut off, ever. Other interrupt inputs succumb to the “disable” instruction, and generally turn off automatically when a hardware-initiated interrupt occurs. Until you explicitly turn the interrupt back on, the unavoidable non-reentrant parts of the ISR are safe. An NML handler, however is never safe. Non-reentrant code will be destroyed if the interrupt recurs. Many CPUs use an edge sensitive input for this beast, so the slightest bit of noise can create multiple false NMIs over the course of a few microseconds. And debugging tools, like emulators, often couple small bits of spurious noise into the target system. Reserve NMI for one-time events like power failure or the apocalypse.

Fill unused vectors. Though a CPU might support hundreds of interrupt sources, each one defined by an entry in the dispatch table, we rarely use more than a handful. If you leave those unused dispatch table entries blank, any weird vectoring will crash the application horribly, leaving no trail of evidence to the root cause.

Why would spurious interrupts occur? Maybe the hardware is defective or glitchy—it’s a prototype during development, isn’t it? Perhaps you’ve misprogrammed one of the hundreds of registers inside of today’s too complex peripherals.

Better to fill all unused vectors with a pointer to a debug routine that either logs the erroneous interrupt or reaches a lurking breakpoint.

Listen—don‘t interrupt others. You’ll learn far more listening than talking, and the listener never puts his foot in his mouth.

Bugs

Inspect rather than debug. Bottom line: code inspections find bugs some 20 times more efficiently than debugging by test. Inspect the code, design, specs, and all relevant design documents to find problems before writing/debugging/testing and then chucking a lot of expensive firmware. Inspections won’t find all of the problems, but a well-implemented inspection process wrings out 70% to 80% of the defects for a fraction of the cost.

Studies indicate that in many systems 50% of the code never gets tested. It’s difficult at best to devise test conditions for every error condition/exception handler, and for IFs nested five deep. Since post-compile error rates run around 5% 1 (five bugs per 100 lines of code), even a small system with 10,000 lines of code might have 250 lurking bugs after it’s completely “tested.” Though devising better tests is surely a good idea, inspections will bring most of these hidden problems to light.

Inspect approximately 100 to 200 lines of code per hour. There’s a sweet-spot at 150 lines of code per hour where inspections proceed very efficiently yet unveil most of the defects. 2 Monitor the inspection rate. These numbers, which come from data I’ve been accumulating for a number of years, suggest that inspections cost (assuming there’s no benefit!) approximately $2 per line of code, or 10% of the usual $15 to $30 per line cost for most commercial firmware.

Track debugging time. It sure is fun to crank code. But in many organizations a large portion of a project’s time gets consumed in the debugging phase. This is a certain sign of dysfunctional development. It indicates that the developers either write the code carelessly or spend too little time on specification and design.

Measure bug rates. A few functions or modules typically exhibit most of the product’s defects. We’ve all been there. We’ve all worked on a function so complex, so poorly understood, and so badly coded that we’re terrified of opening it in the editor. Change a single character in a comment and the code stops working. Barry Boehm, the software estimation guru, has shown that these error-prone functions, which typically represent a small percentage of the total code base, contain 60% to 80% of the errors. More compellingly, his data indicate these problem-functions eat up four times more effort than their well-behaved brethren. It behooves us to take data to quantitatively figure out which are bad, and then toss bad code and start again.

Debug proactively. Face it, you’re going to have problems. The code will be far from perfect. Plan for bugs and instrument your code to find them quickly. Does your RTOS include a stack-overflow checker? Leave it enabled whilst debugging. Or seed the stack with a pattern, then stop the debugger from time to time to see if stacks are too big or too small.

Why not fill unused ROM/flash with nasty instructions, like software interrupts, that vector off to a debug routine? When code crashes it often just wanders off, perhaps into your carefully seeded ROM area. The software interrupts and associated handler will capture the crash quickly, and in safety-critical systems can bring the system to a known harmless state.

You’d be surprised how many embedded systems, those that are “done” and shipping, access memory in bizarre ways. Writing to ROM. Reading from unused memory. Though perhaps harmless, these odd behaviors indicate lurking software problems. Consider setting up unused/extra chip selects to trigger on any errant memory access, or expand your PLD decode logic to signal such problems by means of an extra output. Code that behaves unexpectedly is flawed, even when the symptoms seem benign.

Clean your room. Bugs thrive in messy places.

Performance and size

Sometimes engineering costs more than faster hardware. If you’re building a million of something, production costs overwhelm development costs. That’s much less true for small production runs.

Never forget that one of the “production” costs is that of the amortized engineering. If a design decision adds a month to the project, at perhaps a cost of $20,000, then the product’s price must include this additional cost: $20,000 divided by the number of units made. Does an 8051 really make sense for your low-run application? Would a bigger CPU dramatically reduce development costs? Will shoe-horning bytes into an undersized code space eat weeks of expensive developers’ time?

A tiny CPU is, without question, the perfect choice for a huge range of applications. You just can’t beat them for minimizing PCB real estate, recurring costs, and power consumption. And I’ve long been a proponent of distributing small CPUs around a board to handle small chores like I/O processing. But do understand the very real costs of working in a confined address space with perhaps underpowered tools and languages. Make CPU tradeoffs that minimize total system cost, from engineering through production.

Overload a CPU at your peril. A 90% loaded processor doubles development time. At 95%, figure on tripling the effort. This is hardly surprising to our intuitive understanding of programming. At one point or another, we’ve all battled a performance-bound system by tuning every bit of code it contains, instead of the 20% that is typically responsible for most of the real-time problems. Margins minimize engineering costs by allowing us to be a little sloppy. They let us deliver a product which is not quite tuned to perfection, avoiding the extreme costs that entails.

Create a dynamic model of code size. Few of us really create a meaningful estimate of ROM/flash needs. Instead, we tend to ask for as much ROM as possible, or perhaps double the amount used on the previous project. It’s tough to estimate binary sizes when starting a large project.

But we can’t abdicate our responsibility to monitor code growth. Telling the boss a month before delivery— when the hardware design is cast in PCBs—that we need more ROM is a sure path to career stagnation.

It’s a simple matter to build a spreadsheet that lists all of the modules the system will contain with estimates of their size (in lines of code, function points, or any other reasonable measure). Edit in the real source line and object size of each module as it’s completed. Over time you’ll find a reasonable approximation to the number of bytes of code per line of C; have the model apply this to the as-yet- uncompleted portions of the code to predict final system size. Odds are you’ll spot ROM shortages early on, when there’s still time to take design action.

Size doesn‘t matter. Be content with yourself and who you are.

Reuse and maintenance

Be realistic about reuse. Reuse is hard. Good rules of thumb: Before you can develop code for reuse you must have developed it at least three times. Before you can reap the benefits of reuse you must have reused it three times. One proposal for Reagan’s version of the Star Wars missile defense system, which was pegged at 100 million lines of code, was that every module had to have been used three times before being included in the system. Not a bad idea, especially for a system so difficult to test.

Avoid dependencies. Global variables are responsible for most of the evil in the world. A program infested with globals becomes non-maintainable, buggy, and a nightmare for all team members. Globals also make reuse all but impossible.

Embedded systems suffer from another dependency problem: code that talks to hardware. Encapsulate all I/O operations.

Self documenting code does not exist. Long variable names do not self documenting code make. Judicious name selection is just a part of good coding.

Comment aggressively. Any idiot can write code. Even teenaged hackers manage to crank out working software. Professionals create beautiful code that is crystal clear and a joy to maintain. Accurate, lucid comments are an important ingredient of well- written firmware. Code is nothing more than the computerese description of what’s going on; comments are the human description.

Use active voice. Capitalize using standard English rules. Check your spelling. Describe concisely the goes-intas and goes-outtas, as well as what happens and why. Some enlightened programmers write all of the comments first, and then fill in the C at their leisure. The hard part, after all, is creating an accurate, documented design. The code is nothing more than a simple translation of a good design into computer-lingo.

Keep compiles clean. Don’t come to the dinner table with dirty hands, and don’t deliver code reeking of unpleasant warnings.

Why do we tolerate warning messages from our compiler? Firmware lives forever; When someone else opens your code five years from now for an upgrade and finds hundreds of warnings scrolling off the screen, he’ll have no idea if the messages are expected or are an effect of the way he’s reinstalled the tools. Maintenance is an unavoidable aspect of the software development process; he who programs without maintenance in mind is an amateur.

Keep the code strictly ANSI compliant to minimize warnings and maximize portability. Segment unavoidable deviations from the standard to separate modules which document expected unusual compiler behaviors.

Encapsulate. The OOP folks chant “encapsulation, polymorphism and inheritance.” Of those three, encapsulation is the easiest and most powerful tool for building well-written, easy-to- understand code. It’s equally effective in assembly, C, or C++. Bind “methods” (code that accesses a devices or data structure) with the data itself.

Floss. You’ll miss your teeth when they’re gone. esp

References

1. Gilb, Tom and Dorothy Graham. Software inspection. Reading, MA; Addison-Wesley, 1993.

2. Wheeler, David A., Bill Brykczynski, and Reginald N. Meeson, eds. Software Inspection: An Industry Best Practice, Los Alamitos: IEEE Computer Society, 1996.

Embedded System Programming October 2000 http://www.embedded.com/

Studying, understanding and avoiding pitfalls is important in real-world projects Friday October 27, 2000 07:39

Software failure can lead to financial catastrophe

•The aftermath of failed and abandoned software projects can cost your company millions. But what is causing this trend of botches and how can you avoid becoming an application failure statistic?

By Bruce Abbott,
For Info World Test Center

WE SAT IN A GRAY cafe at 8 a.m., eating gray eggs and gray home fries. Our watery coffee wasn’t even strong enough to be gray. Since four o’clock the previous afternoon we had been trying, with little success, to install a relatively simple custom- built software program.

“I don’t know how I got into this situation’ the small-business entrepreneur said in hushed tones. He didn’t know if he should be telling me this.

“I’ve got customers I can’t bill because I can’t get their data. I’ve got people waiting to come online but since I can’t install the program, I can’t get any new cash flow. My business is on the line. My reputation is at stake.”

Software failure is a terrifying possibility that should scare the bejeezus out of any company looking to stay in business. Although the media inundates us every day with software success stories, news on the growing economy, and reports that the Internet has caused an information revolution that rivals the advent of Gutenberg’s printing press, we rarely hear the flip side of this fairy tale: Hundreds of thousands of software projects fail every year.

Of course, software companies and IT professionals are not eager to share their disaster stories with the public, which is one reason we don’t hear of these doomed efforts. But application failure happens with alarming regularity. In fact, The Wall Street Journal estimates that 50 percent of all corporate technology projects don’t meet expectations, and that 42 percent of software projects are abandoned before even getting off the ground.

This is scary stuff. Application failure can be cataclysmal. Consider NASA’s 1999 mission to launch a spacecraft into the orbit of Mars. Instead of reaching a cruising altitude around the planet, the spacecraft crashed into Mars, destroying itself and kissing $125 million goodbye.

Investigators blamed the crash on NASA’s failure to convert English rocket thrust measurements into newtons. Ouch. Maybe we shouldn’t feel so bad when our little Web sites flop. After all, it’s only our jobs and reputations at stake.

Complicating the problem

Considering all the money and time that goes into these projects, it seems peculiar that the application failure rate is so high. But developers and managers consistently make several mistakes that help explain this phenomenon.

First of all, software programs are often made more complex than they need to be. Programming is the art of taking complicated requirements and breaking them down into simple solutions. But sometimes the opposite occurs, and overly complex solutions are used to solve simple tasks.

This may be because the programmer does not fully understand the project requirements or is not familiar with the tools used to create simple solutions. Or perhaps the programmer think simply. Many programmers take deliberate coding joy rides, often using unintended side effects of certain language functions to perform crucial tasks. Repairing such a system is like removing a wall in a house of cards: Have fun playing 52-card pickup.

At times the dreaded Wizard of Oz syndrome strikes desperate developers in need of immediate re- sources. Let’s say a manager is seeking a Java programmer for a business-to-business application that is running behind schedule. In his blind desperation, he may hire a Cobol programmer, who although very competent, lacks the training necessary for the Java project at hand.

Just as the Wizard of Oz makes the Scarecrow a Doctor of Thinkology, so the manager anoints the Cobol programmer a Java programmer. Because he is clever and hard-working, the Cobol programmer may soon learn the Java syntax. But what the manager ends up with is “Jobol” and a project even further behind schedule.

Giving in to temptation

Philip Riles of Gemstone Associates, a Sacramento, Calif., company that builds educational software, believes that because software professionals usually come straight out of academia, they are not accustomed to “the real world of budgets and deadlines’

Unlike construction workers and auto mechanics, software engineers often choose inventiveness over practicality. Because they are curious and forward-thinking, they often use the latest, most flashy solution where a more mature one might work better. Stephen Flower, author of the book Software Failure: Management Failure, calls this “the lure of the leading edge.”

By contrast, the newest technology in the construction industry is the cordless drill. No wonder construction workers are better able to keep their minds on the job.

Being too agreeable

Overoptimism is a dangerous virus that crosses all party lines and one that feeds on itself. Salespeople like to sell it, customers like to believe it, project managers like to hear it. When a manager asks for an estimated date of completion, programmers, eager to please, often respond, “I can whip that out in a few hours;’ when the answer ought to be, “It seems pretty complex and may take a long time to finish.”

Optimism makes everybody happy and promises big money — never mind that it causes false expectations, missed deadlines, untested programs, sloppy coding, and bug-ridden applications.

Testing is the No. 1 safeguard against software failure, especially against potent postproduction bugs. Yet testing always gets dropped when a project is under the gun. The project manager, in a rush to meet a promised deadline, is happy to assume that the programmers have conducted sufficient testing on their own. But of course the programmers, under the same pressure and already late on another unrealistic deadline, have had no time to test their software.

That’s OK, right? After all, nobody makes any money from testing (except testers) and customers never demand more testing over fewer features. With these truths in mind, programmers and managers would rather spend time churning out new code than wasting time testing finished software. And so the bugs take hold.

Flip-flopping requirements

Changing software requirements is a necessary evil in today’s evolving market. Business practices are shifting at such a rapid rate that software models must be very flexible to adapt. Of course, flexibility is good. It allows for some mistakes along the way and also for a fickle customer’s new ideas. But it is essential to have a pretty solid idea of your project’s requirements before you invest a lot of money in it.

Some developers are so confident in the flexibility of their software that they fail to define a sufficient number of specifications at a project’s outset.

This inevitably leads to costly design tune-ups and a convoluted end result that is difficult to test or is overly complex. The fewer changing requirements, the more direct the distance between your business objective and your application.

The truth is that software people love changing requirements. Ever drop by your car mechanic’s to replace an $8 hose and learn that what you really need is a new $1,700 transmission? Oh! And you’d better replace the clutch while your transmission is out.

Sure, your mechanic is shaking his head and clucking his tongue, but inside he’s jumping for joy. He can almost see that big-screen TV in his living room. Software people are the same way. Changing requirements mean more money.

Changing requirements also give developers a catchall excuse for failure. The application doesn’t work? Well, there were changing requirements. The project isn’t on time? A change in requirements held me up.

Yes, you ought to utilize today’s flexible software to accommodate design modifications. But try to manage your changes wisely and make sure your customer understands the implications of redefining requirements along the way.

Missing precious pieces

Successful projects require that all involved components and parties work together in perfect harmony. If even one of these components is missing — and it doesn’t have to be the most important or expensive one — the entire project will come to a screeching halt.

These “missing links” cause a bevy of software catastrophes every day. There are many reasons for the missing link. Vendors may go out of business or sell products that do not perform to expectation. Now the project is suddenly way over budget. Someone on your staff may be better at promising than producing results. Now you’re way past deadline. And of course Joe, who alone holds crucial information on the project, may get hit by a truck. Now your project is history.

To avoid these missing links, managers should use mature products from reliable vendors and make sure there is a strong knowledge overlap among team members.

Biting off too much

Experience tells us that project success is inversely proportional to project size. A study by the Standish Group reveals that projects costing less than $750,000 succeed 55 percent of the time, those in the $1 million to $2 million range have an 18 percent success rate, and those in the $5 million to $10 million range succeed only 7 percent of the time. The authors of AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis, boldly state that “extremely large projects are futile efforts’

The Internal Revenue Service's failed $4 billion Tax Systems Modernization Program (TSM) of the 1990s is a glaring example of this tendency. TSM failed because it took a “big bang” approach in which the IRS tried to build a new and gigantic information system to integrate many disparate and out- of-date systems.

Having learned from this lesson, Arthur Gross, former CIO of the IRS, declared that the agency would from now on take “an evolutionary approach to modernization’ building on systems already in place.

Denying the negatives

In his book Death March, Ed Yourdon dismisses certain software projects as exercises in futility. These “death march” projects are identified as having at least one key factor 50 percent off reasonable norms. Either the schedule is 50 percent too short, the staff is 50 percent too small, the budget is 50 per cent too meager, or the scope of features is 50 percent too large.

Many death march projects are caused by dysfunctional organizations in which certain groups won’t support each other, goals are unrealistic, and good old-fashioned stupidity reigns.

But perhaps the most significant contributing factor to these death marches is the software industry’s refusal to acknowledge its own problems. The Standish Group reports that that because the computer industry has a tendency to cover up or rationalize its failures, it continues to make the same planning, programming, and management mistakes again and again.

This is very different from practices in other fields, such as construction and engineering, where every disaster is investigated and reported upon. And lucky for us. If bridges collapsed and elevators plunged as often as applications fail, none of us would be here to complain about it.

Bruce Abbott is a self-employed Java developer. Send him e-mail at [email protected].

INFOWORLD OCTOBER 2, 2000 http://www.infoworld.com/

The 767 microcontroller/computer network is contrasted to a PC's mainframe-sized Windows operating systems.

Microcontrollers/microcomputers will have a relative-easy time taking to windows 98 and 2000 applications through wdm drivers.

Tuesday May 16, 2000 14:50

See how planes are built in the world ‘s largest building

Boeing plant one of Washington's most popular tourist destinations

By SHANA MCNALLY
The Associated Press

EVERETT, Wash. — The world’s building is not in New York, Chicago or even Kuala Lumpur. It’s right here, 30 miles north of Seattle, where many of the world’s planes are manufactured. Boeing Co.’s Everett manufacturing plant, which encloses 472 million cubic feet of space and covers more than 98 acres, dwarfs the famous tall buildings of the world in volume and has become one of the most popular tourist destinations in the state of Washington.

Every year, 140,000 visitors walk along the building’s many catwalks, looking down on the wide-body jets being assembled below. Roughly 66 percent of the tour’s annual visitors come from the United States and 13 percent from Europe.

It wasn’t initially meant for visitors. Construction on the facility began in 1966, when Boeing announced it would start making the 747, a jetliner capable of carrying nearly twice as many passengers as previous models.

But before the first 747 even rolled out the door, 13,000 people came to the plant to see the planes being made. Company officials decided to create a tour, which began in 1968.

In 1984 a 5,500-square-foot tour center was added to accommodate visitors.

But before the first 747 even rolled out the door, 13,000 people came to the plant to see the planes being made. Company officials decided to create a tour, which began in 1968.

In 1984 a 5,500-square-foot tour center was added to accommodate visitors.

Part of the appeal of the tour lies in Boeing’s name recognition.

Boeing jetliners take off once every 3.5 seconds or 24,600 times a day, flying more luau 3 million people about 17.1 million miles every 24 hours.

And there’s the factory’s size.

The facility, which is 11 stories tall, is listed in the Guinness Book of Records as the world’s largest building by volume.

To get some perspective on the sheer size, 911 basketball courts or 74 American-style football fields would fit in the building. So would Disneyland — along with 12 acres of covered parking.

The building requires more than 1 million light fixtures and has an $18 million annual electricity bill. It was expanded by 45 percent in 1979 for production of the 767 and an additional 50 percent in 1990 for the 777.

Today 24,000 people on three shifts work at the Everett site, which also has five cafeterias, two cafes, - 12 food plazas, a medical clinic, a fire department and a day-care center.

The 70-minute tour starts with a seven-minute film in the 100-seat theater. The movie takes visitors in fast motion through the 11-month production of an airplane from small parts fabrication through flying.

After the movie, visitors take a short bus ride to the factory then walk down a set of steep stairs to the utility access tunnel under the factory floor. After a one- third mile walk down the tunnel, an elevator takes visitors 35 feet above the factory floor.

Once on the metal catwalk, visitors can look down on the workers — tiny figures as seen from above — as they work the assembly line, pedal around on bicycles and tool around in golf carts.

Though only a small part of the production line is in view because of the building’s size, visitors get some insight into the skill and organization it takes to fit millions of parts together and make a plane fly.

The 747 has 6 million parts, and the 767 and 777 each have more than 3 million.

“It was neat, very interesting to see everything close up and you can’t conceive of the building’s size until you’re in the midst of it,” says Krissa Ross, 29, visiting from Dallas, Texas.

______________________

ASSEMBLY LINE: The forward portion of the fuselage. of a Boeing 767 is lowered into position at Boing's Everett, Wash., plant.
______________________

Visitors also get a good idea of the various stages of manufacturing as tour guides point out the seven stations planes are moved through: wing systems installation, the wing and body joined, cleaning and sealing of the wing and center fuel tanks, the final body joined and in the final stages: installation and testing.

Another impressive part of the factory is the overhead crane system used to move wings and fuselages. The 26 cranes, 90 feet above, cruise on 31 miles of networked tracks. Eighteen 747 and 767 cranes lift 34 tons, while eight more 777 cranes can each lift 40 tons.

The tour ends with a drive through the flight line, where planes are tested. The flight line includes: three to five days to paint in the paint hangar, a weigh-in to calculate fuel and passengers and two test flights totaling eight hours.

After all these steps, the visitor comes to learn, the customer can come pick up their plane — at a hefty price tag. The 747 starts at $167 million, the 767 at $89 million and the 777 at $137 million.

THE ASSOCIATED PRESS

Boeing Plant Tour WHEN: Tours are 9 a.m., a.m., 11 a.m., 1 p.m., 2 p.rn, and 3 p.m. Monday through Friday.

WHERE: 3003 W. Casino Road, Everett, Wash.

HOW MUCH: Admission is $5 for adults, $3 for seniors and $3 for children. Visitors must be at least 4 feet 2 inches tall. Contact - (800) 464-1476 or on the Internet, http://www.boeing.com/

Albuquerque Journal Sunday April 30, 2000

In the 1960-70's code for the Boeing AWACS cost-out to $100 per line. This work was done by Boeing Computer Services.

In 1979 BCS personnel were charged by management with looking into code cost on the Boeing 767 airplane.

At that time the 767 contained about 124 microcomputers/microcontrollers talking on an AIRINC 429 serial bus.

Each of the micros contained several thousand lines of code.

BSC personnel hit the panic button on code cost for the 767.

But this was not be a problem.

A 124 micros running several thousand lines of code each is much different than one computer [as in the AWACS] running one huge code.

How much do wdm drivers cost per line? Lots.

Both the learning curve and debugging time contribute to this cost.

When wdms fail one frequently gets the 'blue screen' or simply nothing happens. Wednesday May 3, 2000 08:43

Books for Coders and Thinkers

A confirmed bookworm, I’m always on the prowl for new and interesting books, both about technology and a myriad of other subjects. As a columnist I’m fortunate that folks help this search, often sending new titles my way for comment. An awful lot of the techie tomes are deadly dull; even if they include useful information, it’s often too hard to extract it from the fog of narcosis they induce.

Jean Labrosse’s new work MicroC/OS-II, The Real Time Kernel appeared this past November. This volume replaces Labrosse’s previous book on his RTOS, and is virtually a complete rewrite. In it he discusses µC/OS-II. It’s predecessor, µC/OS, is famous as the most popular of the free RTOSes around.

With the new version, the licensing restrictions have been tightened up a bit. If you’re building a commercial for-profit product, a fee is now associated with the OS. This charge seems reasonable to me, as software costs money. I often wonder if free software isn’t some of the most expensive code around. Without some sort of effective—and always expensive—support organization, it’s awfully hard to do justice to complex code.

If you’re engaged on a project that already uses a commercial RTOS, this book may offer little of use. However, various surveys indicate that between 50% and 70% of all RTOSes used today are homebrewed; even more embedded projects make no use of an OS at all. There’s little doubt that the trend is strongly towards the use of these useful beasts.

There’s equally little doubt that rolling your own is, in nearly all circumstances, a mistake. A quick glance at µC/OS-II and a bit of math illustrates just how silly it is to build an OS from scratch. Labrosse’s code weighs in at about 4,000 lines of C, surely not huge by any measure. When our egos subjugate rational thought, many of us are tempted to think we could crank 4K lines of C in a week or two, but we can’t. After years of collecting cost data on embedded systems, I’ve found most commercial code costs around $15 to $30 per line, regardless of language. Labrosse’s µC/OS, then, represents as much as $120,000 of software engineering. Don’t believe me? Even if you divide the numbers by five, it’s clear that the cost of writing an RTOS is very high.

Many other commercial RTOSes are available for a few kilobucks, with and without royalty fees. That’s essentially zero cost compared to creating your own. It’s true that µC/OS isn’t the most feature-rich of products; most commercial versions will offer more capabilities with correspondingly more code and equivalent cost.

Please forgive the rant: I’m constantly astounded that we parrot the principles of software reuse when we can’t even buy something as cleanly canned as the operating system. Some 80 vendors (see the Embedded Systems Buyer's Guide) offer different flavors of RTOSes for virtually any CPU, with an astonishing range of footprints and costs. Until we learn to buy every component possible—just as the hardware crowd does—software engineering is going to be a very expensive, error-ridden proposition. ....

Jack G Ganssle Embedded System Programming January 1999 http://www.embedded.com/

A One Day Seminar May 3, 2000 In San Jose

Better Firmware...
Faster!

For Engineers and Programmers

Rodney Dangerfield would understand the life of an embedded developer: it’s pretty hard to get much respect from our friends and families, most of whom have no idea what we do, though they all use the wonderful products that result from our efforts. The field is surprisingly devoid of educational opportunities. Here’s one. Spend a day with Jack Ganssle learning new ways to get your embedded systems out the doorfast without sacrificing quality.

We’ll cover technical issues - like how to write an embedded driver or isolate performance problems - and practical software process ideas, as well as how to manage your people and projects.

80% of all embedded systems are delivered late!
You’re not alone in waging the often impossible battle of getting a high quality product out on time. Learn ways to get products to market faster - without pulling “hero” all-nighters.

Do you get mired in debugging code?
New code generally has 50 to 100 bugs per thousand lines. The slowest way to find them is using a debugger. Learn better, proven, techniques that are up to 20 times more efficient.

Are hardware problems plaguing your efforts?
Learn new ways to create drivers for embedded peripherals. Study better approaches to ISR design. We’ll cover better diagnostics, improved tool use, and ways to prototype a system.

Did you know that...

... doubling the size of the code results in much more than twice the work? In this seminar you’ll learn ways unique to embedded systems to partition your firmware to keep complexity in line with code size.

... you can reduce bugs by a factor of ten before starting debugging using simple techniques that don’t require revolutionizing the engineering department? We’ll show you how.

... you can create a predictable real time design? This class will show you how to measure the system’s performance, manage reentrancy, and implement ISRs with the least amount of pain. You’ll even study real timing data for common C constructs on various CPUs.

... at 90% processor loading development time doubles? Learn simple approaches that keep loading low while simplifying the overall system design.

Course Outline

Languages
• C, C++ or Java?
• Learn the realities behind software reuse
• Managing embedded stack and heap issues

Partitioning Embedded Systems
• How to finish early using iterative development
• How to save time with an RTOS
• When to use hardware to help the software

Hardware
• Building faster peripheral drivers
• Understanding high speed signal problems
• How to probe SMT CPUs

Stamp out bugs!
• Learn bug management techniques
• Prevent defects with quick code inspections
• Study optimal use of bug-hunting tools
• Finding those hardware/software glitches

Overcoming deadline madness
• Negotiating realistic deadlines
• Improve your estimation skills
• What to do when a project runs late

Managing Real Time Issues
• How to design predictable real time code
• Preventing system performance debacles
• Dealing with reentrancy problems
• Managing interrupts and ISRs

How to learn from failures,., and successes
• Learn to conduct an effective postmortem
• Closing the feedback loop
• The 7 step-plan to firmware success

The Ganssle Group
http://www.ganssle.com/

Code size and bugs Monday April 3, 2000 08:06

The diagram [see Windows win32 driver model [wdm] overview Monday March 27, 2000 09:33] of wdm driver structure shows code can reside in

In ring 0 [98] in C++ 6.0, C, inline assembler, or masm 6.13 code
In ring 3 dll C++ 6.0, C, or masm 6.13 code
In ring 3 Visual Basic 6.0, C++, C 6.0 code

The general rule is that its best to do the most coding possible at the highest level.

Also partitioning a driver into code at these three level has a further advantage because of better language support at the higher levels.

Number of bugs increases geometrically? or exponentially? as the number of lines of code.

Partitioning the driver to run at all three levels have the added advantages

three smaller programs rather an one large program
more powerful language support at the higher levels
three programming teams can work on a driver more or less independently

DriverWorks 1.2, 1.5, and 2.0.0 build 473 size comparisons Thursday August 10, 2000 15:52

DriverWorks

1.2

1.5

2.0.0
build 473

44,417,024 bytes
789 files
86 folders

57,996,592 bytes
926 files
99 folders

63,717,3762 bytes
1006 files
106 folders

Code cost expectations Friday March 24, 2000 20:33

More on Wandering Code

Jack G. Ganssle 1

A recent column of mine “Wandering Pointers, Wandering Code” (November 1999, p. 21), produced a couple of interesting responses from readers. In that column, I suggested that pointers often aim at things they shouldn’t and code often incorrectly accesses memory regions.

A couple of readers suggested that C, while indeed leading to pointer problems when working with data structures, never runs amok. The feeling was that C runs in a pretty defined way, and as long as you call functions that exist, the code should never wander off.

Balderdash! Regardless of language, code crashes horribly for many reasons. What if the hardware produces an exception for which there is no handler? Using function pointers? That’s a golden opportunity to create wildly roaming program counters. Interrupts are a perennial source of crashes; don’t balance the stack properly, don’t protect resources correctly, and the code will vector off to never-never land. Bad hardware, a not uncommon threat in prototype work, can create bizarre happenings. And so, with respect, I suggest that wandering code is all too common a phenomenon. In fact, systems that are in production sometimes exhibit these, and other, related problems.

Over the years I’ve learned from bitter experience that a critical part of testing the reliability of extant code is to look for obvious flaws. When a project is finally “done,” or when someone asks me to look at a completed system, I’ll connect a logic analyzer and program it to find silly things like a write to ROM, or any read/write to unused address spaces. Though these effects don’t harm anything (ROM writes are surely futile), in my experience they’re symptoms of a lurking bug. No code should ever do anything unexpected or silly.

A couple of years ago a reader told an interesting tale: searching out errant writes, he connected an analyzer to his system and quickly found the bug. Most of us would stop there, disconnect the tool, and continue traditional debugging. Instead, he left the analyzer connected for a number of weeks until completing debug. In that time he caught seven — count ‘em — similar problems that may very well have gone undetected. These were weird writes to code or unused address spaces, bugs with no immediately apparent symptoms.

What brilliant engineering! He identified a problem, then developed a continuous process to always find new instances of the issue. Without this approach the unit would have shipped with these bugs undetected.

I believe that one of the most insidious problems facing firmware engineers are these lurking bugs. Code that does things it shouldn’t is flawed, even the effects seem benign.

With the best of intentions and the most lenient of schedules, bugs—especially insidious ones like errant memory accesses—slip through the test/debug phase simply because devising a test plan that checks the code’s every twist and turn is almost impossible. And the majority of firmware projects have only the most rudimentary of test plans.

Various studies suggest that up to half the code in a typical system never gets tested. Deeply nested ifs, odd exception/error conditions, and similar ills defy even well-designed test plans. A rule of thumb predicts that (uninspected) code has about a 5% error rate. This suggests that a l0,000-line project (not big by any measure) likely has 250 bugs poised to strike at the worst possible moment. ...

Embedded Systems Programming MARCH 2000 page 153

Hosted by www.Geocities.ws