ITGS Angel Castro: MILLENIUM BUG 2

MILLENIUM BUG

The Year 2000 problem (also known as the Y2K problem, the Millennium bug, the Y2K bug, or simply Y2K) was a problem for both digital (computer-related) and non-digital documentation and data storage situations which resulted from the practice of abbreviating a four-digit year to two digits.

In 1997, The British Standards Institute (BSI) developed a standard, DISC PD2000-1,^[1] which defines "Year 2000 Conformity requirements" as four rules:

1. No valid date will cause any interruption in operations.

2. Calculation of durations between, or the sequence of, pairs of dates will be correct whether any dates are in different centuries.

3. In all interfaces and in all storage, the century must be unambiguous, either specified, or calculable by algorithm

4. Year 2000 must be recognized as a leap year

It identifies two problems that may exist in many computer programs.

Firstly, the practice of representing the year with two digits becomes problematic with logical error(s) arising upon "rollover" from x99 to x00. This has caused some date-related processing to operate incorrectly for dates and times on and after 1 January 2000, and on other critical dates which were billed "event horizons". Without corrective action, long-working systems would break down when the "...97, 98, 99, 00..." ascending numbering assumption suddenly became invalid.

Secondly, some programmers had misunderstood the rule that although years that are exactly divisible by 100 are not leap years, if they are divisible by 400 then they are. Thus the year 2000 was a leap year.

Companies and organizations worldwide checked, fixed, and upgraded their computer systems.

The number of computer failures that occurred when the clocks rolled over into 2000 in spite of remedial work is not known; amongst other reasons is the reticence of organisations to report problems.^[2] There is evidence of at least one date-related banking failure due to Y2K.^[2] There were plenty of other Y2K problems, and that none of the glitches caused major incidents is seen by some, such as the Director of the UN-backed International Y2K Co-operation Centre and the head of the UK's Taskforce 2000, as vindication of the Y2K preparation.^[2] However, some questioned whether the relative absence of computer failures was the result of the preparation undertaken or whether the significance of the problem had been overstated.

The total cost of fixing the Millennium ‘bug’ is estimated at overUS$300 billion considered vital by many to avoid disaster. It was a monumental task and a tribute to the IT industry that, in the main, they resolved the problem in time. This is often overlooked.

A few computers were not corrected in time for 1st January 2000 – and this did cause a number of problems, before and after the turn of the millennium. For example, in the UK, as a result of the Millennium Bug, incorrect Down’s syndrome test results were sent to 154 pregnant women, resulting in two abortions. In Japan, radiation monitoring equipment failed at midnight 31 December 1999 – fortunately no one was hurt. Several websites, including the weather forecasting service in France, showed the wrong date.

DENVER AIRPORT BAGGAGE SYSTEM:

Automated baggage system

The airport's computerized baggage system, which was supposed to reduce delays, shorten waiting times at luggage carousels, and cut airline labor costs, was an unmitigated failure. An airport opening originally scheduled for October 31, 1993, with a single system for all three concourses turned into a February 28, 1995, opening with separate systems for each concourse, with varying degrees of automation.

The system's $186 million original construction costs grew by $1 million per day during months of modifications and repairs. Incoming flights on the airport's B Concourse made very limited use of the system, and only United, DIA's dominant airline, used it for outgoing flights. The 40-year-old company responsible for the design of the automated system, BAE Automated Systems of Carrollton, Texas, at one time responsible for 90% of the baggage systems in the United States, was acquired in 2002 by G&T Conveyor Company, Inc.^[13]

The automated baggage system never worked as designed, and in August 2005 it became public knowledge that United would abandon the system, a decision that would save them $1 million per month in maintenance costs

Denver Airport had ambitious plans to route passenger’s bags to and from aircraft without significant human intervention. The system was called the Denver International Airport Baggage System (DIA ABS). It ran over budget by almost 30%, with an actual cost of $250M vs. $195M planned, and completion was delayed 18 months. These delays themselves are bad, but not disastrous. The problem was that the system did not function as intended. The system itself was not a trivial undertaking with 4,000 vehicles, 5.5 miles of conveyors and 22 miles of track. The design failed in several respects – the carts were often unable to cope with sharp corners in the track and loading bags directly from the aircraft failed. The sensors to determine where bags were in the system were not reliable. The design used a number of technologies that were untested. Whereas the Sydney Opera House is an example of a project with tremendously ambitious goals that simply ran over time and budget until those goals were met, the Denver Airport baggage system stayed much closer to duration and budget estimates, but the goals of the system were not met. And unlike the FBI’s virtual case file project there was no issue with vague goals, it’s just that the baggage system’s goals were clear but unrealistic.

The baggage system simply was poorly designed and poorly tested, more recent, simple computer simulations have found problems with the system, that the project itself was not able to catch until implementation.

NATIONAL CANCER INSTITUTE PANAMA:

As software spreads from computers to the engines of automobiles to robots in factories to X-ray machines in hospitals, defects are no longer a problem to be managed. They have to be spread

Victor Garcia considers himself lucky to be alive. Three years ago, a combination of cancer and miscalculation almost killed him.   The former distribution manager for fragrance maker Chanel now can feel the hot Panamanian morning sun stream through his living-room window. He can smell lunch cooking in the kitchen. He can sit in an armchair surrounded by pictures of his six children and six grandchildren and talk to his wife. Simple pleasures he almost lost following a software malfunction. In November of 2000, Garcia and 27 other patients at the National Cancer Institute in Panama were jolted with massive overdoses of gamma rays partly due to limitations of the computer program that guided use of a radiation-therapy machine.   

In the 40 months that have passed, 21 patients have died. While it's unclear how many of the patients would have died of cancer anyway, the International Atomic Energy Agency (IAEA) said in May 2001 that at least five of the deaths were probably from radiation poisoning and at least 15 more patients risked developing "serious complications" from radiation.   Garcia, being treated for prostate cancer, survived but suffered damage to his intestines. He now has a colostomy. "I am very lucky," he says, shaking his head in wonderment. "That's what the [investigating] doctors from Houston told me. 'You are so lucky.'"   The three Panamanian medical physicists who used the software to figure out just how much radiation to apply to patients are scheduled to be tried on May 18 in Panama City on charges of second-degree murder. Under Panamanian law, they may be held responsible for "introducing changes into the software" that led directly to the patients' deaths, according to Special Superior Deputy Prosecutor Cristobal Arboleda.   The physicists, of course, thought they were helping the patients. Having consulted a doctor at the hospital and the software's manual, they thought they had figured out how to place five radiation shields over each patient's body, instead of four, to protect against possible overdoses. "I thought I was home free," one of them, Olivia Saldaña, recalls now.   This is not a cautionary tale for medical technicians, even though they can find themselves fighting to stay out of jail if they misunderstand or misuse technology. This also is not a tale of how human beings can be injured or worse by poorly designed or poorly explained software, although there are plenty of examples to make the point. This is a warning for any creator of computer programs: that software quality matters, that applications must be foolproof, and that-whether embedded in the engine of a car, a robotic arm in a factory or a healing device in a hospital-poorly deployed code can kill.   In this case, a St. Louis company, Multidata Systems International, has found itself in and out of courts in two countries for much of the past three years, fending off charges that its product is at fault in a score of fatalities. The deaths occurred more than 2,000 miles from its home, at an installation of a customer it claims it did not even know it still had-until the death toll began mounting.   Now Multidata may face judgments that could damage-if not destroy-the company itself, if the firm is found guilty and is forced to pay damages sought by the victims. No one can accurately predict the amount Multidata would have to pay if the victims succeed in suing in the U.S. So far the plaintiffs have failed. But each of the 28 victims could be entitled to as much as $500,000 to $1 million of compensation for such factors as pain and suffering, lost wages and the number and age of surviving dependents, according to Brian Kerley, a defense attorney at a leading New York malpractice firm. Using those numbers, Multidata could be facing total damages in the range of $14 million to $28 million. Multidata, which is privately held, says it has about $2 million in annual sales and fewer than 15 employees.

MAARS CLIMATE ORBITER:

he Mars Climate Orbiter (formerly the Mars Surveyor '98 Orbiter) was a 338 kilogram (750 lb) robotic space probelaunched by NASA on December 11, 1998 to study the Martian climate, atmosphere, surface changes and to act as the communications relay in the Mars Surveyor '98 program, for Mars Polar Lander. However, on September 23, 1999, communication with the spacecraft was lost as the spacecraft went into orbital insertion, due to ground based computer software which produced output in non-SI units of pound-seconds (lbf×s) instead of the metric units of newton-seconds (N×s) specified in the contract between NASA and Lockheed. The spacecraft encountered Mars at an improperly low altitude, causing it to incorrectly enter the upper atmosphere and disintegrate.

Cause of failure

On November 10, 1999, the Mars Climate Orbiter Mishap Investigation Board released a Phase I report, detailing the suspected issues encountered with the loss of the spacecraft. Previously, on September 8, 1999, Trajectory Correction Maneuver-4 was computed and then executed on September 15, 1999. It was intended to place the spacecraft at an optimal position for an orbital insertion maneuver that would bring the spacecraft around Mars at an altitude of 226 kilometers on September 23, 1999. However, during the week between TCM-4 and the orbital insertion maneuver, the navigation team indicated the altitude may be much lower than intended at 150 to 170 kilometers. Twenty-four hours prior to orbital insertion, calculations placed the orbiter at an altitude of 110 kilometers; 80 kilometers is the minimum altitude that Mars Climate Orbiter was thought to be capable of surviving during this maneuver. Final calculations placed the spacecraft in a trajectory that would have taken the orbiter within 57 kilometers of the surface where the spacecraft likely disintegrated because of atmospheric stresses. The primary cause of this discrepancy was engineering error. Specifically, the flight system software on the Mars Climate Orbiter was written to take thrust instructions using the metric unit newtons (N), while the software on the ground that generated those instructions used the Imperial measure pound-force (lbf). This error has since been known as the "metric mixup" and has been carefully avoided in all missions since by NASA.§

The discrepancy between calculated and measured position, resulting in the discrepancy between desired and actual orbit insertion altitude, had been noticed earlier by at least two navigators, whose concerns were dismissed. A meeting of trajectory software engineers, trajectory software operators (navigators), propulsion engineers, and managers, was convened to consider the possibility of executing Trajectory Correction Maneuver-5, which was in the schedule. Attendees of the meeting recall an agreement to conduct TCM-5, but it was ultimately not done.

As part of the NASA Mars Surveyor Program, the Mars Climate Orbiter was to orbit Mars and collect environmental and weather data. But as the spacecraft approached its destination, telemetry signals fell silent, and a $125 million mission failed.

The root cause identified by NASA was the failure to convert between metric and English units. When the fatal error was detected, Noel Hinners, vice-president for flight systems at Lockheed, the company that built the spacecraft, said in disbelief, “It can’t be something that simple that could cause this to happen.” But it was.

Apparently Lockheed had used pounds during the design of engines, while NASA scientists, responsible for the operation and flight, thought the information was in metric units.

There were early signs during its flight that something was wrong with the craft’s trajectory and an internal review later confirmed that it may have been off course for months (Pollack, 1999) (Oberg, 1999). Project culture, however, required that engineers prove that something was wrong rather than “prove that everything was right.” This difference in perspective prevented the team from looking into the problem. Edward Weiler, NASA associate administrator for space science, said, “The problem here was not the error; it was the failure of NASA’s systems engineering, and the checks and balances in our processes to detect the error” (Oberg, 1999).

The Mars Investigation Panel report identified several contributing factors to the failure: the system engineering process did not adequately address the transition from development (Lockheed) to operations (NASA); inadequate communications between project elements; and inadequate staffing and training.

Within a few months of the Orbiter failure, the Mars Polar Lander, a related NASA project with a price tag of $165 million, suffered the same fate. Its flight was uneventful until it began its landing approach. Then, during its descent into the rough terrain of the polar cap, telemetry signals fell silent. With no data to pinpoint the precise cause of failure, the teams investigating the accident speculated that the vehicle’s descent engines prematurely shut down. Unable to slow the descent, the speculation was that the engines quit when the Lander was 130 feet high, crashing into the surface of Mars at about 50 miles per hour. The inappropriate response of its engines was attributed to software glitches (Leary, 2000).

The prevailing culture at NASA of “Better, Faster and Cheaper, that defined the period when these projects were in development, has been highlighted many times as the contributing factor behind these failures. Thomas Young, a former NASA official said that they were trying “to do too much with too little.” He continued, “No one had a sense of how much trouble they were actually in.” (Broad, 1999).

The prevailing culture was best expressed in an internal memo written by a laboratory official at the Jet Propulsion Lab. “There might have been some overconfidence, inadequate robustness in our processes, designs or operations, inadequate modeling and simulation of operations, and failure to heed early warnings.” (Oberg, 1999).

While the trajectory problem associated with the Orbiter and the engine ignition problem associated with the Lander could be characterized as technical, The Mars Climate Orbiter Failure Board (2000) said that management failures were also to blame. They found that these projects suffered from a lack of senior management involvement and too much reliance on inexperienced project managers. The Board also criticized the strategy where project managers in one organization were responsible for development (Lockheed) and a separate organization (NASA) was responsible for operations after launch.

Lessons Learned
If the orbiter did not launch on schedule, it would have to wait several months before its next opportunity to launch. With launch windows far apart, and with budgets unable to tolerate a substantial delay, managers were under pressure to meet the deadline; it was important not to “waste” the effort put into the project to that point. This suggests that decision makers fell into the “sunk cost” trap, a situation in which past expenditures of time and money continue to propel a project into the future even when evidence suggests that this would be unwise.

Selective perception explains why the engineers at the Jet propulsion Lab, the design team, failed to coordinate with the operational team at NASA. In large-scale complex projects, such as the Orbiter and Lander, with countless activities, contractors, and suppliers, it is very possible that teams may take a narrow view of their own activities. The risk is that the work of one team may be incompatible with the work of another.

Conservatism, the failure to consider new data, explains why engineers did not take action when they noticed that the trajectory of the spacecraft was off. They even held a meeting in Denver to address the issue, but it was never resolved. Even as the spacecraft approached its destination and data showed that it was drifting off course, controllers attempted to ignore the real data and assume it was on-course (Oberg, 1999).

ITGS Angel Castro

viernes, 24 de mayo de 2013

MILLENIUM BUG 2

No hay comentarios:

Publicar un comentario

Datos personales