MILLENIUM BUG
The Year 2000 problem (also known as
the Y2K problem, the Millennium bug, the Y2K bug, or simply Y2K) was a problem for both digital
(computer-related) and non-digital documentation and data storage situations
which resulted from the practice of abbreviating a four-digit year to two
digits.
In 1997, The British Standards
Institute (BSI) developed a standard, DISC PD2000-1,[1] which defines "Year
2000 Conformity requirements" as four rules:
1. No valid date will cause any
interruption in operations.
2. Calculation of durations
between, or the sequence of, pairs of dates will be correct whether any dates
are in different centuries.
3. In all interfaces and in all
storage, the century must be unambiguous, either specified, or calculable by
algorithm
4. Year 2000 must be recognized
as a leap year
It identifies two problems
that may exist in many computer programs.
Firstly, the practice of
representing the year with two digits becomes problematic with logical error(s)
arising upon "rollover" from x99 to x00. This has caused some
date-related processing to operate incorrectly for dates and times on and after
1 January 2000, and on other critical dates which were billed "event horizons". Without corrective
action, long-working systems would break down when the "...97, 98, 99,
00..." ascending numbering assumption suddenly became invalid.
Secondly, some programmers had
misunderstood the rule that although years that are exactly divisible by 100
are not leap years, if they are divisible by 400 then they are. Thus the year
2000 was a leap year.
Companies and organizations
worldwide checked, fixed, and upgraded their computer systems.
The number of computer
failures that occurred when the clocks rolled over into 2000 in spite of
remedial work is not known; amongst other reasons is the reticence of
organisations to report problems.[2] There is evidence of at
least one date-related banking failure due to Y2K.[2] There were plenty of
other Y2K problems, and that none of the glitches caused major incidents is
seen by some, such as the Director of the UN-backed International Y2K
Co-operation Centre and the head of the UK's Taskforce 2000, as vindication of
the Y2K preparation.[2] However, some questioned
whether the relative absence of computer failures was the result of the
preparation undertaken or whether the significance of the problem had been
overstated.
The total cost of fixing the
Millennium ‘bug’ is estimated at overUS$300 billion
considered vital by many to avoid disaster. It was a monumental task and
a tribute to the IT industry that, in the main, they resolved the problem in
time. This is often overlooked.
A few computers were not corrected in time for 1st January 2000 – and this did cause a number of problems, before and after the turn of the millennium. For example, in the UK, as a result of the Millennium Bug, incorrect Down’s syndrome test results were sent to 154 pregnant women, resulting in two abortions. In Japan, radiation monitoring equipment failed at midnight 31 December 1999 – fortunately no one was hurt. Several websites, including the weather forecasting service in France, showed the wrong date.
A few computers were not corrected in time for 1st January 2000 – and this did cause a number of problems, before and after the turn of the millennium. For example, in the UK, as a result of the Millennium Bug, incorrect Down’s syndrome test results were sent to 154 pregnant women, resulting in two abortions. In Japan, radiation monitoring equipment failed at midnight 31 December 1999 – fortunately no one was hurt. Several websites, including the weather forecasting service in France, showed the wrong date.
DENVER AIRPORT BAGGAGE SYSTEM:
Automated baggage system
The airport's computerized
baggage system, which was supposed to reduce delays, shorten waiting times at
luggage carousels, and cut airline labor costs, was an unmitigated failure. An
airport opening originally scheduled for October 31, 1993, with a single system
for all three concourses turned into a February 28, 1995, opening with separate
systems for each concourse, with varying degrees of automation.
The system's $186 million
original construction costs grew by $1 million per day during months of
modifications and repairs. Incoming flights on the airport's B Concourse made
very limited use of the system, and only United, DIA's dominant airline, used
it for outgoing flights. The 40-year-old company responsible for the design of
the automated system, BAE Automated Systems of Carrollton, Texas, at one time responsible for
90% of the baggage systems in the United States, was acquired in 2002 by
G&T Conveyor Company, Inc.[13]
The automated baggage system
never worked as designed, and in August 2005 it became public knowledge that United
would abandon the system, a decision that would save them $1 million per month
in maintenance costs
Denver Airport had ambitious
plans to route passenger’s bags to and from aircraft without significant
human intervention. The system was called the Denver International Airport
Baggage System (DIA ABS). It ran over budget by almost 30%, with an actual cost
of $250M vs. $195M planned, and completion was delayed 18 months. These
delays themselves are bad, but not disastrous. The problem was that the system
did not function as intended. The system itself was not a trivial undertaking
with 4,000 vehicles, 5.5 miles of conveyors and 22 miles of track. The design
failed in several respects – the carts were often unable to cope with sharp
corners in the track and loading bags directly from the aircraft failed.
The sensors to determine where bags were in the system were not reliable.
The design used a number of technologies that were untested. Whereas
the Sydney Opera House is an example of a
project with tremendously ambitious goals that simply ran over time and budget
until those goals were met, the Denver Airport baggage system stayed much
closer to duration and budget estimates, but the goals of the system were not
met. And unlike the FBI’s virtual case file project there was no
issue with vague goals, it’s just that the baggage system’s goals were
clear but unrealistic.
The baggage system simply was poorly designed and
poorly tested, more recent, simple computer simulations have found problems
with the system, that the project itself was not able to catch until
implementation.
NATIONAL CANCER INSTITUTE PANAMA:
As software spreads
from computers to the engines of automobiles to robots in factories to X-ray
machines in hospitals, defects are no longer a problem to be managed. They have
to be spread
Victor Garcia considers himself
lucky to be alive. Three years ago, a combination of cancer and miscalculation
almost killed him.
The former distribution manager for fragrance maker Chanel
now can feel the hot Panamanian morning sun stream through his living-room
window. He can smell lunch cooking in the kitchen. He can sit in an armchair
surrounded by pictures of his six children and six grandchildren and talk to
his wife. Simple pleasures he almost lost following a software malfunction. In
November of 2000, Garcia and 27 other patients at the National Cancer Institute
in Panama were jolted with massive overdoses of gamma rays partly due to
limitations of the computer program that guided use of a radiation-therapy
machine.
In the 40 months that have
passed, 21 patients have died. While it's unclear how many of the patients
would have died of cancer anyway, the International Atomic Energy Agency (IAEA)
said in May 2001 that at least five of the deaths were probably from radiation
poisoning and at least 15 more patients risked developing "serious
complications" from radiation.
Garcia, being treated for prostate
cancer, survived but suffered damage to his intestines. He now has a colostomy.
"I am very lucky," he says, shaking his head in wonderment.
"That's what the [investigating] doctors from Houston told me. 'You are so
lucky.'"
The three Panamanian medical physicists who used the software
to figure out just how much radiation to apply to patients are scheduled to be
tried on May 18 in Panama City on charges of second-degree murder. Under
Panamanian law, they may be held responsible for "introducing changes into
the software" that led directly to the patients' deaths, according to
Special Superior Deputy Prosecutor Cristobal Arboleda.
The physicists, of
course, thought they were helping the patients. Having consulted a doctor at
the hospital and the software's manual, they thought they had figured out how
to place five radiation shields over each patient's body, instead of four, to
protect against possible overdoses. "I thought I was home free," one
of them, Olivia Saldaña, recalls now.
This is not a cautionary tale for
medical technicians, even though they can find themselves fighting to stay out
of jail if they misunderstand or misuse technology. This also is not a tale of
how human beings can be injured or worse by poorly designed or poorly explained
software, although there are plenty of examples to make the point. This is a
warning for any creator of computer programs: that software quality matters,
that applications must be foolproof, and that-whether embedded in the engine of
a car, a robotic arm in a factory or a healing device in a hospital-poorly
deployed code can kill.
In this case, a St. Louis company, Multidata Systems
International, has found itself in and out of courts in two countries for much
of the past three years, fending off charges that its product is at fault in a
score of fatalities. The deaths occurred more than 2,000 miles from its home,
at an installation of a customer it claims it did not even know it still
had-until the death toll began mounting.
Now Multidata may face judgments
that could damage-if not destroy-the company itself, if the firm is found
guilty and is forced to pay damages sought by the victims. No one can
accurately predict the amount Multidata would have to pay if the victims
succeed in suing in the U.S. So far the plaintiffs have failed. But each of the
28 victims could be entitled to as much as $500,000 to $1 million of
compensation for such factors as pain and suffering, lost wages and the number
and age of surviving dependents, according to Brian Kerley, a defense attorney
at a leading New York malpractice firm. Using those numbers, Multidata could be
facing total damages in the range of $14 million to $28 million. Multidata,
which is privately held, says it has about $2 million in annual sales and fewer
than 15 employees.
MAARS CLIMATE ORBITER:
he Mars Climate Orbiter (formerly the Mars Surveyor '98 Orbiter) was a 338 kilogram (750 lb) robotic space probelaunched
by NASA on December 11, 1998 to study
the Martian
climate, atmosphere, surface
changes and
to act as the communications relay in the Mars Surveyor '98 program,
for Mars Polar
Lander. However, on September 23, 1999, communication
with the spacecraft was lost as the spacecraft went into orbital insertion, due
to ground based computer software which produced output in non-SI units of pound-seconds (lbf×s)
instead of the metric units of newton-seconds (N×s) specified in the contract
between NASA and Lockheed. The spacecraft encountered Mars at an improperly low
altitude, causing it to incorrectly enter the upper atmosphere and
disintegrate.
Cause of failure
On November 10, 1999, the Mars
Climate Orbiter Mishap Investigation Board released a Phase I report, detailing
the suspected issues encountered with the loss of the spacecraft. Previously,
on September 8, 1999, Trajectory Correction Maneuver-4 was computed and then
executed on September 15, 1999. It was intended to place the spacecraft at an
optimal position for an orbital insertion maneuver that would bring the spacecraft
around Mars at an altitude of 226 kilometers on September 23, 1999. However,
during the week between TCM-4 and the orbital insertion maneuver, the
navigation team indicated the altitude may be much lower than intended at 150
to 170 kilometers. Twenty-four hours prior to orbital insertion, calculations
placed the orbiter at an altitude of 110 kilometers; 80 kilometers is the
minimum altitude that Mars Climate Orbiter was thought to be capable of
surviving during this maneuver. Final calculations placed the spacecraft in a
trajectory that would have taken the orbiter within 57 kilometers of the
surface where the spacecraft likely disintegrated because of atmospheric
stresses. The primary cause of this discrepancy was engineering error.
Specifically, the flight system software on the Mars Climate Orbiter was written to take thrust
instructions using the metric unit newtons (N), while the software
on the ground that generated those instructions used the Imperial measure pound-force (lbf). This error has
since been known as the "metric mixup" and has been carefully avoided
in all missions since by NASA.§
The discrepancy between
calculated and measured position, resulting in the discrepancy between desired
and actual orbit insertion altitude, had been noticed earlier by at least two
navigators, whose concerns were dismissed. A meeting of trajectory software
engineers, trajectory software operators (navigators), propulsion engineers,
and managers, was convened to consider the possibility of executing Trajectory
Correction Maneuver-5, which was in the schedule. Attendees of the meeting
recall an agreement to conduct TCM-5, but it was ultimately not done.
As part of the NASA Mars Surveyor Program, the Mars
Climate Orbiter was to orbit Mars and collect environmental and weather data.
But as the spacecraft approached its destination, telemetry signals fell
silent, and a $125 million mission failed.
The root cause identified by NASA was the failure to
convert between metric and English units. When the fatal error was detected,
Noel Hinners, vice-president for flight systems at Lockheed, the company that
built the spacecraft, said in disbelief, “It can’t be something that simple
that could cause this to happen.” But it was.
Apparently Lockheed had used pounds during the design
of engines, while NASA scientists, responsible for the operation and flight,
thought the information was in metric units.
There were early signs during its flight that
something was wrong with the craft’s trajectory and an internal review later
confirmed that it may have been off course for months (Pollack, 1999) (Oberg,
1999). Project culture, however, required that engineers prove that something
was wrong rather than “prove that everything was right.” This difference in
perspective prevented the team from looking into the problem. Edward Weiler,
NASA associate administrator for space science, said, “The problem here was not
the error; it was the failure of NASA’s systems engineering, and the checks and
balances in our processes to detect the error” (Oberg, 1999).
The Mars Investigation Panel report identified several
contributing factors to the failure: the system engineering process did not
adequately address the transition from development (Lockheed) to operations
(NASA); inadequate communications between project elements; and inadequate
staffing and training.
Within a few months of the Orbiter failure, the Mars
Polar Lander, a related NASA project with a price tag of $165 million, suffered
the same fate. Its flight was uneventful until it began its landing approach.
Then, during its descent into the rough terrain of the polar cap, telemetry
signals fell silent. With no data to pinpoint the precise cause of failure, the
teams investigating the accident speculated that the vehicle’s descent engines
prematurely shut down. Unable to slow the descent, the speculation was that the
engines quit when the Lander was 130 feet high, crashing into the surface of
Mars at about 50 miles per hour. The inappropriate response of its engines was
attributed to software glitches (Leary, 2000).
The prevailing culture at NASA of “Better, Faster and
Cheaper, that defined the period when these projects were in development, has
been highlighted many times as the contributing factor behind these failures.
Thomas Young, a former NASA official said that they were trying “to do too much
with too little.” He continued, “No one had a sense of how much trouble they
were actually in.” (Broad, 1999).
The prevailing culture was best expressed in an
internal memo written by a laboratory official at the Jet Propulsion Lab.
“There might have been some overconfidence, inadequate robustness in our
processes, designs or operations, inadequate modeling and simulation of
operations, and failure to heed early warnings.” (Oberg, 1999).
While the trajectory problem associated with the
Orbiter and the engine ignition problem associated with the Lander could be
characterized as technical, The Mars Climate Orbiter Failure Board (2000) said
that management failures were also to blame. They found that these projects
suffered from a lack of senior management involvement and too much reliance on
inexperienced project managers. The Board also criticized the strategy where
project managers in one organization were responsible for development
(Lockheed) and a separate organization (NASA) was responsible for operations
after launch.
Lessons Learned
If the orbiter did not launch on schedule, it would have to wait several months before its next opportunity to launch. With launch windows far apart, and with budgets unable to tolerate a substantial delay, managers were under pressure to meet the deadline; it was important not to “waste” the effort put into the project to that point. This suggests that decision makers fell into the “sunk cost” trap, a situation in which past expenditures of time and money continue to propel a project into the future even when evidence suggests that this would be unwise.
If the orbiter did not launch on schedule, it would have to wait several months before its next opportunity to launch. With launch windows far apart, and with budgets unable to tolerate a substantial delay, managers were under pressure to meet the deadline; it was important not to “waste” the effort put into the project to that point. This suggests that decision makers fell into the “sunk cost” trap, a situation in which past expenditures of time and money continue to propel a project into the future even when evidence suggests that this would be unwise.
Selective perception explains why the engineers at the
Jet propulsion Lab, the design team, failed to coordinate with the operational
team at NASA. In large-scale complex projects, such as the Orbiter and Lander,
with countless activities, contractors, and suppliers, it is very possible that
teams may take a narrow view of their own activities. The risk is that the work
of one team may be incompatible with the work of another.
Conservatism, the failure to consider new data,
explains why engineers did not take action when they noticed that the
trajectory of the spacecraft was off. They even held a meeting in Denver to
address the issue, but it was never resolved. Even as the spacecraft approached
its destination and data showed that it was drifting off course, controllers
attempted to ignore the real data and assume it was on-course (Oberg, 1999).
No hay comentarios:
Publicar un comentario