Part 1: "Reliance" and "distraction"
effects in PTC automation White Paper, 11/28/99 EXECUTIVE SUMMARY This document was requested by T. Raslear of the Federal Rail Administration
(FRA) on 3/3/99 of the PTC Human Factors Team in conjunction with ongoing
discussions of PTC standards. The charge was to investigate the "reliance
effect" and the "distraction effect," where definition and
focus were left to the authors. With regard to future automation of railway systems, and in particular
with regard to the implementation of Positive Train Control (PTC), questions
have been raised about the possible propensity for a locomotive engineer
(LE) or conductor (C) to become over-reliant on automation and/or to become
distracted by the additional monitoring burdens required by the automation,
and for these effects to compromise the performance of their duties and
for safe and efficient train operation. This white paper is organized by section as follows: (1) First, details on the charge given to the authors by the FRA. (2) Next, working definitions of terms "reliance effect" and
"distraction effect" and the issues surrounding them. (3) Review of the general human factors literature regarding humans and
automation, and specifically the reliance and distraction phenomena - for
example in piloting aircraft, driving highway vehicles, operating nuclear
power plants and performing routine machine operation tasks. For each of
the reliance and distraction effects the relevance to PTC automation is
discussed. (4) Details of the relation of reliance and distraction to operations
under PTC, along with implied recommendations. This section, the longest,
reviews the "open system" nature of the rail transportation system,
proposes a "human-centered" design philosophy for PTC, comments
on the relevance of the UK's Great Western accident of 1997, discusses which
kinds of distraction are particularly threatening, analyses the potential
levels of automation for PTC design, and recommends which level seems best
for safety. (5) Classroom and simulator training for PTC. (6) Conclusions. The conclusions are: (1) Over-reliance on (or not knowing how much to rely on) automation,
and added distraction of having to (or poor ability to) monitor automation,
are well known problems in the human factors literature, but there are few
easy remedies. (2) Maintenance of the locomotive engineer's perceptual, decision-making
and control skills is considered mandatory. (3) A PTC system should provide an auditory warning of appropriate hazards
and graphical information about stopping profiles from the given speed.
Otherwise it should allow for manual operation, unless certain limits are
exceeded, at which point automatic braking enforcement should go into effect. (4) Failures of a PTC system should be announced by a clearly discernible
auditory alarm, and the type and time of failure recorded on the locomotive
event recorder. (5) Special classroom and simulator training for PTC operation, including
failure scenarios, should be given to train crews. 1. Charge from the FRA The original charge to the RSAC "Human Factors Team" dated
3/30/99 was as follows. (1) "Investigate the 'Reliance Effect' on the non-fail safe systems.
Will the operator become reliant upon the overlay system and become less
attentive? If so, is it possible to estimate the effect on the safety of
railroad operations? Are there countermeasures or redesign alternatives
that warrant exploration?" 2. 'Investigate the 'Distraction Effect' associated with frequent or
complex requirements to interact with the system. Is this a legitimate concern?
To what extent? If it is a significant problem, is it possible to describe
tolerable limits for these interactions and redesign alternatives that warrant
exploration?" The 9/8/99 Report of the Railroad Safety Advisory Committee to the Federal
Railroad Administrator (page xiii, item 5.c) reads: "Develop human
factors analysis methodology to project the response of crews and dispatchers
to changes brought about by overlay' type PTC technology, including possible
'reliance' or 'complacency' and 'distraction' effects (initiated 2nd quarter
1999). Apply methodology to candidate projects." 2. The Concepts of Reliance and Distraction 2.1 Purpose of PTC and PTS PTC has been defined to have the following core features in the Railroad
Safety Advisory Committee's report to the Federal Railroad Administrator
"Implementation of Positive Train Control Systems" (RSAC, 1999:
vii, 16-17). (1) Prevent train-to-train collisions (positive train separation). (2) Enforce speed restrictions, including civil engineering restrictions
(curves, bridges, etc.) and temporary slow orders. (3) Provide protection for roadway workers and their equipment operating
under specific authorities." It should be noted that Positive Train Separation (PTS) is included in
the core-feature definition of PTC. Consequently, PTS need not be mentioned
in discussion of PTC without a particular reason to do so. 2.2. Working definitions of "Reliance Effect" and "Designed
Reliance" in PTC Automation The "reliance effect" is taken to refer to the tendency of
the LE, C or train dispatcher to over-rely (rely more than the system designers
or managers intend) on automation such as PTC in performing work tasks,
particularly to the degree that the automation is deemed not to be fail-safe
by itself. Concepts closely related to "reliance" are "complacency"
and ''over-trust. Insofar as the system is intentionally designed, or the level of automation
is such, that the that the human operator is compelled or encouraged to
defer to the automation, we call that "designed reliance." In
Section 4.5 below we make specific recommendations in that regard. There
may be a thin line between intentional, designed-in reliance and unintentional
over-reliance, especially as understood by the human operator. 2.3. Definition of Distraction Effect in PTC Automation The "distraction effect" is assumed to refer to the tendency
of the LE to be distracted from other duties by frequent or complex cognitive
interactions with the automation to plan and program its operation, monitor
its performance, detect and diagnose and stay aware of any abnormalities,
and rectify any abnormalities and ensure control. (Of course there are other
distractions from radio conversation or wayside events.) Associated with
"distraction" are the concepts of "mental workload,"
"attention deficit," and decrement in "situation awareness." 2.4. Levels of Automation Insofar as reliance implies reliance on automation by design it is sometimes
useful to consider levels of automation from none to full computerized automation.
The following scale (Sheridan, 1987) has been used in a variety of contexts: 1. The computer offers no assistance: the human must do it all. 2. The computer suggests alternative ways to do the task. 3. The computer selects one way to do the task, and 4. executes that suggestion if the human approves, or 5. allows the human a restricted time to veto before automatic execution,
or 6. executes automatically, then necessarily informs the human, or 7. executes automatically, then informs the human only if asked. 8. The computer selects, executes, and ignores the human. The tendency to move further along this scale has been a continuing trend
in recent years, and is most evident in the evolution of commercial aircraft.
It began with autopilot systems, then came navigation aids, then diagnostic
aids, collision and stall and ground proximity warnings, and finally the
integration of all these into the Flight Management System, a multi-purpose
computer system which oversees all functions and through which the pilot
flies the aircraft. Pilots now call themselves "flight managers."
Similar evolution is beginning to happen in highway vehicles, ships, factories,
chemical plants, power stations, and hospitals as well as trains. It is
commonly called "supervisory control" (see Sheridan, 1987, 1992). 3. Review of Reliance and Distraction Effects in the General Literature,
and Their Relevance to PTC In considering the experimental literature as well as practical experience
with automation in piloting aircraft, driving highway vehicles, operating
nuclear power plants and performing routine manufacturing tasks, one cannot
discuss reliance without discussing complacency and trust. 3.1. Reliance Effect in the General Literature When machines or people demonstrate their reliability it is only natural
to depend on, indeed trust, them. Most of the technology around us works
well, and even though our life may depend upon it, we simply do not think
about it. Do we rely on the roofs over our heads or the buildings we are
in not to fall down? Do we trust our brakes to slow and stop our cars from
high speeds? Obviously we do - unless there are environmental circumstances
(e.g., earthquakes, very steep hills) which cause us to make closer observations,
or unless we receive unexpected signals (ominous noises, leaking oil, etc.).
To some degree reliance on trustworthy systems is proper behavior, since
we do not have time or attentional capacity to attend to and worry about
everything around us. Clearly, however, one can become reliant on automation,
trusting and complacent (insofar as the third term implies the first two)
to a degree greater than is justified by the small risks which may be involved
(where risk means probability of serious consequences times magnitude of
those consequences.) There have been numerous studies of human reliance
on automation recently (see, e.g., Riley, 1994; Sheridan, 1992; Parasuraman
and Moula, 1994; Moula and Koonce, 1997). Safety engineers have long worried about whether, if actions are taken
to make systems safer, operators will simply take advantage of that safety
margin to take correspondingly more risks, to the point where level of safety
remains constant. The technical term for this is "risk homeostasis."
Evidence in automotive vehicles is clearly that as brakes, tires, handling
qualities and highways have improved drivers drive faster. Are they driving
so fast that the safety improvements are nullified? Apparently not, for
mortality and morbidity rates per passenger mile have declined significantly
over the last 50 years (see National Highway Traffic Safety Administration
database). At the same time it can be said they are not as safe as they
would be if they continued to drive at the same speeds as they did 50 years
ago. So clearly in this context risk homeostasis, in the sense of behaving
so as to maintain constant risk, is a false premise. But, surely, drivers
are taking advantage of the technology to achieve greater performance while
maintaining acceptable risk, where what is acceptable is now significantly
safer than it was earlier. "Acceptable" is an important term in
understanding human behavior relative to risk. It is also a relative term
regarding danger to humans and property. What might be acceptable to persons
removed from a danger might not be to persons directly affected by such
danger. The story with respect to risk homeostasis appears to be similar in other
aspects of driving and in other transportation contexts. Currently there
is worry that radar-based intelligent cruise control systems will lead drivers
to follow the lead car more closely, and that GPS-based air traffic displays
in the cockpit, heretofore not available to pilots (only the ground controllers
saw radar returns) will lead pilots to second-guess ground controllers and
take more chances. "Trust" is a term which is relatively new in the human factors
literature but which is drawing much attention. The term can have different
subtle meanings, but usually it relates to the subjective expectation of
future performance. Muir and Moray (1996) showed that as automation errors
manifest themselves trust declines and monitoring behavior increases. Lee
and Moray (1992) showed that subjective trust is a significant determiner
of whether an operator will use an automatic controller or, given the choice,
or will opt for manual control. They modeled subjective trust as a function
of both overall automation performance, the seriousness of faults, and the
recency of faults. They also discuss the mounting evidence that a system
is less trusted if there are no clear indications about what it is doing
or about to do. Aircraft pilots, for example, frequently complain that they
cannot tell what the automation is thinking or will do next (Woods and Roth,
1988). Should we worry that human supervisors of automation may become complacent?
Clearly this begs the further question of what is the optimum level of sampling
the displays and/or adjusting the control settings. If, given the relative
costs of attending to the automation (less time available to attend other
things) and not attending, plus some assumptions about the statistics of
how soon after a sample the automation is likely to become abnormal, one
can specify an optimal sampling rate (Sheridan, 1970). If the operator samples
at the optimal rate that of course does NOT mean that critical signals will
never be missed - they still occasionally will. Moray (1999) argues that
if the optimal rate is not specified one can never assert that there is
complacency (assuming it means sampling at less than the optimal rate).
A recent qualitative model by Moray, Inagaki and Itoh (1999) suggests that
in the absence of faults or disagreements with the decisions of the automation,
subjective trust asymptotes to a level just below the objective reliability,
which does not suggest complacency. A concern with automated warning systems is that a very small percentage
of warnings truly indicate the condition to be avoided. This occurs because
the designer has set the sensitivity threshold such that false alarms occur
much more often than misses (the misses carrying a much more serious consequence)
-which is rational based on the objective tradeoff between risks associated
with each. Signal detection theory, the same analytic techniques that design engineers
developed during World War II to decide how to make the optimal trade-off
between false alarms and misses, has by now been widely applied to measuring
how humans should or actually do make the trade-off (Swets and Pickett,
1982; Parasuraman et al., 1998). It requires knowledge of probability densities
for true positives (hits) and false positives (false alarms) as functions
of input signals or symptoms, or the equivalent relative operating characteristic
(ROC) curve - the cross-plot of probability of hit vs. probability of false
alarm. It has been shown that the human operator does not respond mechanically
and indifferently to these events. Indeed, the fact that the warning system
may "cry wolf" so often may lead the operator to lose confidence
in the automated warning system and come to respond slowly or even ignore
it (Getty et al., 1995). Classical expected-value decision theory, from which signal detection
theory is derived, can also be used to make optimal decisions as to whether
one or another form of automatic fault detection system is better, or whether
the human is better (Sheridan and Parasuraman, 1999). 3.2. Operating Crew Reliance, Trust and Complacency with PTC With regard to "risk homeostasis" there is some question as
to whether a LE or C would ever be motivated to "take advantage"
of the safety margin in a PTC system. This is because of an ever-present
electronic monitoring of their acts. The event recorder on locomotives should
be an interacting subsystem of PTC. Event recording should be of failures
in PTC and other automation as well as errors in human performance. The
overall PTC system will serve as a kind of event recorder, just as does
the present centralized train control (CTC) system. Thus any infraction
of the operating rules by the LE will meet with the normal disciplinary
procedures and penalties-all the more so with the teeth in the rules of
FRA certification, and decertification. At present many computer workstations in ordinary business offices monitor
and record the nature of an employee's work tasks and the speed, accuracy,
and rules-compliance of employee performance. The ability of PTC, similarly,
to monitor electronically operator compliance with the rules is comprehensive.
The on-locomotive computers are all the more effective in this monitoring
because of their interfacing with other machine systems, usually, having
electronic and, often, computer characteristics. Railroads have traditionally
and are required by FRA regulations to conduct in-field efficiency tests
for operating employees. PTC has the capability of continuously testing
operating personnel. It is generally true that in automated warning systems only a very small
percentage of warnings truly indicate the condition to be avoided- most
are false alarms. Nevertheless, in railroading danger signals are ordinarily
observed. We distinguish between false alarms not safety critical and those
that constitute railroading's "danger (stop)" signals. And we
realize the great operating frequency of such rail danger signals. A nonsafety
in-cab warning such as "hot engine" or "dynamic brake overload"
might go immediately unheeded but not so with a danger signal. First, the
danger signal (such as red stop-and-proceed signal) is common in railroading.
Repeating these signals on a display in the cab does not necessarily make
them any different in their operating effect on personnel. Second, railroaders
do not lose confidence in a danger signal: it might be for real; it might
be an efficiency test; or it might be a false-alarm "wolf cry."
But all tend to be heeded, regardless. We would have to hypothesize PTC-generated wolf cries of danger signals
that would overcome the particular culture of safety in railroading that
observes possible wolf cries as danger signals. For example, when two torpedoes
unexpectedly explode on the rail head and, from experiential knowledge,
the LE immediately reduces to and observes restricted speed, it does not
matter whether a MOW flagman forgot to pick them up at the end of the workday,
or he left them for a good, unanticipated, reason. This is not an argument
against a need for PTC. The LE or C could be incapacitated or distracted
when first confronted with a danger signal. A falsely and reportedly overacting warning device for a danger signal,
such as an in-cab alarm, might not be heeded as much as one not giving false
signals. But, then, the railroad rules ordinarily call for eliminating such
failed components and a consequent operating under more restrictive rules
than previously. 3.3. Distraction Effect in the General Literature The long accepted Yerkes-Dodson "law" in experimental psychology
refers to the notion that with very low attentional demand humans get bored
and drowsy and are not vigilant, while with very high attentional demand
people cannot take in all appropriate information. Performance is best in
a broad middle-range of attentional demand. During World War II there was interest in the low end of this curve because
watches on ships and monitors of sonar in submarines and radar in aircraft
ground control stations found themselves scanning electronic displays over
long periods for signals which seldom occurred. The associated research
was identified with the term "vigilance", and the net result was
a variety of studies which showed that after about 30 minutes people's monitoring
performance declines significantly (Mackworth and Taylor, 1963). Associated
studies of operators performing visual inspection tasks on assembly lines
produced a similar result. Allegedly it was asserted that in one test of
a cola bottle washing inspection operation, a higher percentage of clean
bottles resulted when cockroaches were randomly added to bottles at the
start of the line. Interest in the high-demand end of the curve peaked in the mid 1970s
when many new attentional demands were being placed on fighter aircraft
pilots, and military laboratories started research on "mental workload."
At that same time, in conjunction with the certification of the MD-80, pressures
from aircraft manufacturers and airlines to automate and allegedly justify
reducing the crew from three to two set off a dispute with the pilots. The
regulatory agency, the Federal Aviation Administration, turned to the human
factors community to observe commercial pilots and try to define mental
workload. After a flurry of research, four methods were evolved to define
and measure mental workload: physiological indices, secondary task measures,
subjective scaling, and task analysis (Moray, 1988). It should be noted
that physical workload is nowadays relatively easily measured by percent
of CO2 increase between inhaled and exhaled respiratory gas, but this physical
workload has no correlation with what is called mental workload. The various physiological indices tested over the years include: heart
rate variability, particularly in the power spectrum at 0.1 Hz.; galvanic
skin response (as in a lie-detector test); pupil diameter; the 300 msec
characteristics of the transient evoked response potential; and formant
(spectral) changes in the voice (frequencies rise under stress). Unfortunately
none of these measures has proven satisfactory for most requirements because
the measures have to be calibrated to the individual being measured and
because they usually require relatively long time samples - often longer
than the period over which one seeks to measure changes in mental workload. The second measure of mental workload is the secondary task. It assumes
that a human monitor has a fixed workload capacity, and that by giving the
test subject some easily measurable additional task (such as performing
mental arithmetic or simple tasks of motor skill), along with specific instructions
to perform the secondary task only as time is NOT required to perform the
primary task, "spare capacity" can be measured. The assumption
is made that the worse the performance on the secondary task the greater
are the primary task mental workload. This technique has been used successfully
in laboratory tests, but is usually impractical in real-world tasks such
as landing an aircraft since operators refuse to cooperate because of possible
compromise with safety. A third method, subjective scaling, is not the design engineer's ideal,
simply because it is subjective rather than objective. Yet it is the method
most often used, and indeed is the method most frequently used to validate
the other methods. NASA has developed a subjective scale called TLX and
the U.S. Air Force a scale called SWAT (Williges and Wierwille, 1979). Multi-dimensional
subjective scales have been suggested, including for example fraction of
time busy (spare capacity), emotional stress, and problem complexity -the
idea being that these are orthogonal attributes of a situation (Sheridan
and Simpson, 1979). The fourth method, task analysis, simply considers the number of items
to be attended to, the number of actions to be performed, etc. without regard
to the operator's actual performance or subjective sense of workload. This
method has been criticized as not really being about mental workload because
it neglects level of training or experience. A well trained or experienced
operator, after all, may have an easy time performing a task, i.e., with
insignificant mental workload, where a novice might be heavily loaded. However,
such task analysis is amenable to objectivity, for example use of the Shannon
(1949) information measure H= average of log [l/p(x)], p(x) being the probability
of each different stimulus element (x) which must be attended to (or different
response element which must be executed). This provides an index of "difficulty"
or entropy (degree of uncertainty to be resolved). The problem lies in the
somewhat arbitrary classification of stimulus and response elements. For simple tasks the greater the mental work load and/or information
difficulty (entropy) H the greater the operator's response time (Hick, 1952;
Fitts, 1954) in almost direct proportionality to H. For complex tasks there
may be great variability in response time. It is well established that human
response times follow a log normal probability density, meaning that no
response takes zero time, and the 95th percentile may be one or two orders
of magnitude greater than the median. Experiments of experienced nuclear
plant operators responding to simulated emergencies showed an almost perfect
fit to a log normal function (Sheridan, 1992). The long responses often
result from confusion about what problem is presented to the person and
what is the expected criterion for satisfactory response. There have been numerous studies to determine whether operators are better
monitors or failure detectors when they are controlling a task manually
or when they are monitoring automation. Mostly these studies have shown
that performance capability (in terms of failure detection and response
recovery) declines when operators are monitors of automation and the automation
fails. (Wiener and Curry, 1980; Desmond et al., 1998; Wickens, 1992). However,
at the extreme where the operator is so heavily loaded performing manual
operations that there is no attentional capacity remaining for failure detection,
automation may provide relief and improved capability to detect failures. One problem with automation is that there may be very little to do for
long periods of monitoring, but suddenly and without warning, the automation
may fail and/or unexpected circumstances may arise, and the operator is
expected to get back into the control loop instantly to set matters straight.
Such workload transients are deemed to be more troublesome in many cases
than sustained period of high workload, for the operator is unlikely to
be able to "wake up" and figure out what is happening, and quickly
make the correct decision. A currently popular term in aviation is "situation awareness".
The ideal is have a maximum level of situation awareness. A means to test
situation awareness in a simulator experiment is to stop the simulation
abruptly and unexpectedly and ask the subject to recall certain stimuli
or response events (Endsley, 1995; Endsley and Kiris, 1995). Improvements
in graphic displays and decision aids have been suggested to enhance situation
awareness. Automation which is opaque to the user may well impede situation
awareness. However it has been pointed out that to the extent that an operator
expends more mental effort on situation awareness, to that extent less spare
mental capacity, if we can accept that notion, remains for decision and
response execution (Sheridan, 1999). 3.4. Maintaining Performance in a Broad Middle-Range of Attentional
Demand Given the Yerkes-Dodson "law," that with very low attentional
demand humans do get bored and drowsy and are not vigilant, and with very
high attentional demand people cannot take in all appropriate information,
safety is clearly best in a broad middle-range of attentional demand. But
how do we assure this in PTC operations for the C and LE? The most effective
way to assure operation in the mid-range is by skills maintenance through
retention of most pre-PTC motor and cognitive work tasks, despite the "designed
in reliance" effect of PTC. A primarily manual operation of trains
by the LE and C, with a fully automated safety compliance backup is, therefore,
necessary. This primary manual operation should be at the reliance level-2
of the automation scale (the PTC suggests alternative ways to do the task)
or, perhaps, 3 (the PTC selects one way to do the task). That is, the system
provides an audible warning in advance of a civil speed restriction (CSR),
a signal (in-cab or otherwise) change to a more restrictive indication,
or some other restriction of train movement. And the system also meets the
requirement of PTC in that the restrictions will be enforced by a sub-system
on board the locomotive at level 6 (the PTC executes automatically, then
necessarily informs the human and the event recorder). In all, the automation
scale level of 2 or 3 is what we strive for as normal PTC operation, but
level 6 must always be operable in the background as the safeguard.
Part 2 of the PTC White Paper will be published in the February 2000 issue of the Locomotive Engineer Newsletter.
A complete copy of the 23-page report can be found on the BLE webpage,
at http://www.ble.org/pr/news/ptcposition.pdf.
© 2000 Brotherhood of Locomotive Engineers