Applying Cybernetic Principles to Understand Speaking
What is speaking and how is it seen from a cybernetic point of
view? Speaking can be conceived of as a biological action which produces
an acoustical time-series we call speech. Numerous biological systems are
involved in this action. but it is possible to think of them collectively
as a single large bioacoustical system which produces a series of sound
units sometimes known as phonemes. The phonemes are ordered, organized
and distributed over a time sequence. As listeners we recognize them ordinarily
as speech. The speaking system is complex and must be controlled and coordinated
by some central agency. That control system we normally refer to as the
brain. But how does the brain control the speaking system?
Grant Fairbanks (1954) seems to have been the first person to
explicitly apply the cybernetic principles of feedback control to language
and human speech production. He created a basic feedback control model
in which he pointed out that "auditory monitoring" of one's own speech
is not just an ancillary listening function, but rather an integral part
of the speech control system. He proposed a model of speech as a servomechanism
in which the speech is controlled through feedback loops. Fairbank's insights
into the application of cybernetic principles so early after Wiener's published
account, show his vision to be truly prophetic. He was however. apparently
too far ahead of his contemporaries. His experiments with delayed auditory
feedback (Fairbanks, 1955) provided dramatic evidence and did create some
discussion. The monitor model presented by Krashen (1977) is in fact derived
from discussions about delayed auditory feedback experiments. Unfortunately,
various ramifications of cybernetic principles themselves have not been
followed up and studied by people within the language field and have therefore
not been fully understood by most members of the profession.
The task of exploring the effects of applying cybernetic principles
to human behavior fell primarily to people outside the field of language.
In a way, this is both predictable, and regrettable. Motor behavior is
a relatively simple act, which can be examined relatively straight forwardly.
Language performance on the other hand involves a much larger portion of
"symbolic manipulation" prior to the public response portion of the language
performance. Hence it is more complex and more difficult to study. Nevertheless,
it is one of the most common human behaviors we know. To understand language
as a form of symbolic behavior from a cybernetic point of view would be
a major step forward for understanding human behavior in general. Regardless,
the other disciplines have provided some solid stepping stones for language
people to learn from.
A hierarchical aspect of the feedback model was introduced to
the field of psychology by a neurologist, Pribram and a Psychologist, Miller.
(Miller, et al 1960). The model they introduced was call TOTE
which stood for Test, Operate, Test, Exit. It used the Test as a feedback
sensor which compared the actual performance with an intended performance.
For example, if one is hammering a nail, one first checks to see if it
is still sticking out, then one Operates, or hammers the nail, then one
Tests to see if it is still sticking out, or flush. If the Test indicates
it is still sticking out, then one Operates, or hammers the nail until
it is flush, and then one Exits to another behavior or another nail. What
the authors pointed out however, was that the Operation phase was not quite
so simple. The Operate phase itself contained a TOTE. For example, within
the operation "hammer", one must first Test whether the hammer is up or
down. If down, then the Operate must be to raise or lift the head of the
hammer. If up, then the Operate must be to lower or hit the nail with the
hammer. But one can go even further. Within the operation "lift" there
must also be a TOTE regarding the state of the various muscles involved.
It must Test whether they are relaxed, tensed, etc.
The hierarchical TOTE model pointed to two major factors. One
was that in a hierarchy of feedback loops, some of the Test units could
be focused upon the environment external to the total system, i.e. the
person looking at the nail, but some of them, while external to the immediate
Operate system, were in fact inside the skin of the total system. In terms
of the speech system what is being indicated is that feedback loops exist
not only through the ear, but also through proprioceptive senses and even
within the brain itself. The second factor which became apparent through
the model was that time scales are not the same at all levels. The amount
of time it takes to adjust one's muscles must be much more rapid than the
time it takes to look at the nail. In language training, rate of processing
has generally been ignored as a factor, while a cybernetic approach would
put it in a central position. It was up to the physiologists to examine
this factor more completely.
Welford: (1974) has pointed out some of the significance of including
reaction time as a critical factor in understanding behavior from a cybernetic
point of view.
A baseball player who is highly skilled at bat, is just as subject to the law of reaction time as anyone. As a batter views the ball coming, he must make a prediction about its continued movement before he begins to move his bat. Once the bat is in motion, the movement continues even thought he may see the ball dropping off and curving away from the position he predicted only a fraction of a second earlier. There is not sufficient time to change the movement of the bat and the called strike only reconfirms the consequence of reaction time. Welford indicates that both the principle of prediction and that of ballistic continuity have far reaching implications.
Prediction means that action cannot depend on any simple connection between stimulus and response, but must involve a more or less complex computation. To the extent that this is so, the stimulus-response and conditioned reflex approaches to performance are inadequate. The ballistic principle means that elementary units of action, if one may so speak legitimately, are essentially timed and phrases sequences of muscular contractions and relaxations which are initiated as wholes. (p. 382)
The prediction referred to by Welford is similar to the "end,"
or "result" referred to by Powers earlier. The ballistic principle is similar
to Power's ' means" or "act". The prediction portions involves a more or
less complex computation of anticipated or expected events based upon prior
learning. As Bandura (1977) has pointed out, "There are certain regularities
in the succession or coexistence of most environmental events. Such uniformities
create expectation about what leads to what. Knowledge of conditional relations
thus enables one to predict with varying accuracy what is likely to happen
under given antecedent conditions." (p. 58) The brain is a very good statistical
machine. It even works for languages. But as Bandura points out, it develops
language expectancies not through just frequency counts, but through the
abstracting of rules based upon that statistical analysis. "Rather than
simply copying individual utterances, children learn sets of rules which
enable them to generate an almost infinite variety of new sentences that
they have never heard. It is abstract modeling, with its perceptual, cognitive,
and reproductive component processes, rather than simple verbal mimicry,
that is most germane to the development of generative grammar." (p. 174)
He then goes on to point out that, "During initial language learning .
. . they can acquire linguistic rules without engaging in any motor speech."
(p.174)
The ballistic principle is also a type of computational process.
It involves a whole unit of movement. It is a type of "programmed" behavior.
In playing tennis, the unit is the whole stroke involving both the drawing
back and the driving forward of the racquet. In playing a musical instrument,
the unit is not the single note, but the unit consists of the phrase or
arpeggio. In language usage, the ballistic unit is the "thought unit",
the phrase, which is general]y about 3 seconds in length (Turner and Poppel,
1983, p.296) not the word or the phoneme. In these and similar cases, a
complex computation is involved based not only upon the immediate stimulus
to action, hut also future goals. past experiences and concurrent factors
such as posture at the moment of action. Welford points out that although
the repeated musical phrase. or particular tennis stroke is in one sense
the same as those executed previously, in another sense, it is never the
same. This was the conclusion of Bartlett (1932) when in talking about
the stroke of an athlete in a skilled athletic game, he observed,
In speaking, we generate a new sound which has its own unique
characteristic depending upon the linguistic context, our emotional set,
the audience etc., every time we talk, even though we may be reciting the
same phrase we have said a dozen times before. The controlling feature
for both the ballistic computation and the prediction computation is feedback.
The ballistic computation takes into account feedback from both peripheral
senses such as eyes and touch, and from proprioceptive senses within the
muscular structure. They operate at different speeds and they are therefore
TOTE type computations. The predictions computations take place even faster
. . . in the cerebral cortex itself. There is then feedback loops between
the ballistic consequences and the predictions computed. Most of these
feedback loops act much faster than our conscious understanding of them,
which is one of the reasons we tend to ignore them.
Welford proceeds to analyze the consequences of the two principles
of prediction and elementary units of actions with respect to time. Using
experimental research data from tracking experiments, he pointed out that
corrections made in tracking were not continuous, but rather " . . . corrections
were intermittent, as if the subject observed an error, made a correction
for it, observed again, made a further correction, and so on." (p. 383)
If, for example, a signal to react has been given, and during the reaction
time to it a further signal appears, response to the second signal is delayed
by an amount which suggests that the central processes required to deal
with it did not begin until the reaction time to the previous signal had
ended. Seemingly, monitoring the response occupies the central mechanism
to an extent that precludes their dealing with fresh signals for action
until the monitoring is finished. Serial action such as in tracking or
in speaking, appears to involve an alternation between, on the one hand,
the observing of signals, and computing of responses to them, and on the
other hand, monitoring the response made. In other words, " . . . the speed
of serial action does not depend upon the time taken to execute movements,
but upon the time required to decide and monitor them." (p. 384)
Nuttin and Greenwald (1968) have made a distinction similar to
that made by Welford between prediction and ballistic, and Powers between
results and acts or ends and means.
Applying this analysis to the language using process, we can recognize
that the prediction or preparatory phase of speech behavior is the internal
generation of the phoneme time-series to be generated. These are then held
in a Test point for monitoring the executive or Operate stage of the performance
cycle. "In other words, a representation of the sound being spoken must
be made available to the perceptual centers for comparison with the auditory
signal that is produced during the articulation of the intended message.
If the two match, then the perceptual representation of the auditory signal
remains constant and stable." (Lackner, 1974, p. 901) If they do not match
as in the delay of auditory feedback experiments, then there occurs a"
. . . slowing of speech rate, increased loudness of speaking, elevation
of pitch of the voice, and a blocking of the normal flow of words that
result in artificial stutter. Many errors of articulation appear, including
omissions, additions, and substitutions of syllables or words." (Smith
& Smith, 1965, p. 400). In simple terms we speak what we expect to
hear. The ability to hear, to know what to listen for, then comes first.
The building of the cognitive map of expectations is what the comprehension
approach is focused on. The speaking will be controlled through a feedback
loop process.
The prediction computation in language without a motor response
is what we could call "thinking" in the language. The prediction computation
is also critical in the process of listening, since we do not just listen
"passively", but actually generate (predict) what will be heard and then
monitor what is heard to compare it to what was expected. Learning to listen
then reduces to essentially leaning the predictive computation rules. It
is however at this point that the cybernetic models for motor behavior
are somewhat over simplified. While they are correct as far as they go,
they can not adequately handle the complexity of symbolic prediction. A
full coverage of cybernetic understanding of symbolic behavior is beyond
the scope of this paper, but a brief sketch will illustrate the principle
aspects.
There are basically two aspects of language, the form and the
meaning. There are therefore three sets of computations which are necessary
to generate the prediction or what is to be heard portion of the performance.
We now have evidence that these computations take place in different
parts of the brain. The first set is the semantic associations. From the
general context of the situation, certain semantic associations are generated
in thinking. That is, certain expectancies of a meaningfulness nature are
generated. There is some evidence that this takes place primarily in the
frontal lobes of the brain. (Luria, 1973, p. 318) A second set of computations
is the symbolic encapsulation of these meanings into the associated word
or form in the particular language being used. These computations apparently
take place in the "listening" portion of the brain. "If the word is to
be spoken, the pattern is transmitted from Wernicke's area to Broca's area
where the articulatory form is aroused and passed on to the motor area
that controls the movement of the muscles of speech." (Geschwind, 1972,
p. 79) The computation of the form structure, that is, the expectations
regarding sequencing of the words and the function words . . . this we
generally refer to as the grammatical rules and it takes place separately.
"There thus appears to be a natural neurological separation between the
functions of processing sentence form and that of processing semantic representations.
This evidence can be taken as an encouraging sign that it is possible to
connect cognitive and neurological organization in the domain of language."
(Zurif, 1980 p. 311)
Critiques of the comprehension approach point out that listening
is primarily a semantic activity and that one can listen and understand
without really knowing the syntactic or form rules, but these they point
out are crucial in speaking. These are redundant systems, and one could
learn just the "semantic" rules and "appear" to be a good listener. The
lack of syntactic computational rules would perhaps not show up until one
tried to speak.
While this fact has been used as an argument that speech is necessary
for the syntactical organization to be learned, the element of rate processing
does not seem to be taken into account sufficiently. The feedback process
from the peripheral organs and from the proprioceptors is of a different
time scale from that of the feedback loop within the cerebral cortex. The
semantic and syntactical computations must be coordinated through feedback
at this cortical time scale, or it will not be fast enough to control the
rest of the system. While speaking practice may help focus the listener's
attention on certain critical syntactic points in later listening, the
speaking itself does not enhance the learning of these syntactic points
directly. This must be done at a rate processing level only available in
listening.
Osgood (1957) proposed a three level model, which could account
for the different feedback time scales necessary to operate fluent speech.
Turner and Poppel (1983) have examined in even greater detail
the different time scales. "Events separated by periods of time shorter
than three thousandths of a second are classified by the hearing system
as simultaneous . . . If the sounds are a little more than .003 sec. apart,
the subject will experience two sounds. However, he will not be able to
tell which of the two sounds came first . . . When two sounds are about
three hundredths of a second apart, a subject can experience sequence,
accurately . . . Once the temporal interval is above three tenths of a
second . . . [there] is enough time for a human subject to react to an
acoustic stimulus." (p. 294) The delayed auditory feedback experiments
created maximum disturbance when the delay was approximately .25 seconds.
This is approximately the time scale of the projection level in Osgood's
model. The integration level on the other hand which must deal with sequences
can operate approximately 10 times faster, and the representational level
100 times faster. On the other hand, it appears that in the actual operations,
"A human speaker will pause for a few milliseconds every three seconds
or so, and in that period decide on the precise syntax and lexicon of the
next three seconds." (p. 296) In other words, the prediction and ballistic
units appear to be about 3 second intervals. Similarly, "A listener will
absorb about three seconds of heard speech without pause or reflection,
then stop listening briefly in order to integrate and make sense of what
he has heard." (p. 296) Through the use of a short term memory "buffer",
the speech units are a) generated b) monitored, and regenerated and remonitored
if necessary in order to make sense out of what was said.
Those who criticize the delay of oral response approach do not
seem to understand that with appropriate listening guidance, both semantic
and syntactic rules could be learned. Listening exercises can be created
which focus on syntactical structure, just as listening tests have been
created to test for comprehension of Broca's aphasic patients. "Broca's
aphasics understand a sentence primarily by inferring what makes factual
sense from a sampling of the major lexical items of the sentence-its nouns
and verbs- independent of syntactic structure. When they can not make use
of semantic and pragmatic cues their comprehension fails." (Zurif, 1980,
p. 307) Listening exercises can be created which develop a predictive capacity
for both semantic and syntactic aspects of language. Oller (1972) has referred
to this latter capacity as "grammar of expectancy".
The issue is not whether some people might not learn the syntactic
rules by just listening, but determining how to create listening exercises
which do in fact do just that. We know for example that grammatical features
of speech are more informative and distinguishable when the semantic references
for the utterances are present than when they are absent. Young children,
for example, are aided in comprehending plural forms if they hear singular
and plural labels applied to single and multiple objects respectively.
The acquisition of syntactic language rules is greatly facilitated by pairing
linguistic modeling with perceptual references. This has been confirmed
by Brown (1976) in an experimental study.
This has also been the kind of listening practice advocated by
those such as James Asher (1969), who has strongly promoted the use of
the Total Physical Response Strategy. The same kind of listening practice
has been advocated by most of those who promote the delay of oral response.
Advocates of authentic listening on the other hand, tend to ignore this
use of perceptual referents for syntactical differentiation. Many listening
exercises for adult second language learning also ignore this point. Many
of these listening exercises tend to involve the "general meaning" of the
passage. The consequence is that they do not necessarily learn the syntactic
rules and hence transfer to speaking is neither automatic nor complete.
The prediction principle to be useful must be understood more completely.
The cybernetic approach points out that if the feedback concept is to hold
for both the syntactic prediction aspect as well as the semantic predication
aspect, then both must be developed through carefully developed listening
exercises.