ListenSpeak2

Applying Cybernetic Principles to Understand Speaking

What is speaking and how is it seen from a cybernetic point of view? Speaking can be conceived of as a biological action which produces an acoustical time-series we call speech. Numerous biological systems are involved in this action. but it is possible to think of them collectively as a single large bioacoustical system which produces a series of sound units sometimes known as phonemes. The phonemes are ordered, organized and distributed over a time sequence. As listeners we recognize them ordinarily as speech. The speaking system is complex and must be controlled and coordinated by some central agency. That control system we normally refer to as the brain. But how does the brain control the speaking system?
Grant Fairbanks (1954) seems to have been the first person to explicitly apply the cybernetic principles of feedback control to language and human speech production. He created a basic feedback control model in which he pointed out that "auditory monitoring" of one's own speech is not just an ancillary listening function, but rather an integral part of the speech control system. He proposed a model of speech as a servomechanism in which the speech is controlled through feedback loops. Fairbank's insights into the application of cybernetic principles so early after Wiener's published account, show his vision to be truly prophetic. He was however. apparently too far ahead of his contemporaries. His experiments with delayed auditory feedback (Fairbanks, 1955) provided dramatic evidence and did create some discussion. The monitor model presented by Krashen (1977) is in fact derived from discussions about delayed auditory feedback experiments. Unfortunately, various ramifications of cybernetic principles themselves have not been followed up and studied by people within the language field and have therefore not been fully understood by most members of the profession.
The task of exploring the effects of applying cybernetic principles to human behavior fell primarily to people outside the field of language. In a way, this is both predictable, and regrettable. Motor behavior is a relatively simple act, which can be examined relatively straight forwardly. Language performance on the other hand involves a much larger portion of "symbolic manipulation" prior to the public response portion of the language performance. Hence it is more complex and more difficult to study. Nevertheless, it is one of the most common human behaviors we know. To understand language as a form of symbolic behavior from a cybernetic point of view would be a major step forward for understanding human behavior in general. Regardless, the other disciplines have provided some solid stepping stones for language people to learn from.
A hierarchical aspect of the feedback model was introduced to the field of psychology by a neurologist, Pribram and a Psychologist, Miller.
(Miller, et al 1960). The model they introduced was call TOTE which stood for Test, Operate, Test, Exit. It used the Test as a feedback sensor which compared the actual performance with an intended performance. For example, if one is hammering a nail, one first checks to see if it is still sticking out, then one Operates, or hammers the nail, then one Tests to see if it is still sticking out, or flush. If the Test indicates it is still sticking out, then one Operates, or hammers the nail until it is flush, and then one Exits to another behavior or another nail. What the authors pointed out however, was that the Operation phase was not quite so simple. The Operate phase itself contained a TOTE. For example, within the operation "hammer", one must first Test whether the hammer is up or down. If down, then the Operate must be to raise or lift the head of the hammer. If up, then the Operate must be to lower or hit the nail with the hammer. But one can go even further. Within the operation "lift" there must also be a TOTE regarding the state of the various muscles involved. It must Test whether they are relaxed, tensed, etc.
The hierarchical TOTE model pointed to two major factors. One was that in a hierarchy of feedback loops, some of the Test units could be focused upon the environment external to the total system, i.e. the person looking at the nail, but some of them, while external to the immediate Operate system, were in fact inside the skin of the total system. In terms of the speech system what is being indicated is that feedback loops exist not only through the ear, but also through proprioceptive senses and even within the brain itself. The second factor which became apparent through the model was that time scales are not the same at all levels. The amount of time it takes to adjust one's muscles must be much more rapid than the time it takes to look at the nail. In language training, rate of processing has generally been ignored as a factor, while a cybernetic approach would put it in a central position. It was up to the physiologists to examine this factor more completely.
Welford: (1974) has pointed out some of the significance of including reaction time as a critical factor in understanding behavior from a cybernetic point of view.

The fact that it takes time to react has two consequences which are not fully recognized as they deserve to be. First, in any developing situation, there must be some prediction of the trend of events because, if action is to be appropriate at the time it becomes effective, it must have been decided before the events it is designed to meet have occurred. Second, any action taken must be ballistic in the sense that an appreciable time must elapse before it can be modified by any fresh information or feedback from the action itself. (p. 382)

A baseball player who is highly skilled at bat, is just as subject to the law of reaction time as anyone. As a batter views the ball coming, he must make a prediction about its continued movement before he begins to move his bat. Once the bat is in motion, the movement continues even thought he may see the ball dropping off and curving away from the position he predicted only a fraction of a second earlier. There is not sufficient time to change the movement of the bat and the called strike only reconfirms the consequence of reaction time. Welford indicates that both the principle of prediction and that of ballistic continuity have far reaching implications.

Prediction means that action cannot depend on any simple connection between stimulus and response, but must involve a more or less complex computation. To the extent that this is so, the stimulus-response and conditioned reflex approaches to performance are inadequate. The ballistic principle means that elementary units of action, if one may so speak legitimately, are essentially timed and phrases sequences of muscular contractions and relaxations which are initiated as wholes. (p. 382)

The prediction referred to by Welford is similar to the "end," or "result" referred to by Powers earlier. The ballistic principle is similar to Power's ' means" or "act". The prediction portions involves a more or less complex computation of anticipated or expected events based upon prior learning. As Bandura (1977) has pointed out, "There are certain regularities in the succession or coexistence of most environmental events. Such uniformities create expectation about what leads to what. Knowledge of conditional relations thus enables one to predict with varying accuracy what is likely to happen under given antecedent conditions." (p. 58) The brain is a very good statistical machine. It even works for languages. But as Bandura points out, it develops language expectancies not through just frequency counts, but through the abstracting of rules based upon that statistical analysis. "Rather than simply copying individual utterances, children learn sets of rules which enable them to generate an almost infinite variety of new sentences that they have never heard. It is abstract modeling, with its perceptual, cognitive, and reproductive component processes, rather than simple verbal mimicry, that is most germane to the development of generative grammar." (p. 174) He then goes on to point out that, "During initial language learning . . . they can acquire linguistic rules without engaging in any motor speech." (p.174)
The ballistic principle is also a type of computational process. It involves a whole unit of movement. It is a type of "programmed" behavior. In playing tennis, the unit is the whole stroke involving both the drawing back and the driving forward of the racquet. In playing a musical instrument, the unit is not the single note, but the unit consists of the phrase or arpeggio. In language usage, the ballistic unit is the "thought unit", the phrase, which is general]y about 3 seconds in length (Turner and Poppel, 1983, p.296) not the word or the phoneme. In these and similar cases, a complex computation is involved based not only upon the immediate stimulus to action, hut also future goals. past experiences and concurrent factors such as posture at the moment of action. Welford points out that although the repeated musical phrase. or particular tennis stroke is in one sense the same as those executed previously, in another sense, it is never the same. This was the conclusion of Bartlett (1932) when in talking about the stroke of an athlete in a skilled athletic game, he observed,

We may fancy that we are repeating a series of movements learned a long time before from a textbook or from a teacher. But motion study shows that in fact we build afresh on a basis of the immediately preceding balance of postures and momentary needs of the game. Every time we make it, it has its own characteristics. (p. 204)

In speaking, we generate a new sound which has its own unique characteristic depending upon the linguistic context, our emotional set, the audience etc., every time we talk, even though we may be reciting the same phrase we have said a dozen times before. The controlling feature for both the ballistic computation and the prediction computation is feedback. The ballistic computation takes into account feedback from both peripheral senses such as eyes and touch, and from proprioceptive senses within the muscular structure. They operate at different speeds and they are therefore TOTE type computations. The predictions computations take place even faster . . . in the cerebral cortex itself. There is then feedback loops between the ballistic consequences and the predictions computed. Most of these feedback loops act much faster than our conscious understanding of them, which is one of the reasons we tend to ignore them.
Welford proceeds to analyze the consequences of the two principles of prediction and elementary units of actions with respect to time. Using experimental research data from tracking experiments, he pointed out that corrections made in tracking were not continuous, but rather " . . . corrections were intermittent, as if the subject observed an error, made a correction for it, observed again, made a further correction, and so on." (p. 383) If, for example, a signal to react has been given, and during the reaction time to it a further signal appears, response to the second signal is delayed by an amount which suggests that the central processes required to deal with it did not begin until the reaction time to the previous signal had ended. Seemingly, monitoring the response occupies the central mechanism to an extent that precludes their dealing with fresh signals for action until the monitoring is finished. Serial action such as in tracking or in speaking, appears to involve an alternation between, on the one hand, the observing of signals, and computing of responses to them, and on the other hand, monitoring the response made. In other words, " . . . the speed of serial action does not depend upon the time taken to execute movements, but upon the time required to decide and monitor them." (p. 384)

Nuttin and Greenwald (1968) have made a distinction similar to that made by Welford between prediction and ballistic, and Powers between results and acts or ends and means.

Another distinction that seems warranted is to separate the performance process into two phases, a preparatory phase and an executive phase, the latter of which corresponds to overt behavior . . . Corresponding to the preparatory-executive distinction, one may distinguish two categories of learned content: expectations (knowledge of 'what leads to what') and behavioral skills. In effect, it is one thing to learn how to walk - a behavioral skill - and quite another to know how to get from the room one is in to the exit of a building � a set of expectations about relations among environmental events, that is, a cognitive map. . . The acquisition of expectations may be described as cognitive learning. Expectations can be formalized as structures containing information about what will happen as a result of a given action operating on a given initial situation. (p. 127)

Applying this analysis to the language using process, we can recognize that the prediction or preparatory phase of speech behavior is the internal generation of the phoneme time-series to be generated. These are then held in a Test point for monitoring the executive or Operate stage of the performance cycle. "In other words, a representation of the sound being spoken must be made available to the perceptual centers for comparison with the auditory signal that is produced during the articulation of the intended message. If the two match, then the perceptual representation of the auditory signal remains constant and stable." (Lackner, 1974, p. 901) If they do not match as in the delay of auditory feedback experiments, then there occurs a" . . . slowing of speech rate, increased loudness of speaking, elevation of pitch of the voice, and a blocking of the normal flow of words that result in artificial stutter. Many errors of articulation appear, including omissions, additions, and substitutions of syllables or words." (Smith & Smith, 1965, p. 400). In simple terms we speak what we expect to hear. The ability to hear, to know what to listen for, then comes first. The building of the cognitive map of expectations is what the comprehension approach is focused on. The speaking will be controlled through a feedback loop process.
The prediction computation in language without a motor response is what we could call "thinking" in the language. The prediction computation is also critical in the process of listening, since we do not just listen "passively", but actually generate (predict) what will be heard and then monitor what is heard to compare it to what was expected. Learning to listen then reduces to essentially leaning the predictive computation rules. It is however at this point that the cybernetic models for motor behavior are somewhat over simplified. While they are correct as far as they go, they can not adequately handle the complexity of symbolic prediction. A full coverage of cybernetic understanding of symbolic behavior is beyond the scope of this paper, but a brief sketch will illustrate the principle aspects.
There are basically two aspects of language, the form and the meaning. There are therefore three sets of computations which are necessary to generate the prediction or what is to be heard portion of the performance.
We now have evidence that these computations take place in different parts of the brain. The first set is the semantic associations. From the general context of the situation, certain semantic associations are generated in thinking. That is, certain expectancies of a meaningfulness nature are generated. There is some evidence that this takes place primarily in the frontal lobes of the brain. (Luria, 1973, p. 318) A second set of computations is the symbolic encapsulation of these meanings into the associated word or form in the particular language being used. These computations apparently take place in the "listening" portion of the brain. "If the word is to be spoken, the pattern is transmitted from Wernicke's area to Broca's area where the articulatory form is aroused and passed on to the motor area that controls the movement of the muscles of speech." (Geschwind, 1972, p. 79) The computation of the form structure, that is, the expectations regarding sequencing of the words and the function words . . . this we generally refer to as the grammatical rules and it takes place separately. "There thus appears to be a natural neurological separation between the functions of processing sentence form and that of processing semantic representations. This evidence can be taken as an encouraging sign that it is possible to connect cognitive and neurological organization in the domain of language." (Zurif, 1980 p. 311)
Critiques of the comprehension approach point out that listening is primarily a semantic activity and that one can listen and understand without really knowing the syntactic or form rules, but these they point out are crucial in speaking. These are redundant systems, and one could learn just the "semantic" rules and "appear" to be a good listener. The lack of syntactic computational rules would perhaps not show up until one tried to speak.
While this fact has been used as an argument that speech is necessary for the syntactical organization to be learned, the element of rate processing does not seem to be taken into account sufficiently. The feedback process from the peripheral organs and from the proprioceptors is of a different time scale from that of the feedback loop within the cerebral cortex. The semantic and syntactical computations must be coordinated through feedback at this cortical time scale, or it will not be fast enough to control the rest of the system. While speaking practice may help focus the listener's attention on certain critical syntactic points in later listening, the speaking itself does not enhance the learning of these syntactic points directly. This must be done at a rate processing level only available in listening.
Osgood (1957) proposed a three level model, which could account for the different feedback time scales necessary to operate fluent speech.

The three levels of organization are assumed to apply to both sides of the behavioral equation, to both decoding and encoding: (1) a projection level of organization which relates both the receptor and muscle events to the brain via "wire-in" neural mechanism; (2) an integration level, which organizes and sequences both incoming and outgoing neural events; and (3) a representation or cognitive level, which is at once the termination of decoding operations and the initiation of encoding operations. (p. 77)

Turner and Poppel (1983) have examined in even greater detail the different time scales. "Events separated by periods of time shorter than three thousandths of a second are classified by the hearing system as simultaneous . . . If the sounds are a little more than .003 sec. apart, the subject will experience two sounds. However, he will not be able to tell which of the two sounds came first . . . When two sounds are about three hundredths of a second apart, a subject can experience sequence, accurately . . . Once the temporal interval is above three tenths of a second . . . [there] is enough time for a human subject to react to an acoustic stimulus." (p. 294) The delayed auditory feedback experiments created maximum disturbance when the delay was approximately .25 seconds. This is approximately the time scale of the projection level in Osgood's model. The integration level on the other hand which must deal with sequences can operate approximately 10 times faster, and the representational level 100 times faster. On the other hand, it appears that in the actual operations, "A human speaker will pause for a few milliseconds every three seconds or so, and in that period decide on the precise syntax and lexicon of the next three seconds." (p. 296) In other words, the prediction and ballistic units appear to be about 3 second intervals. Similarly, "A listener will absorb about three seconds of heard speech without pause or reflection, then stop listening briefly in order to integrate and make sense of what he has heard." (p. 296) Through the use of a short term memory "buffer", the speech units are a) generated b) monitored, and regenerated and remonitored if necessary in order to make sense out of what was said.
Those who criticize the delay of oral response approach do not seem to understand that with appropriate listening guidance, both semantic and syntactic rules could be learned. Listening exercises can be created which focus on syntactical structure, just as listening tests have been created to test for comprehension of Broca's aphasic patients. "Broca's aphasics understand a sentence primarily by inferring what makes factual sense from a sampling of the major lexical items of the sentence-its nouns and verbs- independent of syntactic structure. When they can not make use of semantic and pragmatic cues their comprehension fails." (Zurif, 1980, p. 307) Listening exercises can be created which develop a predictive capacity for both semantic and syntactic aspects of language. Oller (1972) has referred to this latter capacity as "grammar of expectancy".
The issue is not whether some people might not learn the syntactic rules by just listening, but determining how to create listening exercises which do in fact do just that. We know for example that grammatical features of speech are more informative and distinguishable when the semantic references for the utterances are present than when they are absent. Young children, for example, are aided in comprehending plural forms if they hear singular and plural labels applied to single and multiple objects respectively. The acquisition of syntactic language rules is greatly facilitated by pairing linguistic modeling with perceptual references. This has been confirmed by Brown (1976) in an experimental study.
This has also been the kind of listening practice advocated by those such as James Asher (1969), who has strongly promoted the use of the Total Physical Response Strategy. The same kind of listening practice has been advocated by most of those who promote the delay of oral response. Advocates of authentic listening on the other hand, tend to ignore this use of perceptual referents for syntactical differentiation. Many listening exercises for adult second language learning also ignore this point. Many of these listening exercises tend to involve the "general meaning" of the passage. The consequence is that they do not necessarily learn the syntactic rules and hence transfer to speaking is neither automatic nor complete. The prediction principle to be useful must be understood more completely. The cybernetic approach points out that if the feedback concept is to hold for both the syntactic prediction aspect as well as the semantic predication aspect, then both must be developed through carefully developed listening exercises.

Hosted by www.Geocities.ws