Self-monitoring of speech serves to detect errors (slips of the tongue), but also to check whether an utterance done matches the intended message. As already mentioned in the preceding sections, self-monitoring further plays an important role in speech acquisition and in the control of speech as a sensorimotor sequence: It ensures that the next speaking program can executed only after the current one is correct and complete. A further part of the self-monitoring of speech serves for the online-adjustment of speed, volume, pitch, and articulatory distinctness.
Speech errors are detected by comparing a speech unit (e.g., a word or phrase) just spoken with an expectation of its correct form, i.e., its correct sound sequence. A question important for our theme is: How can those expectations be generated? Levelt (1995) has believed that a ‘phonetic plan’ produced in the Formulator (see figure) is the basis of those expectations. A variety of this approach is the ‘efference copy’ theory. An efference copy is the (theoretically claimed) projection of a motor plan onto that sensory system, in which the movement controlled by the motor plan is perceived (about efference copies assumed in speech monitoring in the context of stuttering see, e.e., Beal et al., 2010; Brown et al., 2005).
However, the issue is how the copy of a motor plan can function as the basis of speech monitoring: How should, for instance, the confusion of two similar sounding words (a frequently occurring error) be detected on the basis of the copy of a motor plan? This plan necessarily contains the wrong movement sequence, otherwise the error could not occur. How can a correct expectation be generated on the basis of a wrong motor plan? An error could be detected with the help of efference copies only if muscles moved in a way not contained in the motor plan – which may, e.g., be the case in Parkinson’s desease (read more about efference copies). But a theory claiming the cause of stuttering is that muscles do what they want appears somewhat simple. Such a pure motor theory could hardly account for the variability of stuttering, for the specific properties of the core symptoms (e.g., their dependence on linguistic factors), and not least for the fact that only speaking is affected.
Eventually, the question can be posed whether efference copies are anyway needed for the self-monitoring of speech. When listening to the speech of someone else, humans are quite able to detect errors immediately without having any copy of the speaker’s plan. Syntax errors (phrase structure violations) in spoken sentences elicit responses in a listener’s brain measurable as event-related potentials (ERPs) after ca. 120ms; semantic errors elicit ERPs after ca. 400ms (Friederici, 1999). Obviously, a listener is able to generate an expectation of what a speaker is going to say and what it should sound like, and to compare this expectation with the perception in a very short time.
How can a listener generate these expectations? First, he or she intuitively knows from experience in what way the initial words of a sentence constrain a speaker’s options of how to continue: the more words of a sentence have already been spoken the fewer syntactic and semantic options remain. That makes it easier for the listener to quickly generate expectations and to identify a perception not matching the expectation as a potential mistake. Simply said: It is the words already heard and the listener’s implicit knowledge of language that enables a listener’s brain to generate expectations of how the speaker must continue.
However, not only the words already heard enable us to predict the following ones. A listener often recognizes a familiar word already after hearing its initial portion, particularly if the word is embedded in a sentence context. The context facilitates the recognition of the word on the basis of few initial sounds and the prediction of its phoneme sequence. These assumptions are in line with Astheimer and Sanders (2009, 2012) who found by means of auditory event-related potentialss that both adults and preschool-aged children, when listening to connected speech, temporally modulate selective attention to preferentially process the initial portions of words. Already Halle and Stevens (1959) developed a model describing how phoneme sequences can be predicted on the basis of a minimal information input and, in this way, words can be recognized. This ‘analysis-by-synthesis’ model was updated by Poeppel and Monaban (2011) (read more).
In summary, we can say that listening and the implicit knowledge of language together enable a person to detect errors in the speech of someone else. Assuming now, according to Levelt (1995), that the same mechanisms that let us to detect errors in the speech of others also operate in monitoring one’s own speech. Then both the components necessary for the self-monitoring of speech – the expectation of the correct form of a speech unit and the perception of the speech unit produced – depend on auditory feedback (read more). In monitoring one’s own speech, however, it may be somewhat easier to generate expectations of the correct forms than in monitoring the speech of someone else, since the speaker knows the intended message of his own speech. That may be the cause why errors in one’s own speech are sometimes more quickly detected than errors in the speech of others, particularly semantic errors.
We now have a rough scheme of normal speech production. Its main features are:
Speaking is to produce a sensorimotor sequence. A speech sequence is composed of speech units like phonemes, words, phrases, clauses, and breathing pauses.
Speaking is mainly feedforward-controlled by speaking programs, that is, by motor routines controlling the production of familiar speech units. Phoneme sequence, syllable structure, and linguistic stress are integrated in a speaking program.
Speaking is accompanied by an automatic feedback-based self-monitoring, that interrupts speech flow when an error has been detected, in order to enable a correction.
Error detection works in the way that the speech unit just produced and perceived via the external auditory feedback is compared with an expectation of the correct sound sequence of the speech unit.
The expectation of the correct sound sequence of a speech unit is generated on the basis of the auditory feedback of the initial portion of the unit, supported by speaker’s knowledge of the intended message.
These basic assumptions about normal speech production provide the framework for the theory of stuttering presented in the next chapter.
The last experiment shows that the crucial thing is not to produce an exact copy or prediction, but simply to discriminate self-elicited from not self-elicited perceptions. In darkness, for example, it can be important to discriminate rustling caused by one’s own steps from rustling due to other causes. The discrimination may mainly be an effect of expectancy:: Deliberately causing a stimulus, e.g., a sound – by speaking or by pushing a button – produces a high degree of expectancy; the stimulus does not come surprisingly. This assumption is supported by the finding that activity in the auditory cortical areas increases immediately when the auditory feedback of speech is altered unexpectedly, for example, when a person hears his own speech life via headphones, and the playback frequency is suddenly altered (Tourville et al., 2008; Behroozmand & Larson, 2011). Chang et al. (2013) found that some sites in the posterior superior temporal gyrus showed suppressed responses to the normal auditory feedback of speech, whereas other sites showed enhanced responses when the auditory feedback was suddenly altered in pitch.
The crucial point, in my view, is that the expectations that seem to lead to attenuated responses in the cases mentioned above are generated on the basis of plans for the movements actually executed. Such kind of expectation, however, is useless in the self-monitoring of speech, since the motor plan of a mistaken or misarticulated word is a wrong plan, and an expectation based on a wrong plan can be only a wrong expectation, useless for the self-monitoring of speech.
In a theoretical paper, Tian and Poeppel (2012) described a top-down simulation-estimation process, in which internal somatosensory and auditory verbal perceptions (estimations) are generated on the basis of the motor simulation of speech. I think, that is very important: It might be the basis of inner speech, i.e., of verbal thinking (see also Tian, Zarate, & Poeppel, 2016). The authors assumed that this top-down process is acquired by learning. They wrote: “... the movement of articulators can induce somatosensory feedback and subsequent auditory perception of one’s own speech. On the basis of the occurrence order (action first, then somatosensory activation, followed by auditory perception), an internal association can be established to link a particular movement trajectory of articulators with the specific somatosensory sensation, followed by a given auditory perception of speech.” (5).
Although Tian and Poeppel assumed that the simulation-estimation process is learned, they link it with the efference copy theory: “The core presupposition is that the neural system can predict the perceptual consequences by internal simulating a copy of a planned action command (the efference copy).” (2). But why should a copy of a planned action command be needed for a simulation? When a motor program, e.g. a speaking program is activated, the associated somatosensory and auditory imagination may be co-activated. A copy of the motor program is not needed – aside from the question how a copy should be made in the brain.
Tian and Poeppel further assumed that these internal auditory perceptions based on motor simulation serve as predictions for the self-monitoring of speech: “This sequential estimation mechanism (motor plan → somatosensory estimation → auditory prediction/estimation) can derive detailed auditory predictions that are then compared with auditory feedback for self-monitoring and online control.” (2). I think that’s wrong. As argued above, a prediction based on the motor plan just executed is unfit for error detection, because an error can only occur if the motor plan is incorrect. The same thing is with the online-adjustment of speech rate, volume, pitch, or articulation: We need an idea of the proper or intended behavior that cannot be derived from a wrong motor plan.
A study about the hypothesized role of efference copies in stuttering has recently been published by Mock, Foundas, and Golob (2015). By means of MEG, they investigated whether speech preparation modulates auditory processing, via motor efference copy, differently in stutterers and normal fluent controls. For this purpose, they used a picture naming task, in which, first, a visual prime (cue word in letters) and, 1.5 seconds later, the target (picture to name) was presented. The cue words, in terms of meaning, matched the target picture on 90 % of trials, thus participants could prepare speaking in the time between cue and target presentation. Just in this time of speech preparation, a short probe tone was presented, and the event-related brain response to the probe tone 100–400ms after its onset was recorded.
The stutterers’ overall brain responses to the probe tone were significantly weaker than that of the controls. Since the authors had premised that the brain response to the probe tone tells something about efference copies, they concluded from the result that efference copies were smaller in the stutterers. However, I think there is an alternative explanation: Couldn’t it simply be that the stutterers, aware of the fact that speaking is not easy for them, were more focused on speech preparation than the controls, who had never experienced difficulty with speaking? In this case, the stutterers would have lesser attention left for perceiving and processing the probe tone.
This would explain not only the result that the stutterers’ overall response amplitude was smaller, but also that auditory N100 amplitude was greater on average in the stutterer group (Fig. 3A): Since they were more focused on speech preparation, the probe tone presented in every trial came more surprising for the stutterers than for the controls. I think that’s a good and simple explanation for the interesting results of Mock, Foundas, and Golob (2015) without referring to the efference copy theory (note that N100 is independent from attention and can be interpreted as a measure of unecpectancy; the later ERP components are attention-depending).
The model describes how the brain can ‘guess’ and predict a word on the basis of only a few sounds perceived. It is assumed that a first vague prediction is updated step by step on the basis of additional sounds meanwhile perceived, and/or on the basis of the context (Poeppel and Monaban (2010)). By the way, the model explains our ability to understand the speech of someone speaking faultily, for instance, in a strong foreign accent. The ability to recognize a familiar word or phrase and to generate an expectation of its correct sound sequence, on the basis of a few initial phonemes perceived, seems to be a special case of a general ability which enables us, for example, to identify a familiar musical composition based on some initial tacts.
Perhaps, it appears somewhat strange when I claim that auditory feedback is the basis not only of the perception of one’s own speech, but also of the prediction of the correct forms (sound sequences) of spoken words or phrases. The thesis has to do with a basic problem of linguistics, namely the relation between language and thought (see, e.g., Carruthers & Boucher, 1998). The specific question in our context is: Do we, in spontaneous speech, know the formulation of our sentences before we have spoken them? Quite a few people will claim they do. But would they also claim to know their thoughts, i.e., their internally formulated sentences, before they have perceived, i.e., heard them internally? Hardly they would. But if it is true that we become aware of our internally spoken sentences by hearing them internally – why should we believe that we become aware of our externally spoken sentences in another way than by hearing them externally?
It makes little sense to say we would think before thinking in order to formulate our thoughts, and it makes just as little sense to say we would, in spontaneous speech, think before speaking in order to formulate our sentences. Spontaneous speech is even thinking – only aloud. Clearly, unconscious brain processes precede the formulation of a sentence, in overt speech as well as in inner speech. But unconscious brain processes are no thoughts – thoughts are conscious. Therefore, in spontaneous speech, a speaker does neither know the formulation of a sentence nor the sound sequence of a word (in its specific inflection form), before he has spoken and heard the sentence. All what the speaker knows before is the intended message – as far as the term knowledge is appropriate here; I think, spontaneous speech is very often an immediate behavioral response to a situation, even without an awareness of an intended message: One word leads to another.