Connected speech can be regarded as a special, very complex kind of sequential behavior; that is, a behavior consisting of several consecutive steps (segments, submovements). In speaking, such steps are the production of a phoneme, of a syllable, of a word, or of a clause. There is a hierarchy of steps of different complexity in speaking: A syllable is a sequence of phonemes, a word is a sequence of syllables, and so on. We can also say: Speech is a sensorimotor sequence – a sequence of motor actions as well as a sequence of perceptions, with both aspects closely linked to one another.
In some kinds of sensorimotor sequence, e.g., in tying a bowknot, a certain number of steps must be made in a fix order to accomplish a goal. But not so in speaking: The sensorimotor sequence producing a certain word is, indeed, widely invariable, similar to that of tying a bowknot, but the sequences of words in sentences are not. Their order is determined by syntactic rules only, which allows a great variety of wording. The result of the sequence of tying a bowknot, if successfully completed, is always a bowknot – the result of a speech sequence can be a sentence never said before.
An issue important for a theory of stuttering is: How are sensorimotor sequences controlled by the brain in a way such that they can be executed fluently? In the first half of the 20th century, experts believed that behavioral sequences were feedback-controlled; that means, the perception of the current step being complete triggers the next step to start. The main argument against this position is that an organism needs some time from feedback perception to reaction – ca. 150ms or more. Because of this reaction time, an only feedback-controlled sequence could just not be fluently executed. A further argument against a purely feedback-based control is that sensorimotor sequences, after they were automatized, can be executed even if sensory feedback is interrupted. This is also true for speaking.
It was Lashley (1951) who showed that sequential behavior is not attributable to sensory feedback only. He proposed a hierarchical organization of ‘plans’ – today, we would rather say programs – allowing a feedforward control of sensorimotor sequences. Meanwhile, Lashley’s position is regarded as confirmed (see, e.g., Rosenbaum et al., 2007). However, Lee (1951) and Stromsta (1959) demonstrated that a healthy person’s speech flow can be severely disrupted by manipulation of auditory feedback. and Kalveram and Jäncke (1989) showed that even a delayed auditory feedback (DAF) of 40ms, which is below the threshold of conscious perception, influences the timing of speech. These facts indicate that speaking is not purely feedforward-controlled – obviously, auditory feedback plays some role (read more).
By the way, not only auditory feedback, but also proprioceptive, tactile, or kinaesthetic feedback – feeling the movements of vocal folds, jaws, tongue, and lips, feeling the contact of palate, tongue, and lips, etc. – might influence speech control (imagine how the anesthesia of tongue, lips, and palate would alter the quality of articulation!). I will not further deal with these kinds of sensory feedback here, since I do not assume that they are impaired or anyway involved in stuttering.
Let us, now, consider how a sensorimotor sequence is learned. At least in the initial phase of learning, it is necessary to evaluate whether the current step has been successfully completed and the subsequent step can start. If an error has happened, it can be repaired by repeating the last step only, thus not the whole sequence (in the worst case) must be repeated. In this way, a continuous self-monitoring develops, which ensures that the sequence is correctly executed and that errors are immediately repaired. The steps of a sequence and their succession are feedforward- (program-) controlled, but self-monitoring, by contrast, is an element of feedback-based control. In the period of learning, self-monitoring is consciously done with the effect that the sequence is not yet fluently executed, because short breaks are needed to evaluate every step. Later, after the sequence became automatic, self-monitoring runs alongside and is unconscious. In tying a bowknot, you do not consciously monitor whether every step has been completed before starting the next one – but if something has gone wrong, you notice it immediately, you stop the movement, and you do it again properly.
Both the kinds of control are depicted in Figure 1. Above, a sequence in the initial period of learning is visualized. The single steps theirselves are program-controlled, but the control of the sequence is strongly feedback-based: The success of every step is consciously perceived (arrow to the monitor), only after this perception, the next step is started as a reesponse to the perception (arrow from the monitor). If an error has been detected, the affected step is repeated. Below, a sequence already automatized is depicted. The success of every step is monitored too, but this monitoring is automatic and widely unconscious. The each next step is not started as a reaction to the feedback, but by the program controlling the entire sequence (horizontal arrow, also symbolizing time line). If an error has been detected, the monitor stops the movement after a reaction time, and the affected step is correctly repeated before the execution of the sequence goes on.
In summary, we can say: In an automatized sensorimotor sequence, self-monitoring is an alongside running, feedback-based element of control. It ensures that the next step can start only if the preceding step was correct and complete. The automatic unconscious monitor intervenes only if an error was detected, and after a reaction time. It is, however, important to understand that the monitor is not a particular mechanism for error detection, but at first a part of the control system allowing to learn the correct and fluent execution of a sensorimotor sequence.
Looking at the brain, we find the cerebellum to be the structure crucial for the sequencing of motor programs. Hallett and Grafman (1997) assumed the cerebellum to be involved in the organization of sequences of both movement patterns and mental operations, and Molinari, Leggio, and Silveri (1997) suggested that the cerebellum operates as a “controller” mediating sequential organization of the various subcomponents of complex cognitive tasks. Based on morphological and electrophysiological data, Braitenberg, Heck, and Sutlan (1997) proposed that the cerebellar cortex acts as a “sequence-in/ sequence-out operator” transforming input sequences of events into a coordinated output sequence. Ackermann, Mathiak, and Ivry (2004) apply this model to the domain of speech production and propose that a left-fronto/right-cerebellar network subserves the ongoing sequencing of speech movements. Whereas the right cerebellar hemisphere seems to be responsible for sequencing, the left cerebellar hemisphere is involved in self-monitoring and error repair within motor sequences, and it seems to play a crucial role in the causation of stuttering (see here in Section 2.1).
Levelt’s Perceptual Loop Theory and his Main Interruption Rule, that are subject in Section 1.3, describe an alongside running, feedback-based monitoring in the control of speech: An automatic and unconscious monitor continuously checks the steps of the speech sequence, i.e., the words, phrases, and clauses just spoken, and it interrupts speech flow immediately if a step appears erroneous (read more about Levelt’s model of speech production).
Lee (1951) showed that a delay of the auditory feedback of speech of about one syllable length (1/4–1/5 second) leads to speech disfluencies in normal fluent individuals. Repetitions and prolongations occur, but also speech tempo fluctuations and formulation errors like incorrect or suddenly canceled sentences (see also Fairbanks & Guttman, 1958). These observations are referred to as the ‘Lee effect’ today. Lee himself called them “artificial stutter”; however, they are significantly different from those typical of real stuttering: They are not accompanied by muscular tension, and repetitions mostly occur at the end of words. Stromsta (1959) obtained short blockages of phonation in normal fluent speakers by phase shifting of the auditory feedback, which provided additional evidence for auditory feedback to influence speech flow.
By means of altered auditory feedback (delayed, but also virtual premature feedback) of 20–60ms, which is below the threshold of perception, Kalveram (1983) discovered that the duration of long stressed syllables is controlled via auditory feedback. He referred to the phenomenon as ‘audiophonatory coupling’ (see also Kalveram & Jäncke, 1989; Kalveram & Natke, 1998). The effect suggests a feedback-based online control of vowel duration in long stressed syllables, i.e., in syllables that are stressed by speaking them not only louder but also a little longer: Vowel offset depends on the auditory feedback of vowel onset. By contrast, the duration of short (stressed or unstressed) syllables was found to be not or very less influenced by the altered auditory feedback.
It might be the most noted and common model of speech processing today. On the following pages, I will often refer to this model; therefore, it is shortly presented here. The figure shows a simplified version confined to speech production, and the two feedback loops. The figure is equal to Fig. 12.3 on page 470 in Levelt (1995).
The main advantage of the model is: It describes the relationship between speech production and auditory feedback. On the following pages, however, I will argue against some positions of this model, and I will revise it in part. Here, I only want to point to a general problem:
The investigation of speech errors (slips of the tongue) has played an important role in the development of psycholinguistic models of speech production (Levelt, 1999): A given model of speech production can be realistic only if frequently occurring types of speech errors are possible in this model. However, what is true for speech errors, that should also be true for stuttering: A model of speech production can be realistic only if stuttering is possible in the framework of the model. Let us consider Levelt’s model thoroughly in this respect:
Formulation and the control of articulation are localized in two separate ‘encapsulated modules’ in the brain, referred to as Formulator and Articulator. In the Formulator, a ‘phonetic plan’ is generated. This plan is transferred to the Articulator, where it is converted into a sequence of commands for the executing muscles. Information is transferred only from the Formulator to the Articulator, but not reverse. Now, the question is: Which of both modules is impaired in stuttering?
In a rule, stutterers are well able to formulate correct sentences; they have no difficulty thinking them or writing them down. Therefore, stuttering seems to be a disorder of articulation only – but the Articulator seems to be able to work well too: Stuttering, in a rule, does not occur when stutterers repeat single phonemes, single syllables, or single words, when they speak in chorus, or when they ‘shadow’ the speech of someone else. Moreover, some stutterers fluently recite poems, actors who stutter in everyday situations speak their roles fluently from memory. In all these conditions, no self-generated formulation is required. What does that mean? Is the cause of stuttering, nevertheless, to be found in the Formulator? Or is the information transfer from Formulator to Articulator impaired in stuttering? That can hardly be the case because stutterers always exactly know what they are going to say when they get stuck.
Obviously, neither Formulator nor Articulator seems to be responsible for stuttering, and an interaction between them is definitively excluded in Levelt’s model; it can, therefore, also not cause stuttering. There seems to be no place for stuttering in Levelt’s model of speech production. And there is a further problem: It is, at least, unclear what the person does and what the modules in the person’s brain do; If, in spontaneous speech, sentences were first formed by an unconscious internal Formulator (quasi behind the speaker’s back), subsequently monitored (before articulation, by an unconscious internal censor), and finally spoken, controlled by an Articulator – who, in the framework of this model, is responsible for what the speaker says?
Instead, I assume that, in spontaneous speech, sentences are formulated by being articulated, and they are monitored automatically via the external feedback loop, i.e., by hearing. That means: In spontaneous speech, a speaker does not know the exact formulation of a sentence before the sentence has been spoken. Therefore, it is not surprising that spontaneous speech is usually not perfect, contains errors and unfortunate wording. Sometimes, we say the wrong thing, realize it, and correct the mistake. Our responsibility, in spontaneous speech, is not to formulate perfectly, but to realize our mistakes and to correct them in order to avoid misunderstanding.