• Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies

Linguistics

  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Speech Production

Introduction.

  • Historical Studies
  • Animal Studies
  • Evolution and Development
  • Functional Magnetic Resonance and Positron Emission Tomography
  • Electroencephalography and Other Approaches
  • Theoretical Models
  • Speech Apparatus
  • Speech Disorders

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Acoustic Phonetics
  • Animal Communication
  • Articulatory Phonetics
  • Biology of Language
  • Clinical Linguistics
  • Cognitive Mechanisms for Lexical Access
  • Cross-Language Speech Perception and Production
  • Dementia and Language
  • Early Child Phonology
  • Interface Between Phonology and Phonetics
  • Khoisan Languages
  • Language Acquisition
  • Speech Perception
  • Speech Synthesis
  • Voice and Voice Quality

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Cognitive Grammar
  • Edward Sapir
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Speech Production by Eryk Walczak LAST REVIEWED: 22 February 2018 LAST MODIFIED: 22 February 2018 DOI: 10.1093/obo/9780199772810-0217

Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics , Acoustic Phonetics and Speech Perception , which are all studying various elements of language and are part of a broader field of Linguistics . Because of the interdisciplinary nature of the current topic, it is usually studied on several levels: neurological, acoustic, motor, evolutionary, and developmental. Each of these levels has its own literature but in the vast majority of speech production literature, each of these elements will be present. The large body of relevant literature is covered in the speech perception entry on which this bibliography builds upon. This entry covers general speech production mechanisms and speech disorders. However, speech production in second language learners or bilinguals has special features which were described in separate bibliography on Cross-Language Speech Perception and Production . Speech produces sounds, and sounds are a topic of study for Phonology .

As mentioned in the introduction, speech production tends to be described in relation to acoustics, speech perception, neuroscience, and linguistics. Because of this interdisciplinarity, there are not many published textbooks focusing exclusively on speech production. Guenther 2016 and Levelt 1993 are the exceptions. The former has a stronger focus on the neuroscientific underpinnings of speech. Auditory neuroscience is also extensively covered by Schnupp, et al. 2011 and in the extensive textbook Hickok and Small 2015 . Rosen and Howell 2011 is a textbook focusing on signal processing and acoustics which are necessary to understand by any speech scientist. A historical approach to psycholinguistics which also covers speech research is Levelt 2013 .

Guenther, F. H. 2016. Neural control of speech . Cambridge, MA: MIT.

This textbook provides an overview of neural processes responsible for speech production. Large sections describe speech motor control, especially the DIVA model (co-authored by Guenther). It includes extensive coverage of behavioral and neuroimaging studies of speech as well as speech disorders and ties them together with a unifying theoretical framework.

Hickok, G., and S. L. Small. 2015. Neurobiology of language . London: Academic Press.

This voluminous textbook edited by Hickok and Small covers a wide range of topics related to neurobiology of language. It includes a section devoted to speaking which covers neurobiology of speech production, motor control perspective, neuroimaging studies, and aphasia.

Levelt, W. J. M. 1993. Speaking: From intention to articulation . Cambridge, MA: MIT.

A seminal textbook Speaking is worth reading particularly for its detailed explanation of the author’s speech model, which is part of the author’s language model. The book is slightly dated, as it was released in 1993, but chapters 8–12 are especially relevant to readers interested in phonetic plans, articulating, and self-monitoring.

Levelt, W. J. M. 2013. A history of psycholinguistics: The pre-Chomskyan era . Oxford: Oxford University Press.

Levelt published another important book detailing the development of psycholinguistics. As its title suggests, it focuses on the early history of discipline, so readers interested in historical research on speech can find an abundance of speech-related research in that book. It covers a wide range of psycholinguistic specializations.

Rosen, S., and P. Howell. 2011. Signals and Systems for Speech and Hearing . 2d ed. Bingley, UK: Emerald.

Rosen and Howell provide a low-level explanation of speech signals and systems. The book includes informative charts explaining the basic acoustic and signal processing concepts useful for understanding speech science.

Schnupp, J., I. Nelken, and A. King. 2011. Auditory neuroscience: Making sense of sound . Cambridge, MA: MIT.

A general introduction to speech concepts with main focus on neuroscience. The textbook is linked with a website which provides a demonstration of described phenomena.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Accessibility Theory in Linguistics
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Consonant Epenthesis
  • Constructions, Verb-Particle
  • Contrastive Analysis in Linguistics
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Dialectology
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Generative Syntax
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Languages of the World
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Machine Translation
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Positive Discourse Analysis
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Pragmatics, Teaching
  • Prague Linguistic Circle, The
  • Presupposition
  • Psycholinguistics
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • Sound Change
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Production
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Translation
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [66.249.64.20|185.80.151.9]
  • 185.80.151.9

Logo for University of Minnesota Libraries

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Hearing in Complex Environments

75 Speech Production

Learning Objectives

Understand the separate roles of respiration, phonation, and articulation.

Know the difference between a voiced and an unvoiced sound.

The field of phonetics studies the sounds of human speech. When we study speech sounds, we can consider them from two angles. Acoustic phonetics, in addition to being part of linguistics, is also a branch of physics. It’s concerned with the physical, acoustic properties of the sound waves that we produce. We’ll talk some about the acoustics of speech sounds, but we’re primarily interested in articulatory phonetics—that is, how we humans use our bodies to produce speech sounds.

Producing speech takes three mechanisms.

  • Respiration at the lungs
  • Phonation at the larynx
  • Articulation in the mouth

Let’s take a closer look

  • Respiration (At the lungs): The first thing we need to produce sound is a source of energy. For human speech sounds, the air flowing from our lungs provides energy
  • Phonation (At the larynx): Secondly, we need a source of sound: air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony part under your skin. That’s the front of your larynx. It’s not actually made of bone; it’s cartilage and muscle. This picture shows what the larynx looks like from the front.

The larynx is shown from the front view. It is also labelled with its various different parts.

What you in Fig. 7.8.3 is that the opening of the larynx can be covered by two triangle-shaped pieces of tissue. These are often called “vocal cords” but they’re not really like cords or strings. A better name for them is vocal folds. The opening between the vocal folds is called the glottis.

Vocal Folds Experiment:

First I want you to say the word “uh-oh.” Now say it again, but stop half-way through (“uh-“). When you do that, you’ve closed your vocal folds by bringing them together. This stops the air flowing through your vocal tract. That little silence in the middle of uh-oh is called a glottal stop because the air is stopped completely when the vocal folds close off the glottis. Now I want you to open your mouth and breathe out quietly, making a sound like “haaaaaaah.” When you do this, your vocal folds are open and the air is passing freely through the glottis. Now breathe out again and say “aaah,” as if the doctor is looking down your throat. To make that “aaaah” sound, you’re holding your vocal folds close together and vibrating them rapidly. When we speak, we make some sounds with vocal folds open, and some with vocal folds vibrating.  Put your hand on the front of your larynx again and make a long “SSSSS” sound. Now switch and make a “ZZZZZ” sound. You can feel your larynx vibrate on “ZZZZZ” but not on “SSSSS.” That’s because [s] is a voiceless sound, made with the vocal folds held open, and [z] is a voiced sound, where we vibrate the vocal folds. Do it again and feel the difference between voiced and voiceless. Now take your hand off your larynx and plug your ears and make the two sounds again. You can hear the difference between voiceless and voiced sounds inside your head.3. The oral cavity is the space in your mouth. The nasal cavity, as we know, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well. In the next unit, we’ll look in more detail at how we use our articulators.

  • Articulation (In the oral cavity): The oral cavity is the space in your mouth. The nasal cavity, as we know, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well. In the next unit, we’ll look in more detail at how we use our articulators.

introduction speech production

So, to sum it up, the three mechanisms that we use to produce speech are:

  • Respiration (At the lungs): Energy comes from the air supplied by the lungs.
  • Phonation (At the larynx): The vocal folds produce sound at the larynx.
  • Articulation (In the mouth): The south is filtered, or shaped, by the articulators.

Wikipedia, Larynx URL: https://commons.wikimedia.org/wiki/File:Illu_larynx.jpg License: Public Domain

Introduction to Sensation and Perception Copyright © 2022 by Students of PSY 3031 and Edited by Dr. Cheryl Olman is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

Logo for Open Library Publishing Platform

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

2.1 How Humans Produce Speech

Phonetics studies human speech. Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation).

Check Yourself

Video script.

The field of phonetics studies the sounds of human speech.  When we study speech sounds we can consider them from two angles.   Acoustic phonetics ,  in addition to being part of linguistics, is also a branch of physics.  It’s concerned with the physical, acoustic properties of the sound waves that we produce.  We’ll talk some about the acoustics of speech sounds, but we’re primarily interested in articulatory phonetics , that is, how we humans use our bodies to produce speech sounds. Producing speech needs three mechanisms.

The first is a source of energy.  Anything that makes a sound needs a source of energy.  For human speech sounds, the air flowing from our lungs provides energy.

The second is a source of the sound:  air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony part under your skin.  That’s the front of your larynx . It’s not actually made of bone; it’s cartilage and muscle. This picture shows what the larynx looks like from the front.

Larynx external

This next picture is a view down a person’s throat.

Cartilages of the Larynx

What you see here is that the opening of the larynx can be covered by two triangle-shaped pieces of skin.  These are often called “vocal cords” but they’re not really like cords or strings.  A better name for them is vocal folds .

The opening between the vocal folds is called the glottis .

We can control our vocal folds to make a sound.  I want you to try this out so take a moment and close your door or make sure there’s no one around that you might disturb.

First I want you to say the word “uh-oh”. Now say it again, but stop half-way through, “Uh-”. When you do that, you’ve closed your vocal folds by bringing them together. This stops the air flowing through your vocal tract.  That little silence in the middle of “uh-oh” is called a glottal stop because the air is stopped completely when the vocal folds close off the glottis.

Now I want you to open your mouth and breathe out quietly, “haaaaaaah”. When you do this, your vocal folds are open and the air is passing freely through the glottis.

Now breathe out again and say “aaah”, as if the doctor is looking down your throat.  To make that “aaaah” sound, you’re holding your vocal folds close together and vibrating them rapidly.

When we speak, we make some sounds with vocal folds open, and some with vocal folds vibrating.  Put your hand on the front of your larynx again and make a long “SSSSS” sound.  Now switch and make a “ZZZZZ” sound. You can feel your larynx vibrate on “ZZZZZ” but not on “SSSSS”.  That’s because [s] is a voiceless sound, made with the vocal folds held open, and [z] is a voiced sound, where we vibrate the vocal folds.  Do it again and feel the difference between voiced and voiceless.

Now take your hand off your larynx and plug your ears and make the two sounds again with your ears plugged. You can hear the difference between voiceless and voiced sounds inside your head.

I said at the beginning that there are three crucial mechanisms involved in producing speech, and so far we’ve looked at only two:

  • Energy comes from the air supplied by the lungs.
  • The vocal folds produce sound at the larynx.
  • The sound is then filtered, or shaped, by the articulators .

The oral cavity is the space in your mouth. The nasal cavity, obviously, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well.  In the next unit, we’ll look in more detail at how we use our articulators.

So to sum up, the three mechanisms that we use to produce speech are:

  • respiration at the lungs,
  • phonation at the larynx, and
  • articulation in the mouth.

Essentials of Linguistics Copyright © 2018 by Catherine Anderson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Introduction to Speech Processing - Home

Speech production and acoustic properties

2.2. speech production and acoustic properties #, 2.2.1. physiological speech production #, 2.2.1.1. overview #.

When a person has the urge or intention to speak, her or his brain forms a sentence with the intended meaning and maps the sequence of words into physiological movements required to produce the corresponding sequence of speech sounds. The neural part of speech production is not discussed further here.

The physical activity begins by contracting the lungs, pushing out air from the lungs, through the throat, oral and nasal cavities. Airflow in itself is not audible as a sound - sound is an oscillation in air pressure. To obtain a sound, we therefore need to obstruct airflow to obtain an oscillation or turbulence. Oscillations are primarily produced when the vocal folds are tensioned appropriately. This produces voiced sounds and is perhaps the most characteristic property of speech signals. Oscillations can also be produced by other parts of the speech production organs, such as letting the tongue oscillate against the teeth in a rolling /r/, or by letting the uvula oscillate in the airflow, known as the uvular trill (viz. something like a guttural /r/). Such trills, both with the tongue and the uvula, should however not be confused with voiced sounds, which are always generated by oscillations in the vocal folds. Sounds without oscillations in the vocal folds are known as unvoiced sounds.

Most typical unvoiced sounds are caused by turbulences produced by static constrictions of airflow in any part of the air spaces above the vocal folds (viz. larynx, pharynx and oral or nasal cavities). For example, by letting the tongue rest close to the teeth, we obtain the consonant /s/, and by stopping and releasing airflow by closing and opening the lips, we obtain the consonant /p/. A further particular class of phonemes are nasal consonants, where airflow through the mouth is stopped entirely or partially, such that a majority of the air flows through the nose.

2.2.1.2. The vocal folds #

The vocal folds, also known as vocal cords, are located in the throat and oscillate to produce voiced sounds. The opening between the vocal folds (the empty space between the vocal folds) is known as the glottis . Correspondingly, the airspace between the vocal folds and the lungs is known as the subglottal area.

When the pressure below the glottis, known as the subglottal pressure increases, it pushes open the vocal folds. When open, air rushes through the vocal folds. The return movement, again closing the vocal folds is mainly caused by the Venturi effect , which causes a drop in air pressure between the vocal folds when air is flowing through them. As the vocal folds are closing, they will eventually clash together. This sudden stop of airflow is the largest acoustic event in the vocal folds and is known as the glottal excitation .

In terms of airflow, the effect is that during the closed phase (when the vocal folds are closed), there is no airflow. At the beginning of the open phase (when the vocal fold are open), air starts to flow through the glottis and obviously, with the closing of the vocal folds also air flow is decreasing. However, due to the momentum of air itself, the movement of air occurs slightly after the vocal folds. In other words, there is a phase-difference between vocal folds movement and glottal airflow waveform.

The frequency of vocal folds oscillation is dependent on three main components; amount of lengthwise tension in the vocal folds, pressure differential above and below the vocal folds, as well as length and mass of the vocal folds. Pressure and tension can be intentionally changed to cause a change in frequency. The length and mass of the vocal folds are in turn correlated with overall body size of the speaker, which explains the fact that children and females have on average a higher pitch than male speakers.

Note that the frequency of the vocal folds refers to the actual physical phenomenon, whereas pitch refers to the perception of frequency. There are many cases where these two may differ, for example, resonances in the vocal tract can emphasise harmonics of the fundamental frequency such that the harmonics are louder than the fundamental, and such that we perceive one of the harmonics as the fundamental. The perceived pitch is then the frequency of the harmonic instead of the fundamental.

2.2.1.3. The vocal tract #

The vocal tract, including the larynx, pharynx and oral cavities, have a great effect on the timbre of the sound. Namely, the shape of the vocal tract determines the resonances and anti-resonances of the acoustic space, which boost and attenuate different frequencies of the sound. The shape is determined by a multitude of components, in particular by the position of the jaw, lips and tongue. The resonances are easily modified by the speaker and perceived by the listener, and they can thus be used in communication to convey information. Specifically, the acoustic features which differentiate vowels from each other are the frequencies of the resonances in the vocal tract, corresponding to specific places of articulation primarily in terms of tongue position. Since the air can flow relatively unobstructed, vowel sounds tend to have high energy and loudness compared to consonants .

In consonant sounds, there is a partial or full obstruction at some part of the vocal tract. For instance, fricative consonants are characterized by a narrow gap between the tongue and front/top of the mouth, leading to hiss-like turbulent air flow. In plosives, the airflow in the vocal tract is fully temporarily obstructed. As an example, bilabial plosives are characterized by temporary closure of the lips, which leads to accumulation of air pressure in the vocal tract due to sustained lung pressure. When the lips are opened, the accumulated air is released together with a short burst sound (plosion) that has impulse- and noise-like characteristics. Similarly to vowels, the place of the obstruction in the mouth (i.e., place of articulation) will affect the acoustic characteristics of the consonant sound by modifying the acoustic characteristics of the vocal tract. In addition, manner of articulation is used to characterize different consonant sounds, as there are several ways to produce speech while the position of the primary obstruction can remain the same (e.g., short taps and flaps , repeated trills, or already mentioned narrow constrictions for fricatives ).

In terms of vocal tract shape, a special class of consonants are the nasals , which are produced with velum (a soft structure at the back top of the oral cavity) open, thereby allowing air to flow to the nasal cavity . When the velum is open, the vocal tract can be viewed as a shared tube from the larynx to the back of the mouth, after which the tract is divided into two parallel branches consisting of the oral and nasal cavities. Coupling of the nasal cavity to the vocal tract has a pronounced impact on the resonances and anti-resonance of the tract. This is commonly perceived as nasalization of speech sounds by listeners.

Side-view of the speech production organs.

sideview

By BruceBlaus. When using this image in external sources it can be cited as: Blausen.com staff (2014). “Medical gallery of Blausen Medical 2014”. WikiJournal of Medicine 1 (2). DOI:10.15347/wjm/2014.010. ISSN 2002-4436. - Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=29294598

Vocal folds as seen from above.

larynx

The motion of vocal folds seen from the front (or back).

Organs in the mouth.

motion

The four images above are from Wikipedia.

2.2.2. Acoustic properties of speech signals #

The most important acoustic features of a speech signal are (roughly speaking)

The resonance of the vocal tract , especially the two lowest resonances, known as the formants F1 and F2 (see figure below). The resonance structure can be easily examined by drawing an “ envelope” above the spectrum, that is, to draw a smooth line which goes just above the spectrum, as seen on the figure below. We thus obtain the spectral envelope , which characterizes the macro-shape of the spectrum of a speech signal, and which is often used to model speech signals.

The fundamental frequency of a speech signal or its absence carries a lot of information. Per definition, voiced and unvoiced phonemes, respectively, are those with or without an oscillation in the vocal folds. Due to its prominence, we categorize phonemes according to whether they are voiced or unvoiced. The airflow which passes through the oscillating vocal folds will generally have a waveform which resembles a half-wave rectified sinusoid . That is, airflow is zero when the vocal folds are closed (closed phase) and during the open time (open phase) the waveform resembles (somewhat) the shape of the upper part of a sinusoid. The spectrum of this waveform will therefore have the structure of a harmonic signal, that is, the spectrum will have peaks at the fundamental frequency and its integer multiples (see figure below). In most languages, pitch does not differentiate between phonemes. However, in languages that are known as tonal languages, the shape of the pitch contour over time does bear semantic meaning (see Wikipedia:Tone (linguistics) for a nice sound sample). Pitch contours are however often used to encode emphasis in a sentence. Roughly speaking, exerting more physical effort on a phoneme raises its pitch and intensity, and that is usually interpreted as emphasis, that is, the word (or phoneme) with emphasis is more important than other words (or phonemes) in a sentence.

Signal amplitude or intensity over time is another important characteristic and in its most crude form can be the difference between speech and silence (see also Voice activity detection (VAD) ). Furthermore, there are phonemes characterized by their temporal structure; in particular, stop and plosive-consonants , where airflow is stopped and subsequently released (e.g. /p/, /t/ and /k/). While the stop-part is not prominently audible, it is the contrast of a silence before a burst of energy which characterizes these consonants.

formants and f0

The waveform of a sentence of speech, illustrating variations in amplitude and intensity.

waveform

2.2.3. Physiological modelling #

2.2.3.1. vocal tract #.

Vowels are central to spoken communication, and vowels are determined by the shape of the vocal tract. Modelling the vocal tract is therefore of particular interest.

2.2.3.1.1. Simple models #

The vocal tract is essentially a tube of varying length. It has a 90-degree bend, where the throat turns into the mouth, but the acoustic effect of that bend is minor and can be ignored in simple models. The tube has two pathways, through the oral and nasal cavities. The acoustic effect of the oral cavity dominates the output signal such that, roughly speaking, the oral cavity generates resonances to the output sound, while the nasal cavities contributes mainly anti-resonances (dips or valleys) to the spectral envelope. Presence of energy is perceptually more important than absence of energy and anti-resonances can therefore be ignored in simple models.

A very simple model is thus a straight cylindrical tube sub-divided into constant radius segments of equal length (see illustration below). If we further assume that the tube-segments are lossless, then this tube is analytically equivalent with a linear predictor . This is a fantastic simplification in the sense that from a physiologically motivated model we obtain a analytically reasonable model whose parameters we can readily estimate from observed signals. In fact, the temporal correlation of speech signals can be very efficiently modelled with linear predictors. It offers a very attractive connection between physiological and signal modelling. Unfortunately, it is not entirely accurate.

Though speech signals are very efficiently modelled by linear predictors, and linear predictors are analytically equivalent with tube-models, linear predictors estimated from sound signals need not correspond to the tube which generated the sound . The mismatch in the shape of estimated and real tubes is due to two primary reasons;

Estimation of linear predictive coefficients assumes that the excitation, viz. the glottal excitation, is uncorrelated (white noise). This is certainly an incorrect assumption. Though the periodic structure of the glottal excitation does not much bias linear predictors, glottal excitations are also dominated by low-frequency components which will bias the linear predictor. The linear predictor cannot make a distinction between features of the glottal excitation and contributions of the vocal tract, but model both indiscriminately. We also do not know the precise contribution of the glottal excitation such that it is hard to compensate for it.

The analytical relationship between coefficients of the linear predictor and the radii of the tube-model segments is highly non-linear and sensitive to estimation errors. Small errors in predictor parameters can have large consequences in the shape of the tube model.

Still, since linear predictors are efficient for modelling speech, they are useful in speech modelling even if the connection to tube-modelling is sensitive to errors. Linear prediction is particularly attractive because it gives computationally efficient algorithms.

2.2.3.1.2. Advanced models #

When more accurate modelling of the vocal tract is required, we have to re-evaluate our assumptions. With digital waveguides we can readily formulate models which incorporate a second pathway corresponding to the nasal tract. A starting point for such models is linear prediction, written as a delay-line with reflections corresponding to the interfaces between tube-segments. The nasal tract can then be introduced by adding a second delay line. Such models are computationally efficient in synthesis of sounds, but estimating their parameters from real sounds can be difficult.

Stepping up the accuracy, we then already go into full-blown physical modelling such as finite-element methods (FEM). Here, for example, the air-volume of the vocal tract can be split into small interacting elements governed by fluid dynamics . The more dense the mesh of the elements is, the more accurately the model corresponds to physical reality. Measuring and modelling the vocal tract with this method is involved and an art form of its own .

tubemodel

Illustration of a vocal-tract tube-model consisting of piece-wise constant-radius tube-segments.

2.2.3.2. Glottal activity #

As characterization of the glottal flow, we define events of a single glottal period as follows (illustrated in the figure below):

Opening and closing time (or instant), are the points in time where respectively, glottal folds open and close, and where glottal flow starts and ends.

Open and closed phase , are the periods during which the glottis is open and closed, respectively.

The length of time when glottis is open and closed are, respectively, known as open time (OT) and closed time (CT) . Consequently, the period length is \(T=OT+CT\) .

Opening and closing phases are the portions of the open phase, when the glottis is opening and closing, respectively.

The steepness of the closing phase is related to the “agressiveness” of the pulse, that is, it relates to the tension of glottal folds and is characterized by the (negative) peak of the glottal flow derivative .

All parameters describing a length in time are often further normalized by the period length \(t\) .

Like modelling of the vocal tract, also in modelling glottal activity, there is a range of models of different complexity:

Maximum-phase linear prediction ; The most significant event in a single glottal flow pulse is its closing instant; the preceding waveform is smooth but the closing event is abrupt. The waveform can thus be interpreted as the impulse response of an IIR filter but turned backwards, which also known as the impulse response of a maximum-phase linear predictor (the figure on the right was generated with this method). The beauty of this method is that it is similar to vocal tract modelling with linear prediction, such that we are already familiar with the method and computational complexity is simple. Observe, however, that maximum-phase filters are by definition unstable (not realizable), but we have to always process the signal backwards, which complicates systems design.

The Liljencrantz-Fant (LF) -model is a classical model of the glottal flow, the original form of which is a function of four parameters ( defined in article ). It is very useful and influential because it parametrizes the flow with a low number of easily understandable parameters. The compromise is that the parameters are not easily estimated from real signals and that it is based on anecdotal evidence of glottal flow shapes and if it were presented today, to be widely accepted, we would require more evidence to support it.

Mass-spring systems ; the opposing glottal folds can be modelled as simple point-masses connected with damped springs to fixed points. When subjected to the Venturi-forces generated by the airflow, these masses can be brought to oscillate like the vocal folds. Such models are attractive because, again, their parameters have physical interpretations, but since their parameters are difficult to estimate from real-world data and they oscillate only a limited range of the parameters, their usefulness in practical applications is limited.

Finite-element methods (FEM) are again the ultimate method for accurate analysis, suitable for example in medical analysis, yet the computational complexity is prohibitively large for consumer applications.

glottal flow

Illustration of a glottal flow pulse, its derivative and a sequence of glottal flow pulses (corresponding sound below).

2.2.3.3. Lip radiation #

Having travelled through the vocal tract, air exits primarily through the mouth and in some extent through the nose. In leaving this tube, it enters the free field where airflow in has little effect. Recall that sounds are, instead, variations in air pressure. At the transition from the tube to the free field, variations in air flow become variations in air pressure.

The physics of this phenomenon are governed by fluid dynamics , an advanced topic, but heuristically we can imagine that variations air pressure is related to variations in airflow. Thus if we take the derivative of the airflow, we get an approximation of its effect on air pressure $ \( sound(t) \approx \frac d{dt} flow(t), \) \( where \) t$ is time.

Often we deal with signals sampled at time indices \(n\) , where the derivative can be further approximated by the first difference $ \( sound(n) \approx g \left[flow(n) - flow(n-1)\right], \) \( where \) g>0$ is a scalar gain coefficient.

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical Literature
  • Classical Reception
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Archaeology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Agriculture
  • History of Education
  • History of Emotions
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Variation
  • Language Families
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Lexicography
  • Linguistic Theories
  • Linguistic Typology
  • Linguistic Anthropology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Modernism)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Culture
  • Music and Religion
  • Music and Media
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Society
  • Law and Politics
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Oncology
  • Medical Toxicology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Medical Ethics
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Games
  • Computer Security
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Neuroscience
  • Cognitive Psychology
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business History
  • Business Strategy
  • Business Ethics
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Methodology
  • Economic Systems
  • Economic History
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Theory
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Psycholinguistics

A newer edition of this book is available.

  • < Previous chapter
  • Next chapter >

The Oxford Handbook of Psycholinguistics

29 Speech Production

Carol A. Fowler, Haskins Laboratories and Department of Psychology, University of Connecticut, and Department of Linguistics, Yale University.

  • Published: 18 September 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. Such an account must address several issues. Two central issues are considered in this article. One issue concerns the nature of language forms that ostensibly compose plans for utterances. Because of their role in making linguistic messages public, a straightforward idea is that language forms are themselves the public behaviors in which members of a language community engage when talking. By most accounts, however, the relation of phonological segments to actions of the vocal tract is not one of identity. Rather, phonological segments are mental categories with featural attributes. Another issue concerns what, at various levels of description, the talker aims to achieve. This article focuses on speech production, and considers language forms and plans for speaking, along with speakers' goals as acoustic targets or vocal tract gestures, the DIVA theory of speech production, the task dynamic model, coarticulation, and prosody.

L anguage forms provide the means by which language users can make an intended linguistic message available to other members of the language community. Necessarily, then, they have two distinct characteristics. On the one hand, they are linguistic entities, morphemes and phonological segments, that encode the talker's linguistic message. On the other hand, they either have physical properties themselves (e.g. Browman and Goldstein, 1986 ) or, by other accounts, they serve as an interface between the linguistic and physical domains of language use.

A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. 1 Such an account must address several issues. Two central issues are discussed here.

One issue concerns the nature of language forms that ostensibly compose plans for utterances. Because of their role in making linguistic messages public, a straightforward idea is that language forms are themselves the public behaviors in which members of a language community engage when talking. By most accounts, however, the relation of phonological segments to actions of the vocal tract is not one of identity. Rather, phonological segments are mental categories with featural attributes. We will consider reasons for this stance, relevant evidence, and an alternative theoretical perspective.

Another issue concerns what, at various levels of description, the talker aims to achieve (e.g. Levelt et al., 1999 ). In my discussion of this issue, I focus here on the lowest level of description—that is, on what talkers aim to make publicly available to listeners. A fundamental theoretical divide here concerns whether the aims are acoustic or articulatory. On the one hand, it is the acoustic signal that stimulates the listener's ears, and so one might expect talkers to aim for acoustic targets that point listeners toward the language forms that compose the talker's intended message. On the other hand, acoustic speech signals are produced by vocal tract actions. The speaker has to get the actions right to get the acoustic signal right.

Readers may wonder whether this is a “tempest in a teapot.” That is, why not suppose that talkers plan and control articulations that will get the signal right, so that in a sense both articulation and acoustics are controlled? Readers will see, however, that there are reasons why theorists typically choose one account or the other.

These issues are considered in turn in the following two sections.

29.1 Language forms and plans for speaking

By most accounts, as already noted, neither articulation nor the acoustic signal is presumed to implement phonological language forms transparently. Language forms are conceived of as abstract mental categories about which acoustic speech signals provide cues.

There are two quite different reasons for this point of view. One is that language forms are cognitive entities (e.g. Pierrehumbert, 1990 ). In particular, word forms are associated, presumably in the lexical memory of a language user, with word meanings. As such they constitute an important part of what a language user knows that permits him or her to produce and understand language. Moreover, word forms in the lexicons of languages exhibit systematic properties which can be captured by formal rules. There is some evidence that language users know these rules. For example, in English, voiceless stop consonants are aspirated in stressed syllable-initial position. That systematic property can be captured by a rule (Kenstowicz and Kisseberth, 1979 ).

Evidence that such a rule is part of a language user's competence is provided, for example, by foreign accents. When native English speakers produce words in a Romance language such as French, which has unaspirated stops where English has aspirated stops, they tend to aspirate the stops. Accordingly, the word pas , [pa] 2 in French is pronounced [p h a] as if the English speaker is applying the English rule to French words. A second source of evidence comes from spontaneous errors of speech production. Kenstowicz and Kisseberth ( 1979 ) report an error in which a speaker intended to produce tail spin , but instead said pail stin. In the intended utterance, /t/ in tail is aspirated; /p/ in spin is unaspirated. The authors report, however, that, in the error, appropriately for their new locations, /p/ was pronounced [p h ]; /t/ was pronounced [t]. One account of this “accommodation” (but not the only one possible) is that the exchange of /t/ and /p/ occurred before the aspiration rule had been applied by the talker. When the aspiration rule was applied, /p/ was accommodated to its new context. 3

A second reason to suppose that language forms exist only in the mind is coarticulation. Speakers temporally overlap the articulatory movements for successive consonants and vowels. This makes the movements associated with a given phonetic segment context-sensitive and lacking an obvious discrete segmental structure. Likewise, the acoustic signal which the movements produce is context-sensitive. Despite researchers' best efforts (e.g. Stevens and Blumstein, 1981 ) they have not uncovered invariant acoustic information for individual consonants and vowels. In addition, the acoustic signal, like the movements that produce it, lacks a phone-sized segmental structure.

This evidence notwithstanding, there are reasons to resist the idea that language forms reside only in the minds of language users. They are, as noted, the means that languages provide to make linguistic messages public. Successful recognition of language forms would seem more secure were the forms themselves public things.

Browman and Goldstein (e.g. 1986 ; 1992 ) have proposed that phonological language forms are gestures achieved by vocal tract synergies that create and release constrictions. They are both the actions of the vocal tract (properly described) that occur during speech and at the same time units of linguistic contrast. (“Contrast” means that a change in a gesture or gestural parameter can change the identity of a word. For example, the word hot can become tot by addition of a tongue tip constriction gesture; tot can become sot by a change in the tongue tip's constriction degree.)

From this perspective, phonetic gestures are cognitive in nature. That is, they are components of a language users' language competence, and, as noted, they serve as units of contrast in the language. However, cognitive entities need not be covert (see e.g. Ryle, 1949 ). They can be psychologically meaningful actions, in this case of a language user. As for coarticulation, although it creates context sensitivity in articulatory movements, it does not make gestures context-sensitive. For example, lip closure for /b/, /p/, and /m/ occurs despite coarticulatory encroachment from vowels that affects jaw and lip motion.

There is some skepticism about whether Browman and Goldstein's “articulatory phonology” as just described goes far enough beyond articulatory phonetics. 4 This is in part because it does not yet provide an account of many of the phonological systematicities (e.g. vowel harmony in Hungarian, Turkish, and many other languages; but see Gafos and Benus, 2003 ) which exist across the lexicon of languages and that other theories of phonology capture by means of rules (e.g. Kenstowicz and Kisseberth, 1979 ) or constraints (Archangeli,1997). However, the theory is well worth considering, because it is unique in proposing that language forms are public events.

Spontaneous errors of speech production have proved important sources of evidence about language planning units. These errors, produced by people who are capable of producing error-free tokens, appear to provide evidence both about the units of language that speakers plan to produce and about the domain over which they plan. Happily, the units which participate in errors have appeared to converge with units that linguistic analysis has identified as real units of the language. For example, words participate in errors as anticipations (e.g. sky is in the sky for intended sun is in the sky; this and other errors from Dell, 1986 ), perseverations ( class will be about discussing the class for intended class will be about discussing the test ), exchanges ( writing a mother to my letter for writing a letter to my mother ), and non-contextual substitutions ( pass the salt for pass the pepper ) . Consonants and vowels participate in the same kinds of error. Syllables do so only rarely; however, they serve as frames that constrain how consonants and vowels participate in errors. Onset consonants interact only with onset consonants; vowels interact with vowels; and, albeit rarely, coda consonants interact with coda consonants. Interacting segments tend to be featurally similar to one another. Moreover, when segments move, they tend to move to contexts which are featurally similar to the contexts in which they were planned to occur. Segments are anticipated over shorter distances than words (Garrett, 1980 ), suggesting that the planning domains for words and phonological segments are different.

Historically, most error corpora were collected by individuals who transcribed the errors that they heard. As noted, the errors tended to converge with linguists' view of language forms as cognitive, not physical entities (e.g. Pierrehumbert, 1990 ). As researchers moved error collection into the laboratory, however, it became clear that errors occur that are inaudible. Moreover, these errors violate constraints on errors that collectors had identified.

One constraint was that errors are categorical in nature. If, in production of Bob flew by Bligh Bay , the /l/ of Bligh were perseverated into the onset of Bay, producing Blay , the /l/ would be a fully audible production. However, electromyographic evidence revealed to Mowrey and MacKay ( 1990 ) that errors are gradient. Some produce an audible token of/l/; others do not, yet show activity of a lingual muscle indicating the occurrence of a small lingual (tongue) gesture for /l/.

A second constraint is that errors result in phonologically well-formed utterances. Not only do vowels interact only with other vowels in errors, and onsets with onsets and codas with codas, but also sequences of consonants in onsets and codas tend to be permissible in the speaker's language. Or so investigators thought before articulatory data were collected in the laboratory. Pouplier ( 2003a ; 2003b ) used a midsagittal electromagnetometer to collect articulator movement data as participants produced repetitions of pairs of words such as cop—top or sop—shop. Like Mowrey and MacKay ( 1990 ), she found errorful articulations (for example, intrusive tongue tip movement toward a /t/ articulation during cop ) in utterances that sounded error-free. In addition, however, she found that characteristically intrusions were not accompanied by reductions of the intended gesture. This meant that, in the foregoing example, constriction gestures for both /t/ and /k/ occurred in the onset of a syllable, a phonotactically impermissible cluster for her English speakers.

What do these findings imply for theories of speech production? For Pouplier and colleagues (Pouplier, 2003b ; Goldstein et al., forthcoming), planning units are intended sequences of vocal-tract gestures that are coordinated in the manner of coupled oscillators. In the literature on limb movements, it has been found that two modes of coordination are stable. Limbs (or hands or fingers) may be oscillated in phase or 180 degrees out of phase (so that extension of one limb occurs when the other limb is flexing). In tasks in which, for example (Kelso, 1984 ; see also Yamanishi et al., 1980 ), hands are oscillated about the wrist at increasing rates, in-phase movements remain stable; however, out-of-phase movements become unstable. Participants attempting to maintain out-of-phase movements slip into phase. Pouplier and colleagues suggest that findings of intrusive tongue tip gestures in the onset of cop and of intrusive tongue body gestures in top constitute a similar shift from a less to a more stable oscillation mode. When top—cop is repeated, syllable onsets /t/ and /k/ each occur once for each pair of rime (/ap/) productions giving a 1:2 coordination mode. When intrusive /t/ and /k/ gestures occur, the new coordination mode is 1:1; that is, the new onset is produced once for each one production of the syllable rime. A 1:1 coordination mode is more stable than a 1:2 mode.

A question is what the findings of gradient, phonotactically impermissible errors imply about the interpretability of error analyses based on transcribed, rather than articulatory, corpora. Certainly these errors occur, and certainly they were missed in transcription corpora. However, does it mean that categorical consonant and vowel errors do not occur, that planning units should be considered to be intended phonetic gestures (Pouplier) or even commands to muscles (Mowrey and MacKay), not the consonants and vowels of traditional phonetic analysis?

There are clearly categorical errors that occur at the level of whole words (recall writing a mother to my letter ) . It does not seem implausible, therefore, that categorical phonetic errors also occur. It may be appropriate (as in the model of Levelt et al., 1999 ) to imagine levels of speech planning, with consonants and vowels of traditional analyses serving as elements of plans at one level, giving way to planned gestures at another.

Findings that error corpora in some ways misrepresent the nature of spontaneous errors of speech production, however, have had the positive consequence that researchers have sought converging (or, as appropriate, diverging) evidence from experiments that elicit error-free speech. For example, Meyer ( 1991 ) found evidence for syllable constituents serving as “encoding” units in language production planning. Participants memorized sets of word pairs consisting of a prompt word produced by the experimenter and a response word produced as quickly as possible by the participant. Response words in a set were “homogeneous” if they shared one or more phonological segments; otherwise they were “heterogeneous.” Meyer found faster responses to words in homogeneous compared to heterogeneous sets if response words shared their initial consonant or initial syllable, but not if they shared the syllable rime (that is, the vowel and any following consonants). There was no further advantage over responses to heterogeneous words when the CV of a CVC syllable was shared in homogeneous sets as compared to when just the initial C was shared. There was an advantage over responses to words sharing the initial consonant of responses to words sharing the whole first syllable. These findings suggest, as errors do, that syllable constituents are among the planning units. They also suggest that encoding for production is a sequential “left-to-right” process.

Sevald et al. ( 1995 ) obtained converging evidence with errors data suggesting that syllables serve as planning frames. They asked participants to repeat pairs of non-words (e.g. KIL KILPER or KIL KILPNER) in which the initial monosyllable either did or did not match the initial syllable of the disyllable. The task was to repeat the pair as many times as possible in four seconds. Mean syllable production time was less when the syllable structure matched. Remarkably, the advantage of matching syllable structure was no less when only syllable structure, but not syllable content, matched (e.g. KEM TILFER vs. KEM TILFNER). In the foregoing examples, it looks as if the advantage could be due to the fact that there were fewer phonetic segments to produce in the matching condition. However, there were other items in which the length advantage was reversed.

29.2 Speakers' goals as acoustic targets or vocal tract gestures

A next issue is how intended sequences of phonetic entities are planned to be implemented as actions or their consequences that are available to a listener. In principle, this issue is orthogonal to the one just considered about the nature of planned language forms. As just discussed, these forms are variously held to be covert, cognitive representations or public, albeit still cognitive, entities. Either view is compatible with proposals that, at the lowest level of description, talkers aim to achieve either acoustic or gestural targets. In the discussion below, therefore, the issue of whether language forms are covert or public in nature is set aside. It may be obvious, however, that, in fact, acoustic target theorists at least implicitly hold the former view and gesture theorists the latter.

Guenther et al. ( 1998 ) argue against gestural targets on several grounds and argue for acoustic targets. One ground for rejecting gestural targets, such as constriction location and degree, concerns the feedback information that speakers would need to implement the targets sufficiently accurately. To know whether or not a particular constriction has been achieved requires perceptual information. If, for example, an intended constriction is by the lips (as for /b/, /p/, or /m/), talkers can verify that the lips are closed from proprioceptive information for lip contact. However, Guenther et al. argue that, in particular for vowels, constrictions do not always involve contact by articulators, and therefore intended constrictions cannot be verified. In addition, they argue, to propose that talkers intend to achieve particular constrictions implies that talkers should not be able to compensate for experimental perturbations that prevent those constrictions from being achieved. However, some evidence suggests that they can. For example, Savariaux et al. ( 1995 ) had talkers produce vowels with a tube between their lips that prevented normal lip rounding for the vowel /u/. The acoustic effects of the lip tube could be compensated for by lowering the larynx (thereby enlarging the oral cavity by another means than rounding). Of the eleven participants, one compensated fully for the lip tube. Six others showed limited evidence of compensation.

A third argument for acoustic targets is provided by American English /r/. According to Guenther et al., /r/ is produced in very different ways by different speakers or even by the same speaker in different contexts. The different means of producing /r/ are acoustically very similar. One account for the articulatory variability, then, is that it is tolerated if the different means of production produce inaudibly different acoustic signals, the talker's production aim. Finally, Guenther et al. argue that ostensible evidence for constriction targets—that, for example, invariant constriction gestures occur for /b/ and other segments—need not be seen as evidence uniquely favoring gestural targets. Their model “DIVA” (originally “directions in orosensory space onto velocities of articulators”; described below) learns to achieve acoustic-perceptual targets, but nonetheless shows constriction invariance. However, there is also evidence favoring the alternative idea that talkers' goals are articulatory not acoustic. Moreover, the arguments of Guenther et al. favoring acoustic targets can be challenged.

Tremblay et al. ( 2003 ) applied mechanical perturbations to the jaw of talkers producing the word sequence see—at. The perturbation altered the motion path of the jaw, but had small and inaudible acoustic effects. Even though acoustic effects were inaudible, over repetitions, talkers compensated for the perturbations and showed after-effects when the perturbation was removed. Compensation also occurred in a silent speech condition, but not in a non-speech jaw movement condition. These results appear inconsistent with a hypothesis that speech targets are acoustic.

There is also a more natural speech example of preservation of inaudible articulations. In an investigation of an X-ray microbeam database, Browman and Goldstein ( 1991 ) found examples of utterances such as perfect memory in which transcription suggested deletion of the final /t/ of perfect. However, examination of the tongue tip gesture for the /t/ revealed its presence. Because of overlap from the bilabial gesture of /m/, however, acoustic consequences of the /t/ constriction gesture were absent or inaudible. As for the suggestion that constriction goals should be unverifiable by feedback when constricting articulators are not in contact with another structure, to my knowledge this is untested speculation.

As for the compensation found by Savariaux et al ( 1995 ; see also Perkell et al., 1993 ), Guenther et al. do not remark that the compensation is markedly different from that associated with certain other perturbations in being, for most participants, either partial or absent. Compensations for a bite block (which prevents jaw movement) are immediate and nearly complete in production of vowels (e.g. Lindblom et al., 1979 ). Compensations for jaw and lip perturbations during speech (e.g. tugging the jaw down as it raises to close the lips for a /b/) are very short in latency, immediate, and nearly complete (e.g. Kelso et al., 1984 ). These different patterns of compensation are not distinct in the DIVA model. However, they are in speakers. The difference may be understood as relating to the extent to which they mimic perturbations which occur naturally in speech production. When a speaker produces, say, /ba/ versus /bi/, coarticulation by the following low (/a/) or high (/i/) vowel will tug the jaw and lower lip down or up. Speakers have to compensate for that to get the lips shut for bilabial /b/. That routine compensation for coarticulation may underlie fast and functional compensations which occur in the laboratory (Fowler and Saltzman, 1993 ). However, it is a rare perturbation outside the laboratory that prevents lip rounding. Accordingly, talkers may have no routines in place to compensate for the lip tube, and have to learn them. In a gestural theory, they have to learn to create a mirage—that is, an acoustic signal that mimics consequences of lip rounding.

As for /r/, ironically, it has turned out to be a poster child for both acoustic and articulatory theorists. Delattre and Freeman ( 1968 ), whom Guenther et al. cite as showing considerable variability in American English articulation of /r/, in fact remark that in every variant they observed there were two constrictions, one by the back of the tongue in the pharyngeal region and one by the tongue tip against the hard palate. (Delattre and Freeman were only looking at the tongue, and so did not remark on a third shared constriction, rounding by the lips.) Accordingly, whether one sees variability or invariance in /r/ articulations may depend on the level of description of the vocal tract configuration deemed relevant to talkers and listeners. In Browman and Goldstein's articulatory phonology (e.g. 1986 ; 1995 ), the relevant level is that of constriction locations and degrees, and those are invariant across the /r/ variants.

Focus on constrictions permits an understanding of a source of dialect variation in American English /r/ that is not illuminated by a proposal that acoustic targets are talkers' aims. Among consonants involving more than one constriction—for example, the nasal consonants (constrictions by lips, tongue tip or tongue body, and by the velum), the liquids, /l/ (tongue tip and body) and /r/ (tongue body, tip, and lips), and the approximant /w/ (tongue body and lips)—a generalization holds regarding the phasing of the constriction gestures. Prevocalically, the gestures are achieved nearly simultaneously; postvocalically, the gesture with the more open (vowel-like) constriction degree leads (see research by Sproat and Fujimora, 1993 ; Krakow, 1989 ; 1993 ; Gick, 1999 ). This is consistent with the general tendency in syllables for the more sonorant (roughly more vowel-like) consonants to be positioned closest to the vowel. (For example, the ordering in English is /tr/ before the vowel as in tray , but /rt/ after the vowel as in art.) Goldstein (pers. comm., 15 Aug. 2005) points out that, in two dialects of American English, one spoken in Brooklyn and one in New Orleans, talkers produce postvocalic consonants in such a way that, for example, bird sounds to listeners somewhat like boyd. This is understandable if talkers exaggerate the tendency for the open lip and tongue body constrictions to lead the tip constriction. Together, the lip and tongue body configurations create a vowel sound like /Ɔ/ (in saw ) ; by itself, the tip gesture is like /i/ (in see ) . Together, the set of gestures yield something resembling the diphthong /Ɔ i / as in boy.

In short, there are arguments and there is evidence favoring both theoretical perspectives—that targets of speech production planning are acoustic or else are gestural. Deciding between the perspectives will require further research.

29.2.1 Theories of speech production

As noted, theories of speech production differ in their answer to the question of what talkers aim to achieve, and a fundamental difference is whether intended targets are acoustic or articulatory. Within acoustic theories, accounts can differ in the nature of acoustic targets; within articulatory theories, accounts can be that muscle lengths or muscle contractions are targets, that articulatory movements are targets, or that coordinated articulatory gestures are targets. I will review one acoustic and one articulatory account. I chose these accounts because they are the most fully developed theories within the acoustic and articulatory domains.

29.3 The DIVA theory of speech production

In this account (e.g. Guenther et al., 1998 ), targets of speaking are normalized acoustic signals reflecting resonances of the vocal tract (“formants”). The normalization transformations create formant values that are the same for men, women, and children even though acoustic reflections of formants are higher in frequency for women than for men and for young children than for women. Because formants characterize vowels and sonorant consonants but not (for example) stop or fricative consonants, the model is restricted to explanation of just those classes of phones.

Between approximately six and eight months of age, infants engage in vocal behavior called babbling in which they produce what sounds like sequences of CV syllables. In this way, in DIVA, the young model learns a mapping from articulator positions to normalized acoustic signals. Over learning, this mapping is inverted so that acoustic-perceptual targets can underlie control of articulatory movements. In the model, the perceived acoustic signal has three degrees of freedom (one per normalized formant). In contrast, the articulatory system has the seven degrees of freedom of Maeda's ( 1990 ) articulatory model. This difference in degrees of freedom mean that the inverted mapping is one to many. Accordingly, a constraint is required to make the mapping determinate. Guenther et al. use a “postural relaxation” constraint whereby the articulators remain as close as possible to the centers of their ranges of motion. This constraint underlies the model's tendency to show near-invariance of constrictions despite having acoustic-perceptual rather than articulatory targets.

In addition to that characteristic, the model compensates for perturbations—not, however, distinguishing those that humans do well and poorly.

29.4 The task dynamic model

Substantially influenced by the theorizing of Bernstein ( 1967 ), Turvey ( 1977 ) introduced a theory of action in which he proposed that the minimal meaningful units of action were produced by synergies or coordinative structures (Easton, 1972 ). These are transiently established coordinative relations among articulators—those of the vocal tract for speech—which achieve action goals. An example in speech is the organized relation among the jaw and the two lips that achieves bilabial constriction for English /b/, /p/, or /m/. That coordinative relation is not in place when speakers produce a constriction which does not include lip closure (e.g. Kelso et al., 1984 ). The coordinative relation underlies the ability of speakers to compensate for jaw or lip perturbations in the laboratory, and presumably to compensate for coarticulatory demands on articulators shared by temporally overlapping phones outside the laboratory.

Saltzman and colleagues (e.g. Saltsman and Kelso, 1987; Saltzman and Munhall, 1989 ; see also Turvey, 1990 ) proposed that synergies are usefully modeled as dynamical systems. Specifically, they suggested that speech gestures can be modeled as mass-spring systems with point attractor dynamics. In turn those systems are characterized by equations that reflect how the systems' states undergo change over time. Each vocal tract gesture is defined in terms of “tract variables.” Variables include lip protrusion (a constriction location) and lip aperture (constriction degree). Appropriately parameterized, the variables achieve gestural goals. The tract variables have associated articulators (e.g. the jaw and the two lips) that constitute the synergy that achieves that gestural goal. In one version of the theory, a word is specified by a “gestural score” (Browman and Goldstein, 1986 ) which provides parameters for the relevant tract variables and the interval of time over which they should be active. In a more recent version (Saltzman et al., 2000 ) gestural scores are replaced by a central “clock” that regulates the timing of gesture activation. The clock's average “tick” rate determines the average rate of speaking. As we will see later, local clock slowing can mark the edges of prosodic domains.

These systems show the equifinality characteristic of real speakers which underlies their ability to compensate for perturbations. That is, although the parameters of the dynamical system for a gesture have context independent values, gestural goals are achieved in a context-dependent manner so that, for example, as in the research by Kelso et al. ( 1984 ), lip closure for /b/ is achieved by different contributions from the lips and jaw on perturbed and unperturbed trials. The model compensates for perturbations which speakers handle without learning, but not for those such as in the study by Savariaux et al., which speakers require learning to handle, if they handle them at all.

29.5 Coarticulation

A hallmark of speech production is coarticulation. Speakers talk very quickly, and talking involves rapid sequencing of the particulate atoms (Studdert-Kennedy, 1998 ) which constitute language forms. Although the atoms are discrete, their articulation is not. Much research on speech production has been conducted with an aim to understand coarticulation. Coarticulation is characterized either as context-sensitivity of production of language forms or as temporally overlapping production. It occurs in both an anticipatory and a carryover direction. In the word stew , for example, lip rounding from the vowel /u/ begins near the beginning of the /s/. In use , it carries over during /s/.

Thirty years ago, there were two classes of accounts of coarticulation. In one point of view (e.g. Daniloff and Hammarberg, 1973 ) coarticulation was seen as “feature spreading.” Consonants and vowels can be characterized by their featural attributes. For example, consonants can be described as being voiced or unvoiced, as having a particular place of articulation (e.g. bilabial for /b/, /p/, and /m/) and a particular manner of articulation (e.g. /b/ and /p/ are stops; /f/ is a fricative). Vowels are front, mid, or back; high, mid, or low, and rounded or unrounded. Many features which characterize consonants and vowels are contrastive, in that changing a feature value changes the identity of a consonant or vowel and the identity of a word that they, in part, compose. For example, changing the feature of a consonant from voiced to unvoiced can change a consonant from /b/ to /p/ and a word from bat to pat . However, some features are not contrastive. Adding rounding to a consonant does not change its identity in English; adding nasalization to a vowel in English likewise does not change its identity.

In feature spreading accounts of coarticulation, non-contrastive features were proposed to spread in an anticipatory direction to any phone unspecified for the feature (i.e. for which the feature was non-contrastive). Accordingly, lip rounding should spread through any consonant preceding a rounded vowel; nasalization should spread through any vowel preceding a nasal consonant. Carryover coarticulation was seen as inertial. Articulators cannot stop on a dime. Accordingly lip rounding might continue during a segment following a rounded vowel. There was some supportive evidence for the feature spreading view of anticipatory coarticulation (Daniloff and Moll, 1968 ).

However, there was also disconfirming evidence. One was a persistent finding (e.g. Benguerel and Cowan, 1974 ) that indications of coarticulation did not neatly begin at phonetic segment edges, as they should if a feature had spread from one phone to another. A second kind of evidence consisted of reports of “troughs” (e.g. Gay, 1978 ; Boyce, 1990 ). These were findings that, for example, during a consonant string between two rounded vowels, the lips would reduce their rounding and lip muscle activity would reduce, inconsistent with an idea that a rounding feature had spread to consonants in the string.

A different general point of view was that coarticulation was “coproduction” (e.g. Fowler, 1977 )—i.e. temporal overlap in the production of two or more phones. In this point of view, for example, rounding need not begin at the beginning of a consonant string preceding a rounded vowel, and a trough during a consonant string between two rounded vowels would be expected as the rounding gesture for the first vowel wound down and before rounding for the second vowel began. Bell-Berti and Harris ( 1981 ) proposed a specific account of coproduction, known as “frame” theory, in which anticipatory coarticulation began a fixed interval before the acoustically defined onset of a rounded vowel or nasalized consonant.

For a while (Bladon and Al-Bamerni, 1982 ; Perkell and Chiang, 1986 ), there was the congenial suggestion that both theories might be right. Investigators found evidence sometimes that there was a start of a rounding or nasalization gesture at the beginning of a consonant (for rounding) or vowel string preceding a rounded vowel or nasalized consonant. Then, at an invariant interval before the rounded or nasalized phone, there was a rapid increase in rounding or nasalization as predicted by frame theory. However, that evidence was contaminated by a confounding (Perkell and Matthies, 1992 ). Bell-Berti and colleagues (e.g. Boyce et al., 1990 ); Gelfer et al., 1989 ) pointed out that some consonants are associated with lip rounding (e.g. /s/). Similarly, vowels are associated with lower positions of the velum compared to oral obstruents. Accordingly, to assess when anticipatory coarticulation of lip rounding or nasalization begins requires appropriate control utterances, to enable a distinction to be made between lip rounding or velum lowering due to coarticulation and that due to characteristics of phonetic segments in the coarticulatory domain. For lip rounding, for example, rounding during an utterance such as stew requires comparision with rounding during a control utterance such as stee in which the rounded vowel is replaced by an unrounded vowel. Any lip rounding during the latter utterance indicates rounding associated with the consonant string, and needs to be subtracted from lip activity during stew. Likewise, velum movement during a CV n N sequence (that is, a sequence consisting of an oral consonant followed by n vowels preceding a nasal consonant) needs to be compared to velum movement during a CV n C sequence. When those comparisons are made, evidence for feature spreading evaporates.

Recently, two different coproduction theories have been distinguished (Lindblom et al., 2002 ). In the account proposed by Ohman ( 1966 ), vowels are produced continuously. In a VCV utterance, according to the account, speakers produce a diphthongal movement from the first to the second vowel. The consonant was superimposed on that diphthongal trajectory. In the alternative account (e.g. Fowler and Saltzman, 1993 ), gestures for consonants and vowels overlap temporally. Any vowel-to-vowel overlap is temporal overlap, not production of a diphthongal gesture.

Evidence favoring the view of Fowler and Saltzman is the same kind of evidence that disconfirmed feature spreading theory. As noted earlier, speakers show troughs in lip gestures in sequences of consonants that intervene between rounded vowels. They should not if vowels are produced as diphthongal tongue gestures, but they are expected to if vowels are produced as separate gestures that overlap temporally with consonantal gestures.

29.5.1 Coarticulation resistance

Coarticulation has been variously characterized as a source of distortion (e.g. Ohala, 1981 )—i.e. as a means by which articulation does not transparently implement essential phonological properties of consonants and vowels—or even as destructive of those properties (e.g. Hockett, 1955 ).

However, these characterizations overlook the finding of “coarticulation resistance”—an observation first made by Bladon and Al-Bamerni ( 1976 ), but developed largely by Recasens (e.g. 1984a ; 1984b ; 1985 ; 198); see also Farnetani, 1990 ). This is the observation that phones resist coarticulatory overlap by neighbors to the extent that the neighbors would interfere with achievement of the phones’ gestural goals. For example, Recasens ( 1984a ) found decreasing vowel-to-vowel coarticulation in Catalan VCV sequences when the intervening consonant was one of the set: /J/ (a dorso-palatal approximant), /ɲ/ (an alveolopalatal nasal), /ʎ/ (an alveolopalatal lateral), /n/ (an alveolar nasal). In the set, the consonants decreasingly use the tongue body to achieve their place of articulation. The tongue body is a major articulator in the production of vowels. Accordingly, it is likely that the decrease in vowel-to-vowel coarticulation in the consonant series occurs to prevent the vowels from interfering with achievement of the consonants' constriction location and degree. Recasens ( 1984b ) found increasing vowel-to-consonant coarticulation in the same consonant series.

Compatible data from English can be seen in Figure 29.1 . Figure 1a shows tongue body fronting data from a speaker of American English producing each of six consonants in the context of six following vowels (Fowler, 2005 ). During closure of three consonants (/b/, /v/, and /g/), there is a substantial shift in the tongue body height depending on the following vowel. During closure of the other three consonants (/d/, /z/, and /ð/, there is considerably less. Figure 1b shows similar results for tongue dorsum fronting. /b/, /v/, and, perhaps surprisingly, /g/ show less resistance to coarticulation for this speaker of American English than do /d/, /z/ and /ð/. The results for /b/ and /v/ most likely reflect the fact that they are labial consonants. They do not use the tongue, and so coproduction by vowels does not interfere with achievement of their gestural goals. The results for /g/, the fronting results at least, may reflect the fact that there is no stop in American English that is close in place of articulation with /g/ that might be confused with it were /g/'s place of articulation to shift due to coarticulation by the vowels.

 Tongue body height (a) and fronting (b) during production of three high and three low coarticulation resistant consonants produced in the context of six following stressed vowels. Measures taken in mid consonant closure.

Tongue body height (a) and fronting (b) during production of three high and three low coarticulation resistant consonants produced in the context of six following stressed vowels. Measures taken in mid consonant closure.

29.5.2 Other factors affecting coarticulation

Frame theory (Bell-Berti and Harris, 1981 ) suggests a fixed extent of anticipatory coarticulation, modulated perhaps by speaking rate. However, the picture is more complicated. Browman and Goldstein ( 1988 ) reported a difference in respect to how consonants are phased to a tautosyllabic vowel depending on whether the consonants were in the syllable onset or in the coda. Consonants in the onset of American English syllables are phased so that the gestural midpoint of the consonants aligns with the vowel. In contrast, in the coda, the first consonant is phased invariantly with respect to the vowel regardless of the number of consonants in the coda.

For multi-gesture consonants, such as /l/ (Sproat and Fujimura, 1993 ), /r/, /w/ (Gick, 1999 ), and the nasal consonants (Krakow, 1989 ), the gestures are phased differently in the onset and coda. Whereas they are nearly simultaneous in the onset, the more open (more vowel-like) gestures precede in the coda. This latter phasing appears to respect the “sonority hierarchy” such that more vowel-like phones are closest to the vowel.

29.6 Prosody

There is more to producing speech than sequencing consonants and vowels. Speech has prosodic properties including an intonation contour, various temporal properties, and variations in articulatory “strength.”

Theorists (see Shattuck-Hufnagel and Turk, 1996 for a review) identify hierarchical prosodic domains, each marked in some way phonologically. Domains include intonational phrases, which constitute the domain of complete intonational contours, intermediate phrases marked by a major (“nuclear”) pitch accent and a tone at the phrase boundary, prosodic words (lexical words or a content word followed by a function word as in “call up”), feet (a strong syllable followed by zero or one weak syllables), and syllables. Larger prosodic domains often, but not always, set off syntactic phrases or clauses.

Intonation contours are patterns of variation in fundamental frequency consisting of high and low pitch accents, or accents that combine a high and low (or low and high) pitch excursions, and boundary tones at intonational and intermediate phrase boundaries. Pitch accents in the contours serve to accent information that the speaker wants to focus attention on, perhaps because it is new information in the utterance or because the speaker wants to contrast that information with other information. A whole intonation contour expresses some kind of meaning. For example, intonation contours can distinguish yes/no questions from statements (e.g. So you are staying home this weekend ? ) Other contours can express surprise, disbelief or other expressions.

Because intonation contours reflect variation in fundamental frequency (f0), their production involves laryngeal control. This laryngeal control is coarticulated with other uses of the larynx, for example, to implement voicing or devoicing, intrinsic f0 (higher f0 for higher vowels), and tonal accompaniments of obstruent devoicing (a high tone on a vowel following an unvoiced obstruent).

Prosody is marked by other indications of phrasing. Prosodic domains from intonational phrases to prosodic words tend to be marked by final lengthening, pausing, and initial and final “strengthening.” These effects generally increase in magnitude with the “strength” of the prosodic boundary (where “strength” increases with height of a phrase in the prosodic hierarchy). Final lengthening is an increase in the duration of articulatory gestures and their acoustic consequences before a phrase boundary. Strengthening is a quite local increase in the magnitude of gestures at phrase edges (e.g. Fougeron and Keating, 1997 ). Less coarticulation occurs across stronger phrase boundaries, and accented vowels resist vowel-to-vowel coarticulation (Cho, 2004 ).

These marks of prosodic structure serve to demarcate informational units in an utterance. However, we need to ask: why these marks? Final lengthening and pausing are, perhaps, intuitive. Physical systems cannot stop on a dime, and if the larger prosodic domains involve articulatory stoppings and restartings, then we should expect to see slowing to a stop and, sometimes, pausing before restarting. However, why strengthening? Byrd and Saltzman ( 2003 ) provide an account of final lengthening and pausing that may also provide some insight into at least some of the occurrences of strengthening. They have extended the task dynamic model, described earlier, to produce the timing variation that characterizes phrasing in prosody. They do so by slowing the rate of time flow of the model's central clock at phrase boundaries. Clock slowing gives rise to longer and less overlapped gestures at phrase edges. The magnitude of slowing reflects the strength of a phrase boundary. Byrd and Saltzman conceive of the slowing as a gesture (a “π gesture”) that consists of an activation wave applied to any segmental gesture with which it overlaps temporally. π gestures span phrase boundaries, and therefore have effects at both edges of a phrase. Because clock slowing has as one effect, less overlap of gestures, a consequence may be less truncation of gestures due to overlap and so larger gestures.

Acknowledgments

Preparation of the manuscript was supported by NICHD grant HD-01994 and NIDCD grant DC-03782 to Haskins Laboratories.

Archangeli, D. ( 1997 ) Optimality theory: an introduction to linguistics in the 1990s. In D. Archangeli and D. T. Langendoen (eds), Optimality Theory: An Overview , pp. 1–32. Blackwell, Malden, MA.

Google Scholar

Google Preview

Bell-Berti, F., and Harris, K. S. ( 1981 ) A temporal model of speech production.   Phonetica , 38: 9–20.

Benguerel, A., and Cowan, H. ( 1974 ) Coarticulation of upper lip protrusion in French.   Phonetica , 30: 41–55.

Bernstein, N. ( 1967 ) The Coordination and Regulation of Movement . Pergamon, London.

Bladon, A., and Al-Bamerni, A. ( 1982 ) One-stage and two-stage temporal patterns of coarticulation.   Journal of the Acoustical Society of America , 72: S104.

Bladon, A., and Al-Bamerni, A. ( 1976 ) Coarticulation resistance in English /l/.   Journal of Phonetics , 4: 137–50.

Boyce, S. ( 1990 ) Coarticulatory organization for lip rounding in Turkish and in English.   Journal of the Acoustical Society of America , 8: 2584–95.

Boyce, S., Krakow, R., Bell-Berti, F., and Gelfer, C. ( 1990 ) Converging sources of evidence for dissecting articulatory movements into gestures.   Journal of Phonetics , 18: 173–88.

Browman, C., and Goldstein, L. ( 1986 ) Towards an articulatory phonology.   Phonology Yearbook , 3: 219–52.

Browman, C., and Goldstein, L. ( 1988 ) Some notes on syllable structure in articulatory phonology.   Phonetica , 45: 140–55.

Browman, C., and Goldstein, L. ( 1991 ) Tiers in articulatory phonology, with some implications for casual speech. In J. Kingston and M. Beckman (eds), Papers in Laboratory Phonology , vol. 1: Between the Grammar and the Physics of Speech , pp. 341–76. Cambridge University Press, Cambridge.

Browman, C., and Goldstein, L. ( 1992 ) Articulatory phonology: an overview.   Phonetica , 49: 155–80.

Browman, C., and Goldstein, L. ( 1995 ) Dynamics and articulatory phonology. In R. Port and T. van Gelder (eds), Mind as Motion: Explorations in the Dynamics of Cognition , pp. 175–93. MIT Press, Cambridge, MA.

Byrd, D., and Saltzman, E. ( 2003 ) The elastic phrase: modeling the dynamics of boundary-adjacent lengthening.   Journal of Phonetics , 31: 149–80.

Cho, T. ( 2004 ) Prosodically conditioned strengthening and vowel-to-vowel coarticulation in English.   Journal of Phonetics , 32: 141–76.

Daniloff, R., and Hammarberg, R. ( 1973 ) On defining coarticulation.   Journal of Phonetics , 1: 2390–48.

Daniloff, R., and Moll, K. ( 1968 ) Coarticulation of lip rounding.   Journal of Speech and Hearing Research , 11: 707–21.

Delattre, P., and Freeman, D. ( 1968 ) A dialect study of American r's by x-ray motion picture.   Linguistics , 44: 29–68.

Dell, G. ( 1986 ) A spreading-activation theory of retrieval in speech production.   Psychological Review , 93: 283–321.

Easton, T. ( 1972 ) On the normal use of reflexes.   American Scientist , 60: 591–9.

Farnetani, E. ( 1990 ) V-C-V lingual coarticulation and its spatiotemporal domain. In W. J. Hardcastle and A.Marchal (eds), Speech Production and Speech Modeling , pp. 93–130. Kluwer, The Netherlands.

Fougeron, C., and Keating, P. ( 1997 ) Articulatory strengthening at edges of prosodic domains.   Journal of the Acoustical Society of America , 101: 3728–40.

Fowler, C. A. ( 1977 ) Timing Control in Speech Production . Indiana University Linguistics Club, Bloomington.

Fowler, C. A. ( 2005 ) Parsing coarticulated speech: effects of coarticulation resistance.   Journal of Phonetics , 33: 195–213.

Fowler, C. A., and Saltzman, E. ( 1993 ) Coordination and coarticulation in speech production.   Language and Speech , 36: 171–95.

Gafos, A., and Benus, S. (2003) On neutral vowels in Hungarian. Paper presented at the 15th International Congress of Phonetic Sciences, Barcelona.

Garrett, M. ( 1980 ) Levels of processing in speech production. In B. Butterworth (ed.), Language Production , vol. 1: Speech and Talk , pp. 177–220. Academic Press, London.

Gay, T. ( 1978 ) Articulatory units: segments or syllables? In A. Bell and J. B. Hooper (eds), Syllables and Segments , pp. 121–31. North-Holland, Amsterdam.

Gelfer, C., Bell-Berti, F., and Harris, K. ( 1989 ) Determining the extent of coarticulation: effects of experimental design.   Journal of the Acoustical Society of America , 86: 2443–5.

Gick, B. (1999) The articulatory basis of syllable structure: a study of English glides and liquids. Ph.D. dissertation, Yale University.

Goldstein, L., Pouplier, M., Chen, L. Saltzman, E., and Byrd, D. ( forthcoming ) Action units slip in speech production errors.   Cognition .

Guenther, F., Hampson, M., and Johnson, D. ( 1998 ) A theoretical investigation of reference frames for the planning of speech.   Psychological Review , 105: 611–633.

Hockett, C. ( 1955 ) A Manual of Phonetics . Indiana University Press, Bloomington.

Kelso, J. A. S. ( 1984 ) Phase transitions and critical behavior in human bimanual coordination.   American Journal of Physiology , 246: 1000–1004.

Kelso, J. A. S., Tuller, B., Vatikiotis-Bateson, E., and Fowler, C. A. ( 1984 ) Functionally-specific articulatory cooperation following jaw perturbation during speech: evidence for coordinative structures.   Journal of Experimental Psychology: Human Perception and Performance , 10: 812–32.

Kenstowicz, M., and Kisseberth, C. ( 1979 ) Generative Phonology . Academic Press, New York.

Krakow, R. (1989) The articulatory organization of syllables: a kinematic analysis of labial and velar gestures. Ph.D. dissertation, Yale University.

Krakow, R. ( 1993 ) Nonsegmental influences on velum movement patterns: syllables, segments, stress and speaking rate. In M. Huffman, and R. Krakow (eds), Phonetics and Phonology , vol. 5: Nasals, Nasalization and the Velum , pp. 87–116. Academic Press, New York.

Levelt, W., Roelofs, A., and Meyer, A. ( 1999 ) A theory of lexical access in speech production.   Behavioral and Brain Sciences , 22: 1–38.

Lindblom, B., Lubker, J., and Gay, T. ( 1979 ) Formant frequencies of some fixed mandible vowels and a model of speech motor programming by predictive simulation.   Journal of Phonetics , 7: 147–61.

Lindblom, B., Sussman, H., Modaressi, G., and Burlingame, E. ( 2002 ) The trough effect in speech production: implications for speech motor programming.   Phonetica , 59: 245–62.

Maeda, S. ( 1990 ) Compensatory articulation during speech: evidence from the analysis and synthesis of vocal tract shapes using an articulatory model. In W. Hardcastle and A. Marchal (eds), Speech Production and Speech Modeling , pp. 131–49. Kluwer Academic, Boston, MA.

Meyer, A. ( 1991 ) The time course of phonological encoding in language production: phonological encoding inside a syllable.   Journal of Memory and Language , 30: 69–89.

Mowrey, R. and MacKay, I. ( 1990 ) Phonological primitives: electromyographic speech error evidence.   Journal of the Acoustical Society of America , 88: 1299–1312.

Ohala, J. ( 1981 ) The listener as a source of sound change. In C. Masek, R. Hendrick, R. Miller, and M. Mille (eds), Papers from the Parasession on Language and Behavior , pp. 178–03. Chicago Linguistics Society, Chicago.

Ohman, S. ( 1966 ) Coarticulation in VCV utterances: spectrographic measurements.   Journal of the Acoustical Society of America , 39: 151–68.

Perkell, J. and Chiang, C. ( 1986 ) Preliminary support for a ‘hybrid model’ of anticipatory coarticulation. In Proceedings of the 12th International Congress of Acoustic , pp. A3–A6.

Perkell, J. and Matthies, M. ( 1992 ) Temporal measures of labial coarticulation for the vowel /u/.   Journal of the Acoustical Society of America , 91: 2911–25.

Perkell, J., Matthies, M., Svirsky, M., and Jordan, M. ( 1993 ) Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: a pilot ‘motor equivalence’ study.   Journal of the Acoustical Society of America , 93: 2948–61.

Pierrehumbert, J. ( 1990 ) Phonological and phonetic representations.   Journal of Phonetics , 18: 375–94.

Pouplier, M. (2003a) The dynamics of error. Paper presented at the 15th International Congress of Phonetic Sciences, Barcelona.

Pouplier, M. (2003b) Units of phonological encoding: empirical evidence. Ph.D. dissertation, Yale University.

Recasens, D. ( 1984 a) Vowel-to-vowel coarticulation in Catalan VCV sequences.   Journal of the Acoustical Society of America , 76: 1624–35.

Recasens, D. ( 1984 b) V-to-C coarticulation in Catalan VCV sequences: an articulatory and acoustical study.   Journal of Phonetics , 12: 61–73.

Recasens, D. ( 1985 ) Coarticulatory patterns and degrees of coarticulation resistance in catalan cv sequences.   Language and Speech , 28: 97–114.

Recasens, D. ( 1987 ) An acoustic analysis of v-to-c and v-to v coarticulatory effects in Catalan and Spanish VCV sequences.   Journal of Phonetics , 15: 299–312.

Ryle, G. ( 1949 ) The Concept of Mind . Barnes & Noble, New York.

Saltzman, E., and Kelso, J. A. S. ( 1987 ) Skilled action: a task-dynamic approach.   Psychological Review , 94: 84–106.

Saltzman, E., Lofqvist, A., and Mitra, S. ( 2000 ) ‘Clocks’ and ‘glue’: global timing and intergestural cohesion. In M. B. Broe and J. Pierrehumbert (eds), Papers in Laboratory Phonology , vol. 5: Acquisition and the Lexicon , pp. 88–101. Cambridge University Press, Cambridge.

Saltzman, E., and Munhall, K. ( 1989 ) A dynamical approach to gestural patterning in speech production.   Ecological Psychology , 1: 333–82.

Savariaux, C., Perrier, P., and Orliaguet, J. P. ( 1995 ) Compensation strategies for the perturbation of the rounded vowel [u] using a lip tube: a study of the control space in speech production.   Journal of the Acoustical Society of America , 98: 2428–42.

Sevald, C. A., Dell, G., and Cole, J. ( 1995 ) Syllable structure in speech production: are syllables chunks or schemas?   Journal of Memory and Language , 34: 807–20.

Shattuck-Hufnagel, S., and Turk, A. E. ( 1996 ) A prosody tutorial for investigators of auditory sentence processing.   Journal of Psycholinguistic Research , 25: 193–247.

Sproat, R., and Fujimura, O. ( 1993 ) Allophonic variation in English /l/ and its implications for phonetic implementation.   Journal of Phonetics , 21: 291–311.

Stevens, K., and Blumstein, S. ( 1981 ) The search for invariant correlates of phonetic features. In P Eimas and J Miller (eds), Perspectives on the Study of Speech , pp. 1–38. Erlbaum, Hillsdale, NJ.

Studdert-Kennedy, M. ( 1998 ) The particulate origins of language generativity: from syllable to gesture. In J. Hurford, M. Studdert-Kennedy, and C. Knight (eds), Approaches to the Evolution of Language , pp. 202–21. Cambridge University Press, Cambridge.

Tremblay, S., Shiller, D., and Ostry, D. ( 2003 ) Somatosensory basis of speech production.   Nature , 423: 866–9.

Turvey, M. T. ( 1977 ) Preliminaries to a theory of action with reference to vision. In R. Shaw and J. Bransford (eds), Perceiving, Acting and Knowing: Toward an Ecological Psychology , pp. 211–66. Erlbaum, Hillsdale, NJ.

Turvey, M. T. ( 1990 ) Coordination.   American Psychologist , 45: 938–53.

Yamanishi, J. Kawato, M., and Suzuki, R. ( 1980 ) Two coupled oscillators as a model for the coordinated finger tapping by both hands.   Biological Cybernetics , 37: 219–25.

By this definition, I intend to contrast the more comprehensive theories of language production from theories of speech production. A theory of language production (e.g. Levelt et al., 1999 ) offers an account of planning for and implementation of meaningful utterances. A theory of speech production concerns itself only with planning for and implementation of language forms.

Slashes (e.g. /p/) indicate phonological segments; square brackets (e.g. [p]) signify phonetic segments. The difference is one of abstractness. For example, the phonological segment /p/ is said to occur in two varieties—the aspirated phonetic segment [p h ] and the unaspirated [p].

An alternative account, which does not implicate rule use, is that pail stin reflects a single feature or gesture error. From a featural standpoint, place of articulation features of /p/ and /t/ exchange, stranding the aspiration feature.

See articles in the 1992 special issue of the journal Phonetica devoted to a critical analysis of articulatory phonology.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Increase Font Size

6 Mechanism of Speech Production

Dr. Namrata Rathore Mahanta

Learning outcome:

This module shall introduce the learner to the various components and processes that are at work in the production of human speech. The learner will also be introduced to the application of speech mechanism in other domains such as medical sciences and technology. After reading the module the learner will be able to distinguish speech from other forms of human communication and will be able to describe in detail the stages and processes involved in the production of human speech.

Introduction : What is speech and why it an academic discipline?

Speech is such a common aspect of human existence that its complexity is often  overlooked in day to day life. Speech is the result of many interlinked intricate processes that need to be performed with precision. Speech production is an area of interest not only for language learners, language teachers, and linguists but also people working in varied domains of knowledge. The term ‘speech’ refers to the human ability to articulate thoughts in an audible form. It also refers to the formal one sided discourse delivered by an individual, on a particular topic to be heard by an audience.

The history of human existence and enterprise reveals that ‘speech’ was an empowering act. Heroes and heroines in history used ‘speech’ in clever ways to negotiate structures of power and overcome oppression. At times when the written word was an attribute of the elite and noble classes ‘speech’ was the vehicle which carried popular sentiments. In adverse times ‘speech’ was forbidden or regulated by authority. At such times poets and ordinary people sang their ‘speech’ in double meaning poems in defiance to authority. In present times the debate on an individual’s ‘right to free speech’ is often raised in varied contexts. As an academic discipline Speech Communication gained prominence in the 20th century and is taught in university departments across the globe. Departments of Speech Communication offer courses that engage with the speech interactions between people in public and private domain, in live as well as technologically mediated situations.

However, the student who peruses a study of ‘mechanism of speech production’ needs to focus primarily on the process of speech production. Therefore, the human brain and the physiological processes become the principal areas of investigation and research. Hence in this module ‘speech’ is delimited to the physiological processes which govern the production of different sounds. These include the brain, the respiratory organs, and the organs in our neck and mouth. A thorough understanding of the mechanism of speech production has helped correct speech disorders, simulate speech through machines, and develop devices for people with speech related needs. Needless to say, teachers of languages use this knowledge in the classroom in a variety of ways.

Speech and Language

In everyday parlance the terms ‘speech’ and ‘language’ are often used as synonyms. However, in academic use these two terms refer two very different things. Speech is the ‘spoken’ and ‘heard’ form of language. Language is a complex system of reception and expression of ideas and thoughts in verbal, non-verbal and written forms. Language can exist without speech but speech is meaningless without language. Language can exist in the mind in the form of a thought, on paper/screen in its orthographic form; it can exist in a gesture or  action in its non-verbal form, it can also exist in a certain way of looking, winking or nodding. Thus speech is only a part of the vast entity of language. It is the verbal form of language.

Over the years Linguists have engaged themselves with the way in which speech and language exists within the human beings. They have examined the processes by which language is acquired and learnt. The role of the individual human being, the role of the society/community/the genetic or physiological attributes of the human beings all been investigated from time to time.

Ferdinand de Saussure  a Swiss linguist who laid the foundation for Structuralism declared that language is imbibed by the individual within in a society or community. His lectures delivered at the University of Geneva during 1906-1911 were later collected and published in 1916 as Cours de linguistique générale . Saussure studied the relationship between speech and the evolution of language. He described language as a system of signs which exists in a pattern or structure. Saussure described language using terms such as ‘ langue ’ ‘ parole ’ and ‘langage ’. These terms are complex and cannot be directly translated. It would be misleading to equate Saussure’s ‘ langage ’ with ‘language’. However at an introductory stage these terms can be described as follows:

American linguist Avram Noam Chomsky argued that the human mind contains the innate source of language and declared that humans are born with a mind that is pre-programmed for language, i.e., humans are biologically programmed to use languages. Chomsky named this inherent human trait as ‘Innate Language’. He introduced two other significant terms: ‘Competence’ and ‘Performance’

‘Competence’ was described as the innate knowledge of language and ‘Performance’ as its actual use. Thus the concepts of ‘Innate Language’ ‘Language Competence’ and ‘Language Performance’ emerged and language came to be accepted as a cognitive attribute of humans while speech came to be accepted as one of the many forms of language communication. These ideas can be summarized in the chart given below:

In the present times speech and language are seen as interdependent and complementary attributes of humans. Current research focuses on finding the inner connections between speech and language. Consequently, the term ‘Speech and Language’ is used in most application based areas.

From Theory to Application

It is interesting to note that the knowledge of the intricacies of speech mechanism is used in many real life applications apart from Language and Linguistics. A vibrant area in Speech and Language application is ‘Speech and Language Processing’. It is used in Computational Linguistics, Natural Language Processing, Speech Therapy, Speech Recognition and many more areas. It is used to simulate speech in robots. Vocoders and Text to speech function (TTS) also makes use of speech mechanism. In Medical Sciences it is used to design therapy modules for different speech and language disorders, to develop advanced gadgets for persons with auditory needs. In Criminology it is used to recognize speech patterns of individuals and to identify manipulations in recorded speech patterns. Speech processing mechanism is also used in Music and Telecommunication in a major way.

What is Speech Mechanism?

Speech mechanism is a function which starts in the brain, moves through the biological processes of respiration, phonation and articulation to produce sounds. These sounds are received and perceived through biological and neurological processes. The lungs are the primary organs involved in the respiratory stage, the larynx is involved in the phonation stage and the organs in the mouth are involved in the articulatory stage.

The brain plays a very important role in speech. Research on the human brain has led to identification of certain areas that are classically associated with speech. In 1861, French physician Pierre Paul Broca discovered that a particular portion of the frontal lobe governed speech production. This area has been named after him and is known as Broca’s area. Injury to this area is known to cause speech loss. In 1874, German neuropsychiatrist Carl Wernicke discovered that a particular area in the brain was responsible for speech comprehension and remembrance of words and images. At a time when brain was considered to be a single organ, Wernicke demonstrated that the brain did not function as a single organ but as a multi  pronged organ with distinctive functions interconnected with neural networks. His most important contribution was the discovery that brain function was dependent on these neural networks. Today it is widely accepted that areas of the brain that are associated with speech are linked to each other through complex network of neurons and this network is mostly established after birth, through life experience, over a period of time.

It has been observed that chronology and patterning of these neural networks differ from individual to individual and also within the same individual with the passage of time or life experience. The formation of new networks outside the classically identified areas of speech has also been observed in people who have suffered brain injury at birth or through life experience. Although extensive efforts are being made to replicate or simulate the plasticity and creativity of the human brain, complete replication has not been achieved. Consequently, complete simulation of human speech mechanism remains elusive.

 The organs of speech

In order to understand speech mechanism one needs to identify the organs used to produce speech. It is interesting to note that each of these organs has a unique life-function to perform. Their presence in the human body is not for speech production but for other primary bodily functions. In addition to primary physiological functions, these organs participate in the production of speech. Hence speech is said to be the ‘overlaid’ function of these organs. The organs of speech can be classified according to their position and function.

  • The respiratory organs consist of: The Lungs and trachea. The lungs compress air and push it up the trachea.
  • The phonatory organs consist of the Larynx: The larynx contains two membrane- like structures called vocal cords or vocal folds. The vocal folds can come together or move apart.
  • The articulatory organs consist of : lips, teeth, roof of mouth, tongue, oral and nasal cavities

The respiratory process involves the movement of air. Through muscle action of the lungs the air is compressed and pushed up to pass through the respiratory tract- trachea, larynx, pharynx, oral cavity, nasal cavity or both. While breathing in, the rib cage is expanded, the thoracic capacity is enlarged and lung volume is increased. Consequently, the air pressure in lungs drops down and the air is drawn into the lungs. While breathing out, the rib cage is contracted, the thoracic capacity is diminished and lung volume is decreased. Consequently, the air pressure in the lungs exceeds the outside pressure and air is released from the lungs to equalize it. Robert Mannel has explained the process through flowcharts and diagrammatic representations given below:

Once the air enters the pharynx, it can be expelled either through the oral passage, or through the nasal passage or through both depending upon the position of soft movable part of the roof of the mouth known as soft palate or velum.

Egressive and Ingressive Airstream:   If the direction of the airstream is inward, it is termed as ‘Ingressive airstream. If  the direction of the airstream is outward, it is ‘Egressive airstream’. Most languages of the world  make use of Pulmonic Egressive airstream. Ingressive airstream is associated with Scandinavian languages of Northern Europe. However, no language can claim to use exclusively Ingressive or Egressive airstreams. While most languages of the world use predominantly Egressive airstreams, they are also known to use Ingressive airstreams in different situations. For extended list of use of ingressive mechanism you may visit Robert Eklund’s Ingressive Phonation and Speech page at www.ingressive.info .

Egressive process involves outward expulsion of air. Ingressive process involves inward intake of air. Egressive and Ingressive airstreams can be pulmonic (involving lungs) or non-pulmonic (involving other organs).

Non Pulmonic Airstreams: There are many languages which make use of non pulmonic airstream. In these cases the air expelled from the lungs is manipulated either in the pharyngeal cavity, or in the vocal tract, or in the oral cavity. Three major non pulmonic airstreams are:

In Ejectives, the air is trapped and compressed in the pharyngeal cavity by an obstruction in the mouth with simultaneous closure of the glottis. The larynx makes an upward movement which coincides with the removal of the obstruction causing the air to be released.

In Implosives, the air is trapped and compressed in the pharyngeal cavity by an obstruction in the mouth with simultaneous closure of the glottis. The larynx makes a downward movement which coincides with the removal of the obstruction causing the air to be sucked into the vocal tract.

In Clicks, the air is trapped and compressed in the oral cavity by lowering of the soft palate or velum and simultaneous closure of the mouth. Sudden opening causes air to be sucked in making a clicking sound. For a list of languages which use these airstream mechanisms you may visit https://community.dur.ac.uk/daniel.newman/phon10.pdf

While the process of phonation occurs before the airstream enters the oral or nasal cavity, the quality of speech is also determined by the state of the pharynx. Any irregularity in the pharynx leads to modification in speech quality.

The Phonatory Process: Inside the larynx are two membrane-like structures or folds called the vocal cords. The space between these is called the glottis. The vocal folds can be moved to varied distance. Robert Mannel has described five main positions of the vocal folds:

Voiceless: In this position the vocal folds are drawn far apart so that the air stream passes without any interference .

Breathy: Vocal folds are drawn loosely apart. The air passes making whisper like sound Voiced: Vocal folds are drawn close and are stretched. The air passes making vibrating sound.

Creaky : The vocal folds are drawn close & vibrate with maximum tension. Air passes making rough creaky sound. This sound is called ‘vocal fry’ and its use is on the rise amongst urban young women. However its sustained and habitual use is harmful.

For more details on laryngeal positions you may visit Robert Mannel’s page- http://clas.mq.edu.au/speech/phonetics/phonetics/airstream_laryngeal/laryngeal.html

You may see a small clip on the vocal fry by visiting the link – http://www.upworthy.com/what-is-vocal-fry-and-why-doesnt-anyone-care-when-men-talk- like-that

The Mouth    The mouth is the major site for articulatory processes of speech production. It contains active articulators that can move and take different positions such as the tongue, the lips, the soft palate. There are passive articulators that cannot move but combine with the active articulators to produce speech. The teeth, the teeth ridge or the alveolar ridge, and the hard palate are the passive articulators.

Amongst the active articulators, the tongue can take the maximum number of positions and combinations to. Being an active muscle, its parts can be lowered or raised. The tongue is a major articulator in the production of vowel sounds. Position of the tongue determines the acoustics in the oral cavity during articulation of vowel sounds. For the purpose of identifying and describing articulatory processes, the tongue has been classified on two parameters.

a.  The part of the tongue that is raised during the articulation process. There are four markers to classify the height to which the tongue is raised

  • Maximum height
  • Minimum height
  • Two third of maximum height
  • One third of maximum height

b.  The height to which the tongue is raised during the articulation process. Three main parts of the tongue are identified as Front, Back, and Center.

For the purpose of description the positions of the tongue are diagrammatically represented through the tongue quadrilateral.

  • Close:   The Maximum height is called the high position or the close position. This is because the gap between the tongue and the roof of mouth is nearly closed.
  • High-Mid  or Half Close : Two third of maximum is called high- mid position or half – close position
  • Low-Mid  or Half Open : One third of maximum is called low – mid position or half- open position
  • Low or Open : The Minimum height is called the Low or the Open position. This permits the maximum gap between the tongue and the roof of mouth.

The tongue also acts as an active articulator on the roof of the mouth to create obstruction in the oral cavity. Few prominent positions of the tongue are shown below

Lips: The lips are two strong muscles. In speech production the movement of the upper lip is less than that of the lower lip. The lips take different shapes: Rounded, Neutral or Spread

Teeth : The Upper Teeth are Passive Articulators.

The roof of the mouth:

The roof of the mouth has a hard portion and a soft portion which are fused seamlessly. The hard portion comprises of the Alveolar Ridge and the Hard Palate. The soft portion comprises of the Velum and the Uvula. The anterior part of the roof of the mouth is hard and unmovable. It begins from the irregular surface called alveolar ridge which lies behind the upper teeth. The alveolar ridge is followed by the hard palate which extends up to the centre of the tongue. The posterior part of the roof of the mouth is soft and movable. It lies after the hard palate and extends up to the small structure called the uvula.

The soft palate: It is movable and can take different positions during speech production.

  • Raised position: In raised position the soft palate rests against the back of the mouth. The nasal passage is fully blocked and air passes through the mouth
  • Lowered Position: In lowered position the soft palate rests against the back part of tongue in such a way that the oral passage is fully blocked and air passes through the nasal passage.
  • Partially lowered Position: In partially lowered position, the oral as well as the nasal passages are partially open. Pulmonic air passes though the mouth as well as the nose to create ‘nasalized’ sounds.

The hard palate lies between the alveolar ridge and velum. It is a hard and unmovable part of the roof of the mouth. It lies opposite to the centre of the tongue and acts as a passive articulator against the tongue to produce sounds. Sounds produced with the involvement of the hard palate are called palatal sounds.

The alveolar ridge is the wavy part that lies just behind the teeth ridge opposite to the front of the tongue. It acts as a passive articulator against the tongue to produce sounds. Sounds produced with the involvement of the Alveolar ridge are called Alveolar sounds. Some sounds are created with the involvement of the posterior region of the Alveolar ridge. These sounds are called post alveolar sounds. Sometimes sounds are created with the involvement of the hindmost part of the alveolar ridge and the foremost part of the hard palate. Such sounds are called palato alveolar sounds.

Air stream mechanisms involved in speech production

The flow of air or the airstream is manipulated in a number of ways during production of speech. This is done with the movement of the active articulators in the oral cavity or the larynx. In this process the air stream plays a major role in the production of speech sound. Air stream works on the concept of air pressure. If the air pressure inside the mouth is greater than the pressure in the atmosphere, air will escape outward to create a balance. If the air pressure inside the mouth is lower than the pressure outside because of expansion of the oral or pharyngeal cavity, the air will move inward into the mouth to create balance. On the basis of the nature of the obstruction and manner of release, the following classification has been made:

Plosive: In this process there is full closure of the passage followed by sudden release of air. The air is compressed and when the articulators are suddenly removed the air in the mouth escapes with an explosive sound.

Affricate: In this process there is full closure of the passage followed by slow release of air.

Fricative : In this process the closure is not complete. The articulators come together to create a narrow passage. Air is compressed to pass through this narrow stricture so that air escapes with audible friction.

Nasal: The soft palate is lowered so that the oral cavity is closed. Air passes through the nasal passage creating nasal sounds. If the soft palate is partially lowered air passes simultaneously through the oral and nasal passages creating the ‘nasalized’ version of sounds. Lateral: The obstruction in the mouth is such that the air is free to pass on both sides of the obstruction.

Glide: The position of the articulators undergoes change during the articulation process. It begins with the articulators taking one position and then smoothly moving to another position.

Speech mechanism is a complex process unique to humans. It involves the brain, the neural network, the respiratory organs, the larynx, the oral cavity, the nasal cavity and the organs in the mouth. Through speech production humans engage in verbal communication. Since earliest times efforts have been made to comprehend the mechanism of speech. In 1791 Wolfgang von Kempelen made the first speech synthesizer. In the first few decades of the twentieth century scientific inventions such as x-ray, spectrograph, and voice recorders provided new tools for the study of speech mechanism. In the later part of the twentieth century electronic innovations were followed by the digital revolution in technology. These developments have made new revelations and have given new direction to the knowledge of human speech mechanism. In the digital world an understanding of speech mechanism has led to new applications in speech synthesis. Speech mechanism studies in present times are divided into areas of super specialization which focus intensively on any specialized attribute of speech mechanism.

References :

  • Chomsky, Noam. Aspects of the Theory of Syntax.1965. Cambridge M.A.: MIT Press, 2015.
  • Chomsky, Noam. Language and Mind. 3rd ed. New York: Cambridge University Press, 2006. Eklund, Robert. www.ingressive.info. Web. Accessed on 5 March 2017.
  • Mannel,Robert. http://clas.mq.edu.au/speech/phonetics/phonetics/introduction/respiration.html. Web. Accessed on 5March 2017.
  • Mannel,Robert. http://clas.mq.edu.au/speech/phonetics/phonetics/introduction/vocaltract_diagram.htm l. Web. Accessed on 5 March 2017.
  • Mannel,Robert. http://clas.mq.edu.au/speech/phonetics/phonetics/airstream_laryngeal/laryngeal.html. Web. Accessed on 5 March 2017.
  • Newman, Daniel. https://community.dur.ac.uk/daniel.newman/phon10.pdf. Web. Accessed on 5 March 2017.
  • Saussure, Ferdinand. Course in General Linguistics. Translated by Wade Baskin. Edited by Perry Meisel and Haun Saussy. New York: Columbia University Press, 2011.
  • Wilson, Robert Andrew and Frank C. Keil. Eds. The MIT Encyclopedia of Cognitive Sciences.1999. Cambridge M.A.: MIT Press, 2001.

Nature and Perception of Speech Sounds

Cite this chapter.

introduction speech production

  • Jean-Claude Junqua 3 &
  • Jean-Paul Haton 4  

Part of the book series: The Kluwer International Series in Engineering and Computer Science ((SECS,volume 341))

205 Accesses

This chapter reviews the fundamentals of speech production, acoustics and phonetics of speech sounds as well as their time-frequency representation. Then, the basic structure of the auditory system and the main mechanisms influencing speech perception are briefly described. Throughout this chapter, we also emphasize the influence of noise on speech production and perception. By introducing basic characteristics of speech sounds and how they are produced and perceived, we intend to provide the essential knowledge needed to understand the following chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unable to display preview.  Download preview PDF.

Ainsworth, W. (1976). Mechanisms of Speech Recognition. Pergamon Press.

Google Scholar  

Anglade, Y. (1994). Robustesse de la Reconnaissance Automatique de la Parole: Etude et Application dans un Système d’Aide Vocal pour une Standardiste Mal-Voyante. Ph.D. thesis. Université Henri Poincaré, Nancy I.

Atkinson, J. (1978). Correlation analysis of the physiological factors controlling fundamental voice frequency. J. Acoust. Soc. Am. , 63(1):211–222.

Article   Google Scholar  

Bond, Z., Moore, T., and Gable, B. (1989). Acoustic-phonetic characteristics of speech produced in noise and while wearing an oxygen mask. J. Acoust. Soc. Am. , 85(2):907–912.

Byrd, D. (1993). 54,000 American stops. Technical report, UCLA Working Papers in Phonetics.

Calliope (1989). La Parole et son Traitement Automatique. Masson.

Chiba, T. and Kajiyama, M. (1941). The Vowel, its Nature and Structure. Kaseikan.

Chomsky, N. and Halle, M. (1968). The Sound Pattern of English. Harper and Row.

Coker, C. and Umeda, N. (1975). The importance of spectral details in initial-final constrasts of voiced stops. Journal of Phonetics , 3:63–68.

Datta, A., Ganguli, N., and Majumder, D. (1981). Acoustic features of consonants: A study based on Telugu speech sounds. Acustica , 47(2):72–82.

Deng, L. and Sun, D. (1994). Phonetic classification and recognition using HMM representation of overlapping articulatory features for all classes of English sounds. In ICASSP , pages 45–48.

Draegert, G. (1951). Relationships between voice variables and speech intelligibility in high level noise. Speech Monograph.

Draper, M., Ladefoged, P., and Whiteridge, D. (1959). Respiratory muscles in speech. Journal of Speech and Hearing Research , 2:16–27.

Dreher, J. and O’Neill, J. (1957). Effects of ambient noise on speaker intelligibility for words and phrases. J. Acoust. Soc. Am. , 29:1320–1323.

Dunn, H. (1950). The calculation of vowel resonances, and an electrical vocal tract. J. Acoust. Soc. Am. , 22:151–166.

Elliot, L. (1962). Backward and forward masking of probe tones of different frequencies. J. Acoust. Soc. Am. , 34:1116–1117.

Fant, G. (1960). Acoustic Theory of Speech Production. Mouton.

Fant, G. (1973). Speech Sounds and Features. M.I.T. Press.

Farnsworth, D. (1940). High speed motion pictures of the human vocal cords. Technical report, Bell Lab. Record.

Flanagan, J. (1958). Some properties of the glottal sound source. Journal of Speech and Hearing Research , 1:99–116.

Flanagan, J. (1972). Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd ed.

Fletcher, H. and Munson, W. (1933). Loudness, its definition, measurement, and calculation. J. Acoust. Soc. Am. , 5:82–108.

Fletcher, H. and Munson, W. (1937). Relation between loudness and masking. J. Acoust. Soc. Am. , 9(1).

Fujimura, O. (1962). Analysis of nasal consonants. J. Acoust. Soc. Am. , 34:1865–1875.

Fujisaki, H. and Kunisaki, O. (1978). Analysis, recognition, and perception of voiceless fricative consonants in Japanese. IEEE Trans. ASSP , 26(l):21–27.

Hansen, J. (1988). Analysis and compensation of stressed and noisy speech with application to robust automatic recognition. Ph.D. thesis. Georgia Institute of Technology.

Harris, D. and Dallos, P. (1979). Forward masking of auditory nerve fiber responses. Journal of Neurophysiology , 42:1083–1107.

Heinz, J. and Stevens, K. (1961). On the properties of voiceless fricative consonants. J. Acoust. Soc. Am. , 33(5):589–596.

Hirano, M. (1976). Structure and vibratory behavior of the vocal folds. In Sawashima, M. and Cooper, F.-S., editors, U.S.-Japan Joint Seminar on Dynamics Aspects of Speech Production , pages 13–27. Univ. of Tokyo Press.

Houtgast, T. (1972). Psychophysical evidence for lateral inhibition in hearing. J. Acoust. Soc. Am. , 51(6.2): 1885–1894.

Jakobson, R., Fant, G., and Halle, M. (1952). Preliminaries to Speech Analysis, 1st edition. M.I.T. Press.

Jakobson, R., Fant, G., and Halle, M. (1961). Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates. M.I.T. Press.

Javel, E., McGee, J., Walsh, E., Farley, G., and Gorga, M. (1983). Suppression of auditory-nerve responses. Suppression threshold and growth, iso-suppression contours. J. Acoust. Soc. Am. , 74(3):801–813.

Junqua, J.-C. (1989). Toward robustness in isolated-word automatic speech recognition. Ph.D. thesis. University of Nancy I, STL Monograph.

Junqua, J.-C. (1993). The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Am. , 93(1):510–524.

Junqua, J.-C. and Wakita, H. (1989). A comparative study of cepstral lifters and distance measures for all-pole models of speech in noise. In ICASSP , pages 476–479.

Kiang, N. (1968). A survey of recent developments in the study of auditory physiology. Ann. Otol. Rhinol. Laryngol , 77:656–675.

Kiang, N., Watanabe, T., Thomas, E., and Clark, L. (1965). Discharge Patterns of Single Fibres in the Cat’s Auditory Nerve. M.I.T. Press.

Koenig, W., Dunn, H., and Lacey, L. (1946). The sound spectrograph. J. Acoust. Soc. Am. , 18:19–49.

Ladefoged, P. (1985). The phonetic basis for computer speech processing. In Fallside, F. and Woods, W. A., editors, Computer Speech Processing , pages 3–27. Prentice Hall.

Lahiri, A., Gewirth, L., and Blumstein, S. (1984). A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: Evidence from a cross-language study. J. Acoust. Soc. Am. , 76(2):391–404.

Lamel, L. (1988). Formalizing Knowledge Used in Spectrogram Reading: Acoustic and Perceptual Evidence of Stops. Ph.D. thesis. Massachusetts Institute of Technology.

Lane, H. and Tranel, B. (1971). The Lombard sign and the role of hearing in speech. Journal of Speech and Hearing Research , 14:677–709.

Lane, H., Tranel, B., and Sisson, C. (1970). Regulation of voice communication by sensory dynamics. J. Acoust. Soc. Am. , 47(2):618–624.

Lombard, E. (1911). Le signe de l’élévation de la voix. Ann. Maladies Oreille, Larynx, Nez, Pharynx , 37:101–119.

Makhoul, J. and Cosell, L. (1976). LPCW: An LPC vocoder with linear predictive warping. In ICASSP , pages 466–469.

Moller, A. (1961). Network model of the middle ear. J. Acoust. Soc. Am. , 33:168–176.

Olive, J., Greenwood, A., and Coleman, J. (1993). Acoustics of American English Speech. A Dynamic Approach. Springer-Verlag.

O’Shaughnessy, D. (1987). Speech Communication: Human and Machine. Addison-Wesley.

Peterson, G. and Barney, H. (1952). Control methods used in a study of vowels. J. Acoust. Soc. Am. , 24(2): 175–184.

Picheny, M., Durlach, N., and Braida, L. (1985). Speaking clearly for the hearing impaired I: Intelligibility differences between clear and conversational speech. Journal of Speech and Hearing Research , 28:96–103.

Picheny, M., Durlach, N., and Braida, L. (1986). Speaking clearly for the hard of hearing TL: Acoustic characteristics of clear and conversational speech. Journal of Speech and Hearing Research , 29:434–446.

Pick, H., Siegel J., Fox, P., Garber, S., and Kearney, J. (1989). Inhibiting the Lombard effect. J. Acoust. Soc. Am. , 85(2):894–900.

Pickett, J. (1956). Effects of vocal force on the intelligibility of speech sounds. J. Acoust. Soc. Am. , 28(5):902–905.

Pickett, J. (1980). The Sounds of Speech Communication. University Park Press.

Pisoni, D., Bernacki, R., Nusbaum, H., and Yuchtman, M. (1985). Some acoustic-phonetic correlates of speech produced in noise. In ICASSP , pages 1581–1584.

Rose, J., Brugge, J., Anderson, D., and Hind, J. (1967). Phase-locked response to low-frequency tones in single auditory nerve fibers of the squirrel monkey. J. Neu-rophysiol. , 30:769–793.

Rose, J., Hind, J., Anderson, D., and Brugge, J. (1971). Some effects of stimulus intensity on response of auditory nerve fibers in the squirrel monkey. Neurophysiol , 34:685–699.

Rostolland, D. (1982a). Acoustic features of shouted voice. Acustica , 50(2): 118–125.

Rostolland, D. (1982b). Phonetic structure of shouted voice. Acustica , 51(2):80–89.

Rostolland, D. (1985). Intelligibility of shouted voice. Acustica , 57(3): 104–121.

Schulman, R. (1985). Articulatory targeting and perceptual constancy of loud speech. Technical report, PERILUS IV, Stockholm University.

Schulman, R. (1989). Articulatory dynamics of loud and normal speech. J. Acoust. Soc. Am. , 85(1):295–312.

Stanton, B., Jamieson, L., and Allen, G. (1988). Acoustic-phonetic analysis of loud and Lombard speech in simulated cockpit conditions. In ICASSP , pages 331–334.

Stevens, K. (1956). Stop consonants. Technical report, Acoustic Lab., Massachusetts Institute of Technology.

Stevens, K. (1971). Airflow and turbulent noise for fricative and stop consonants: Statistic considerations. J. Acoust. Soc. Am. , 50:1180–1192.

Stevens, S. and Volkmann, J. (1940). The relation of pitch to frequency. Am. J. Psychol. , 53(4, part 2):329.

Strevens, P. (1960). Spectra of fricative noise in human speech. Language & Speech , 3:32–49.

Summers, W., Pisoni, D., Bernacki, R., Pedlow, R., and Stokes, M. (1988). Effects of noise on speech production: Acoustic and perceptual analyses. J. Acoust. Soc. Am. , 84(3):917–928.

Traunmüller, H. (1985). The role of the fundamental and the higher formants in the perception of speaker size, vocal effort, and vowel openess. Technical report, Stockholm University.

Ungeheuer, G. (1962). Elemente Einer Akustischen Theorie der Vokalarticulation. Springer-Verlag.

von Békésy, G. (1960). Experiments in Hearing. McGraw-Hill.

Whitehead, R., Metz, D., and Whitehead, B. (1984). Vibration patterns of the vocal folds during pulse register phonation. J. Acoust. Soc. Am. , 75(4): 1293–1996.

Wickelgren, W. A. (1966). Distinctive features and errors in short-term memory for English consonants. J. Acoust. Soc. Am. , 39:388–398.

Zahorian, S. and Rothenberg, M. (1981). Principal-component analysis for low-redundancy encoding of speech spectra. J. Acoust. Soc. Am. , 69(3):832–845.

Zwicker, E. and Feldtkeller, R. (1981). Psychoacoustique: L’oreille Récepteur d’Informations. Masson.

Zwicker, E. and Terhardt, E. (1980). Analytical expressions for critical band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. , 68(5): 1523–1525.

Zwislocki, J. (1959). Electrical model of the middle ear. J. Acoust. Soc. Am. , 31:841

Download references

Author information

Authors and affiliations.

Speech Technology Laboratory, USA

Jean-Claude Junqua

CRIN - INRIA, France

Jean-Paul Haton

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 1996 Kluwer Academic Publishers

About this chapter

Junqua, JC., Haton, JP. (1996). Nature and Perception of Speech Sounds. In: Robustness in Automatic Speech Recognition. The Kluwer International Series in Engineering and Computer Science, vol 341. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-1297-0_1

Download citation

DOI : https://doi.org/10.1007/978-1-4613-1297-0_1

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4612-8555-7

Online ISBN : 978-1-4613-1297-0

eBook Packages : Springer Book Archive

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Psycholinguistics/Development of Speech Production

  • 1 Introduction
  • 2.1 Stage 1: Reflexive Vocalization
  • 2.2 Stage 2: Gooing, Cooing and Laughing
  • 2.3 Stage 3: Vocal Play
  • 2.4 Stage 4: Canonical babbling
  • 2.5 Stage 5: Integration
  • 3.1 Patterns of Speech
  • 3.2.1 Definition of Error Patterns
  • 3.3 Factors affecting development of phonology
  • 4.1 First Words
  • 4.2 Vocabulary Spurt
  • 4.3 Semantic Errors
  • 5.1.1 Two-word utterances
  • 5.2 Syntactic Errors
  • 7 Learning Exercise
  • 8 Learning Exercise Answers
  • 9 References

Introduction [ edit | edit source ]

Speech production is an important part of the way we communicate. We indicate intonation through stress and pitch while communicating our thoughts, ideas, requests or demands, and while maintaining grammatically correct sentences. However, we rarely consider how this ability develops. We know infants often begin producing one-word utterances, such as "mama," eventually move to two-word utterances, such as "gimme toy" and finally sound like an adult. However, the process itself involves development not only of the vocal sounds (phonology), but also semantics (meaning of words), morphology and syntax (rules and structure). How do children learn to this complex ability? Considering that an infant goes from an inability to speak to two-word utterances within 2 years, the accelerated development pattern is incredible and deserves some attention. When we ponder children's speech production development more closely, we begin to ask more questions. How does a child who says "tree" for "three" eventually learn to correct him/herself? How does a child know "nana" (banana) is the yellow,boat-shaped fruit he/she enjoys eating? Why does a child call all four-legged animals "horsie"? Why does this child say "I goed to the kitchen"? What causes a child to learn words such as "doggie" before "hand"? This chapter will address these questions and focus on the four areas of speech development mentioned: phonology, semantics, and morphology and syntax.

Prelinguistic Speech Development [ edit | edit source ]

Throughout infancy, vocalizations develop from automatic, reflexive vocalizations with no linguistic meaning to articulated words with meaning and intonation. In this section, we will examine the various stages an infant goes through while developing speech. In general, researchers seem to agree that as infants develop they increase their speech-like vocalizations and decrease their non-speech vocalizations (Nathani, Ertmer, & Stark) [1] . Many researchers (Oller, ; [2] Stark, as cited in Nathani, Ertmer, & Stark, 2006) [1] . Many researchers (Oller; [2] Stark, as cited in Nathani, Ertmer, & Stark,) [1] have documented this development and suggest growth through the following five stages: reflexive vocalizations, cooing and laughing, vocal play (expansion stage) , canonical babbling and finally, the integration stage.

Stage 1: Reflexive Vocalization [ edit | edit source ]

introduction speech production

As newborns, infants make noises in responses to their environment and current needs. These reflexive vocalizations may consist of crying or vegetative sounds such as grunting, burping, sneezing, and coughing (Oller) [2] . Although it is often thought that infants of this age do not show evidence of linguistic abilities, a recent study has found that newborns’ cries follow the melody of their surrounding language input (Mampe, Friederici, Christophe, & Wermke) [3] . They discovered that the French newborns’ pattern was a rising contour, where the melody of the cry rose slowly and then quickly decreased. In comparison, the German newborns’ cry pattern rose quickly and slowly decreased. These patterns matched the intonation patterns that are found in each of the respective spoken languages. Their finding suggest that perhaps infants vocalizations are not exclusively reflexive and may contain patterns of their native language.

Stage 2: Gooing, Cooing and Laughing [ edit | edit source ]

Between 2 and 4 months, infants begin to produce “cooing” and “gooing” to demonstrate their comfort states. These sounds may often take the form of vowel-like sounds such as “aah” or “oooh.” This stage is often associated with a happy infant as laughing and giggling begin and crying is reduced. Infants will also engage in more face-to-face interactions with their caregivers, smiling and attempting to make eye contact (Oller) [2] .

Stage 3: Vocal Play [ edit | edit source ]

From 4 to 6 months, and infants will attempt to vary the sounds they can produce using their developing vocal apparatus. They show a desire to explore and develop new sounds which may include yells, squeals, growls and whispers(Oller) [2] . Face-to-face interactions are still important at this stage as it promotes development of conversation abililities. Beebe, Alson, Jaffe et al. [4] found that even at this young age, infants’ vocal expression show a “ dialogic structure ” - meaning that, during interactions with caregivers, infants were able to take turns vocalizing.

Stage 4: Canonical babbling [ edit | edit source ]

After 6 months, infants begin to make and combine sounds that are found in their native language, sometimes known as “well-formed syllables,” which are often replicated in their first words(Oller) [2] . During this stage, infants combine consonants and vowels and replicate them over and over - they are thus called reduplicated babble . For example, an infant may produce ‘ga-ga’ over and over. Eventually, infants will begin to string together multiple varied syllables, such as ‘gabamaga’, called variegated babbles . Other times, infants will move right into the variegated babbles stage without evidence of the reduplicated babbles (Oller) [2] . Early in this stage, infants do not produce these sounds for communicative purposes. As they move closer to pronouncing their first words, they may begin to use use sounds for rudimentary communicative purposes(Oller) [2] .

Stage 5: Integration [ edit | edit source ]

introduction speech production

In the final stage of prelinguistic speech, 10 month-old infants use intonation and stress patterns in their babbling syllables, imitating adult-like speech. This stage is sometimes known as conversational babble or gibberish because infants may also use gestures and eye movements which resemble conversations(Oller) [2] . Interestingly, they also seem to have acoustic differences in their vocalizations depending on the purpose of their communication. Papaeliou and Trevarthen [5] found that when they were communicating for social purposes they used a higher pitch and were more expressive in their vocalizations and gestures than when exploring and investigating their surroundings. The transition from gibberish to real words is not obvious(Oller) [2] as this stage often overlaps with the acquisition of an infant’s first words. These words begin when an infant understands that the sounds produced are associated with an object .During this stage, infants develop vocal motor schemes , the consistent production of certain consonants in a certain period of time. Keren-Portnoy and Marjorano’s [6] study showed that these vocal motor schemes play a significant part in the development of first words as children who children who mastered them earlier, produced words earlier. These consistent consonants were used in babble and vocal motor schemes, and would also be present in a child’s first words. Evidence that a child may understand the connection between context and sounds is shown when they make consistent sound patterns in certain contexts (Oller) [2] . For example, a child may begin to call his favorite toy “mub.” These phonetically-consistent sound patterns, known as protowords or quasi-words , do not always reflect real words, but they are an important step towards achieving adult-like speech(Otomo [7] ; Oller) [2] . Infants may also use their proto-words to represent an entire sentence (Vetter) [8] . For example, the child may say “mub” but may be expressing “I want my toy”, “Give me back my toy” “Where is my toy?”, etc.

Phonological Development [ edit | edit source ]

When a child explicitly pronounces their first word they have understood the association between sounds and their meaning Yet, their pronunciation may be poor, they produce phonetic errors, and have yet to produce all the sound combinations in their language. Researchers have come up with many theories about the patterns and rules children and infants use while developing their language. In this section, we will examine some frequent error patterns and basic rules children use to articulate words. We will also look how phonological development can be enhanced.

Patterns of Speech [ edit | edit source ]

Depending on their personalities and individual development, infants develop their speech production slightly differently. Some children, productive learners , attempt any word regardless of proper pronunciation (Rabagaliati, Marcus, & Pylkkänen) [9] . Conservative learners (Rabagaliati, Marcus, & Pylkkänen) [9] , are hesitant until they are confident in their pronunciation. Other differences include preference to use nouns and name things versus use of language in a more social context. (Bates et al., as cited in Smits-Bandstra) [10] . Although infants vary in their first words and the development of their phonology, by examining the sound patterns found in their early language, researchers have extracted many similar patterns. For example, McIntosh and Dodd [11] examined these patterns in 2 year olds and found that they were able to produce multiple phonemes but were lacking [ ʃ , θ , tʃ , dʒ , r ]. They were also able to produce complex syllables. Vowel errors also occurred, although consonant errors are much more prevalent. The development of phonemes continues throughout childhood and many are not completely developed until age 8 (Vetter) [8] .

Phonological Errors [ edit | edit source ]

As a child pronounces new words and phonemes, he/she may produce various errors that follow patterns. However, all errors will reduce with age (McIntosh & Dodd) [11] . Although each child does not necessarily produce the same errors, errors can typically be categorized into various groups. For example, they are multiple kinds of consonant errors. A cluster reduction involves reducing a multiple consonants in a row (ie: skate). Most often, a child will skip the first consonant (thus skate becomes kate), or they may leave out the second stop consonant ( consonant deletion - Wyllie-Smith, McLeod, & Ball) [12] (thus skate becomes sate). This type of error has been found by McIntosh and Dodd [11] . For words that have multiple syllables, a child may skip the unstressed syllable at the beginning of the sentence (ie: potato becomes tato) or in the middle of a sentence (ie: telephone becomes tephone) (Ganger & Brent) [13] . This omission may simply be due to the properties of unstressed syllables as they are more difficult to perceive and thus a child may simply lack attention to it. As a child grows more aware of the unstressed syllable, he/she may chose to insert a dummy syllable in place of the unstressed syllable to attempt to lengthen the utterance (Aoyama, Peters, & Winchester [14] ). For example, a child may say [ə hat] (‘ə hot’) (Clark, as cited in Smits-Bandstra) [10] . Replacement shows that the child understands that there should be some sound there, but the child has inserted the wrong one. Another common phonological error pattern is assimilation . A child may pronounce a word such that a phoneme within that word sounds more like another phoneme near it (McIntosh & Dodd) [11] . For example, a child may say “”gug” instead of “bug”. This kind of error may also be seen for with vowels and is common in 2 year-olds, but decreases with age (Newton) [15] .

Definition of Error Patterns [ edit | edit source ]

Definition of error pattern

Factors affecting development of phonology [ edit | edit source ]

introduction speech production

As adequate phonology is an important aspect in effective communication, researchers are interested in factors that can enhance it. In a study done by Goldstein and Schwade [16] , it was found that interactions with caregivers provided opportunities for8-10 month old infants to increase their babbling of language sounds (consonant-vowel syllables and vowels). This study also found that infants were not simply imitations their caregivers vocalizations as they were producing various phonological patterns and had longer vocalizations! Thus, it would seem that social feedback from caregivers advances infants phonological development. On the other hand, factors such as hearing impairment, can negatively affect phonological development (Nicolaidis [17] ). A Greek population with hearing impairments was compared to a control group and it was found that they have a different pattern of pronunciation of phonemes. Their pattern displayed substitutions (ie:[x] for target /k/), distortions (ie: place of articulation)and epenthesis/cluster production (ie:[ʃtʃ] or [jθ] for /s/) of words.

Semantic Development [ edit | edit source ]

When children purposefully use words they are trying to express a desire, refusal, a label or for social communication (Ninio & Snow ) [18] . As a child begins to understand that each word has a specific purpose, they will inevitably need to learn meaning of multiple words. Their vocabulary will rapidly expand as they experience various social contexts, sing songs, practice routines and through direct instruction at school (Smits-Bandstra, 2006) [19] . In this section, we will examine children’s first words, their vocabulary spurt, and what their semantic errors are like.

First Words [ edit | edit source ]

Many studies have analyzed the types of words found in early speech. Overall, children’s first words are usually shorter in syllabic length, easier to pronounce, and occur frequently in everyday speech (Storkel, 2004 [20] ). Whether early vocabularies have a noun-bias or not tends to divide researchers. Some researchers argue that the noun bias, or children’s tendency to produce names for objects, people and animals, is sufficient evidence of this bias (Gllette et al.) [21] . However, this bias may not be entirely accurate. Recently, Tardif [22] studied first words cross-culturally between English, Cantonese and Mandarin 8-16 month old infants and found interesting differences. Although all children used terms for people, there was much variation between languages for animals and objects. This suggests that there may be some language differences in which types of words children acquire first.

Vocabulary Spurt [ edit | edit source ]

introduction speech production

Around the age of 18 months, many infants will undergo a vocabulary spurt , or vocabulary explosion , where they learn new words at an increasingly rapid rate (Smits-Bandstra) [10] ; Mitchell & McMurray,2009 [23] . Before onset of this spurt, the first 50 words a child learned as usually acquired at a gradual rate (Plunkett, as cited in Smits-Bandstra) [10] .Afterward the spurt, some studies have found upwards of 20 words learned per week( Mitchell and McMurray) [23] . There has been a lot of speculation about the process underlying the vocabulary spurt and there are three main theories. First, it has been suggested that the vocabulary spurt results from the naming insight (Reznick and Gldfield) [24] . The naming insight is a process where children begin to understand that referents can be labeled, either out of context or in place of the object. Second, this period seems to coincide with Piaget’s sensorimotor stage in which children are expanding their understanding of categorizing concepts and objects. Thus, children would necessarily need to expand their vocabulary to label categories (Gopnik) [25] . Finally, it has been suggested that leveraged learning may facilitate the vocabulary explosion (Mitchell & McMurray) [23] . Learning any word begins slowly - one word is learned, which acts as a ‘leverage’ to learn the next word, then those two words can each facilitates learning a new word, and so on. Learning therefore becomes easier. It is possible that not all children experience a vocabulary spurt, however. Some researchers have tested to determine whether there truly is an accelerated learning process. Interestingly, Ganger and Brent [13] used a mathematical model and found that only a minority of the infants studied fit the criteria of a growth spurt. Thus the growth spurt may not be as common as once believed.

Semantic Errors [ edit | edit source ]

Even after a child has developed a large vocabulary; errors are made in selecting words to convey the desired meaning. One type of improper word selection is when children invent a word (called lexical innovation ). This is usually because they have not yet learned a word associated with the meaning they are trying to express, or they simply cannot retrieve it properly. Although made-up words are not real words, it is fairly easy to figure out what a child means, and sometimes easier to remember than the traditional words (Clarke, as cited in Swan) [26] . For example, a child may say “pourer” for “cup” (Clarke, as cited in Swan) [26] .These lexical innovations show that the child is able to understand derivational morphology and use it creatively and productively (Swan) [26] .

Sometimes children may use a word in an inappropriate context either extending or restricting use of the word. For example, a child says “doggie” while pointing to any four-legged animal - this is known as overextension and is most common in 1-2 year olds (McGregor, et al. [27] Bloomquist; [28] Bowerman; [29] Jerger & Damian) [30] . Other times, children may use a word only in one specific context, this is called underextension (McGregor, et al. [27] Bloomquist; [28] Bowerman; [29] Jerger & Damian) [30] . For example, they may only say “baba” for their bottle and not another infant’s bottle. Semantic errors manifest themselves in naming tasks and provide an opportunity to examine how children might organize semantic representations. In McGregor et al.’s [27] naming pictures task for 3-5 year olds, errors were most often related to functional or physical properties (ie: saying chair for saddle). Why are such errors produced? McGregor et al. [27] proposed three reasons for these errors:

Grammatical and Morphological Development [ edit | edit source ]

As children develop larger lexicons, they begin to combine words into sentences that become progressively long and complex, demonstrating their syntactic development. Longer utterances provide evidence that children are reaching an important milestone in beginning the development of morphosyntax (Aoyama et al.) [14] . Brown [31] developed a method that would measure syntactic growth called mean length of the utterance (MLU) . It is determined by recording or listening to a 30-minute sample of a child’s speech, counting the number of meaningful morphemes (semantic roles – see chart below) and dividing it by the number of utterances. Meaningful morphemes can be function words (ie: “of” ), content words (ie: “cat”) or grammatical inflections (ie: -s). Utterances will include each separate thought conveyed thus repetitions, filler words, recitations, or titles and compound words would be counted as one utterance. Brown ended up with 5 different stages to describe syntactical development: Stage I (MLU 1.0-2.0), Stage II (MLU 2.0-2.5), Stage III (MLU 2.5-3.0), Stage IV (MLU 3.0-3.5) Stage V (MLU 3.5-4.0).

Semantic roles

What is this child's MLU? [ edit | edit source ]

Two-word utterances [ edit | edit source ].

Around the age of 18 months, children’s utterances are usually in two-word forms such as “want that, mommy do, doll fall, etc.” (Vetter [8] . In English, these forms are dominated by content words such as nouns, verbs and adjectives and are restricted to concepts that the child is learning based on their sensorimotor stage as suggested by Piaget (Brown) [31] . Thus, they will express relations between objects, actions and people. This type of speech is called telegraphic speech . During this development stage, children are combining words to convey various meanings. They are also displaying evidence of grammatical structure with consistent word orders and inflections.(Behrens & Gut; [32] Vetter) [8] .

Once the child moves from Stage 1, simple sentences begin to form and the child begins to use inflections and function words (Aoyama et al.) [14] . At this time, the child develops grammatical morphemes (Brown) [31] which are classified into 14 different categories organized by acquisition (See chart below).These morphemes modify the meaning of the utterance such as tense, plurality, possession, etc. There are two theories for why this particular order takes place. The frequency hypothesis suggests that children acquire the morphemes they hear most frequently in adult speech. Brown argued against this theory by analyzing adult speech where articles were the most common word form, yet children did not acquire articles quickly. He suggested that linguistic complexity may account for the order of acquisition where the less complex morphemes were acquired first. Complexity of the morphemes was determined based on semantics (meaning) and/or syntax (rules) of the morpheme. In other words, a morpheme with only one meaning such as plurality (-s) is easier to learn than the copula “is” (which encodes number and time the action occurs). Brown also suggested that for a child to have successfully mastered a grammatical morpheme, they must use it properly 90% of the time.

Syntactic Errors [ edit | edit source ]

As children begin to develop more complex sentences, they must learn to use to grammar rules appropriately too. This is difficult in English because of the prevalence of irregular rules. For example, a child may say, “I buyed my toy from the store.” This is known as an overregularization error . The child has understood that there are syntactic patterns and rules to follow, but overuses them, failing to realize that there are exceptions to rules. In the previous example, the child applied a regular part tense rule (-ed) to an irregular verb. Why do these errors occur? It may be that the child does not have a complete understanding of the word meaning and thus incorrectly selects it (Pinker, et al.) [33] . Brooks et al. [34] suggested that these errors may be categorization errors. For example, intransitive or transitive verbs appear in different contexts and thus the child is required to learn that certain verbs appear only in certain contextes. (Brooks) [34] . Interestingly, Hartshorne and Ullman [35] found a gender difference for overregularization errors. Girls were more than three times more likely than boys to produce overregularizations. They concluded that girls were more likely to overgeneralize associatively, whereas boys overgeneralized only through rule-governed methods. In other words, girls, who remember regular forms, better than boys, quickly associated their rule forms to similar sounding words (ie: fold-folded, mold-molded, but they would say hold becomes holded). Boys, on the other hand, will use the regular rule when they have difficulty retrieving the irregular form (ie: past tense form - ed added to irregular form run becomes runed) (Hartshorne & Ullman) [35] .

Another common error committed by children is omission of words from an utterance. These errors are especially prevalent in their early speech production, which frequently lack function words (Gerken, Landau, & Remez) [36] . For example, a child may say “dog eat bone” forgetting function words “the” and “a”.This type of error has been frequently studied and researchers have proposed three main theories to account for omissions. First, it may be that children may focus on words that have referents (Brown) [31] . For example, a child may focus on “car” or “ball”, rather than “jump” or “happy.” The second theory suggests children simply recognize the content words which have greater stress and emphasis (Brown) [31] . The final theory, suggested by Gerken [36] , involves an immature production system. In their study, children could perceive function words and classify them into various syntactic categories, yet still omitted them from their speech production.

Summary [ edit | edit source ]

In this chapter, the development of speech production was examined in the areas of prelinguistics , phonology , semantics , syntax and morphology . As an infant develops, their vocalizations will undergo a transition from reflexive vocalizations to speech-like sounds and finally words. However, their linguistic development does not end there. Infants underdeveloped speech apparatus restricts them from producing all phonemes properly and thus they produce errors such as consonant cluster reduction , omissions of syllables and assimilation . At 18 months, many children seem to undergo a vocabulary spurt . Even with a larger vocabulary, children may also overextend (calling a horse a doggie) or underextend (not calling the neighbors’ dog, doggie) their words. When a child begins to combine words, they are developing syntax and morphology. Syntactic development is measured using mean length of the utterance (MLU) which is categorized into 5 stages (Brown) [31] . After stage II, children begin to use grammatical morphemes (ie: -ed, -s, is) which encode tense, plurality, etc. As with other areas of linguistic development, children also produce errors such as overregularization (ie: “I buyed it”) or omissions (ie: “dog eat bone”). In spite of children’s early errors patterns, children will eventually develop adult-like speech with few errors. Understanding and studying child language development is an important area of research as it may give us insight into underlying processes of language as well as how we might be able to facilitate it or treat individuals with language difficulties.

Learning Exercise [ edit | edit source ]

1. Watch the video clips of a young boy CC provided below.

Video 1 Video 2 Video 3 Video 4 Video 5

2. The following is a transcription of conversations between a mother (*MOT) and a child (*CHI) from Brown's (1970) corpus. You can ignore the # symbol as it represents unintelligible utterances. Use the charts found in the section on " Grammatical and Morphological Development " to help answer this question.

  • Possessive morphemes ('s)
  • Present progressive (-ing)
  • MOT: let me see .
  • MOT: over here +...
  • MOT: you have tapioca on your finger .
  • CHI: tapioca finger .
  • MOT: here you go .
  • CHI: more cookie .
  • MOT: you have another cookie right on the table .
  • CHI: Mommy fix .
  • MOT: want me to fix it ?
  • MOT: alright .
  • MOT: bring it here .
  • CHI: bring it .
  • CHI: that Kathy .
  • MOT: yes # that's Kathy .
  • CHI: op(en) .
  • MOT: no # we'll leave the door shut .
  • CHI: why ?
  • MOT: because I want it shut .
  • CHI: Mommy .
  • MOT: I'll fix it once more and that's all .
  • CHI: Mommy telephone .
  • MOT: well # go and get your telephone .
  • MOT: yes # he gave you your telephone .
  • MOT: who are you calling # Eve ?
  • CHI: my telephone .
  • CHI: Kathy cry .
  • MOT: yes # Kathy was crying .
  • MOT: Kathy was unhappy .
  • MOT: what is that ?
  • CHI: letter .
  • MOT: Eve's letter .
  • CHI: Mommy letter .
  • MOT: there's Mommy's letter .
  • CHI: Eve letter .
  • CHI: a fly .
  • MOT: yes # a fly .
  • MOT: why don't you go in the room and kill a fly ?
  • MOT: you go in the room and kill a fly .
  • MOT: yes # you get a fly .
  • MOT: oh # what's that ?
  • MOT: I'm going to go in the basement # Eve .

3. Below are examples of children's speech. These children are displaying some characteristics of terms of we have covered in this chapter. The specfic terms found in each video are provided. Find examples of these terms within their associated video. Indicate which type of development (phonological, semantic, syntactic) is associated with each of these term.

5.The following are examples of children’s speech errors. Name the error and the type of development it is associated with (phonological, syntactic, morphological, or semantic). Can you explain why such an error occurs?

Learning Exercise Answers [ edit | edit source ]

Click here!

References [ edit | edit source ]

  • ↑ 1.0 1.1 1.2 Nathani, S., Ertmer, D. J., & Stark, R. E. (2006). Assessing vocal development in infants and toddlers. Clinical linguistics & phonetics, 20(5), 351-69.
  • ↑ 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 2.11 Oller, D.K.,(2000). The Emergence of the Speech Capacity. NJ: Lawrence Erlbaum Associates, Inc.
  • ↑ Mampe, B., Friederici, A. D., Christophe, A., & Wermke, K. (2009). Newbornsʼ cry melody is shaped by their native language. Current biology : CB, 19(23), 1994-7.
  • ↑ Beebe, B., Alson, D., Jaffe, J., Feldstein, S., & Crown, C. (1988). Vocal congruence in mother-infant play. Journal of psycholinguistic research, 17(3), 245-59.
  • ↑ Papaeliou, C. F., & Trevarthen, C. (2006). Prelinguistic pitch patterns expressing “communication” and “apprehension.” Journal of Child Language, 33(01), 163.
  • ↑ Keren-Portnoy, T., Majorano, M., & Vihman, M. M. (2009). From phonetics to phonology: the emergence of first words in Italian. Journal of child language, 36(2), 235-67.
  • ↑ Otomo, K. (2001). Maternal responses to word approximations in Japanese childrenʼs transition to language. Journal of Child Language, 28(1), 29-57.
  • ↑ 8.0 8.1 8.2 8.3 Vetter, H. J. (1971). Theories of Language Acquisition. Journal of Psycholinguistic Research, 1(1), 31. McIntosh, B., & Dodd, B. J. (2008). Two-year-oldsʼ phonological acquisition: Normative data. International journal of speech-language pathology, 10(6), 460-9. Cite error: Invalid <ref> tag; name "vet" defined multiple times with different content
  • ↑ 9.0 9.1 Rabagliati, H., Marcus, G. F., & Pylkkänen, L. (2010). Shifting senses in lexical semantic development. Cognition, 117(1), 17-37. Elsevier B.V.
  • ↑ 10.0 10.1 10.2 10.3 Smits-bandstra, S. (2006). The Role of Segmentation in Lexical Acquisition in Children Rôle de la Segmentation Dans l’Acquisition du Lexique chez les Enfants. Audiology, 30(3), 182-191.
  • ↑ 11.0 11.1 11.2 11.3 McIntosh, B., & Dodd, B. J. (2008). Two-year-oldsʼ phonological acquisition: Normative data. International journal of speech-language pathology, 10(6), 460-9.
  • ↑ Wyllie-Smith, L., McLeod, S., & Ball, M. J. (2006). Typically developing and speech-impaired childrenʼs adherence to the sonority hypothesis. Clinical linguistics & phonetics, 20(4), 271-91.
  • ↑ 13.0 13.1 Ganger, J., & Brent, M. R. (2004). Reexamining the vocabulary spurt. Developmental psychology, 40(4), 621-32.
  • ↑ 14.0 14.1 14.2 Aoyama, K., Peters, A. M., & Winchester, K. S. (2010). Phonological changes during the transition from one-word to productive word combination. Journal of child language, 37(1), 145-57.
  • ↑ Newton, C., & Wells, B. (2002, July). Between-word junctures in early multi-word speech. Journal of Child Language.
  • ↑ Goldstein, M. H., & Schwade, J. a. (2008). Social feedback to infantsʼ babbling facilitates rapid phonological learning. Psychological science : a journal of the American Psychological Society / APS, 19(5), 515-23. doi: 10.1111/j.1467-9280.2008.02117.x.
  • ↑ Nicolaidis, K. (2004). Articulatory variability during consonant production by Greek speakers with hearing impairment: an electropalatographic study. Clinical linguistics & phonetics, 18(6-8), 419-32.
  • ↑ Nionio, A., & Snow, C. (1996). Pragmatic development. Boulder, CO: Westview Press
  • ↑ Smits-bandstra, S. (2006). The Role of Segmentation in Lexical Acquisition in Children Rôle de la Segmentation Dans l’Acquisition du Lexique chez les Enfants. Audiology, 30(3), 182-191.
  • ↑ Storkel, H. L. (2004). Do children acquire dense neighborhoods? An investigation of similarity neighborhoods in lexical acquisition. Applied Psycholinguistics, 25(02), 201-221.
  • ↑ Gillette, J., Gleitman, H., Gleitman, L., & Lederer, a. (1999). Human simulations of vocabulary learning. Cognition, 73(2), 135-76.
  • ↑ Tardif, T., Fletcher, P., Liang, W., Zhang, Z., Kaciroti, N., & Marchman, V. a. (2008). Babyʼs first 10 words. Developmental psychology, 44(4), 929-38.
  • ↑ 23.0 23.1 23.2 Mitchell, C., & McMurray, B. (2009). On Leveraged Learning in Lexical Acquisition and Its Relationship to Acceleration. Cognitive Science, 33(8), 1503-1523.
  • ↑ Reznick, J. S., & Goldfield, B. a. (1992). Rapid change in lexical development in comprehension and production. Developmental Psychology, 28(3), 406-413.
  • ↑ Gopnik, A., & Meltzoff, A. (1987). The Development of Categorization in the Second Year and Its Relation to Other Cognitive and Linguistic Developments. Child Development, 58(6), 1523.
  • ↑ 26.0 26.1 26.2 Swan, D. W. (2000). How to build a lexicon: a case study of lexical errors and innovations. First Language, 20(59), 187-204.
  • ↑ 27.0 27.1 27.2 27.3 McGregor, K. K., Friedman, R. M., Reilly, R. M., & Newman, R. M. (2002). Semantic representation and naming in young children. Journal of speech, language, and hearing research : JSLHR, 45(2), 332-46.
  • ↑ 28.0 28.1 Bloomquist, J. (2007). Developmental trends in semantic acquisition: Evidence from over-extensions in child language. First Language, 27(4), 407-420.
  • ↑ 29.0 29.1 Bowerman, M. (1978). Systematizing Semantic Knowledge ; Changes over Time in the Child ’ s Organization of Word Meaning tion that errors of word choice stem from the Substitution Errors as Evidence for the Recognition of Semantic Similarities among Words. Child Development, 7.
  • ↑ 30.0 30.1 Jerger, S., & Damian, M. F. (2005). Whatʼs in a name? Typicality and relatedness effects in children. Journal of experimental child psychology, 92(1), 46-75.
  • ↑ 31.0 31.1 31.2 31.3 31.4 31.5 A first Language. Cambridge, MA: Harvard University Press.
  • ↑ Behrens, H., & Gut, U. (2005). The relationship between prosodic and syntactic organization in early multiword speech. Journal of Child Language, 32(1), 1-34.
  • ↑ <Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., Xu, F., et al. (2011). IN LANGUAGE ACQUISITION Michael Ullman. Language Acquisition, 57(4).
  • ↑ 34.0 34.1 Brooks, P. J., Tomasello, M., Dodson, K., & Lewis, L. B. (1999). Young Childrenʼs Overgeneralizations with Fixed Transitivity Verbs. Child Development, 70(6), 1325-1337. doi: 10.1111/1467-8624.00097.
  • ↑ 35.0 35.1 Hartshorne, J. K., & Ullman, M. T. (2006). Why girls say “holded” more than boys. Developmental science, 9(1), 21-32.
  • ↑ 36.0 36.1 Gerken, L., Landau, B., & Remez, R. E. (1990). Function Morphemes in \ bung Children ’ s Speech Perception and Production. Developmental Psychology, 26(2), 204-216.

introduction speech production

  • Psycholinguistics
  • Pages with reference errors

Navigation menu

Logo for TRU Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

9.1 Evidence for Speech Production

Dinesh Ramoo

The evidence used by psycholinguistics in understanding speech production can be varied and interesting. These include speech errors, reaction time experiments, neuroimaging, computational modelling, and analysis of patients with language disorders. Until recently, the most prominent set of evidence for understanding how we speak came from speech errors . These are spontaneous mistakes we sometimes make in casual speech. Ordinary speech is far from perfect and we often notice how we slip up. These slips of the tongue can be transcribed and analyzed for broad patterns. The most common method is to collect a large corpus of speech errors by recording all the errors one comes across in daily life.

Perhaps the most famous example of this type of analysis are what are termed ‘ Freudian slips .’ Freud (1901-1975) proposed that slips of the tongue were a way to understand repressed thoughts. According to his theories about the subconscious, certain thoughts may be too uncomfortable to be processed by the conscious mind and can be repressed. However, sometimes these unconscious thoughts may surface in dreams and slips of the tongue. Even before Freud, Meringer and Mayer (1895) analysed slips of the tongue (although not in terms of psychoanalysis).

Speech errors can be categorized into a number of subsets in terms of the linguistic units or mechanisms involved. Linguistic units involved in speech errors could be phonemes, syllables, morphemes, words or phrases. The mechanisms of the errors can involve the deletion, substitution, insertion, or blending of these units in some way. Fromkin (1971; 1973) argued that the fact that these errors involve some definable linguistic unit established their mental existence at some level in speech production. We will consider these in more detail in discussing the various stages of speech production.

An error in the production of speech.

An unintentional speech error hypothesized by Sigmund Freud as indicating subconscious feelings.

9.1 Evidence for Speech Production Copyright © 2021 by Dinesh Ramoo is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Speech Writing

Introduction Speech

Barbara P

Introduction Speech - A Step-by-Step Guide & Examples

11 min read

introduction speech

People also read

The 10 Key Steps for Perfect Speech Writing

Understanding the Speech Format - Detailed Guide & Examples

How to Start A Speech - 13 Interesting Ideas & Examples

20+ Outstanding Speech Examples for Your Help

Common Types of Speeches that Every Speechwriter Should Know

Good Impromptu Speech Topics for Students

Entertaining Speech Topics for Your Next Debate

How to Write a Special Occasion Speech: Types, Tips, and Examples

How to Write the Best Acceptance Speech for Your Audience?

Presentation Speech - An Ultimate Writing Guide

Commemorative Speech - Writing Guide, Outline & Examples

Farewell Speech - Writing Tips & Examples

How to Write an Extemporaneous Speech? A Step-by-Step Guide

Crafting the Perfect Graduation Speech: A Guide with Examples

Introduction speeches are all around us. Whenever we meet a new group of people in formal settings, we have to introduce ourselves. That’s what an introduction speech is all about.

When you're facing a formal audience, your ability to deliver a compelling introductory speech can make a lot of difference. With the correct approach, you can build credibility and connections.

In this blog, we'll take you through the steps to craft an impactful introduction speech. You’ll also get examples and valuable tips to ensure you leave a lasting impression.

So, let's dive in!

Arrow Down

  • 1. What is an Introduction Speech? 
  • 2. How to Write an Introduction Speech?
  • 3. Introduction Speech Outline
  • 4. 7 Ways to Open an Introduction Speech
  • 5. Introduction Speech Example
  • 6. Introduction Speech Ideas
  • 7. Tips for Delivering the Best Introduction Speech

What is an Introduction Speech? 

An introduction speech, or introductory address, is a brief presentation at the beginning of an event or public speaking engagement. Its primary purpose is to establish a connection with the audience and to introduce yourself or the main speaker.

This type of speech is commonly used in a variety of situations, including:

  • Public Speaking: When you step onto a stage to address a large crowd, you start with an introduction to establish your presence and engage the audience.
  • Networking Events: When meeting new people in professional or social settings, an effective introduction speech can help you make a memorable first impression.
  • Formal Gatherings: From weddings to conferences, introductions set the tone for the event and create a warm and welcoming atmosphere.

In other words, an introduction speech is simply a way to introduce yourself to a crowd of people. 

How to Write an Introduction Speech?

Before you can just go and deliver your speech, you need to prepare for it. Writing a speech helps you organize your ideas and prepare your speech effectively. 

Here is how to introduce yourself in a speech.

  • Know Your Audience

Understanding your audience is crucial. Consider their interests, backgrounds, and expectations to tailor your introduction accordingly.

For instance, the audience members could be your colleagues, new classmates, or various guests depending on the occasion. Understanding your audience will help you decide what they are expecting from you as a speaker.

  • Start with a Hook

Begin with a captivating opening line that grabs your audience's attention. This could be a surprising fact, a relevant quote, or a thought-provoking question about yourself or the occasion.

  • Introduce Yourself

Introduce yourself to the audience. State your name, occupation, or other details relevant to the occasion. You should mention the reason for your speech clearly. It will build your credibility and give the readers reasons to stay with you and read your speech.

  • Keep It Concise

So how long is an introduction speech?

Introduction speeches should be brief and to the point. Aim for around 1-2 minutes in most cases. Avoid overloading the introduction with excessive details.

  • Highlight Key Points

Mention the most important information that establishes the speaker's credibility or your own qualifications. Write down any relevant achievements, expertise, or credentials to include in your speech. Encourage the audience to connect with you using relatable anecdotes or common interests.

  • Rehearse and Edit

Practice your introduction speech to ensure it flows smoothly and stays within the time frame. Edit out any unnecessary information, ensuring it's concise and impactful.

  • Tailor for the Occasion

Adjust the tone and content of your introduction speech to match the formality and purpose of the event. What works for a business conference may not be suitable for a casual gathering.

Introduction Speech Outline

To assist you in creating a structured and effective introduction speech, here's a simple outline that you can follow:

Here is an example outline for a self-introduction speech.

Outline for Self-Introduction Speech

7 Ways to Open an Introduction Speech

You can start your introduction speech as most people do:

“Hello everyone, my name is _____. I will talk about _____. Thank you so much for having me. So first of all _______”

However, this is the fastest way to make your audience lose interest. Instead, you should start by captivating your audience’s interest. Here are 7 ways to do that:

  • Quote  

Start with a thought-provoking quote that relates to your topic or the occasion. E.g. "Mahatma Gandhi once said, 'You must be the change you want to see in the world."

  • Anecdote or Story

Begin with a brief, relevant anecdote or story that draws the audience in. It could be a story about yourself or any catchy anecdote to begin the flow of your speech.

Pose a rhetorical question to engage the audience's curiosity and involvement. For example, "Have you ever wondered what it would be like to travel back in time, to experience a moment in history?”

  • Statistic or Fact

Share a surprising statistic or interesting fact that underscores the significance of your speech. E.g. “Did you know that as of today, over 60% of the world's population has access to the internet?”

  • “What If” Scenario

Paint a vivid "What if" scenario that relates to your topic, sparking the audience's imagination and curiosity. For example, "What if I told you that a single decision today could change the course of your life forever?"

  • Ignite Imagination  

Encourage the audience to envision a scenario related to your topic. For instance, "Imagine a world where clean energy powers everything around us, reducing our carbon footprint to almost zero."

Start your introduction speech with a moment of silence, allowing the audience to focus and anticipate your message. This can be especially powerful in creating a sense of suspense and intrigue.

Introduction Speech Example

To help you understand how to put these ideas into practice, here are the introduction speech examples for different scenarios.

Introduction Speech Writing Sample

Short Introduction Speech Sample

Self Introduction Speech for College Students

Introduction Speech about Yourself

Student Presentation Introduction Speech Script

Teacher Introduction Speech

New Employee Self Introduction Speech

Introduction Speech for Chief Guest

Moreover, here is a video example of a self introduction. Watch it to understand how you should deliver your speech:

Want to read examples for other kinds of speeches? Find the best speeches at our blog about speech examples !

Introduction Speech Ideas

So now that you’ve understood what an introduction speech is, you may want to write one of your own. So what should you talk about?

The following are some ideas to start an introduction speech for a presentation, meeting, or social gathering in an engaging way. 

  • Personal Story: Share a brief personal story or an experience that has shaped you, introducing yourself on a deeper level.
  • Professional Background: Introduce yourself by highlighting your professional background, including your career achievements and expertise.
  • Hobby or Passion: Discuss a hobby or passion that you're enthusiastic about, offering insights into your interests and what drives you.
  • Volunteer Work: Introduce yourself by discussing your involvement in volunteer work or community service, demonstrating your commitment to making a difference.
  • Travel Adventures: Share anecdotes from your travel adventures, giving the audience a glimpse into your love for exploring new places and cultures.
  • Books or Literature: Provide an introduction related to a favorite book, author, or literary work, revealing your literary interests.
  • Achievements and Milestones: Highlight significant achievements and milestones in your life or career to introduce yourself with an impressive track record.
  • Cultural Heritage: Explore your cultural heritage and its influence on your identity, fostering a sense of cultural understanding.
  • Social or Environmental Cause: Discuss your dedication to a particular social or environmental cause, inviting the audience to join you in your mission.
  • Future Aspirations: Share your future goals and aspirations, offering a glimpse into what you hope to achieve in your personal or professional life.

You can deliver engaging speeches on all kinds of topics. Here is a list of entertaining speech topics to get inspiration.

Tips for Delivering the Best Introduction Speech

Here are some tips for you to write a perfect introduction speech in no time. 

Now that you know how to write an effective introduction speech, let's focus on the delivery. The way you present your introduction is just as important as the content itself. 

Here are some valuable tips to ensure you deliver a better introduction speech:

  • Maintain Eye Contact 

Make eye contact with the audience to establish a connection. This shows confidence and engages your listeners.

  • Use Appropriate Body Language 

Your body language should convey confidence and warmth. Stand or sit up straight, use open gestures, and avoid fidgeting.

  • Mind Your Pace

Speak at a moderate pace, avoiding rapid speech. A well-paced speech is easier to follow and more engaging.

  • Avoid Filler Words

Minimize the use of filler words such as "um," "uh," and "like." They can be distracting and detract from your message.

  • Be Enthusiastic

Convey enthusiasm about the topic or the speaker. Your energy can be contagious and inspire the audience's interest.

  • Practice, Practice, Practice

Rehearse your speech multiple times. Practice in front of a mirror, record yourself, or seek feedback from others.

  • Be Mindful of Time

Stay within the allocated time for your introduction. Going too long can make your speech too boring for the audience.

  • Engage the Audience

Encourage the audience's participation. You could do that by asking rhetorical questions, involving them in a brief activity, or sharing relatable anecdotes.

Mistakes to Avoid in an Introduction Speech

While crafting and delivering an introduction speech, it's important to be aware of common pitfalls that can diminish its effectiveness. Avoiding these mistakes will help you create a more engaging and memorable introduction. 

Here are some key mistakes to steer clear of:

  • Rambling On

One of the most common mistakes is making the introduction too long. Keep it concise and to the point. The purpose is to set the stage, not steal the spotlight.

  • Lack of Preparation

Failing to prepare adequately can lead to stumbling, awkward pauses, or losing your train of thought. Rehearse your introduction to build confidence.

  • Using Jargon or Complex Language

Avoid using technical jargon or complex language that may confuse the audience. Your introduction should be easily understood by everyone.

  • Being Too Generic

A generic or uninspiring introduction can set a lackluster tone. Ensure your introduction is tailored to the event and speaker, making it more engaging.

  • Using Inappropriate Humor

Be cautious with humor, as it can easily backfire. Avoid inappropriate or potentially offensive jokes that could alienate the audience.

  • Not Tailoring to the Occasion

An introduction should be tailored to the specific event's formality and purpose. A one-size-fits-all approach may not work in all situations.

To Conclude,

An introduction speech is more than just a formality. It's an opportunity to engage, inspire, and connect with your audience in a meaningful way. 

With the help of this blog, you're well-equipped to shine in various contexts. So, step onto that stage, speak confidently, and captivate your audience from the very first word.

Moreover, you’re not alone in your journey to becoming a confident introducer. If you ever need assistance in preparing your speech, let the experts help you out.

MyPerfectWords.com offers a custom essay service with experienced professionals who can craft tailored introductions, ensuring your speech makes a lasting impact.

Don't hesitate; hire our professional speech writing service to deliver top-quality speeches at your deadline!

AI Essay Bot

Write Essay Within 60 Seconds!

Barbara P

Dr. Barbara is a highly experienced writer and author who holds a Ph.D. degree in public health from an Ivy League school. She has worked in the medical field for many years, conducting extensive research on various health topics. Her writing has been featured in several top-tier publications.

Get Help

Paper Due? Why Suffer? That’s our Job!

Keep reading

speech writing

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Speech perception and production

Elizabeth d. casserly.

1 Department of Linguistics, Speech Research Laboratory, Indiana University, Bloomington, IN 47405, USA

David B. Pisoni

2 Department of Psychological and Brain Sciences, Speech Research Laboratory, Cognitive Science Program, Indiana University, Bloomington, IN 47405, USA

Until recently, research in speech perception and speech production has largely focused on the search for psychological and phonetic evidence of discrete, abstract, context-free symbolic units corresponding to phonological segments or phonemes. Despite this common conceptual goal and intimately related objects of study, however, research in these two domains of speech communication has progressed more or less independently for more than 60 years. In this article, we present an overview of the foundational works and current trends in the two fields, specifically discussing the progress made in both lines of inquiry as well as the basic fundamental issues that neither has been able to resolve satisfactorily so far. We then discuss theoretical models and recent experimental evidence that point to the deep, pervasive connections between speech perception and production. We conclude that although research focusing on each domain individually has been vital in increasing our basic understanding of spoken language processing, the human capacity for speech communication is so complex that gaining a full understanding will not be possible until speech perception and production are conceptually reunited in a joint approach to problems shared by both modes.

Historically, language research focusing on the spoken (as opposed to written) word has been split into two distinct fields: speech perception and speech production. Psychologists and psycholinguists worked on problems of phoneme perception, whereas phoneticians examined and modeled articulation and speech acoustics. Despite their common goal of discovering the nature of the human capacity for spoken language communication, the two broad lines of inquiry have experienced limited mutual influence. The division has been partially practical, because methodologies and analysis are necessarily quite different when aimed at direct observation of overt behavior, as in speech production, or examination of hidden cognitive and neurological function, as in speech perception. Academic specialization has also played a part, since there is an overwhelming volume of knowledge available, but single researchers can only learn and use a small portion. In keeping with the goal of this series, however, we argue that the greatest prospects for progress in speech research over the next few years lie at the intersection of insights from research on speech perception and production, and in investigation of the inherent links between these two processes.

In this article, therefore, we will discuss the major theoretical and conceptual issues in research dedicated first to speech perception and then to speech production, as well as the successes and lingering problems in these domains. Then we will turn to several exciting new directions in experimental evidence and theoretical models which begin to close the gap between the two research areas by suggesting ways in which they may work together in everyday speech communication and by highlighting the inherent links between speaking and listening.

SPEECH PERCEPTION

Before the advent of modern signal processing technology, linguists and psychologists believed that speech perception was a fairly uncomplicated, straightforward process. Theoretical linguistics’ description of spoken language relied on the use of sequential strings of abstract, context-invariant segments, or phonemes, which provided the mechanism of contrast between lexical items (e.g., distinguishing pat from bat ). 1 , 2 The immense analytic success and relative ease of approaches using such symbolic structures led language researchers to believe that the physical implementation of speech would adhere to the segmental ‘linearity condition,’ so that the acoustics corresponding to consecutive phonemes would concatenate like an acoustic alphabet or a string of beads stretched out in time. If that were the case, perception of the linguistic message in spoken utterances would be a trivial matching process of acoustics to contrastive phonemes. 3

Understanding the true nature of the physical speech signal, however, has turned out to be far from easy. Early signal processing technologies, prior to the 1940s, could detect and display time-varying acoustic amplitudes in speech, resulting in the familiar waveform seen in Figure 1 . Phoneticians have long known that it is the component frequencies encoded within speech acoustics, and how they vary over time, that serve to distinguish one speech percept from another, but waveforms do not readily provide access to this key information. A major breakthrough came in 1946, when Ralph Potter and his colleagues at Bell Laboratories developed the speech spectrogram, a representation which uses the mathematical Fourier transform to uncover the strength of the speech signal hidden in the waveform amplitudes (as shown in Figure 1 ) at a wide range of possible component frequencies. 4 Each calculation finds the signal strength through the frequency spectrum of a small time window of the speech waveform; stringing the results of these time-window analyses together yields a speech spectrogram or voiceprint, representing the dynamic frequency characteristics of the spoken signal as it changes over time ( Figure 2 ).

An external file that holds a picture, illustration, etc.
Object name is nihms430957f1.jpg

Speech waveform of the words typical and yesteryear as produced by an adult male speaker, representing variations in amplitude over time. Vowels are generally the most resonant speech component, corresponding to the most extreme amplitude levels seen here. The identifying formant frequency information in the acoustics is not readily accessible from visual inspection of waveforms such as these.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f2.jpg

A wide-band speech spectrogram of the same utterance as in Figure 1 , showing the change in component frequencies over time. Frequency is represented along the y -axis and time on the x -axis. Darkness corresponds to greater signal strength at the corresponding frequency and time.

Phonemes—An Unexpected Lack of Evidence

As can be seen in Figure 2 , the content of a speech spectrogram does not visually correspond to the discrete segmental units listeners perceive in a straightforward manner. Although vowels stand out due to their relatively high amplitudes (darkness) and clear internal frequency structure, reflecting harmonic resonances or ‘formant frequencies’ in the vocal tract, their exact beginning and ending points are not immediately obvious to the eye. Even the seemingly clear-cut amplitude rises after stop consonant closures, such as for the [p] in typical , do not directly correlate with the beginning of a discrete vowel segment, since these acoustics simultaneously provide critical information about both the identity of the consonant and the following vowel. Demarcating consonant/vowel separation is even more difficult in the case of highly sonorant (or resonant) consonants such as [w] or [r].

The simple ‘acoustic alphabet’ view of speech received another set-back in the 1950s, when Franklin Cooper of Haskins Laboratories reported his research group’s conclusion that acoustic signals composed of strictly serial, discrete units designed to corresponding phonemes or segments are actually impossible for listeners to process at speeds near those of normal speech perception. 5 No degree of signal simplicity, contrast between units, or user training with the context-free concatenation system could produce natural rates of speech perception for listeners. Therefore, the Haskins group concluded that speech must transmit information in parallel, through use of the contextual overlap observed in spectrograms of the physical signal. Speech does not look like a string of discrete, context-invariant acoustic segments, and in order for listeners to process its message as quickly as they do, it cannot be such a system. Instead, as Alvin Liberman proposed, speech is a ‘code,’ taking advantage of parallel transmission of phonetic content on massive scale through co-articulation 3 (see section ‘Variation in Invariants,’ below).

As these discoveries came to light, the ‘speech perception problem’ began to appear increasingly insurmountable. On the one hand, phonological evidence (covered in more depth in the ‘Variation in Invariants’ section) implies that phonemes are a genuine property of linguistic systems. On the other hand, it has been shown that the acoustic speech signal does not directly correspond to phonological segments. How could a listener use such seemingly unhelpful acoustics to recover a speaker’s linguistic message? Hockett encapsulated early speech scientists’ bewilderment when he famously likened the speech perception task to that of the inspector in the following scenario:

Imagine a row of Easter eggs carried along a moving belt; the eggs are of various sizes, and variously colored, but not boiled. At a certain point, the belt carries the row of eggs between two rollers of a wringer, which quite effectively smash them and rub more or less into each other. The flow of eggs before the wringer represents the series of impulses from the phoneme source; the mess that emerges from the wringer represents the output of the speech transmitter. At a subsequent point, we have an inspector whose task it is to examine the passing mess and decide, on the basis of the broken and unbroken yolks, the variously spread out albumen, and the variously colored bits of shell, the nature of the flow of eggs which previously arrived at the wringer. (Ref 1 , p. 210)

For many years, researchers in the field of speech perception focused their efforts on trying to solve this enigma, believing that the heart of the speech perception problem lay in the seemingly impossible task of phoneme recognition—putting the Easter eggs back together.

Synthetic Speech and the Haskins Pattern Playback

Soon after the speech spectrogram enabled researchers to visualize the spectral content of speech acoustics and its changes over time, that knowledge was put to use in the development of technology able to generate speech synthetically. One of the early research synthesizers was the Pattern Playback ( Figure 3 , top panel), developed by scientists and engineers, including Cooper and Liberman, at Haskins Laboratories. 6 This device could take simplified sound spectrograms like those shown in Figure 3 and use the component frequency information to produce highly intelligible corresponding speech acoustics. Hand-painted spectrographic patterns ( Figure 3 , lower panel) allowed researchers tight experimental control over the content of this synthetic, very simplified Pattern Playback speech. By varying its frequency content and transitional changes over time, investigators were able to determine many of the specific aspects in spoken language which are essential to particular speech percepts, and many which are not. 3 , 6

An external file that holds a picture, illustration, etc.
Object name is nihms430957f3.jpg

Top panel: A diagram of the principles and components at work in the Haskins Pattern Playback speech synthesizer. (Reprinted with permission from Ref. 68 Copyright 1951 national Academies of Science.) Lower panel: A series of hand-painted schematic spectrographic patterns used as input to the Haskins Pattern Playback in early research on perceptual ‘speech cues.’ (Reprinted with permission from Ref. 69 Copyright 1957 American Institute of Physics.)

Perceptual experiments with the Haskins Pattern Playback and other speech synthesizers revealed, for example the pattern of complex acoustics that signals the place of articulation of English stop consonants, such as [b], [t] and [k]. 3 For voiced stops ([b], [d], [g]) the transitions of the formant frequencies from silence to the vowel following the consonant largely determine the resulting percept. For voiceless stops ([p], [t], [k]) however, the acoustic frequency of the burst of air following the release of the consonant plays the largest role in identification. The experimental control gained from the Pattern Playback allowed researchers to alter and eliminate many aspects of naturally produced speech signals, discovering the identities of many such sufficient or necessary acoustic cues for a given speech percept. This early work attempted to pair speech down to its bare essentials, hoping to reveal the mechanisms of speech perception. Although largely successful in identifying perceptually crucial aspects of speech acoustics and greatly increasing our fundamental understanding of speech perception, these pioneering research efforts did not yield invariant, context-independent acoustic features corresponding to segments or phonemes. If anything, this research program suggested alternative bases for the communication of linguistic content. 7 , 8

Phoneme Perception—Positive Evidence

Some of the research conducted with the aim of understanding phoneme perception, however, did lead to results suggesting the reality of psychological particulate units such as phonemes. For instance, in some cases listeners show evidence of ‘perceptual constancy,’ or abstraction from signal variation to more generalized representations—possibly phonemes. Various types of such abstraction have been found in speech perception, but we will address two of the most influential here.

Categorical Perception Effects

Phoneme representations split potential acoustic continuums into discrete categories. The duration of aspiration occurring after the release of a stop consonant, for example, constitutes a potential continuum ranging from 0 ms, where vocalic resonance begins simultaneously with release of the stop, to an indefinitely long period between the stop release and the start of the following vowel. Yet stops in English falling along this continuum are split by native listeners into two functional groups—voiced [b], [d], [g] or voiceless [p], [t], [k]—based on the length of this ‘voice onset time.’ In general, this phenomenon is not so strange: perceptual categories often serve to break continuous variation into manageable chunks.

Speech categories appear to be unique in one aspect, however, listeners are unable to reliably discriminate between two members of the same category. That is, although we may assign two different colors both to the category ‘red,’ we can easily distinguish between the two shades in most cases. When speech scientists give listeners stimuli varying along an acoustic continuum, however, their discrimination between different tokens of the same category (analogous to two shades of red) is very close to chance. 9 They are highly accurate at discriminating tokens spanning category boundaries, on the other hand. The combination of sharp category boundaries in listeners’ labeling of stimuli and their within-category insensitivity in discrimination, as shown in Figure 4 , appears to be unique to human speech perception, and constitutes some of the strongest evidence in favor of robust segmental categories underlying speech perception.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f4.jpg

Data for a single subject from a categorical perception experiment. The upper panel gives labeling or identification data for each step on a [b]/[g] place-of-articulation continuum. The lower graph gives this subject’s ABX discrimination data (filled circles) for the same stimuli with one step difference between pairs, as well as the predicted discrimination performance (open circles). Discrimination accuracy is high at category boundaries and low within categories, as predicted. (Reprinted with permission from Ref. 9 Copyright 1957 American Psychological Association.)

According to this evidence, listeners sacrifice sensitivity to acoustic detail in order to make speech category distinctions more automatic and perhaps also less subject to the influence of variability. This type of category robustness is observed more strongly in the perception of consonants than vowels. Not coincidentally, as discussed briefly above and in more detail in the ‘Acoustic Phonetics’ section, below, the stop consonants which listeners have the most difficulty discriminating also prove to be the greatest challenge to define in terms of invariant acoustic cues. 10

Perceptual Constancy

Categorical perception effects are not the only case of such abstraction or perceptual constancy in speech perception; listeners also appear to ‘translate’ the speech they hear into more symbolic or idealized forms, encoding based on expectations of gender and accent. Niedzielski, for example, found that listeners identified recorded vowel stimuli differently when they were told that the original speaker was from their own versus another dialect group. 11 For these listeners, therefore, the mapping from physical speech characteristics to linguistic categories was not absolute, but mediated by some abstract conceptual unit. Johnson summarizes the results of a variety of studies showing similar behavior, 12 which corroborates the observation that, although indexical or ‘extra-linguistic’ information such as speaker gender, dialect, and speaking style are not inert in speech perception, more abstract linguistic units play a role in the process as well.

Far from being exotic, this type of ‘perceptual equivalence’ corresponds very well with language users’ intuitions about speech. Although listeners are aware that individuals often sound drastically different, the feeling remains that something holds constant across talkers and speech tokens. After all, cat is still cat no matter who says it. Given the signal variability and complexity observed in speech acoustics, such consistency certainly seems to imply the influence of some abstract unit in speech perception, possibly contrastive phonemes or segments.

Phoneme Perception—Shortcomings and Roadblocks

From the discussion above, it should be evident that speech perception research with the traditional focus on phoneme identification and discrimination has been unable either to confirm or deny the psychological reality of context-free symbolic units such as phonemes. Listeners’ insensitivity to stimulus differences within a linguistic category and their reference to an abstract ideal in identification support the cognitive role of such units, whereas synthetic speech manipulation has simultaneously demonstrated that linguistic percepts simply do not depend on invariant, context-free acoustic cues corresponding to segments. This paradoxical relationship between signal variance and perceptual invariance constitutes one of the fundamental issues in speech perception research.

Crucially, however, the research discussed until now focused exclusively on the phoneme as the locus of language users’ perceptual invariance. This approach stemmed from the assumption that speech perception can essentially be reduced to phoneme identification, relating yet again back to theoretical linguistics’ analysis of language as sequences of discrete, context-invariant units. Especially given the roadblocks and contradictions emerging in the field, however, speech scientists began to question the validity of those foundational assumptions. By attempting to control variability and isolate perceptual effects on the level of the phoneme, experimenters were asking listeners to perform tasks that bore little resemblance to typical speech communication. Interest in the field began to shift toward the influence of larger linguistic units such as words, phrases, and sentences and how speech perception processes are affected by them, if at all.

Beyond the Phoneme—Spoken Word Recognition Processes

Both new and revisited experimental evidence readily confirmed that the characteristics of word-level units do exert massive influence in speech perception. The lexical status (word vs non-word) of experimental stimuli, for example, biases listeners’ phoneme identification such that they hear more tokens as [d] in a dish / tish continuum, where the [d] percept creates a real word, than a da / ta continuum where both perceptual options are non-words. 13 Moreover, research into listeners’ perception of spoken words has shown that there are many factors that play a major role in word recognition but almost never influence phoneme perception.

Perhaps the most fundamental of these factors is word frequency: how often a lexical item tends to be used. The more frequently listeners encounter a word over the course of their daily lives, the more quickly and accurately they are able to recognize it, and the better they are at remembering it in a recall task (e.g., Refs 14 , 15 ). High-frequency words are more robust in noisy listening conditions, and whenever listeners are unsure of what they have heard through such interference, they are more likely to report hearing a high-frequency lexical item than a low-frequency one. 16 In fact, the effects of lexical status mentioned above are actually only extreme cases of frequency effects; phonotactically legal non-words (i.e., non-words which seem as though they could be real words) are treated psychologically like real words with a frequency of zero. Like cockroaches, these so-called ‘frequency effects’ pop up everywhere in speech research.

The nature of a word’s ‘lexical neighborhood’ also plays a pervasive role in its recognition. If a word is highly similar to many other words, such as cat is in English, then listeners will be slower and less accurate to identify it, whereas a comparably high-frequency word with fewer ‘neighbors’ to compete with it will be recognized more easily. ‘Lexical hermits’ such as Episcopalian and chrysanthemum , therefore, are particularly easy to recognize despite their low frequencies (and long durations). As further evidence of frequency effects’ ubiquitous presence, however, the frequencies of a word’s neighbors also influence perception: a word with a dense neighborhood of high-frequency items is more difficult to recognize than a word with a dense neighborhood of relatively low-frequency items, which has weaker competition. 17 , 18

Particularly troublesome for abstractionist phoneme-based views of speech perception, however, was the discovery that the indexical properties of speech (see ‘Perceptual Constancy,’ below) also influence word recognition. Goldinger, for example, has shown that listeners are more accurate at word recall when they hear stimuli repeated by the same versus different talkers. 19 If speech perception were mediated only by linguistic abstractions, such ‘extra-linguistic’ detail should not be able to exert this influence. In fact, this and an increasing number of similar results (e.g., Ref 20 ) have caused many speech scientists to abandon traditional theories of phoneme-based linguistic representation altogether, instead positing that lexical items are composed of maximally detailed ‘episodic’ memory traces. 19 , 21

Conclusion—Speech Perception

Regardless of the success or failure of episodic representational theories, a host of new research questions remain open in speech perception. The variable signal/common percept paradox remains a fundamental issue: what accounts for the perceptual constancy across highly diverse contexts, speech styles and speakers? From a job interview in a quiet room to a reunion with an old friend at a cocktail party, from a southern belle to a Detroit body builder, what makes communication possible? Answers to these questions may lie in discovering the extent to which the speech perception processes tapped by experiments in word recognition and phoneme perception are related, and uncovering the nature of the neural substrates of language that allow adaptation to such diverse situations. Deeply connected to these issues, Goldinger, Johnson and others’ results have prompted us to wonder: what is the representational specificity of speech knowledge and how does it relate to perceptual constancy?

Although speech perception research over the last 60 years has made substantial progress in increasing our understanding of perceptual challenges and particularly the ways in which they are not solved by human listeners, it is clear that a great deal of work remains to be done before even this one aspect of speech communication is truly understood.

SPEECH PRODUCTION

Speech production research serves as the complement to the work on speech perception described above. Where investigations of speech perception are necessarily indirect, using listener response time latencies or recall accuracies to draw conclusions about underlying linguistic processing, research on speech production can be refreshingly direct. In typical production studies, speech scientists observe articulation or acoustics as they occur, then analyze this concrete evidence of the physical speech production process. Conversely, where speech perception studies give researchers exact experimental control over the nature of their stimuli and the inputs to a subject’s perceptual system, research on speech production severely limits experimental control, making the investigators observe more or less passively, whereas speakers do as they will in response to their prompts.

Such fundamentally different experimental conditions, along with focus on the opposite side of the perceptual coin, allows speech production research to ask different questions and draw different conclusions about spoken language use and speech communication. As we discuss below, in some ways this ‘divide and conquer’ approach has been very successful in expanding our understanding of speech as a whole. In other ways, however, it has met with many of the same roadblocks as its perceptual complement and similarly leaves many critical questions unanswered in the end.

A Different Approach

When the advent of the speech spectrogram made it obvious that the speech signal does not straightforwardly mirror phonemic units, researchers responded in different ways. Some, as discussed above, chose to question the perceptual source of phoneme intuitions, trying to define the acoustics necessary and sufficient for an identifiable speech percept. Others, however, began separate lines of work aiming to observe the behavior of speakers more directly. They wanted to know what made the speech signal as fluid and seamless as it appeared, whether the observed overlap and contextual dependence followed regular patterns or rules, and what evidence speakers might show in support of the reality of the phonemic units. In short, these speech scientists wanted to demystify the puzzling acoustics seen on spectrograms by investigating them in the context of their source.

The Continuing Search

It may seem odd, perhaps, that psychologists, speech scientists, engineers, phoneticians, and linguists were not ready to abandon the idea of phonemes as soon as it became apparent that the physical speech signal did not straightforwardly support their psychological reality. Dating back to Panini’s grammatical study of Sanskrit, however, the use of abstract units such as phonemes has provided enormous gains to linguistic and phonological analysis. Phonemic units appear to capture the domain of many phonological processes, for example, and their use enables linguists to make sense of the multitude of patterns and distributions of speech sounds across the world’s languages. It has even been argued 22 that their discrete, particulate nature underlies humanity’s immense potential for linguistic innovation, allowing us to make ‘infinite use of finite means.’ 23

Beyond these theoretical gains, phonemes were argued to be empirically supported by research on speech errors or ‘slips of the tongue,’ which appeared to operate over phonemic units. That is, the kinds of errors observed during speech production, such as anticipations (‘‘a leading list’’), perseverations (‘pulled a pantrum’), reversals (‘heft lemisphere’), additions (‘moptimal number’), and deletions (‘chrysanthemum p ants’), appear to involve errors in the ordering and selection of whole segmental units, and always result in legal phonological combinations, whose domain is typically described as the segment. 24 Without evidence to the contrary, these errors seemed to provide evidence for speakers’ use of discrete phonological units.

Although there have been dissenters 25 and shifts in the conception of the units thought to underlie speech, abstract features, or phoneme-like units of some type have remained prevalent in the literature. In light of the particulate nature of linguistic systems, the enhanced understanding gained with the assumption of segmental analysis, and the empirical evidence observed in speech planning errors, researchers were and are reluctant to give up the search for the basis of phonemic intuitions in physically observable speech production.

Acoustic Phonetics

One of the most fruitful lines of research into speech production focused on the acoustics of speech. This body of work, part of ‘Acoustic Phonetics,’ examines the speech signals speakers produce in great detail, searching for regularities, invariant properties, and simply a better understanding of the human speech capacity. Although the speech spectrograph did not immediately show the invariants researchers anticipated, they reasoned that such technology would also allow them to investigate the speech signal at an unprecedented level of scientific detail. Because speech acoustics are so complex, invariant cues corresponding to phonemes may be present, but difficult to pinpoint. 10 , 26

While psychologists and phoneticians in speech perception were generating and manipulating synthesized speech in an effort to discover the acoustic ‘speech cues,’ therefore, researchers in speech production refined signal processing techniques enabling them to analyze the content of naturally produced speech acoustics. Many phoneticians and engineers took on this problem, but perhaps none has been as tenacious and successful as Kenneth Stevens of MIT.

An electrical engineer by training, Stevens took the problem of phoneme-level invariant classification and downsized it, capitalizing on the phonological theories of Jackobson et al. 27 and Chomsky and Halle’s Sound Patterns of English 28 which postulated linguistic units below the level of the phoneme called distinctive features. Binary values of universal features such as [sonorant], [continuant], and [high], these linguists argued, constituted the basis of phonemes. Stevens and his colleagues thought that invariant acoustic signals may correspond to distinctive features rather than phonemes. 10 , 26 Since phonemes often share features (e.g., /s/ and /z/ share specification for all distinctive features except [voice]), it would make sense that their acoustics are not as unique as might be expected from their contrastive linguistic function alone.

Stevens, therefore, began a thorough search for invariant feature correlates that continued until his retirement in 2007. He enjoyed several notable successes: many phonological features, it turns out, can be reliably specified by one or two signal characteristics or ‘acoustic landmarks.’ Phonological descriptors of vowel quality, such as [high] and [front], were found to correspond closely to the relative spacings between the first and second harmonic resonances of the vocal tract (or ‘formants’) during the production of sonorant vowel segments. 10

Some features, however, remained more difficult to define acoustically. Specifically, the acoustics corresponding to consonant place of articulation seemed to depend heavily on context—the exact same burst of noise transitioning from a voiceless stop to a steady vowel might result from the lip closure of a [p] or the tongue-dorsum occlusion of a [k], depending on the vowel following the consonant. Equally problematic, the acoustics signaling the alveolar ridge closure location of coronal stop [t] are completely different before different vowels. 29 The articulation/acoustic mismatch, and the tendency for linguistic percepts to mirror articulation rather than acoustics, is represented in Figure 5 .

An external file that holds a picture, illustration, etc.
Object name is nihms430957f5.jpg

Observations from early perceptual speech cue studies. In the first case, two different acoustic signals (consonant/vowel formant frequency transitions) result in the same percept. In the latter case, identical acoustics (release burst at 1440 Hz) result in two different percepts, depending on the vocalic context. In both cases, however, perception reflects articulatory, rather than acoustic, contrast. (Adapted and reprinted with permission from Ref. 29 Copyright 1996 American Institute of Physics.)

Variation in Invariants

Why do listeners’ speech percepts show this dissociation from raw acoustic patterns? Perhaps the answer becomes more intuitive when we consider that even the most reliable acoustic invariants described by Stevens and his colleagues tend to be somewhat broad, dealing in relative distances between formant frequencies in vowels and relative abruptness of shifts in amplitude and so on. This dependence on relative measures comes from two major sources: individual differences among talkers and contextual variation due to co-articulation. Individual speakers’ vocal tracts are shaped and sized differently, and therefore they resonant differently (just as different resonating sounds are produced by blowing over the necks of differently sized and shaped bottles), making the absolute formant frequencies corresponding to different vowels, for instance, impossible to generalize across individuals.

Perhaps more obviously problematic, though, is the second source: speech acoustics’ sensitivity to phonetic context. Not only do the acoustics cues for [p], [t], or [k] depend on the vowel following the stop closure, for example, but because the consonant and vowel are produced nearly simultaneously, the identity of the consonant reciprocally affects the acoustics of the vowel. Such co-articulatory effects are extremely robust, even operating across syllable and word boundaries. This extensive interdependence makes the possibility of identifying reliable invariance in the acoustic speech signal highly remote.

Although some researchers, such as Stevens, attempted to factor out or ‘‘normalize’’ these co-articulatory effects, others believed that they are central to the functionality of speech communication. Liberman et al. at Haskins Laboratories pointed out that co-articulation of consonants and vowels allows the speech signal to transfer information in parallel, transmitting messages more quickly than it could if spoken language consisted of concatenated strings of context-free discrete units. 3 Co-articulation therefore enhances the efficiency of the system, rather than being a destructive or communication-hampering force. Partially as a result of this view, some speech scientists focused on articulation as a potential key to understanding the reliability of phonemic intuitions, rather than on its acoustic consequences. They developed the research program called ‘articulatory phonetics,’ aimed at the study of the visible and hidden movements of the speech organs.

Articulatory Phonetics

In many ways articulatory phonetics constitutes as much of an engineering challenge as a linguistic one. Because the majority of the vocal tract ‘machinery’ lies hidden from view (see Figure 6 ), direct observation of the mechanics of speech production requires technology, creativity, or both. And any potential solution to the problem of observation cannot disrupt natural articulation too extensively if its results are to be useful in understanding natural production of speech. The challenge, therefore, is to investigate aspects of speech articulation accurately and to a high level of detail, while keeping interference with the speaker’s normal production as minor as possible.

An external file that holds a picture, illustration, etc.
Object name is nihms430957f6.jpg

A sagittal view of the human vocal tract showing the main speech articulators as labeled. (Reprinted with permission from Ref. 70 Copyright 2001 Blackwell Publishers Inc.)

Various techniques have been developed that manage to satisfy these requirements, spanning from the broadly applicable to the highly specialized. Electromyography (EMG), for instance, allows researchers to measure directly the activity of muscles within the vocal tract during articulation via surface or inserted pin electrodes. 30 These recordings have broad applications in articulatory phonetics, from determining the relative timing of tongue movements during syllable production to measures of pulmonary function from activity in speakers’ diaphragms to examining tone production strategies via raising and lowering of speakers’ larynxes. EMG electrode placement can significantly impact articulation, however, which does impose limits on its use. More specialized techniques are typically still more disruptive of typical speech production, but interfere minimally with their particular investigational target. In transillumination of the glottis, for example, a bundle of fiberoptic lights is fed through a speaker’s nose until the light source is positioned just above their larynx. 31 A light-sensitive photocell is then placed on the neck just below the glottis to detect the amount of light passing through the vocal folds at any given moment, which correlates directly with the size of glottal opening over time. Although transillumination is clearly not an ideal method to study the majority of speech articulation, it nevertheless provides a highly accurate measure of various glottal states during speech production.

Perhaps the most currently celebrated articulatory phonetics methods are also the least disruptive to speakers’ natural articulation. Simply filming speech production in real-time via X-ray provided an excellent, complete view of unobstructed articulation, but for health and safety reasons can no longer be used to collect new data. 32 Methods such as X-ray microbeam and Electromagnetic Mid-Sagittal Articulometer (EMMA) tracking attempt to approximate that ‘X-ray vision’ by recording the movements of speech articulators in real-time through other means. The former uses a tiny stream of X-ray energy aimed at radio-opaque pellets attached to a speaker’s lips, teeth, and tongue to monitor the movements of the shadows created by the pellets as the speaker talks. The latter, EMMA, generates similar positional data for the speech organs by focusing alternating magnetic fields on a speaker and monitoring the voltage induced in small electromagnetic coils attached to speaker’s articulators as they move through the fields during speech. Both methods track the movements of speech articulators despite their inaccessibility to visible light, providing reliable position-over-time data that minimally disrupts natural production. 33 , 34 However, comparison across subjects can be difficult due to inconsistent placement of tracking points from one subject to another and simple anatomical differences between subjects.

Ultrasound provides another, even less disruptive, articulatory phonetics technique that has been gaining popularity in recent years (e.g., Refs 35 , 36 ). Using portable machinery that does nothing more invasive than send sound waves through a speaker’s tissue and monitor their reflections, speech scientists can track movements of the tongue body, tongue root, and pharynx that even X-ray microbeam and EMMA cannot capture, as these articulators are all but completely manually inaccessible. By placing an ultrasound wand at the juncture of the head and neck below the jaw, however, images of the tongue from its root in the larynx to its tip can be viewed in real-time during speech production, with virtually no interference to the speech act itself. The tracking cannot extend beyond cavities of open air, making this method inappropriate for studies of precise place of articulation against the hard palate or of velum movements, for example, but these are areas in which X-ray microbeam and EMMA excel. The data recently captured using these techniques are beginning to give speech scientists a more complete picture of speech articulation than ever before.

Impact on the Search for Phonemes

Unfortunately for phoneme-based theories of speech production and planning, the results of recent articulatory studies of speech errors do not seem to paint a compatible picture. As discussed above, the categorical nature of speech errors has served as important support for the use of phonemic units in speech production. Goldstein, Pouplier, and their colleagues, however, used EMMA to track speakers’ production of errors in a repetition task similar to a tongue twister. Confirming earlier suspicious (e.g., Ref 25 ), they found that while speakers’ articulation sometimes followed a categorically ‘correct’ or ‘errorful’ gestural pattern, it was more frequently somewhere between two opposing articulations. In these cases, small ‘incorrect’ movements of the articulators would intrude upon the target speech gesture, both gestures would be executed simultaneously, or the errorful gesture would completely overshadow the target articulation. Only the latter reliably resulted in the acoustic percept of a speech error. 37 As Goldstein and Pouplier point out, such non-categorical, gradient speech errors cannot constitute support for discrete phonemic units in speech planning.

Importantly, this finding was not isolated in the articulatory phonetics literature: speakers frequently appear to execute articulatory movements that do not result in any acoustic consequences. Specifically, X-ray microbeam tracking of speakers’ tongue tip, tongue dorsum, and lip closures during casual pronunciation of phrases such as perfect memory reveals that speakers raise their tongue tips for [t]-closure, despite the fact that the preceding [k] and following [m] typically obscure the acoustic realization of the [t] completely. 38 Although they could minimize their articulatory effort by not articulating the [t] where it will not be heard, speakers faithfully proceed with their complete articulation, even in casual speech.

Beyond the Phoneme

So far we have seen that, while technological, methodological and theoretical advances have enabled speech scientists to understand the speech signal and its physical production better than ever before, the underlying source of spoken language’s systematic nature remains largely mysterious. New research questions continue to be formulated, however, using results that were problematic under old hypotheses to motivate new theories and new approaches to the study of speech production.

The theory of ‘Articulatory Phonology’ stands as a prominent example; its proponents took the combination of gradient speech error data, speakers’ faithfulness to articulation despite varying means of acoustic transmission, and the lack of invariant acoustic speech cues as converging evidence that speech is composed of articulatory, rather than acoustic, fundamental units that contain explicit and detailed temporal structure. 8 , 38 Under this theory, linguistic invariants are underlyingly motor-based articulatory gestures which specify the degree and location of constrictions in the vocal tract and delineate a certain amount of time relative to other gestures for their execution. Constellations of these gestures, related in time, constitute syllables and words without reference to strictly sequential segmental or phonemic units. Speech perception, then, consists of determining the speech gestures and timing responsible for creating a received acoustic signal, possibly through extension of experiential mapping between the perceiver’s own gestures and their acoustic consequences, as in Liberman and Mattingly’s Motor Theory of Speech Perception 7 or Fowler’s Direct Realist approach. 39 Recent evidence from neuroscience may provide a biological mechanism for this process 40 (see ‘Neurobiological Evidence—Mirror Neurons,’ below).

And although researchers like Stevens continued to focus on speech acoustics as opposed to articulation, the separate lines of inquiry actually appear to be converging on the same fundamental solution to the invariance problem. The most recent instantiation of Stevens’ theory posits that some distinctive phonological features are represented by sets of redundant invariant acoustic cues, only a subset of which are necessary for recognition in any single token. As Stevens recently wrote, however, the distinction between this most recent feature-based account and theories of speech based on gestures may no longer be clear:

The acoustic cues that are used to identify the underlying distinctive features are cues that provide evidence for the gestures that produced the acoustic pattern. This view that a listener focuses on acoustic cues that provide evidence for articulatory gestures suggests a close link between the perceptually relevant aspects of the acoustic pattern for a distinctive feature in speech and the articulatory gestures that give rise to this pattern.
(Ref 10 , p. 142)

Just as in speech perception research, however, some speech production scientists are beginning to wonder if the invariance question was even the right question to be asked in the first place. In the spirit of Lindblom’s hyper-articulation and hypo-articulation theory 41 (see ‘Perception-Driven Adaptation in Speech Production,’ below), these researchers have begun investigating control and variability in production as a means of pinning down the nature of the underlying system. Speakers are asked to produce the same sentence in various contextual scenarios such that a target elicited word occurs as the main element of focus, as a carrier of stress, as a largely unstressed element, and as though a particular component of the word was misheard (e.g., in an exchange such as ‘Boy?’ ‘No, toy ’), while their articulation and speech acoustics are recorded. Then the data are examined for regularities. If the lengths of onset consonants and following vowels remain constant across emphasized, focused, stressed, and unstressed conditions, for example, that relationship may be specified in the representation of syllables, whereas the absolute closure and vocalic durations vary freely and therefore must not be subject to linguistic constraint. Research of this type seeks to determine the articulatory variables under active, regular control and which (if any) are mere derivatives or side effects of deliberate actions. 42 – 44

Conclusion—Speech Production

Despite targeting a directly observable, overt linguistic behavior, speech production research has had no more success than its complement in speech perception at discovering decisive answers to the foundational questions of linguistic representational structure or the processes governing spoken language use. Due to the joint endeavors of acoustic and articulatory phonetics, our understanding of the nature of the acoustic speech signal and how it is produced has increased tremendously, and each new discovery points to new questions. If the basic units of speech are gesture-based, what methods and strategies do listeners use in order to perceive them from acoustics? Are there testable differences between acoustic and articulatory theories of representation? What aspects of speech production are under demonstrable active control, and how do the many components of the linguistic and biological systems work together across speakers and social and linguistic contexts? Although new lines of inquiry are promising, speech production research seems to have only begun to scratch the surface of the complexities of speech communication.

SPEECH PERCEPTION AND PRODUCTION LINKS

As Roger Moore recently pointed out, the nature of the standard scientific method is such that ‘it leads inevitably to greater and greater knowledge about smaller and smaller aspects of a problem’ (Ref. 45 , p. 419). Speech scientists followed good scientific practice when they effectively split the speech communication problem, one of the most complex behaviors of a highly complex species, into more manageable chunks. And the perceptual and productive aspects of speech each provided enough of a challenge, as we have seen, that researchers had plenty to work on without adding anything. Yet we have also seen that neither discipline on its own has been able to answer fundamental questions regarding linguistic knowledge, representation, and processing.

Although the scientific method serves to separate aspects of a phenomenon, the ultimate goal of any scientific enterprise is to unify individual discoveries, uncovering connections and regularities that were previously hidden. 46 One of the great scientific breakthroughs of the 19th century, for example, brought together the physics of electricity and magnetism, previously separate fields, and revealed them to be variations of the same basic underlying principles. Similarly, where research isolated to either speech perception or production has failed to find success, progress may lie in the unification of the disciplines. And unlike electricity and magnetism the a priori connection between speech perception and speech production is clear: they are two sides of the same process, two links in Denes and Pinson’s famous ‘speech chain’. 47 Moreover, information theory demands that whatever signals generated in speech production match those received in perception, a criteria known as ‘signal parity’ which must be met for successful communication to take place; therefore, the two processes must at some point even deal in the same linguistic currency. 48

In this final section, we will discuss theories and experimental evidence that highlight the deep, inherent links between speech perception and production. Perhaps by bringing together the insights achieved within each separate line of inquiry, the recent evidence pointing to the nature of the connection between them, and several theories of how they may work together in speech communication, we can point to where the most exciting new research questions lie in the future.

Early Evidence—Audiovisual Speech Perception

Lurking behind the idea that speech perception and production may be viewed as parts of a unified speech communication process is the assumption that speech consists of more than just acoustic patterns and motor plans that happen to coincide. Rather, the currency of speech must somehow combine the domains of perception and production, satisfying the criterion of signal parity discussed above. Researchers such as Kluender, Diehl, and colleagues take the former, more ‘separatist’ stance, believing that speech is processed by listeners like any other acoustic signal, without input from or reference to complementary production systems. 49 , 50 Much of the research described in this section runs counter to such a ‘general auditory’ view, but none so directly challenges its fundamental assumptions as the phenomenon of audiovisual speech perception.

The typical view of speech, fully embraced thus far here, puts audition and acoustics at the fore. However, visual and other sensory cues also play important roles in the perception of a speech signal, augmenting or occasionally even overriding a listener’s auditory input. Very early in speech perception research, Sumby and Pollack showed that simply seeing a speaker’s face during communication in background noise can provide listeners with massive gains in speech intelligibility, with no change in the acoustic signal. 51 Similarly, it has been well documented that access to a speaker’s facial dynamics improves the accuracy and ease of speech perception for listeners with mild to more severe types of hearing loss 52 and even deaf listeners with cochlear implants. 53 , 54 Perhaps no phenomenon demonstrates this multimodal integration as clearly or has attracted more attention in the field than the effect reported by McGurk and MacDonald in 1976. 55 When listeners receive simultaneous, mismatching visual and auditory speech input, such as a face articulating the syllable ba paired with the acoustics for ga , they typically experience a unified percept da that appears to combine features of both signals while matching neither. In cases of a closer match—between visual va and auditory ba , for example—listeners tend to perceive va , adhering to the visual rather than auditory signal. The effect is robust even when listeners are aware of the mismatch, and has been observed with conflicting tactile rather than visual input 56 and with pre-lingual infants. 57 As these last cases show, the effect cannot be due to extensive experience linking visual and auditory speech information. Instead, the McGurk effect and the intelligibility benefits of audiovisual speech perception provide strong evidence for the inherently multimodal nature of speech processing, contrary to a ‘general auditory’ view. As a whole, the audiovisual speech perception evidence supports the assumptions which make possible the discussion of evidence for links between speech perception and production below.

Phonetic Convergence

Recent work by Pardo builds on the literature of linguistic ‘alignment’ to find further evidence of an active link between speech perception and production in ‘real-time,’ typical communicative tasks. She had pairs of speakers play a communication game called the ‘map task,’ where they must cooperate to copy a path marked on one speaker’s map to the other’s blank map without seeing one another. The speakers refer repeatedly to certain landmarks on the map, and Pardo examined their productions of these target words over time. She asked naive listeners to compare a word from one speaker at both the beginning and end of the game with a single recording of the same word said by the other speaker. Consistently across pairs, she found that the recordings from the end of the task were judged to be more similar than those from the beginning. Previous studies have shown that speakers may align in their patterns of intonation, 58 for example, but Pardo’s are the first results demonstrating such alignment at the phonetic level in an ecologically valid speech setting.

This ‘phonetic convergence’ phenomenon defies explanation unless the processes of speech perception and subsequent production are somehow linked within an individual. Otherwise, what a speaker hears his or her partner say could not affect subsequent productions. Further implications of the convergence phenomenon become apparent in light of the categorical perception literature described in ‘Categorical Perception Effects’ above. In these robust speech perception experiments, listeners appear to be unable to reliably detect differences in acoustic realization of particular segments. 9 Yet the convergence observed in Pardo’s work seems to operate at the sub-phonemic level, affecting subtle changes within linguistic categories (i.e., convergence results do not depend on whole-segment substitutions, but much more fine-grained effects).

As Pardo’s results show, the examination of links between speech perception and production has already pointed toward new answers to some old questions. Perhaps we do not understand categorical perception effects as well as we thought—if the speech listeners hear can have these gradient within-category effects on their own speech production, then why is it that they cannot access these details in the discrimination tasks of classic categorical perception experiments? And what are the impacts of the answer for various representational theories of speech?

Perception-Driven Adaptation in Speech Production

Despite the typical separation between speech perception and production, the idea that the two processes interact or are coupled within individual speakers is not new. In 1990, Björn Lindblom introduced his ‘hyper-articulation and hypo-articulation’ (H&H) theory, which postulated that speakers’ production of speech is subject to two conflicting forces: economy of effort and communicative contrast. 41 The first pressures speech to be ‘hypo-articulated,’ with maximally reduced articulatory movements and maximal overlap between movements. In keeping with the theory’s roots in speech production research, this force stems from a speaker’s motor system. The contrasting pressure for communicative distinctiveness pushes speakers toward ‘hyper-articulated’ speech, executed so as to be maximally clear and intelligible, with minimal co-articulatory overlap. Crucially, this force stems from listener-oriented motivation. Circumstances that make listeners less likely to correctly perceive a speaker’s intended message—ranging from physical factors like presence of background noise, to psychological factors such as the lexical neighborhood density of a target word, to social factors such as a lack of shared background between the speaker and listener—cause speakers to hyper-articulate, expending greater articulatory effort to ensure transmission of their linguistic message.

For nearly a hundred years, speech scientists have known that physical conditions such as background noise affect speakers’ production. As Lane and Tranel neatly summarized, a series of experiments stemming from the work of Etienne Lombard in 1911 unequivocally showed that the presence of background noise causes speakers not only to raise the level of their speech relative to the amplitude of the noise, but also to alter their articulation style in ways similar to those predicted by H&H theory. 59 No matter the eventual status of H&H theory in all its facets, this ‘Lombard Speech’ effect empirically demonstrates a real and immediate link between what speakers are hearing and the speech they produce. As even this very early work demonstrates, speech production does not operate in a vacuum, free from the influences of its perceptual counterpart; the two processes are coupled and closely linked.

Much more recent experimental work has demonstrated that speakers’ perception of their own speech can be subject to direct manipulation, as opposed to the more passive introduction of noise used in inducing Lombard speech, and that the resulting changes in production are immediate and extremely powerful. In one experiment conducted by Houde and Jordan, for example, speakers repeatedly produced a target vowel [ ε ], as in bed , while hearing their speech only through headphones. The researchers ran the speech through a signal processing program which calculated the formant frequencies of the vowel and shifted them incrementally toward the frequencies characteristic of [æ], raising the first formant and lowering the second. Speakers were completely unaware of the real-time alteration of the acoustics corresponding to their speech production, but they incrementally shifted their articulation of [ ε ] to compensate for the researchers’ manipulation: they began producing lower first formants and higher second formants. This compensation was so dramatic that speakers who began by producing [ ε ] ended the experiment by saying vowels much closer to [i] (when heard outside the formant-shifting influence of the manipulation). 60

Houde, Jordan, and other researchers working in this paradigm point out that such ‘sensorimotor adaptation’ phenomena demonstrate an extremely powerful and constantly active feedback system in operation during speech production. 61 , 62 Apparently, a speaker’s perception of his or her own speech plays a significant role in the planning and execution of future speech production.

The Role of Feedback—Modeling Spoken Language Use

In his influential theory of speech production planning and execution, Levelt makes explicit use of such perceptual feedback systems in production. 63 In contrast to Lindblom’s H&H theory, Levelt’s model (WEAVER++) was designed primarily to provide an account of how lexical items are selected from memory and translated into articulation, along with how failures in the system might result in typical speech errors. In Levelt’s model, speakers’ perception of their own speech allows them to monitor for errors and execute repairs. The model goes a step further, however, to posit another feedback loop entirely internal to the speaker, based on their experience with mappings between articulation and acoustics.

According to Levelt’s model, then, for any given utterance a speaker has several levels of verification and feedback. If, for example, a speaker decides to say the word day the underlying representation of the lexical item is selected and prepared for articulation, presumably following the various steps of the model not directly relevant here. Once the articulation has been planned, the same ‘orders’ are sent to both the real speech organs and a mental emulator or ‘synthesizer’ of the speaker’s vocal tract. This emulator generates the acoustics that would be expected from the articulatory instructions it received, based on the speaker’s past experience with the mapping. The expected acoustics feed back to the underlying representation of day to check for a match with remembered instances of the word. Simultaneous to this process, the articulators are actually executing their movements and generating acoustics. That signal enters the speaker’s auditory pathway, where the resulting speech percept feeds back to the same underlying representation, once again checking for a match.

Such a system may seem redundant, but each component has important properties. As Moore points out for his own model (see below), internal feedback loops of the type described in Levelt’s work allow speakers to repair errors much more quickly than reliance on external feedback would permit, which translates to significant evolutionary advantages. 45 Without external loops backing up the internal systems, however, speakers might miss changes to their speech imposed by physical conditions (e.g., noise). Certainly the adaptation observed in Houde and Jordan’s work demonstrates active external feedback control over speech production: only an external loop could catch the disparity between the acoustics a speaker actually perceives and his or her underlying representation. And indeed, similar feedback-reliant models have been proposed as the underpinnings of non-speech movements such as reaching. 64

As suggested above, Moore has recently proposed a model of speech communication that also incorporates multiple feedback loops, both internal and external. 45 His Predictive Sensorimotor Control and Emulation (PRESENCE) model goes far beyond the specifications of Levelt’s production model, however, to incorporate additional feedback loops that allow the speaker to emulate the listener’s emulation of the speaker , and active roles for traditionally ‘extra-linguistic’ systems such as the speaker’s affective or emotional state. In designing his model, Moore attempts to take the first step in what he argues is the necessary unification of not just research on speech perception and production, but the work related to speech in many other fields as well, such as neuroscience, automated speech recognition, text-to-speech synthesis, and biology, to name just a few. 45

Perhaps most fundamental to our discussion here, however, is the role of productive emulation or feedback during speech perception in the model. Where Levelt’s model deals primarily with speech production, Moore’s PRESENCE incorporates both speech perception and production, deliberately emphasizing their interdependence and mutually constraining relationship. According to his model, speech perception takes place with the aid of listener-internal emulation of the acoustic-to-articulatory mapping potentially responsible for the received signal. As Moore puts it, speech perception in his model is essentially a revisiting of the idea of ‘recognition-by-synthesis’ (e.g., Ref 65 ), whereas speech production is (as in Levelt) ‘synthesis by recognition.’

Neurobiological Evidence—Mirror Neurons

The experimental evidence we considered above suggests pervasive links between what listeners hear and the speech they produce. Conversational partners converge in their production of within-category phonetic detail, speakers alter their speech styles in adverse listening conditions, and manipulation of speakers’ acoustic feedback from their own speech can dramatically change the speech they produce in response. As we also considered, various theoretical models of spoken language use have been proposed to account for these phenomena and the observed perceptual and productive links. Until recently, however, very little neurobiological evidence supported these proposals. The idea of a speaker-internal vocal tract emulator, for instance, seemed highly implausible to many speech scientists; how would the brain possibly implement such a structure?

Cortical populations of newly discovered ‘mirror neurons,’ however, seem to provide a plausible neural substrate for proposals of direct, automatic, and pervasive links between speech perception and production. These neurons ‘mirror’ in the sense that they fire both when a person performs an action themselves and when they perceive someone else performing the same action, either visually or through some other (auditory, tactile) perceptual mode. Human mirror neuron populations appear to be clustered in several cortical areas, including the pre-frontal cortex, which is often implicated in behavioral inhibition and other executive function, and areas typically recognized as centers of speech processing, such as Broca’s area (for in-depth review of the literature and implications, see Ref 66 ).

Neurons which physically equate (or at least directly link) an actor’s production and perception of a specific action have definite implications for theories linking speech perception and production: they provide a potential biological mechanism. The internal feedback emulators hypothesized most recently by Levelt and Moore could potentially be realized in mirror neuron populations, which would emulate articulatory-to-acoustic mappings (and vice versa) via their mutual sensitivity to both processes and their connectivity to both sensory and motor areas. Regardless of their specific applicability to Levelt and Moore’s models, however, these neurons do appear to be active during speech perception, as one study using Transcranial Magnetic Stimulation (TMS) demonstrates elegantly. TMS allows researchers to temporarily either attenuate or elevate the background activation of a specific brain area, respectively inducing a state similar to the brain damage caused by a stroke or lesion or making it so that any slight increase in the activity of the area causes overt behavior when its consequences would not normally be observable. The later excitation technique was used by Fadiga and colleagues, who raised the background activity of specific motor areas controlling the tongue tip. When the ‘excited’ subjects then listened to speech containing consonants which curled the tongue upward, their tongues twitched correspondingly. 67 Perceiving the speech caused activation of the motor plans that would be used in producing the same speech—direct evidence of the link between speech perception and production.

Perception/Production Links—Conclusion

Clearly, the links between speech perception and production are inherent in our use of spoken language. They are active during typical speech perception (TMS mirror neuron study), are extremely powerful, automatic and rapid (sensorimotor adaptation), and influence even highly ecologically valid communication tasks (phonetic convergence). Spoken language processing, therefore, seems to represent a linking of sensory and motor control systems, as the pervasive effects of visual input on speech perception suggest. Indeed, speech perception cannot be just sensory interpretation and speech production cannot be just motor execution. Rather, both processes draw on common resources, using them in tandem to accomplish remarkable tasks such as generalization from talker to talker and acquiring new lexical items. As new information regarding these links comes to light, models such as Lindblom’s H&H, Levelt’s WEAVER++, and Moore’s PRESENCE will both develop greater reflection of the actual capabilities of language users (simultaneous speakers and listeners) and be subject to greater constraint in their hypotheses and mechanisms. And hopefully, theory and experimental evidence will converge to discover how speech perception and production interact in the highly complex act of vocal communication.

CONCLUSIONS

Despite the strong intuitions and theoretical traditions of linguists, psychologists, and speech scientists, spoken language does not appear to straightforwardly consist of linear sequences of discrete, idealized abstract, context-free symbols such as phonemes or segments. This discovery begs the question, however; how does speech convey equivalent information across talkers, dialects, and contexts? And how do language users mentally represent both the variability and constancy in the speech they hear?

New directions in research on speech perception include theories of exemplar-based representation of speech and experiments designed to discover the specificity, generalized application, and flexibility of listeners’ perceptual representations. Efforts to focus on more ecologically valid tasks such as spoken word recognition also promise fruitful progress in coming years, particularly those which provide tests of theoretical and computational models. In speech production, meanwhile, the apparent convergence of acoustic and articulatory theories of representation points to the emerging potential for exciting new lines of research combining their individual successes. At the same time, more and more speech scientists are turning their research efforts toward variability in speech, and what patterns of variation can reveal about speakers’ language-motivated control and linguistic knowledge.

Perhaps the greatest potential for progress and discovery, however, lies in continuing to explore the behavioral and neurobiological links between speech perception and production. Although made separate by practical and conventional scientific considerations, these two processes are inherently and intimately coupled, and it seems that we will never truly be able to understand the human capacity for spoken communication until they have been conceptually reunited.

Examples

Introduction Speech

introduction speech production

Discover the art of crafting compelling introduction speeches through our comprehensive guide. Whether you’re a beginner or a seasoned speaker, our step-by-step approach simplifies the process. Explore a rich collection of speech examples , tailored to inspire and improve your public speaking skills. Master the nuances of delivering impactful introductions that captivate your audience, using our expertly curated speech examples as your roadmap to success.

Introduction Speech Bundle

Download Introduction Speech Buncle

A speech can be of any form and used for various functions. It can be a thank-you speech to show one’s gratitude or even an introduction speech to introduce a person (even oneself), product, company, or the like. In these examples, let’s look at different speech examples that seek to introduce.

Introduction Speech Example

Introduction Speech Example

Free Download

Introduction Speech for Students

Introduction Speech for Students

Introduction Speech for School

Introduction Speech for School

Self-Introduction Sample

Self Introduction Sample4

Size: 143 KB

Short Introduction Speech

Short Introduction Speech2

Size: 110 KB

Introduction Speech for Employee

Personal Introduction Example

Size: 47 KB

What to Include in an Introduction Speech

An introduction speech may also work as a welcome speech . You introduce yourself to an audience and provide the audience with the gist of a meeting or program. This would include providing recognition to significant individuals or even starting a brief discussion on a topic.

But of course, this would solely depend on what you’re trying to introduce. You can also use various speech templates for you to know what other information may be included in your speech.

How to Write a Introduction Speech?

In writing an introduction speech, it’s wise to familiarize the flow of a program.

Think about what your goal is and how you could attain it. You need to be able to capture the attention and interest of your listeners. If you’re giving a speech to introduce the president of your company, be sure to make it grand. Share significant details that are sure to receive a wow factor from the audience as an introduction speech can also be an informative speech . Keep in mind that it’s always best to start with an outline or draft so it will be easier for you to edit.

Introduction Speech for Chairman

Introduction Speech For Business

Size: 281 KB

Introduction Speech for Students

waalc.org.au

Size: 13 KB

Formal Introduction Sample

Formal Introduction Sample2

Size: 223 KB

Tips on Writing an Introduction Speech

1. Keep it short. When you try to self introduction speech   to a person you just met, you don’t tell them paragraphs of information that aren’t even relevant. You would want to entice an audience, not bore them out. You don’t need to make it lengthy for it to be good. A few wise words and a touch of class will be enough for your listeners.

2. Make an outline. Introductions are meant to give an audience a quick run through of what they must know. Create a speech outline that will state the purpose of your speech and provide a preview of main ideas that are to be discussed. This is sure to give your audience a reason to listen.

3. Create an icebreaker. Speeches can be quite awkward, especially since they’re usually made formal. Craft a speech that will leave a good impact. Allow others to feel comfortable with the environment they are in and allow them to feel valued. You may also see orientation speech examples & samples

4. Read it out loud. The thing is, some things sound better in our heads than being said aloud. It’s possible that your speech in pdf may contain words that don’t sound good together or that it might give a different interpretation on a matter.

How to Conclude an Introduction Speech

Just as an essay can be conclude speech in different ways, an introduction speech may end in various ways.

You can close it in a challenging, congratulatory, suggestive or even inviting matter. It’s best to keep it as brief as possible to let your listeners know that you’re ending your speech in word . All you need to make sure of is that you don’t abruptly end your speech, leaving your audience hanging.

In the realm of public speaking, the introduction speech serves as a crucial gateway, opening the door to deeper engagement and understanding. Whether it’s for a corporate event, educational purpose, or a personal introduction, the essence of a good introduction speech lies in its ability to connect the speaker with the audience on a meaningful level. To further enhance your skills in crafting and delivering effective introduction speeches, exploring resources from esteemed institutions can be immensely beneficial. Websites like Harvard’s Public Speaking Resources offer a treasure trove of tips, techniques, and examples that can inspire and guide speakers to refine their approach.

Twitter

Introduction Speech Generator

Text prompt

  • Instructive
  • Professional

Write an Introduction Speech for a guest speaker at a conference.

Create an Introduction Speech for a new teacher at school.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 14 May 2024

The speech neuroprosthesis

  • Alexander B. Silva   ORCID: orcid.org/0000-0003-0838-4136 1 , 2 ,
  • Kaylo T. Littlejohn 1 , 2 , 3 ,
  • Jessie R. Liu   ORCID: orcid.org/0000-0001-9316-7624 1 , 2 ,
  • David A. Moses 1 , 2 &
  • Edward F. Chang   ORCID: orcid.org/0000-0003-2480-4700 1 , 2  

Nature Reviews Neuroscience ( 2024 ) Cite this article

148 Accesses

12 Altmetric

Metrics details

  • Cognitive neuroscience
  • Neuroscience

Loss of speech after paralysis is devastating, but circumventing motor-pathway injury by directly decoding speech from intact cortical activity has the potential to restore natural communication and self-expression. Recent discoveries have defined how key features of speech production are facilitated by the coordinated activity of vocal-tract articulatory and motor-planning cortical representations. In this Review, we highlight such progress and how it has led to successful speech decoding, first in individuals implanted with intracranial electrodes for clinical epilepsy monitoring and subsequently in individuals with paralysis as part of early feasibility clinical trials to restore speech. We discuss high-spatiotemporal-resolution neural interfaces and the adaptation of state-of-the-art speech computational algorithms that have driven rapid and substantial progress in decoding neural activity into text, audible speech, and facial movements. Although restoring natural speech is a long-term goal, speech neuroprostheses already have performance levels that surpass communication rates offered by current assistive-communication technology. Given this accelerated rate of progress in the field, we propose key evaluation metrics for speed and accuracy, among others, to help standardize across studies. We finish by highlighting several directions to more fully explore the multidimensional feature space of speech and language, which will continue to accelerate progress towards a clinically viable speech neuroprosthesis.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

176,64 € per year

only 14,72 € per issue

Buy this article

  • Purchase on Springer Link
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

introduction speech production

Similar content being viewed by others

introduction speech production

A high-performance speech neuroprosthesis

introduction speech production

A high-performance neuroprosthesis for speech decoding and avatar control

introduction speech production

High-resolution neural recordings improve the accuracy of speech decoding

Felgoise, S. H., Zaccheo, V., Duff, J. & Simmons, Z. Verbal communication impacts quality of life in patients with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Front. Degener. 17 , 179–183 (2016).

Article   Google Scholar  

Das, J. M., Anosike, K. & Asuncion, R. M. D. Locked-in syndrome. StatPearls https://www.ncbi.nlm.nih.gov/books/NBK559026/ (StatPearls, 2021).

Lulé, D. et al. Life can be worth living in locked-in syndrome. Prog. Brain Res. 177 , 339–351 (2009).

Article   PubMed   Google Scholar  

Pels, E. G. M., Aarnoutse, E. J., Ramsey, N. F. & Vansteensel, M. J. Estimated prevalence of the target population for brain–computer interface neurotechnology in the Netherlands. Neurorehabil. Neural Repair 31 , 677–685 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Koch Fager, S., Fried-Oken, M., Jakobs, T. & Beukelman, D. R. New and emerging access technologies for adults with complex communication needs and severe motor impairments: state of the science. Augment. Altern. Commun. Baltim. MD 1985 35 , 13–25 (2019).

Google Scholar  

Vansteensel, M. J. et al. Fully implanted brain–computer interface in a locked-in patient with ALS. N. Engl. J. Med. 375 , 2060–2066 (2016).

Utsumi, K. et al. Operation of a P300-based brain–computer interface in patients with Duchenne muscular dystrophy. Sci. Rep. 8 , 1753 (2018).

Pandarinath, C. et al. High performance communication by people with paralysis using an intracortical brain–computer interface. eLife 6 , e18554 (2017).

Willett, F. R., Avansino, D. T., Hochberg, L. R., Henderson, J. M. & Shenoy, K. V. High-performance brain-to-text communication via handwriting. Nature 593 , 249–254 (2021).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Chang, E. F. & Anumanchipalli, G. K. Toward a speech neuroprosthesis. JAMA 323 , 413–414 (2020).

Bull, P. & Frederikson, L. in Companion Encyclopedia of Psychology (Routledge, 1994).

Moses, D. A. et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 385 , 217–227 (2021). The authors first demonstrated speech decoding in a person with vocal-tract paralysis by decoding cortical activity word-by-word into sentences, using a vocabulary of 50 words at a rate of 15 wpm.

Angrick, M. et al. Online speech synthesis using a chronically implanted brain–computer interface in an individual with ALS. Preprint at medRxiv https://doi.org/10.1101/2023.06.30.23291352 (2023). The authors demonstrated speech synthesis of single words from cortical activity during attempted speech in a person with vocal-tract paralysis.

Metzger, S. L. et al. A high-performance neuroprosthesis for speech decoding and avatar control. Nature https://doi.org/10.1038/s41586-023-06443-4 (2023). The authors reported demonstrations of speech synthesis and avatar animation (orofacial-movement decoding), along with improved text-decoding vocabulary size and speed, by using connectionist temporal classification loss to train models to map persistent-somatotopic representations on the sensorimotor cortex into sentences during silent speech (a large vocabulary was used at a speech rate of 78   wpm).

Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature https://doi.org/10.1038/s41586-023-06377-x (2023). The authors improved text decoding to an expansive vocabulary size at 62   wpm, by training models with connectionist temporal classification loss to decode sentences from multiunit activity from microelectrode arrays on precentral gyrus while a person with dysarthria silently attempted to speak.

Card, N. S. et al. An Accurate and Rapidly Calibrating Speech Neuroprosthesis https://doi.org/10.1101/2023.12.26.23300110 (2023). The authors used a similar approach to Willett et al. (2023), demonstrating that doubling the number of microelectrode arrays in the precentral gyrus further improved text-decoding accuracy with a rate of 33   wpm.

Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495 , 327–332 (2013). Here, the authors demonstrated the dynamics of somatotopic organization and speech-articulator representations for the jaw, lips, tongue and larynx during production of syllables, directly connecting phonetic production with speech-motor control of vocal-tract movements.

Carey, D., Krishnan, S., Callaghan, M. F., Sereno, M. I. & Dick, F. Functional and quantitative MRI mapping of somatomotor representations of human supralaryngeal vocal tract. Cereb. Cortex N. Y. N. 1991 27 , 265–278 (2017).

Ludlow, C. L. Central nervous system control of the laryngeal muscles in humans. Respir. Physiol. Neurobiol. 147 , 205–222 (2005).

Browman, C. P. & Goldstein, L. Articulatory gestures as phonological units. Phonology 6 , 201–251 (1989).

Ladefoged, P. & Johnson, K. A Course in Phonetics (Cengage Learning, 2014).

Berry, J. J. Accuracy of the NDI wave speech research system. J. Speech Lang. Hear. Res. 54 , 1295–1301 (2011).

Liu, P. et al. A deep recurrent approach for acoustic-to-articulatory inversion. In 2015 IEEE International Conf. Acoustics, Speech and Signal Processing ( ICASSP ) https://doi.org/10.1109/ICASSP.2015.7178812 (2015).

Chartier, J., Anumanchipalli, G. K., Johnson, K. & Chang, E. F. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron 98 , 1042–1054.e4 (2018). The authors demonstrated that, during continuous speech in able speakers, cortical activity on the ventral sensorimotor cortex encodes coordinated kinematic trajectories of speech articulators and gives rise to a low-dimensional representation of consonants and vowels.

Illa, A. & Ghosh, P. K. Representation learning using convolution neural network for acoustic-to-articulatory inversion. In ICASSP 2019 — 2019 IEEE International Conf. Acoustics, Speech and Signal Processing ( ICASSP ) https://doi.org/10.1109/ICASSP.2019.8682506 (2019).

Shahrebabaki, A. S., Salvi, G., Svendsen, T. & Siniscalchi, S. M. Acoustic-to-articulatory mapping with joint optimization of deep speech enhancement and articulatory inversion models. IEEEACM Trans. Audio Speech Lang. Process. 30 , 135–147 (2022).

Tychtl, Z. & Psutka, J. Speech production based on the mel-frequency cepstral coefficients. In 6th European Conf. Speech Communication and Technology ( Eurospeech 1999 ) https://doi.org/10.21437/Eurospeech.1999-510 (ISCA, 1999).

Belyk, M. & Brown, S. The origins of the vocal brain in humans. Neurosci. Biobehav. Rev. 77 , 177–193 (2017).

Simonyan, K. & Horwitz, B. Laryngeal motor cortex and control of speech in humans. Neuroscientist 17 , 197–208 (2011).

McCawley, J. D. in Tone (ed. Fromkin, V. A.) 113–131 (Academic, 1978).

Murray, I. R. & Arnott, J. L. Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. J. Acoust. Soc. Am. 93 , 1097–1108 (1993).

Article   CAS   PubMed   Google Scholar  

Chomsky, N. & Halle, M. The Sound Pattern of English (Harper, 1968).

Baddeley, A. Working Memory xi, 289 (Clarendon/Oxford Univ. Press, 1986).

Penfield, W. & Boldrey, E. Somatic motor and sensory representation in the cerebral cortex of man as studied by electrical stimulation. Brain 60 , 389–443 (1937). The authors demonstrated evidence of somatotopy on sensorimotor cortex by localizing cortical-stimulation-induced movement and sensation for individual muscle groups.

Penfield, W. & Roberts, L. Speech and Brain-Mechanisms (Princeton Univ. Press, 1959). This study provided insights into cortical control of speech and language through neurosurgical cases, including cortical resection, direct-cortical stimulation and seizure mapping.

Cushing, H. A note upon the Faradic stimulation of the postcentral gyrus in conscious patients. Brain 32 , 44–53 (1909). This study was one of the first that applied direct-cortical stimulation to localize function on the sensorimotor cortex.

Roux, F.-E., Niare, M., Charni, S., Giussani, C. & Durand, J.-B. Functional architecture of the motor homunculus detected by electrostimulation. J. Physiol. 598 , 5487–5504 (2020).

Jensen, M. A. et al. A motor association area in the depths of the central sulcus. Nat. Neurosci. 26 , 1165–1169 (2023).

Eichert, N., Papp, D., Mars, R. B. & Watkins, K. E. Mapping human laryngeal motor cortex during vocalization. Cereb. Cortex 30 , 6254–6269 (2020).

Umeda, T., Isa, T. & Nishimura, Y. The somatosensory cortex receives information about motor output. Sci. Adv. 5 , eaaw5388 (2019).

Murray, E. A. & Coulter, J. D. Organization of corticospinal neurons in the monkey. J. Comp. Neurol. 195 , 339–365 (1981).

Arce, F. I., Lee, J.-C., Ross, C. F., Sessle, B. J. & Hatsopoulos, N. G. Directional information from neuronal ensembles in the primate orofacial sensorimotor cortex. Am. J. Physiol. Heart Circ. Physiol . https://doi.org/10.1152/jn.00144.2013 (2013).

Mugler, E. M. et al. Differential representation of articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 4653 , 1206–1218 (2018). The authors demonstrated that the ventral sensorimotor cortex , not Broca’s area in the inferior frontal gyrus, best represents speech-articulatory gestures.

Dichter, B. K., Breshears, J. D., Leonard, M. K. & Chang, E. F. The control of vocal pitch in human laryngeal motor cortex. Cell 174 , 21–31.e9 (2018). The authors uncovered the causal role of the dorsal laryngeal motor cortex in controlling vocal pitch through feedforward motor commands, as well as additional auditory properties.

Belyk, M., Eichert, N. & McGettigan, C. A dual larynx motor networks hypothesis. Philos. Trans. R. Soc. B 376 , 20200392 (2021).

Lu, J. et al. Neural control of lexical tone production in human laryngeal motor cortex. Nat. Commun. 14 , 6917 (2023).

Silva, A. B. et al. A neurosurgical functional dissection of the middle precentral gyrus during speech production. J. Neurosci. 42 , 8416–8426 (2022).

Itabashi, R. et al. Damage to the left precentral gyrus is associated with apraxia of speech in acute stroke. Stroke 47 , 31–36 (2016).

Chang, E. F. et al. Pure apraxia of speech after resection based in the posterior middle frontal gyrus. Neurosurgery 87 , E383–E389 (2020).

Levy, D. F. et al. Apraxia of speech with phonological alexia and agraphia following resection of the left middle precentral gyrus: illustrative case. J. Neurosurg. Case Lessons 5 , CASE22504 (2023).

Willett, F. R. et al. Hand knob area of premotor cortex represents the whole body in a compositional way. Cell 181 , 396–409.e26 (2020).

Stavisky, S. D. et al. Neural ensemble dynamics in dorsal motor cortex during speech in people with paralysis. eLife 8 , e46015 (2019). The authors demonstrated that, at single locations on the dorsal precentral gyrus (hand area), neurons are tuned to movements of each key speech articulator.

Venezia, J. H., Thurman, S. M., Richards, V. M. & Hickok, G. Hierarchy of speech-driven spectrotemporal receptive fields in human auditory cortex. NeuroImage 186 , 647–666 (2019).

Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343 , 1006–1010 (2014).

Akbari, H., Khalighinejad, B., Herrero, J. L., Mehta, A. D. & Mesgarani, N. Towards reconstructing intelligible speech from the human auditory cortex. Sci. Rep. 9 , 874 (2019).

Pasley, B. N. et al. Reconstructing speech from human auditory cortex. PLOS Biol. 10 , e1001251 (2012).

Binder, J. R. The Wernicke area. Neurology 85 , 2170–2175 (2015).

Binder, J. R. Current controversies on Wernicke’s area and its role in language. Curr. Neurol. Neurosci. Rep. 17 , 58 (2017).

Martin, S. et al. Word pair classification during imagined speech using direct brain recordings. Sci. Rep. 6 , 25803 (2016).

Pei, X., Barbour, D., Leuthardt, E. C. & Schalk, G. Decoding vowels and consonants in spoken and imagined words using electrocorticographic signals in humans. J. Neural Eng. 8 , 046028 (2011).

Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. https://doi.org/10.3389/fneng.2014.00014 (2014).

Proix, T. et al. Imagined speech can be decoded from low- and cross-frequency intracranial EEG features. Nat. Commun. 13 , 48 (2022).

Simanova, I., Hagoort, P., Oostenveld, R. & van Gerven, M. A. J. Modality-independent decoding of semantic information from the human brain. Cereb. Cortex 24 , 426–434 (2014).

Wandelt, S. K. et al. Online internal speech decoding from single neurons in a human participant. Preprint at medRxiv https://doi.org/10.1101/2022.11.02.22281775 (2022). The authors decoded neuronal activity from a microelectrode array in the supramarginal gyrus into a set of eight words while the participant in their study imagined speaking.

Acharya, A. B. & Maani, C. V. Conduction aphasia. StatPearls https://www.ncbi.nlm.nih.gov/books/NBK537006/ (StatPearls, 2023).

Price, C. J., Moore, C. J., Humphreys, G. W. & Wise, R. J. Segregating semantic from phonological processes during reading. J. Cogn. Neurosci. 9 , 727–733 (1997).

Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532 , 453–458 (2016).

Tang, J., LeBel, A., Jain, S. & Huth, A. G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 26 , 858–866 (2023). The authors developed an approach to decode functional MRI activity during imagined speech into sentences with preserved semantic meaning, although word-by-word accuracy was limited.

Andrews, J. P. et al. Dissociation of Broca’s area from Broca’s aphasia in patients undergoing neurosurgical resections. J. Neurosurg . https://doi.org/10.3171/2022.6.JNS2297 (2022).

Mohr, J. P. et al. Broca aphasia: pathologic and clinical. Neurology 28 , 311–324 (1978).

Matchin, W. & Hickok, G. The cortical organization of syntax. Cereb. Cortex 30 , 1481–1498 (2020).

Chang, E. F., Kurteff, G. & Wilson, S. M. Selective interference with syntactic encoding during sentence production by direct electrocortical stimulation of the inferior frontal gyrus. J. Cogn. Neurosci. 30 , 411–420 (2018).

Thukral, A., Ershad, F., Enan, N., Rao, Z. & Yu, C. Soft ultrathin silicon electronics for soft neural interfaces: a review of recent advances of soft neural interfaces based on ultrathin silicon. IEEE Nanotechnol. Mag. 12 , 21–34 (2018).

Chow, M. S. M., Wu, S. L., Webb, S. E., Gluskin, K. & Yew, D. T. Functional magnetic resonance imaging and the brain: a brief review. World J. Radiol. 9 , 5–9 (2017).

Panachakel, J. T. & Ramakrishnan, A. G. Decoding covert speech from EEG — a comprehensive review. Front. Neurosci. 15 , 642251 (2021).

Lopez-Bernal, D., Balderas, D., Ponce, P. & Molina, A. A state-of-the-art review of EEG-based imagined speech decoding. Front. Hum. Neurosci. 16 , 867281 (2022).

Rabut, C. et al. A window to the brain: ultrasound imaging of human neural activity through a permanent acoustic window. Preprint at bioRxiv https://doi.org/10.1101/2023.06.14.544094 (2023).

Kwon, J., Shin, J. & Im, C.-H. Toward a compact hybrid brain–computer interface (BCI): performance evaluation of multi-class hybrid EEG-fNIRS BCIs with limited number of channels. PLOS ONE 15 , e0230491 (2020).

Wittevrongel, B. et al. Optically pumped magnetometers for practical MEG-based brain–computer interfacing. In Brain – Computer Interface Research: A State-of-the-Art Summary 10 (eds Guger, C., Allison, B. Z. & Gunduz, A.) https://doi.org/10.1007/978-3-030-79287-9_4 (Springer International, 2021).

Zheng, H. et al. The emergence of functional ultrasound for noninvasive brain–computer interface. Research 6 , 0200 (2023).

Fernández-de Thomas, R. J., Munakomi, S. & De Jesus, O. Craniotomy. StatPearls https://www.ncbi.nlm.nih.gov/books/NBK560922/ (StatPearls, 2024).

Parvizi, J. & Kastner, S. Promises and limitations of human intracranial electroencephalography. Nat. Neurosci. 21 , 474–483 (2018).

Rubin, D. B. et al. Interim safety profile from the feasibility study of the BrainGate Neural Interface system. Neurology 100 , e1177–e1192 (2023).

Guenther, F. H. et al. A wireless brain–machine interface for real-time speech synthesis. PLoS ONE 4 , e8218 (2009). The authors demonstrated above-chance online synthesis of formants, but not words or sentences, from neural activity recorded with an intracortical neurotrophic microelectrode in the precentral gyrus of an individual with anarthria.

Brumberg, J., Wright, E., Andreasen, D., Guenther, F. & Kennedy, P. Classification of intended phoneme production from chronic intracortical microelectrode recordings in speech motor cortex. Front. Neurosci . https://doi.org/10.3389/fnins.2011.00065 (2011). In a follow-up study to Guenther et al. (2009), the authors demonstrated the above-chance classification accuracy of phonemes.

Ray, S. & Maunsell, J. H. R. Different origins of gamma rhythm and high-gamma activity in macaque visual cortex. PLOS Biol. 9 , e1000610 (2011).

Ray, S., Crone, N. E., Niebur, E., Franaszczuk, P. J. & Hsiao, S. S. Neural correlates of high-gamma oscillations (60–200 Hz) in macaque local field potentials and their potential implications in electrocorticography. J. Neurosci. 28 , 11526–11536 (2008).

Crone, N. E., Boatman, D., Gordon, B. & Hao, L. Induced electrocorticographic gamma activity during auditory perception. Clin. Neurophysiol. 112 , 565–582 (2001).

Crone, N. E., Miglioretti, D. L., Gordon, B. & Lesser, R. P. Functional mapping of human sensorimotor cortex with electrocorticographic spectral analysis. II. Event-related synchronization gamma band. Brain 121 , 2301–2315 (1998).

Vakani, R. & Nair, D. R. in Handbook of Clinical Neurology Vol. 160 (eds Levin, K. H. & Chauvel, P.) Ch. 20, 313–327 (Elsevier, 2019).

Lee, A. T. et al. Modern intracranial electroencephalography for epilepsy localization with combined subdural grid and depth electrodes with low and improved hemorrhagic complication rates. J. Neurosurg. 1 , 1–7 (2022).

Nair, D. R. et al. Nine-year prospective efficacy and safety of brain-responsive neurostimulation for focal epilepsy. Neurology 95 , e1244–e1256 (2020).

Degenhart, A. D. et al. Histological evaluation of a chronically-implanted electrocorticographic electrode grid in a non-human primate. J. Neural Eng. 13 , 046019 (2016).

Silversmith, D. B. et al. Plug-and-play control of a brain–computer interface through neural map stabilization. Nat. Biotechnol. 39 , 326–335 (2021).

Luo, S. et al. Stable decoding from a speech BCI enables control for an individual with ALS without recalibration for 3 months. Adv. Sci. Weinh. Baden-Wurtt. Ger . https://doi.org/10.1002/advs.202304853 (2023). The authors demonstrated stability of electrocorticography -based speech decoding in a person with dysarthria by showing that, despite not re-training a model over the course of months, performance did not drop off.

Nordhausen, C. T., Maynard, E. M. & Normann, R. A. Single unit recording capabilities of a 100 microelectrode array. Brain Res. 726 , 129–140 (1996).

Normann, R. A. & Fernandez, E. Clinical applications of penetrating neural interfaces and Utah Electrode Array technologies. J. Neural Eng. 13 , 061003 (2016).

Wilson, G. H. et al. Decoding spoken English from intracortical electrode arrays in dorsal precentral gyrus. J. Neural Eng. 17 , 066007 (2020).

Patel, P. R. et al. Utah array characterization and histological analysis of a multi-year implant in non-human primate motor and sensory cortices. J. Neural Eng. 20 , 014001 (2023).

Barrese, J. C. et al. Failure mode analysis of silicon-based intracortical microelectrode arrays in non-human primates. J. Neural Eng. 10 , 066014 (2013).

Woeppel, K. et al. Explant analysis of Utah electrode arrays implanted in human cortex for brain–computer-interfaces. Front. Bioeng. Biotechnol . https://doi.org/10.3389/fbioe.2021.759711 (2021).

Wilson, G. H. et al. Long-term unsupervised recalibration of cursor BCIs. Preprint at bioRxiv https://doi.org/10.1101/2023.02.03.527022 (2023).

Degenhart, A. D. et al. Stabilization of a brain–computer interface via the alignment of low-dimensional spaces of neural activity. Nat. Biomed. Eng. 4 , 672–685 (2020).

Karpowicz, B. M. et al. Stabilizing brain–computer interfaces through alignment of latent dynamics. Preprint at bioRxiv https://doi.org/10.1101/2022.04.06.487388 (2022).

Fan, C. et al. Plug-and-play stability for intracortical brain–computer interfaces: a one-year demonstration of seamless brain-to-text communication. Preprint at bioRxiv https://doi.org/10.48550/arXiv.2311.03611 (2023).

Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci . https://doi.org/10.3389/fnins.2015.00217 (2015). The authors demonstrated that sequences of phonemes can be decoded from cortical activity in able speakers and assembled into sentences using language models, albeit with high error rates on increased vocabulary sizes.

Mugler, E. M. et al. Direct classification of all American English phonemes using signals from functional speech motor cortex. J. Neural Eng. 11 , 035015 (2014). The authors demonstrated that all English phonemes can be decoded from cortical activity of able speakers.

Makin, J. G., Moses, D. A. & Chang, E. F. Machine translation of cortical activity to text with an encoder–decoder framework. Nat. Neurosci. 23 , 575–582 (2020). The authors developed a recurrent neural network -based approach to decode cortical activity from able speakers word-by-word into sentences, with word error rates as low as 3%.

Sun, P., Anumanchipalli, G. K. & Chang, E. F. Brain2Char: a deep architecture for decoding text from brain recordings. J. Neural Eng. 17 , 066015 (2020). The authors trained a recurrent neural network with connectionist temporal classification loss to decode cortical activity from able speakers into sequences of characters, which were then built into sentences using language models, achieving word error rates as low as 7% with an over 1,000-word vocabulary.

Anumanchipalli, G. K., Chartier, J. & Chang, E. F. Speech synthesis from neural decoding of spoken sentences. Nature 568 , 493–498 (2019). The authors developed a biomimetic approach to synthesize full sentences from cortical activity in able speakers: articulatory kinematics were first decoded from cortical activity and an acoustic waveform was subsequently synthesized from this intermediate representation.

Angrick, M. et al. Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16 , 036019 (2019). The authors developed a neural-network-based approach to synthesize single words from cortical activity in able speakers.

Herff, C. et al. Generating natural, intelligible speech from brain activity in motor, premotor, and inferior frontal cortices. Front. Neurosci . https://doi.org/10.3389/fnins.2019.01267 (2019). The authors developed a concatenative speech-synthesis approach for single words in healthy speakers, tailored to limited-sized datasets.

Salari, E. et al. Classification of articulator movements and movement direction from sensorimotor cortex activity. Sci. Rep. 9 , 14165 (2019).

Salari, E., Freudenburg, Z. V., Vansteensel, M. J. & Ramsey, N. F. Classification of facial expressions for intended display of emotions using brain–computer interfaces. Ann. Neurol. 88 , 631–636 (2020).

Berezutskaya, J. et al. Direct speech reconstruction from sensorimotor brain activity with optimized deep learning models. Preprint at bioRxiv https://doi.org/10.1101/2022.08.02.502503 (2022).

Martin, S. et al. Decoding inner speech using electrocorticography: progress and challenges toward a speech prosthesis. Front. Neurosci . https://doi.org/10.3389/fnins.2018.00422 (2018).

Moses, D. A., Leonard, M. K., Makin, J. G. & Chang, E. F. Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nat. Commun. 10 , 3096 (2019).

Ramsey, N. F. et al. Decoding spoken phonemes from sensorimotor cortex with high-density ECoG grids. NeuroImage 180 , 301–311 (2018).

Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. 23rd Int. Conf. Machine Learning — ICML ’06 https://doi.org/10.1145/1143844.1143891 (ACM Press, 2006).

Metzger, S. L. et al. Generalizable spelling using a speech neuroprosthesis in an individual with severe limb and vocal paralysis. Nat. Commun. 13 , 6510 (2022).

Pandarinath, C. et al. Latent factors and dynamics in motor cortex and their application to brain–machine interfaces. J. Neurosci. 38 , 9390–9401 (2018).

Parrell, B. & Houde, J. Modeling the role of sensory feedback in speech motor control and learning. J. Speech Lang. Hear. Res. 62 , 2963–2985 (2019).

Houde, J. & Nagarajan, S. Speech production as state feedback control. Front. Hum. Neurosci . https://doi.org/10.3389/fnhum.2011.00082 (2011).

Sitaram, R. et al. Closed-loop brain training: the science of neurofeedback. Nat. Rev. Neurosci. 18 , 86–100 (2017).

Wairagkar, M., Hochberg, L. R., Brandman, D. M. & Stavisky, S. D. Synthesizing speech by decoding intracortical neural activity from dorsal motor cortex. In 2023 11th Int. IEEE/EMBS Conf. Neural Engineering ( NER ) https://doi.org/10.1109/NER52421.2023.10123880 (IEEE, 2023).

Casanova, E. et al. YourTTS: towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In Proc. 39th Int. Conf. Machine Learning (eds Chaudhuri, K. et al.) Vol. 162, 2709–2720 (PMLR, 2022).

Peters, B., O’Brien, K. & Fried-Oken, M. A recent survey of augmentative and alternative communication use and service delivery experiences of people with amyotrophic lateral sclerosis in the United States. Disabil. Rehabil. Assist. Technol. https://doi.org/10.1080/17483107.2022.2149866  (2022).

Wu, P., Watanabe, S., Goldstein, L., Black, A. W. & Anumanchipalli, G. K. Deep speech synthesis from articulatory representations. In Proc. Interspeech 2022 , 779–783 (2022). https://doi.org/10.21437/Interspeech.2022-10892 .

Cho, C. J., Wu, P., Mohamed, A. & Anumanchipalli, G. K. Evidence of vocal tract articulation in self-supervised learning of speech. In ICASSP 2023 — 2023 IEEE International Conference on Acoustics, Speech and Signal Processing ( ICASSP ) (IEEE, 2023). https://doi.org/10.1109/icassp49357.2023.10094711 .

Mehrabian, A. Silent Messages: Implicit Communication of Emotions and Attitudes (Wadsworth, 1981).

Jia, J., Wang, X., Wu, Z., Cai, L. & Meng, H. Modeling the correlation between modality semantics and facial expressions. In Proc. 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference 1–10 (2012).

Sumby, W. H. & Pollack, I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26 , 212–215 (1954).

Branco, M. P. et al. Brain–computer interfaces for communication: preferences of individuals with locked-in syndrome. Neurorehabil. Neural Repair. 35 , 267–279 (2021).

Patterson, J. R. & Grabois, M. Locked-in syndrome: a review of 139 cases. Stroke 17 , 758–764 (1986).

Tomik, B. & Guiloff, R. J. Dysarthria in amyotrophic lateral sclerosis: a review. Amyotroph. Lateral Scler. 11 , 4–15 (2010).

Thomas, T. M. et al. Decoding articulatory and phonetic components of naturalistic continuous speech from the distributed language network. J. Neural Eng. 20 , 046030 (2023).

Flinker, A. et al. Redefining the role of Broca’s area in speech. Proc. Natl Acad. Sci. USA 112 , 2871–2875 (2015).

Cogan, G. B. et al. Sensory–motor transformations for speech occur bilaterally. Nature 507 , 94–98 (2014).

Rainey, S., Martin, S., Christen, A. & Mégevand, P. & Fourneret, E. Brain recording, mind-reading, and neurotechnology: ethical issues from consumer devices to brain-based speech decoding. Sci. Eng. Ethics 26 , 2295–2311 (2020).

Nip, I. & Roth, C. R. in Encyclopedia of Clinical Neuropsychology (eds Kreutzer, J., DeLuca, J. & Caplan, B.) (Springer International, 2017).

Xiong, W. et al. Toward human parity in conversational speech recognition. IEEEACM Trans. Audio Speech Lang. Process. 25 , 2410–2423 (2017).

Munteanu, C., Penn, G., Baecker, R., Toms, E. & James, D. Measuring the acceptable word error rate of machine-generated webcast transcripts. In Interspeech 2006 https://doi.org/10.21437/Interspeech.2006-40 (2006).

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE Int. Conf. Acoustics, Speech and Signal Processing ( ICASSP ) https://doi.org/10.1109/ICASSP.2015.7178964 (IEEE, 2015).

Godfrey, J. J., Holliman, E. C. & McDaniel, J. SWITCHBOARD: telephone speech corpus for research and development. In Proc. ICASSP- 92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing Vol. 1, 517–520 (1992).

OpenAI. GPT-4 Technical Report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

Trnka, K., Yarrington, D., McCaw, J., McCoy, K. F. & Pennington, C. The effects of word prediction on communication rate for AAC. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers 173–176 (Association for Computational Linguistics, 2007).

Venkatagiri, H. Effect of window size on rate of communication in a lexical prediction AAC system. Augment. Altern. Commun. 10 , 105–112 (1994).

Trnka, K., Mccaw, J., Mccoy, K. & Pennington, C. in Human Language Technologies 2007 173–176 (2008).

Kayte, S. N., Mal, M., Gaikwad, S. & Gawali, B. Performance evaluation of speech synthesis techniques for English language. In Proc. Int. Congress on Information and Communication Technology (eds Satapathy, S. C., Bhatt, Y. C., Joshi, A. & Mishra, D. K.) 253–262 https://doi.org/10.1007/978-981-10-0755-2_27 (Springer, 2016).

Wagner, P. et al. Speech synthesis evaluation — state-of-the-art assessment and suggestion for a novel research program. In 10th ISCA Workshop on Speech Synthesis ( SSW 10) https://doi.org/10.21437/SSW.2019-19 (ISCA, 2019).

Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proc. IEEE Pacific Rim Conf. Communications Computers and Signal Processing Vol. 1, 125–128 (1993).

Varshney, S., Farias, D., Brandman, D. M., Stavisky, S. D. & Miller, L. M. Using automatic speech recognition to measure the intelligibility of speech synthesized from brain signals. In 2023 11th Int. IEEE/EMBS Conf. Neural Engineering ( NER ) https://doi.org/10.1109/NER52421.2023.10123751 (IEEE, 2023).

Radford, A. et al. Robust speech recognition via large-scale weak supervision. Preprint at http://arxiv.org/abs/2212.04356 (2022).

Yates, A. J. Delayed auditory feedback. Psychol. Bull. 60 , 213–232 (1963).

Zanette, D. Statistical patterns in written language. Preprint at https://arxiv.org/abs/1412.3336v1 (2014).

Adolphs, S. & Schmitt, N. Lexical coverage of spoken discourse. Appl. Linguist. 24 , 425–438 (2003).

Laureys, S. et al. The locked-in syndrome: what is it like to be conscious but paralyzed and voiceless? in Progress in Brain Research Vol. 150 (ed. Laureys, S.) 495–611 (Elsevier, 2005).

Peters, B. et al. Brain–computer interface users speak up: the Virtual Users’ Forum at the 2013 International Brain–Computer Interface Meeting. Arch. Phys. Med. Rehabil. 96 , S33–S37 (2015).

Huggins, J. E., Wren, P. A. & Gruis, K. L. What would brain–computer interface users want? Opinions and priorities of potential users with amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. 12 , 318–324 (2011).

Kreuzberger, D., Kühl, N. & Hirschl, S. Machine learning operations (MLOps): overview, definition, and architecture. IEEE Access. 11 , 31866–31879 (2023).

Gordon, E. M. et al. A somato-cognitive action network alternates with effector regions in motor cortex. Nature https://doi.org/10.1038/s41586-023-05964-2 (2023).

Degenhart, A. D. et al. Remapping cortical modulation for electrocorticographic brain–computer interfaces: a somatotopy-based approach in individuals with upper-limb paralysis. J. Neural Eng. 15 , 026021 (2018).

Kikkert, S., Pfyffer, D., Verling, M., Freund, P. & Wenderoth, N. Finger somatotopy is preserved after tetraplegia but deteriorates over time. eLife 10 , e67713 (2021).

Bruurmijn, M. L. C. M., Pereboom, I. P. L., Vansteensel, M. J., Raemaekers, M. A. H. & Ramsey, N. F. Preservation of hand movement representation in the sensorimotor areas of amputees. Brain 140 , 3166–3178 (2017).

Guenther, F. H. Neural Control of Speech (MIT Press, 2016).

Castellucci, G. A., Kovach, C. K., Howard, M. A., Greenlee, J. D. W. & Long, M. A. A speech planning network for interactive language use. Nature 602 , 117–122 (2022).

Murphy, E. et al. The spatiotemporal dynamics of semantic integration in the human brain. Nat. Commun. 14 , 6336 (2023).

Ozker, M., Doyle, W., Devinsky, O. & Flinker, A. A cortical network processes auditory error signals during human speech production to maintain fluency. PLOS Biol. 20 , e3001493 (2022).

Quirarte, J. A. et al. Language supplementary motor area syndrome correlated with dynamic changes in perioperative task-based functional MRI activations: case report. J. Neurosurg. 134 , 1738–1742 (2020).

Bullock, L., Forseth, K. J., Woolnough, O., Rollo, P. S. & Tandon, N. Supplementary motor area in speech initiation: a large-scale intracranial EEG evaluation of stereotyped word articulation. Preprint at bioRxiv https://doi.org/10.1101/2023.04.04.535557 (2023).

Oby, E. R. et al. New neural activity patterns emerge with long-term learning. Proc. Natl Acad. Sci. USA 116 , 15210–15215 (2019).

Luu, T. P., Nakagome, S., He, Y. & Contreras-Vidal, J. L. Real-time EEG-based brain–computer interface to a virtual avatar enhances cortical involvement in human treadmill walking. Sci. Rep. 7 , 8895 (2017).

Alimardani, M. et al. Brain – Computer Interface and Motor Imagery Training: The Role of Visual Feedback and Embodiment . Evolving BCI Therapy — Engaging Brain State Dynamics https://doi.org/10.5772/intechopen.78695 (IntechOpen, 2018).

Orsborn, A. L. et al. Closed-loop decoder adaptation shapes neural plasticity for skillful neuroprosthetic control. Neuron 82 , 1380–1393 (2014).

Muller, L. et al. Thin-film, high-density micro-electrocorticographic decoding of a human cortical gyrus. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society ( EMBC ) https://doi.org/10.1109/EMBC.2016.7591001 (2016).

Duraivel, S. et al. High-resolution neural recordings improve the accuracy of speech decoding. Nat. Commun. 14 , 6938 (2023).

Kaiju, T., Inoue, M., Hirata, M. & Suzuki, T. High-density mapping of primate digit representations with a 1152-channel µECoG array. J. Neural Eng. 18 , 036025 (2021).

Woods, V. et al. Long-term recording reliability of liquid crystal polymer µECoG arrays. J. Neural Eng. 15 , 066024 (2018).

Rachinskiy, I. et al. High-density, actively multiplexed µECoG array on reinforced silicone substrate. Front. Nanotechnol . https://doi.org/10.3389/fnano.2022.837328 (2022).

Sun, J. et al. Intraoperative microseizure detection using a high-density micro-electrocorticography electrode array. Brain Commun. 4 , fcac122 (2022).

Ho, E. et al. The layer 7 cortical interface: a scalable and minimally invasive brain–computer interface platform. Preprint at bioRxiv https://doi.org/10.1101/2022.01.02.474656 (2022).

Oxley, T. J. et al. Motor neuroprosthesis implanted with neurointerventional surgery improves capacity for activities of daily living tasks in severe paralysis: first in-human experience. J. NeuroIntervent. Surg. 13 , 102–108 (2021).

Chen, R., Canales, A. & Anikeeva, P. Neural recording and modulation technologies. Nat. Rev. Mater. 2 , 1–16 (2017).

Article   CAS   Google Scholar  

Hong, G. & Lieber, C. M. Novel electrode technologies for neural recordings. Nat. Rev. Neurosci. 20 , 330–345 (2019).

Sahasrabuddhe, K. et al. The Argo: a high channel count recording system for neural recording in vivo. J. Neural Eng. 18 , 015002 (2021).

Musk, E. & Neuralink. An integrated brain–machine interface platform with thousands of channels. J. Med. Internet Res. 21 , e16194 (2019).

Paulk, A. C. et al. Large-scale neural recordings with single neuron resolution using neuropixels probes in human cortex. Nat. Neurosci. 25 , 252–263 (2022).

Chung, J. E. et al. High-density single-unit human cortical recordings using the neuropixels probe. Neuron 110 , 2409–2421.e3 (2022).

Kingma, D. P. & Welling, M. An introduction to variational autoencoders. Found. Trends Mach. Learn. 12 , 307–392 (2019).

Schneider, S., Lee, J. H. & Mathis, M. W. Learnable latent embeddings for joint behavioural and neural analysis. Nature 617 , 360–368 (2023).

Liu, R. et al. Drop, swap, and generate: a self-supervised approach for generating neural activity. Preprint at http://arxiv.org/abs/2111.02338 (2021).

Cho, C. J., Chang, E. & Anumanchipalli, G. Neural latent aligner: cross-trial alignment for learning representations of complex, naturalistic neural data. In Proc. 40th Int. Conf. Machine Learning 5661–5676 (PMLR, 2023).

Keshtkaran, M. R. et al. A large-scale neural network training framework for generalized estimation of single-trial population dynamics. Nat. Methods 19 , 1572–1577 (2022).

Berezutskaya, J. et al. Direct speech reconstruction from sensorimotor brain activity with optimized deep learning models. J. Neural Eng. 20 , 056010 (2023).

Article   PubMed Central   Google Scholar  

Touvron, H. et al. LLaMA: Open and Efficient Foundation Language Models. Preprint at https://doi.org/10.48550/arXiv.2302.13971 (2023).

Graves, A. Sequence transduction with recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.1211.3711 (2012).

Shi, Y. et al. Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition. Preprint at https://doi.org/10.48550/arXiv.2010.10759 (2020).

Rapeaux, A. B. & Constandinou, T. G. Implantable brain machine interfaces: first-in-human studies, technology challenges and trends. Curr. Opin. Biotechnol. 72 , 102–111 (2021).

Matsushita, K. et al. A fully implantable wireless ECoG 128-channel recording device for human brain–machine interfaces: W-HERBS. Front. Neurosci. 12 , 511 (2018).

Cajigas, I. et al. Implantable brain–computer interface for neuroprosthetic-enabled volitional hand grasp restoration in spinal cord injury. Brain Commun. 3 , fcab248 (2021).

Jarosiewicz, B. & Morrell, M. The RNS system: brain-responsive neurostimulation for the treatment of epilepsy. Expert Rev. Med. Dev. 18 , 129–138 (2021).

Lorach, H. et al. Walking naturally after spinal cord injury using a brain–spine interface. Nature 618 , 126–133 (2023).

Weiss, J. M., Gaunt, R. A., Franklin, R., Boninger, M. L. & Collinger, J. L. Demonstration of a portable intracortical brain–computer interface. Brain-Comput. Interfaces 6 , 106–117 (2019).

Kim, J. S., Kwon, S. U. & Lee, T. G. Pure dysarthria due to small cortical stroke. Neurology 60 , 1178–1180 (2003).

Urban, P. P. et al. Left-hemispheric dominance for articulation: a prospective study on acute ischaemic dysarthria at different localizations. Brain 129 , 767–777 (2006).

Wu, P. et al. Speaker-independent acoustic-to-articulatory speech inversion. Preprint at https://doi.org/10.48550/arXiv.2302.06774 (2023).

Oppenheim, A. V., Schafer, R. W. & Schafer, R. W. Discrete-Time Signal Processing (Pearson, 2014).

Kim, J. W., Salamon, J., Li, P. & Bello, J. P. CREPE: a convolutional representation for pitch estimation. Preprint at https://doi.org/10.48550/arXiv.1802.06182 (2018).

Park, K. & Kim, J. g2pE. Github https://github.com/Kyubyong/g2p (2019).

Duffy, J. R. Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (Elsevier Health Sciences, 2019).

Basilakos, A., Rorden, C., Bonilha, L., Moser, D. & Fridriksson, J. Patterns of poststroke brain damage that predict speech production errors in apraxia of speech and aphasia dissociate. Stroke 46 , 1561–1566 (2015).

Berthier, M. L. Poststroke aphasia: epidemiology, pathophysiology and treatment. Drugs Aging 22 , 163–182 (2005).

Wilson, S. M. et al. Recovery from aphasia in the first year after stroke. Brain 146 , 1021–1039 (2022).

Marzinske, M. Help for speech, language disorders. Mayo Clinic Health System https://www.mayoclinichealthsystem.org/hometown-health/speaking-of-health/help-is-available-for-speech-and-language-disorders (2022).

Amyotrophic lateral sclerosis. CDC https://www.cdc.gov/als/WhatisALS.html (CDC, 2022).

Sokolov, A. Inner Speech and Thought (Springer Science & Business Media, 2012).

Alderson-Day, B. & Fernyhough, C. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychol. Bull. 141 , 931–965 (2015).

Sankaran, N., Moses, D., Chiong, W. & Chang, E. F. Recommendations for promoting user agency in the design of speech neuroprostheses. Front. Hum. Neurosci. 17 , 1298129 (2023).

Sun, X. & Ye, B. The functional differentiation of brain–computer interfaces (BCIs) and its ethical implications. Humanit. Soc. Sci. Commun. 10 , 1–9 (2023).

Ienca, M., Haselager, P. & Emanuel, E. J. Brain leaks and consumer neurotechnology. Nat. Biotechnol. 36 , 805–810 (2018).

Yuste, R. Advocating for neurodata privacy and neurotechnology regulation. Nat. Protoc. 18 , 2869–2875 (2023).

Kamal, A. H. et al. A person-centered, registry-based learning health system for palliative care: a path to coproducing better outcomes, experience, value, and science. J. Palliat. Med. 21 , S-61 (2018).

Alford, J. The multiple facets of co-production: building on the work of Elinor Ostrom. Public. Manag. Rev. 16 , 299–316 (2014).

Institute of Medicine (US) Roundtable on Value & Science-Driven Health Care. Clinical Data as the Basic Staple of Health Learning: Creating and Protecting a Public Good: Workshop Summary (National Academies Press, 2011).

Download references

Acknowledgements

The authors are incredibly grateful to the many people who enrolled in the aforedescribed studies. A.B.S. was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number F30DC021872. K.T.L. is supported by the National Science Foundation GRFP. J.R.L. and D.A.M. were supported by the National Institutes of Health grant U01 DC018671-01A1.

Author information

Authors and affiliations.

Department of Neurological Surgery, University of California, San Francisco, San Francisco, CA, USA

Alexander B. Silva, Kaylo T. Littlejohn, Jessie R. Liu, David A. Moses & Edward F. Chang

Weill Institute for Neuroscience, University of California, San Francisco, San Francisco, CA, USA

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA

Kaylo T. Littlejohn

You can also search for this author in PubMed   Google Scholar

Contributions

E.F.C. and A.B.S. researched data for the article and contributed substantially to discussion of the content. All authors wrote the article and reviewed and/or edited the manuscript before submission.

Corresponding author

Correspondence to Edward F. Chang .

Ethics declarations

Competing interests.

D.A.M., J.R.L. and E.F.C. are inventors on a pending provisional UCSF patent application that is relevant to the neural-decoding approaches surveyed in this work. E.F.C. is an inventor on patent application PCT/US2020/028926, D.A.M. and E.F.C. are inventors on patent application PCT/US2020/043706 and E.F.C. is an inventor on patent US9905239B2, which are broadly relevant to the neural-decoding approaches surveyed in this work. EFC is co-founder of Echo Neurotechnologies, LLC. All other authors declare no competing interests.

Peer review

Peer review information.

Nature Reviews Neuroscience thanks Gregory Cogan, who co-reviewed with Suseendrakumar Duraivel; Marcel van Gerven; Christian Herff; and Cynthia Chestek for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Speech-motor disorder referring to an inability to move the vocal-tract muscles to articulate speech.

A disorder of understanding or expressing language.

This is an instruction given to individuals with vocal-tract paralysis to attempt to speak the best they can, despite lack of the attempt being intelligible.

A speech-synthesis approach that relies on matching neural activity with discrete units of a speech waveform that are then concatenated together.

The pathway through which motor commands from the cortex reach the muscles of the vocal tract. At a high level, cortical motor neurons send axons via the corticobulbar tract which terminate in cranial nerve nuclei in the brainstem. Second-order motor neurons in the cranial nerve nuclei then send axons, that bundle and form cranial nerves, to innervate the muscles of the vocal tract.

The preferred resonating frequencies of the vocal tract that are critical for forming different vowel sounds.

Models that are trained to capture the statistical patterns of word occurrences in natural language.

This refers to a clinical condition in which a participant retains cognitive capacity but has limited voluntary motor function. Locked-in syndrome is a spectrum, ranging from fully locked in states (no residual voluntary motor function) to partially locked in states (some residual voluntary motor function such as head movements).

An attempt to move vocal-tract muscles without attempting to vocalize.

This area of the cortex is composed of the precentral and postcentral gyri, primarily responsible for motor control and sensation, respectively.

This is an instruction given to individuals with vocal-tract paralysis to attempt to speak the best they can, but without vocalizing.

The vocal-tract muscle groups that are important for producing (articulating) speech, including the lips, jaw, tongue and larynx.

The arrangement and structure of words to form coherent sentences.

An inability to contract and move the speech articulators, often caused by injury to descending motor-neuron tracts in the brainstem.

The law that generally proposes that the frequencies of items are inversely proportional to their ranks.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Silva, A.B., Littlejohn, K.T., Liu, J.R. et al. The speech neuroprosthesis. Nat. Rev. Neurosci. (2024). https://doi.org/10.1038/s41583-024-00819-9

Download citation

Accepted : 12 April 2024

Published : 14 May 2024

DOI : https://doi.org/10.1038/s41583-024-00819-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

introduction speech production

How to Introduce a Guest Speaker (with Examples)

May 25, 2023

people sitting on gang chairs

Introducing a guest speaker is an important responsibility that sets the stage for their presentation and creates an atmosphere of anticipation. A well-crafted introduction not only provides essential information about the speaker but also captivates the audience and builds excitement. In this article, we will explore the art of how to introduce a guest speaker and how to craft a brilliant script for introducing a guest speaker.

From the best way to introduce a speaker to example speeches and tips for making a memorable impact, we will equip you with the tools to deliver introductions that engage, entertain, and leave a lasting impression.

What Is the Best Way to Introduce a Speaker?

Introducing a speaker effectively requires careful planning and consideration. Here are some key elements to keep in mind for how to introduce a guest speaker successfully.

1. Research and gather information.

Before introducing the guest speaker, conduct thorough research to gather relevant information about their background, achievements, and expertise. This will help you create an introduction that is both personalized and impactful.

2. Establish credibility.

Highlight the speaker’s credentials and accomplishments to establish their credibility in the eyes of the audience. Share their relevant experience, expertise, and any notable achievements that are relevant to the topic of their presentation.

3. Create a connection.

Find a compelling way to establish a connection between both the speaker and the audience. This can be through shared interests, experiences, or values. For example, if you’re introducing a guest speaker at your university who happens to be an alumnus of your school, make sure you draw attention to that in your introduction. Creating a relatable connection helps the audience connect with the speaker right from the start.

4. Build anticipation.

Engage the audience’s curiosity by giving a glimpse of what the speaker will later cover in their presentation. Tease key points, intriguing anecdotes, or unique perspectives that the audience can look forward to during the talk. This builds anticipation and captures attention for the following presentation.

5. Keep it concise and engaging.

Aim for a concise, but also interesting, introduction. Use short, direct sentences that convey information clearly. In addition, avoid lengthy biographies or unnecessary details that may lose the audience’s interest. Finally, craft your words carefully to maintain a lively and engaging tone.

How to Use AI to Practice Introducing a Guest Speaker

When it comes time to practice your guest speaker introduction speech, Yoodli , an AI-powered communication coach, becomes your invaluable practice partner. With Yoodli’s cutting-edge technology and generative AI , you can rehearse and refine your introduction in a virtual, judgement-free environment. Its personalized feedback helps you fine-tune your tone, pacing, and overall delivery, ensuring that you make a powerful impact when introducing a guest speaker.

A screenshot demonstrating how to use Yoodli to practice how to introduce a guest speaker.

Furthermore, Yoodli automatically generates a transcription of your speech, analyzing it for keywords. This means you can get a sense of how your audience might interpret your speech’s overall message and main points. With Yoodli’s assistance, you can gain confidence, practice your high income skills (like your storytelling skills, for example) and create an introduction that captivates as well as energizes the audience.

Examples of How to Introduce a Guest Speaker

To illustrate the power of a great guest speaker introduction, let’s take a look at an example of how to introduce a guest speaker.

Example of a general introduction for a guest speaker

Good morning, all! Today, we have the privilege of being in the presence of a true visionary and leader in the field of environmental sustainability. Our guest speaker has dedicated her career to finding innovative solutions for a greener and more sustainable future. [Speaker’s name], the CEO of [organization/company name], has successfully spearheaded numerous initiatives that have had a profound impact on our environment. Under her leadership, the company has revolutionized the way we approach sustainability challenges, pushing boundaries and inspiring change. With over two decades of experience in environmental engineering, [Speaker’s name] has been at the forefront of designing groundbreaking technologies and implementing sustainable practices in industries ranging from renewable energy to waste management. Her expertise has earned her international recognition and multiple prestigious awards. But it’s not just her professional achievements that make her special. [Speaker’s name] is a passionate advocate for educating the next generation on the importance of environmental stewardship. Her engaging speaking style and ability to connect with audiences of all backgrounds make her an inspiration to many. Today, [Speaker’s name] will be sharing her insights on how we can create a more sustainable future through innovation and collaboration. Get ready to be inspired, challenged, and empowered to take action. Please join me in giving a warm welcome to the exceptional [Speaker’s name]!

This example highlights the speaker’s credentials, builds a connection, creates anticipation, and sets the stage for an engaging and informative presentation.

Example of a personal anecdote for a guest speaker introduction

“Picture this: It was a sunny afternoon in the heart of our city, and I found myself walking through the bustling streets, surrounded by the sound of honking cars and the hum of conversation. Amidst the chaos, I stumbled upon a small park nestled between towering buildings — a hidden oasis of greenery and serenity.

As I entered the park, I noticed a group of children huddled around a captivating woman who stood in front of a majestic oak tree. It was none other than our esteemed guest speaker, [Speaker’s name]. She was engaging the children in a lively discussion about the wonders of nature and the importance of preserving our environment.

What struck me most was the way [Speaker’s name] effortlessly connected with these young minds, sparking their curiosity and inspiring them to take action. I watched as she shared stories of her own childhood adventures exploring forests, climbing trees, and discovering the beauty of our natural world.

In that moment, I realized the profound impact [Speaker’s name] had on these children: instilling a deep love and respect for the environment. Her passion was contagious, and it reminded me of the power we all possess to make a difference, no matter how small.

From that day forward, I became an avid follower of [Speaker’s name]’s work. Her commitment to environmental stewardship and her ability to connect with people from all walks of life is truly remarkable. Today, we have the incredible honor of welcoming her to this stage to share her insights and inspire us all to join the movement for a greener and more sustainable future.

Please finally join me in giving a warm welcome to the extraordinary [Speaker’s name]!”

What Do You Say First When Introducing a Guest Speaker?

The first few sentences of a guest speaker introduction are crucial in capturing the audience’s attention and setting the tone for the entire introduction. Here are some effective opening lines to consider adding to your script when introducing a guest speaking.

1. Engage listeners with a thought-provoking question.

Start with a thought-provoking question related to the speaker’s topic or expertise. This immediately grabs the audience’s attention and, what’s more, encourages them to actively participate in the introduction. For example: “Have you ever wondered how a single individual can make a significant impact on global environmental issues?”

2. Begin with a captivating anecdote or story.

Introduce the speaker by sharing a captivating anecdote or story that relates to their work or accomplishments. This narrative approach instantly draws the audience in and also builds an emotional connection.

3. Use a powerful quote.

Start with a powerful quote that encapsulates the essence of the speaker’s message or expertise. Quotes are attention-grabbing and can also convey a sense of authority and relevance. For example: “As Albert Einstein once said, ‘We cannot solve our problems with the same thinking we used when we created them.'” You can find some powerful quotes from the best motivational speeches , too.

4. Make a bold statement.

Begin your script to introduce your guest speaker with a bold and impactful statement that immediately captures the audience’s attention. This statement should be concise yet intriguing, sparking curiosity as well as setting the stage for the speaker’s presentation. Attention getters are perfect for this. For example: “Today, you’re about to witness a groundbreaking approach to tackling one of the most pressing challenges of our time: climate change.”

Remember, the opening lines of your script to introduce a guest speaker are the gateway to engaging the audience and setting the stage for a memorable presentation. Choose an approach that aligns with the speaker’s personality as well as the event’s atmosphere, and don’t be afraid to be creative and captivating.

The Main Takeaway

Giving an introduction for a guest speaker is an art that requires careful planning, research, and an understanding of the audience’s expectations. By following the principles discussed in this article and using examples as inspiration, you can deliver introductions that engage, entertain, and leave a lasting impression. Remember, the goal is to set the stage for the speaker’s presentation and create a sense of excitement and anticipation.

So, go ahead, embrace the power of a well-crafted introduction, and make every guest speaker’s presence an unforgettable experience for your audience.

Start practicing with Yoodli.

Getting better at speaking is getting easier. Record or upload a speech and let our AI Speech Coach analyze your speaking and give you feedback.

IMAGES

  1. Introduction to Articulatory Phonetics. The production of speech: The

    introduction speech production

  2. PPT

    introduction speech production

  3. PPT

    introduction speech production

  4. Stages of Speech Production (aka Levels of Linguistic Representation)

    introduction speech production

  5. 9.2 The Standard Model of Speech Production

    introduction speech production

  6. Speech production

    introduction speech production

VIDEO

  1. Introduction Speech 1

  2. introduction speech

  3. Introduction to Speech Processing (Lecture 1)

  4. The Stages of Speech Production

  5. Definition of Speech Function & Characteristics of Normal Speech

  6. Presentation Tips

COMMENTS

  1. Speech Production

    Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process, the expulsion ...

  2. Speech Production

    Introduction. Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics, Acoustic Phonetics and Speech Perception, which are all studying various elements of language and are part of a broader field ...

  3. Speech production

    Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus. Speech production can be spontaneous such as when a person creates the words of a ...

  4. Speech Production

    Respiration (At the lungs): The first thing we need to produce sound is a source of energy. For human speech sounds, the air flowing from our lungs provides energy. Phonation (At the larynx): Secondly, we need a source of sound: air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony ...

  5. Articulating: The Neural Mechanisms of Speech Production

    Introduction. Speech production is a highly complex motor act involving respiratory, laryngeal, and supraglottal vocal tract articulators working together in a highly coordinated fashion. Nearly every speech gesture involves several articulators - even an isolated vowel such as "ee" involves coordination of the jaw, tongue, lips, larynx ...

  6. 2.1 How Humans Produce Speech

    Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation). The field of phonetics studies the sounds of human ...

  7. Speech

    Speech is the faculty of producing articulated sounds, which, when blended together, form language. Human speech is served by a bellows-like respiratory activator, which furnishes the driving energy in the form of an airstream; a phonating sound generator in the larynx (low in the throat) to transform the energy; a sound-molding resonator in ...

  8. Speech production and acoustic properties

    2.2. Speech production and acoustic properties #. 2.2.1. Physiological speech production #. 2.2.1.1. Overview #. When a person has the urge or intention to speak, her or his brain forms a sentence with the intended meaning and maps the sequence of words into physiological movements required to produce the corresponding sequence of speech sounds.

  9. 6

    A qualitative introduction to the physiology of speech. 3. Basic acoustics. 4. Source-filter theory of speech production. 5. Speech analysis. 6. ... However, we also have to keep in touch with classical data and theories. The anatomical basis of speech production, for example, has been studied in much detail (Negus, 1949; Zemlin, 1968), and ...

  10. 1

    The production of a speech sound may be divided into four separate but interrelated processes: the initiation of the air stream, normally in the lungs; its phonation in the larynx through the operation of the vocal folds; its direction by the velum into either the oral cavity or the nasal cavity (the oro-nasal process); and finally its ...

  11. Speech Production

    Neural Models of Motor Speech Control. Frank H. Guenther, Gregory Hickok, in Neurobiology of Language, 2016 58.1 Introduction. Speech production is a highly complex motor act involving the finely coordinated activation of approximately 100 muscles in the respiratory, laryngeal, and oral motor systems. To achieve this task, speakers utilize a large network of brain regions.

  12. Introduction (Chapter 1)

    A qualitative introduction to the physiology of speech. 3. Basic acoustics. 4. Source-filter theory of speech production. 5. Speech analysis. 6. ... By the mid nineteenth century Müller (1848) had formulated the source-filter theory of speech production, which is consistent with the most recent data and still guides research on human ...

  13. The Handbook of Speech Production

    The Handbook of Speech Production is the first reference work to provide an overview of this burgeoning area of study. Twenty-four chapters written by an international team of authors examine issues in speech planning, motor control, the physical aspects of speech production, and external factors that impact speech production. Contributions bring together behavioral, clinical, computational ...

  14. Speech Production

    A theory of speech production provides an account of the means by which a planned sequence of language forms is implemented as vocal tract activity that gives rise to an audible, intelligible acoustic speech signal. Such an account must address several issues. Two central issues are considered in this article.

  15. Mechanism of Speech Production

    Summary. Speech mechanism is a complex process unique to humans. It involves the brain, the neural network, the respiratory organs, the larynx, the oral cavity, the nasal cavity and the organs in the mouth. Through speech production humans engage in verbal communication.

  16. Single-neuronal elements of speech production in humans

    This combination of context-general and context-specific representation of speech sound classes, in turn, is supportive of many speech production models which suggest that speakers hold abstract ...

  17. PDF Giving an Introduction Speech

    1: Giving an Introduction Speech 3 Organizing Your Speech Organizing a speech is probably the single most important task of a good presenter. If your speech is well organized, the audience members will likely be able to follow you, even if your grammar and pronunciation are not totally accurate. As you work

  18. PDF Chapter 1 NATURE AND PERCEPTION OF SPEECH SOUNDS

    noise on speech production and perception. By introducing basic characteristics of speech sounds and how they are produced and perceived. we intend to provide the es­ sential knowledge needed to understand the following chapters. Part A SPEECH COMMUNICATION BY HUMANS AND MACHINES J.-C. Junqua et al., Robustness in Automatic Speech Recognition

  19. Speech Production

    Speech production is a complex process that includes the articulation of sounds and words, relying on the intricate interplay of hearing, perception, and information processing by the brain and ...

  20. Psycholinguistics/Development of Speech Production

    Introduction [ edit | edit source] Speech production is an important part of the way we communicate. We indicate intonation through stress and pitch while communicating our thoughts, ideas, requests or demands, and while maintaining grammatically correct sentences. However, we rarely consider how this ability develops.

  21. 9.1 Evidence for Speech Production

    The evidence used by psycholinguistics in understanding speech production can be varied and interesting. These include speech errors, reaction time experiments, neuroimaging, computational modelling, and analysis of patients with language disorders. Until recently, the most prominent set of evidence for understanding how we speak came from.

  22. How to Write an Introduction Speech: 7 Easy Steps & Examples

    Rehearse and Edit. Practice your introduction speech to ensure it flows smoothly and stays within the time frame. Edit out any unnecessary information, ensuring it's concise and impactful. Tailor for the Occasion. Adjust the tone and content of your introduction speech to match the formality and purpose of the event.

  23. Physiology of speech production: An introduction for speech scientists

    The development of vowel production in Cantonese-speaking infants. P. Cheung 張佩斯. Linguistics. 2000. This study investigated the vowel development often Cantonese-speaking infants aged 10-, 12-, 14-, 16- and 18-months. Data were subjected to perceptual and acoustic analyses. Vowel inventories and….

  24. Speech perception and production

    Abstract. Until recently, research in speech perception and speech production has largely focused on the search for psychological and phonetic evidence of discrete, abstract, context-free symbolic units corresponding to phonological segments or phonemes. Despite this common conceptual goal and intimately related objects of study, however ...

  25. Introduction Speech

    Create a speech outline that will state the purpose of your speech and provide a preview of main ideas that are to be discussed. This is sure to give your audience a reason to listen. 3. Create an icebreaker. Speeches can be quite awkward, especially since they're usually made formal.

  26. The speech neuroprosthesis

    Introduction. Losing the ability to speak drastically hinders communication and, as a result, substantially reduces quality of life 1. ... Delayed auditory feedback disrupts speech production, so ...

  27. How to Introduce a Guest Speaker (with Examples)

    When it comes time to practice your guest speaker introduction speech, Yoodli, an AI-powered communication coach, becomes your invaluable practice partner. With Yoodli's cutting-edge technology and generative AI , you can rehearse and refine your introduction in a virtual, judgement-free environment.