Speech Production

  • Reference work entry
  • First Online: 01 January 2015
  • pp 1493–1498
  • Cite this reference work entry

Book cover

  • Laura Docio-Fernandez 3 &
  • Carmen García Mateo 4  

1209 Accesses

3 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

T. Hewett, R. Baecker, S. Card, T. Carey, J. Gasen, M. Mantei, G. Perlman, G. Strong, W. Verplank, Chapter 2: Human-computer interaction, in ACM SIGCHI Curricula for Human-Computer Interaction ed. by B. Hefley (ACM, 2007)

Google Scholar  

G. Fant, Acoustic Theory of Speech Production , 1st edn. (Mouton, The Hague, 1960)

G. Fant, Glottal flow: models and interaction. J. Phon. 14 , 393–399 (1986)

R.D. Kent, S.G. Adams, G.S. Turner, Models of speech production, in Principles of Experimental Phonetics , ed. by N.J. Lass (Mosby, St. Louis, 1996), pp. 2–45

T.L. Burrows, Speech Processing with Linear and Neural Network Models (1996)

J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals , 1st edn. (Macmillan, New York, 1993)

Download references

Author information

Authors and affiliations.

Department of Signal Theory and Communications, University of Vigo, Vigo, Spain

Laura Docio-Fernandez

Atlantic Research Center for Information and Communication Technologies, University of Vigo, Pontevedra, Spain

Carmen García Mateo

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Center for Biometrics and Security, Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Departments of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

Anil K. Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this entry

Cite this entry.

Docio-Fernandez, L., García Mateo, C. (2015). Speech Production. In: Li, S.Z., Jain, A.K. (eds) Encyclopedia of Biometrics. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7488-4_199

Download citation

DOI : https://doi.org/10.1007/978-1-4899-7488-4_199

Published : 03 July 2015

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4899-7487-7

Online ISBN : 978-1-4899-7488-4

eBook Packages : Computer Science Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies

Linguistics

  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Speech Production

Introduction.

  • Historical Studies
  • Animal Studies
  • Evolution and Development
  • Functional Magnetic Resonance and Positron Emission Tomography
  • Electroencephalography and Other Approaches
  • Theoretical Models
  • Speech Apparatus
  • Speech Disorders

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Acoustic Phonetics
  • Animal Communication
  • Articulatory Phonetics
  • Biology of Language
  • Clinical Linguistics
  • Cognitive Mechanisms for Lexical Access
  • Cross-Language Speech Perception and Production
  • Dementia and Language
  • Early Child Phonology
  • Interface Between Phonology and Phonetics
  • Khoisan Languages
  • Language Acquisition
  • Speech Perception
  • Speech Synthesis
  • Voice and Voice Quality

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Cognitive Grammar
  • Edward Sapir
  • Teaching Pragmatics
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Speech Production by Eryk Walczak LAST REVIEWED: 17 April 2023 LAST MODIFIED: 22 February 2018 DOI: 10.1093/obo/9780199772810-0217

Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics , Acoustic Phonetics and Speech Perception , which are all studying various elements of language and are part of a broader field of Linguistics . Because of the interdisciplinary nature of the current topic, it is usually studied on several levels: neurological, acoustic, motor, evolutionary, and developmental. Each of these levels has its own literature but in the vast majority of speech production literature, each of these elements will be present. The large body of relevant literature is covered in the speech perception entry on which this bibliography builds upon. This entry covers general speech production mechanisms and speech disorders. However, speech production in second language learners or bilinguals has special features which were described in separate bibliography on Cross-Language Speech Perception and Production . Speech produces sounds, and sounds are a topic of study for Phonology .

As mentioned in the introduction, speech production tends to be described in relation to acoustics, speech perception, neuroscience, and linguistics. Because of this interdisciplinarity, there are not many published textbooks focusing exclusively on speech production. Guenther 2016 and Levelt 1993 are the exceptions. The former has a stronger focus on the neuroscientific underpinnings of speech. Auditory neuroscience is also extensively covered by Schnupp, et al. 2011 and in the extensive textbook Hickok and Small 2015 . Rosen and Howell 2011 is a textbook focusing on signal processing and acoustics which are necessary to understand by any speech scientist. A historical approach to psycholinguistics which also covers speech research is Levelt 2013 .

Guenther, F. H. 2016. Neural control of speech . Cambridge, MA: MIT.

This textbook provides an overview of neural processes responsible for speech production. Large sections describe speech motor control, especially the DIVA model (co-authored by Guenther). It includes extensive coverage of behavioral and neuroimaging studies of speech as well as speech disorders and ties them together with a unifying theoretical framework.

Hickok, G., and S. L. Small. 2015. Neurobiology of language . London: Academic Press.

This voluminous textbook edited by Hickok and Small covers a wide range of topics related to neurobiology of language. It includes a section devoted to speaking which covers neurobiology of speech production, motor control perspective, neuroimaging studies, and aphasia.

Levelt, W. J. M. 1993. Speaking: From intention to articulation . Cambridge, MA: MIT.

A seminal textbook Speaking is worth reading particularly for its detailed explanation of the author’s speech model, which is part of the author’s language model. The book is slightly dated, as it was released in 1993, but chapters 8–12 are especially relevant to readers interested in phonetic plans, articulating, and self-monitoring.

Levelt, W. J. M. 2013. A history of psycholinguistics: The pre-Chomskyan era . Oxford: Oxford University Press.

Levelt published another important book detailing the development of psycholinguistics. As its title suggests, it focuses on the early history of discipline, so readers interested in historical research on speech can find an abundance of speech-related research in that book. It covers a wide range of psycholinguistic specializations.

Rosen, S., and P. Howell. 2011. Signals and Systems for Speech and Hearing . 2d ed. Bingley, UK: Emerald.

Rosen and Howell provide a low-level explanation of speech signals and systems. The book includes informative charts explaining the basic acoustic and signal processing concepts useful for understanding speech science.

Schnupp, J., I. Nelken, and A. King. 2011. Auditory neuroscience: Making sense of sound . Cambridge, MA: MIT.

A general introduction to speech concepts with main focus on neuroscience. The textbook is linked with a website which provides a demonstration of described phenomena.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Consonant Epenthesis
  • Constructions, Verb-Particle
  • Contrastive Analysis in Linguistics
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Dialectology
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Generative Syntax
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Languages of the World
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Machine Translation
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Positive Discourse Analysis
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Prague Linguistic Circle, The
  • Presupposition
  • Psycholinguistics
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • Sound Change
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Production
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Translation
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [66.249.64.20|195.190.12.77]
  • 195.190.12.77

Logo for Open Library Publishing Platform

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

2.1 How Humans Produce Speech

Phonetics studies human speech. Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation).

Check Yourself

Video script.

The field of phonetics studies the sounds of human speech.  When we study speech sounds we can consider them from two angles.   Acoustic phonetics ,  in addition to being part of linguistics, is also a branch of physics.  It’s concerned with the physical, acoustic properties of the sound waves that we produce.  We’ll talk some about the acoustics of speech sounds, but we’re primarily interested in articulatory phonetics , that is, how we humans use our bodies to produce speech sounds. Producing speech needs three mechanisms.

The first is a source of energy.  Anything that makes a sound needs a source of energy.  For human speech sounds, the air flowing from our lungs provides energy.

The second is a source of the sound:  air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony part under your skin.  That’s the front of your larynx . It’s not actually made of bone; it’s cartilage and muscle. This picture shows what the larynx looks like from the front.

Larynx external

This next picture is a view down a person’s throat.

Cartilages of the Larynx

What you see here is that the opening of the larynx can be covered by two triangle-shaped pieces of skin.  These are often called “vocal cords” but they’re not really like cords or strings.  A better name for them is vocal folds .

The opening between the vocal folds is called the glottis .

We can control our vocal folds to make a sound.  I want you to try this out so take a moment and close your door or make sure there’s no one around that you might disturb.

First I want you to say the word “uh-oh”. Now say it again, but stop half-way through, “Uh-”. When you do that, you’ve closed your vocal folds by bringing them together. This stops the air flowing through your vocal tract.  That little silence in the middle of “uh-oh” is called a glottal stop because the air is stopped completely when the vocal folds close off the glottis.

Now I want you to open your mouth and breathe out quietly, “haaaaaaah”. When you do this, your vocal folds are open and the air is passing freely through the glottis.

Now breathe out again and say “aaah”, as if the doctor is looking down your throat.  To make that “aaaah” sound, you’re holding your vocal folds close together and vibrating them rapidly.

When we speak, we make some sounds with vocal folds open, and some with vocal folds vibrating.  Put your hand on the front of your larynx again and make a long “SSSSS” sound.  Now switch and make a “ZZZZZ” sound. You can feel your larynx vibrate on “ZZZZZ” but not on “SSSSS”.  That’s because [s] is a voiceless sound, made with the vocal folds held open, and [z] is a voiced sound, where we vibrate the vocal folds.  Do it again and feel the difference between voiced and voiceless.

Now take your hand off your larynx and plug your ears and make the two sounds again with your ears plugged. You can hear the difference between voiceless and voiced sounds inside your head.

I said at the beginning that there are three crucial mechanisms involved in producing speech, and so far we’ve looked at only two:

  • Energy comes from the air supplied by the lungs.
  • The vocal folds produce sound at the larynx.
  • The sound is then filtered, or shaped, by the articulators .

The oral cavity is the space in your mouth. The nasal cavity, obviously, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well.  In the next unit, we’ll look in more detail at how we use our articulators.

So to sum up, the three mechanisms that we use to produce speech are:

  • respiration at the lungs,
  • phonation at the larynx, and
  • articulation in the mouth.

Essentials of Linguistics Copyright © 2018 by Catherine Anderson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

speech sound production meaning

The Voice Foundation

Advancing understanding of the voice through interdisciplinary research & education , philadelphia, new york, los angeles, cleveland, boston, paris, lebanon, brazil, china, japan, india , mexico.

Anatomy and Physiology of Voice Production  | Understanding How Voice is Produced |   Learning About the Voice Mechanism  |   How Breakdowns Result in Voice Disorders

Larynx Highly specialized structure atop the windpipe responsible for sound production, air passage during breathing and protecting the airway during swallowing

Vocal Folds (also called Vocal Cords) “Fold-like” soft tissue that is the main vibratory component of the voice box; comprised of a cover (epithelium and superficial lamina propria), vocal ligament (intermediate and deep laminae propria), and body (thyroarytenoid muscle)

Glottis (also called Rima Glottides) Opening between the two vocal folds; the glottis opens during breathing and closes during swallowing and sound production

Voice as We Know It = Voiced Sound + Resonance + Articulation

The “spoken word” results from three components of voice production: voiced sound, resonance, and articulation.

Voiced sound: The basic sound produced by vocal fold vibration is called “voiced sound.” This is frequently described as a “buzzy” sound. Voiced sound for singing differs significantly from voiced sound for speech.

Resonance: Voiced sound is amplified and modified by the vocal tract resonators (the throat, mouth cavity, and nasal passages). The resonators produce a person’s recognizable voice.

Articulation: The vocal tract articulators (the tongue, soft palate, and lips) modify the voiced sound. The articulators produce recognizable words.

Voice Depends on Vocal Fold Vibration and Resonance

Sound is produced when aerodynamic phenomena cause vocal folds to vibrate rapidly in a sequence of vibratory cycles with a speed of about:

  • 110 cycles per second or Hz (men) = lower pitch
  • 180 to 220 cycles per second (women) = medium pitch
  • 300 cycles per second (children) = higher pitchhigher voice: increase in frequency of vocal fold vibrationlouder voice: increase in amplitude of vocal fold vibration

Vibratory Cycle = Open + Close Phase

The vocal fold vibratory cycle has phases that include an orderly sequence of opening and closing the top and bottom of the vocal folds, letting short puffs of air through at high speed. Air pressure is converted into sound waves.

Not Like a Guitar String

Vocal folds vibrate when excited by aerodynamic phenomena; they are not plucked like a guitar string. Air pressure from the lungs controls the open phase. The passing air column creates a trailing “Bernoulli effect,” which controls the close phase.

Voice production involves a three-step process.

  • A column of air pressure is moved towards the vocal folds
  • Air is moved out of the lungs and towards the vocal folds by coordinated action of the diaphragm, abdominal muscles, chest muscles, and rib cage
  • Vocal folds are moved to midline by voice box muscles, nerves, and cartilages
  • Column of air pressure opens bottom of vocal folds
  • Column of air continues to move upwards, now towards the top of vocal folds, and opens the top
  • The low pressure created behind the fast-moving air column produces a “Bernoulli effect” which causes the bottom to close, followed by the top
  • Closure of the vocal folds cuts off the air column and releases a pulse of air
  • New cycle repeats
  • Loudness:  Increase in air flow “blows” vocal folds wider apart, which stay apart longer during a vibratory cycle – thus increasing amplitude of the sound pressure wave
  • Pitch:  Increase in frequency of vocal fold vibration raises pitch

ap_01_160

– repeat 1-10 In the closed position (—) maintained by muscle,  opens and closes in a cyclical, ordered and even manner (1 – 10) as a column of air pressure  from the lungs below flows through. This very rapid ordered closing and opening produced by the column of air is referred to as the mucosal wave. The lower edge opens first (2-3) followed by the upper edge thus letting air flow through (4-6). The air column that flows through creates a “Bernouli effect” which causes the lower edge to close (7-9) as it escapes upwards. The escaping “puffs of air” (10) are converted to sound which is then transformed into voice by vocal tract resonators. Any change that affects this mucosal wave – stiffness of vocal fold layers, weakness or failure of closure, imbalance between R and L vocal folds from a lesion on one vocal fold – causes voice problems.  (For more information, see  Anatomy: How Breakdowns Result in Voice Disorders .)

  • Vocal tract – resonators and articulators: The nose, pharynx, and mouth amplify and modify sound, allowing it to take on the distinctive qualities of voiceThe way that voice is produced is analogous to the way that sound is produced by a trombone. The trombone player produces sound at the mouthpiece of the instrument with his lips vibrating from air that passes from the mouth. The vibration within the mouthpiece produces sound, which is then altered or “shaped” as it passes throughout the instrument. As the slide of the trombone is changed, the sound of the musical instrument is similarly changed.

Amazing Outcomes of Human Voice

The human voice can be modified in many ways. Consider the spectrum of sounds – whispering, speaking, orating, shouting – as well as the different sounds that are possible in different forms of vocal music, such as rock singing, gospel singing, and opera singing.

Key Factors for Normal Vocal Fold Vibration

To vibrate efficiently vocal folds need to be:

At the midline or “closed”: Failure to move vocal folds to the midline, or any lesion which prevents the vocal fold edges from meeting, allows air to escape and results in breathy voice.Key players: muscles, cartilages, nerves

Pliable: The natural “built-in” elasticity of vocal folds makes them pliable. The top, edge, and bottom of the vocal folds that meet in the midline and vibrate need to be pliable. Changes in vocal fold pliability, even if limited to just one region or “spot,” can cause voice disorders, as seen in vocal fold scarring.Key players: epithelium, superficial lamina propria

“Just right” tension: Inability to adjust tension during singing can cause a failure to reach high notes or breaks in voice.Key players: muscle, nerve, cartilages

“Just right” mass: Changes in the soft tissue bulk of the vocal folds – such as decrease or thinning as in scarring or increase or swelling, as in Reinke’s edema, produce many voice symptoms – hoarseness, altered voice pitch, effortful phonation, etc. (For more information, see Vocal Fold Scarring and Reinke’s Edema .)Key players: muscles, nerves, epithelium, superficial lamina propria

Learning About the Voice Mechanism  

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 31 January 2024

Single-neuronal elements of speech production in humans

  • Arjun R. Khanna   ORCID: orcid.org/0000-0003-0677-5598 1   na1 ,
  • William Muñoz   ORCID: orcid.org/0000-0002-1354-3472 1   na1 ,
  • Young Joon Kim 2   na1 ,
  • Yoav Kfir 1 ,
  • Angelique C. Paulk   ORCID: orcid.org/0000-0002-4413-3417 3 ,
  • Mohsen Jamali   ORCID: orcid.org/0000-0002-1750-7591 1 ,
  • Jing Cai   ORCID: orcid.org/0000-0002-2970-0567 1 ,
  • Martina L. Mustroph 1 ,
  • Irene Caprara 1 ,
  • Richard Hardstone   ORCID: orcid.org/0000-0002-7502-9145 3 ,
  • Mackenna Mejdell 1 ,
  • Domokos Meszéna   ORCID: orcid.org/0000-0003-4042-2542 3 ,
  • Abigail Zuckerman 2 ,
  • Jeffrey Schweitzer   ORCID: orcid.org/0000-0003-4079-0791 1 ,
  • Sydney Cash   ORCID: orcid.org/0000-0002-4557-6391 3   na2 &
  • Ziv M. Williams   ORCID: orcid.org/0000-0002-0017-0048 1 , 4 , 5   na2  

Nature volume  626 ,  pages 603–610 ( 2024 ) Cite this article

24k Accesses

3 Citations

476 Altmetric

Metrics details

  • Extracellular recording

Humans are capable of generating extraordinarily diverse articulatory movement combinations to produce meaningful speech. This ability to orchestrate specific phonetic sequences, and their syllabification and inflection over subsecond timescales allows us to produce thousands of word sounds and is a core component of language 1 , 2 . The fundamental cellular units and constructs by which we plan and produce words during speech, however, remain largely unknown. Here, using acute ultrahigh-density Neuropixels recordings capable of sampling across the cortical column in humans, we discover neurons in the language-dominant prefrontal cortex that encoded detailed information about the phonetic arrangement and composition of planned words during the production of natural speech. These neurons represented the specific order and structure of articulatory events before utterance and reflected the segmentation of phonetic sequences into distinct syllables. They also accurately predicted the phonetic, syllabic and morphological components of upcoming words and showed a temporally ordered dynamic. Collectively, we show how these mixtures of cells are broadly organized along the cortical column and how their activity patterns transition from articulation planning to production. We also demonstrate how these cells reliably track the detailed composition of consonant and vowel sounds during perception and how they distinguish processes specifically related to speaking from those related to listening. Together, these findings reveal a remarkably structured organization and encoding cascade of phonetic representations by prefrontal neurons in humans and demonstrate a cellular process that can support the production of speech.

Similar content being viewed by others

speech sound production meaning

Large-scale single-neuron speech sound encoding across the depth of human cortex

Matthew K. Leonard, Laura Gwilliams, … Edward F. Chang

speech sound production meaning

Neural dynamics of phoneme sequences reveal position-invariant code for content and order

Laura Gwilliams, Jean-Remi King, … David Poeppel

speech sound production meaning

Phonemic segmentation of narrative speech in human cerebral cortex

Xue L. Gong, Alexander G. Huth, … Frédéric E. Theunissen

Humans can produce a remarkably wide array of word sounds to convey specific meanings. To produce fluent speech, linguistic analyses suggest a structured succession of processes involved in planning the arrangement and structure of phonemes in individual words 1 , 2 . These processes are thought to occur rapidly during natural speech and to recruit prefrontal regions in parts of the broader language network known to be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and which widely connect with downstream areas that play a role in their motor production 17 , 18 , 19 . Cortical surface recordings have also demonstrated that phonetic features may be regionally organized 20 and that they can be decoded from local-field activities across posterior prefrontal and premotor areas 21 , 22 , 23 , suggesting an underlying cortical structure. Understanding the basic cellular elements by which we plan and produce words during speech, however, has remained a significant challenge.

Although previous studies in animal models 24 , 25 , 26 and more recent investigation in humans 27 , 28 have offered an important understanding of how cells in primary motor areas relate to vocalization movements and the production of sound sequences such as song, they do not reveal the neuronal process by which humans construct individual words and by which we produce natural speech 29 . Further, although linguistic theory based on behavioural observations has suggested tightly coupled sublexical processes necessary for the coordination of articulators during word planning 30 , how specific phonetic sequences, their syllabification or inflection are precisely coded for by individual neurons remains undefined. Finally, whereas previous studies have revealed a large regional overlap in areas involved in articulation planning and production 31 , 32 , 33 , 34 , 35 , little is known about whether and how these linguistic process may be uniquely represented at a cellular scale 36 , what their cortical organization may be or how mechanisms specifically related to speech production and perception may differ.

Single-neuronal recordings have the potential to begin revealing some of the basic functional building blocks by which humans plan and produce words during speech and study these processes at spatiotemporal scales that have largely remained inaccessible 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 . Here, we used an opportunity to combine recently developed ultrahigh-density microelectrode arrays for acute intraoperative neuronal recordings, speech tracking and modelling approaches to begin addressing these questions.

Neuronal recordings during natural speech

Single-neuronal recordings were obtained from the language-dominant (left) prefrontal cortex in participants undergoing planned intraoperative neurophysiology (Fig. 1a ; section on ‘Acute intraoperative single-neuronal recordings’). These recordings were obtained from the posterior middle frontal gyrus 10 , 46 , 47 , 48 , 49 , 50 in a region known to be broadly involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and to connect with neighbouring motor areas shown to play a role in articulation 17 , 18 , 19 and lexical processing 51 , 52 , 53 (Extended Data Fig. 1a ). This region was traversed during recordings as part of planned neurosurgical care and roughly ranged in distribution from alongside anterior area 55b to 8a, with sites varying by approximately 10 mm (s.d.) across subjects (Extended Data Fig. 1b ; section on ‘Anatomical localization of recordings’). Moreover, the participants undergoing recordings were awake and thus able to perform language-based tasks (section on ‘Study participants’), together providing an extraordinarily rare opportunity to study the action potential (AP) dynamics of neurons during the production of natural speech.

figure 1

a , Left, single-neuronal recordings were confirmed to localize to the posterior middle frontal gyrus of language-dominant prefrontal cortex in a region known to be involved in word planning and production (Extended Data Fig. 1a,b ); right, acute single-neuronal recordings were made using Neuropixels arrays (Extended Data Fig. 1c,d ); bottom, speech production task and controls (Extended Data Fig. 2a ). b , Example of phonetic groupings based on the planned places of articulation (Extended Data Table 1 ). c , A ten-dimensional feature space was constructed to provide a compositional representation of all phonemes per word. d , Peri-event time histograms were constructed by aligning the APs of each neuron to word onset at millisecond resolution. Data are presented as mean (line) values ± s.e.m. (shade). Inset, spike waveform morphology and scale bar (0.5 ms). e , Left, proportions of modulated neurons that selectively changed their activities to specific planned phonemes; right, tuning curve for a cell that was preferentially tuned to velar consonants. f , Average z -scored firing rates as a function of the Hamming distance between the preferred phonetic composition of the neuron (that producing largest change in activity) and all other phonetic combinations. Here, a Hamming distance of 0 indicates that the words had the same phonetic compositions, whereas a Hamming distance of 1 indicates that they differed by a single phoneme. Data are presented as mean (line) values ± s.e.m. (shade). g , Decoding performance for planned phonemes. The orange points provide the sampled distribution for the classifier’s ROC-AUC; n  = 50 random test/train splits; P  = 7.1 × 10 −18 , two-sided Mann–Whitney U -test. Data are presented as mean ± s.d.

Source Data

To obtain acute recordings from individual cortical neurons and to reliably track their AP activities across the cortical column, we used ultrahigh-density, fully integrated linear silicon Neuropixels arrays that allowed for high throughput recordings from single cortical units 54 , 55 . To further obtain stable recordings, we developed custom-made software that registered and motion-corrected the AP activity of each unit and kept track of their position across the cortical column (Fig. 1a , right) 56 . Only well-isolated single units, with low relative neighbour noise and stable waveform morphologies consistent with that of neocortical neurons were used (Extended Data Fig. 1c,d ; section on ‘Acute intraoperative single-neuronal recordings’). Altogether, we obtained recordings from 272 putative neurons across five participants for an average of 54 ± 34 (s.d.) single units per participant (range 16–115 units).

Next, to study neuronal activities during the production of natural speech and to track their per word modulation, the participants performed a naturalistic speech production task that required them to articulate broadly varied words in a replicable manner (Extended Data Fig. 2a ) 57 . Here, the task required the participants to produce words that varied in phonetic, syllabic and morphosyntactic content and to provide them in a structured and reproducible format. It also required them to articulate the words independently of explicit phonetic cues (for example, from simply hearing and then repeating the same words) and to construct them de novo during natural speech. Extra controls were further used to evaluate for preceding word-related responses, sensory–perceptual effects and phonetic–acoustic properties as well as to evaluate the robustness and generalizability of neuronal activities (section on ‘Speech production task’).

Together, the participants produced 4,263 words for an average of 852.6 ± 273.5 (s.d.) words per participant (range 406–1,252 words). The words were transcribed using a semi-automated platform and aligned to AP activity at millisecond resolution (section on ‘Audio recordings and task synchronization’) 51 . All participants were English speakers and showed comparable word-production performances (Extended Data Fig. 2b ).

Representations of phonemes by neurons

To first examine the relation between single-neuronal activities and the specific speech organs involved 58 , 59 , we focused our initial analyses on the primary places of articulation 60 . The places of articulation describe the points where constrictions are made between an active and a passive articulator and are what largely give consonants their distinctive sounds. Thus, for example, whereas bilabial consonants (/p/ and /b/) involve the obstruction of airflow at the lips, velar consonants are articulated with the dorsum of the tongue placed against the soft palate (/k/ and /g/; Fig. 1b ). To further examine sounds produced without constriction, we also focused our initial analyses on vowels in relation to the relative height of the tongue (mid-low and high vowels). More phonetic groupings based on the manners of articulation (configuration and interaction of articulators) and primary cardinal vowels (combined positions of the tongue and lips) are described in Extended Data Table 1 .

Next, to provide a compositional phonetic representation of each word, we constructed a feature space on the basis of the constituent phonemes of each word (Fig. 1c , left). For instance, the words ‘like’ and ‘bike’ would be represented uniquely in vector space because they differ by a single phoneme (‘like’ contains alveolar /l/ whereas ‘bike’ contains bilabial /b/; Fig. 1c , right). The presence of a particular phoneme was therefore represented by a unitary value for its respective vector component, together yielding a vectoral representation of the constituent phonemes of each word (section on ‘Constructing a word feature space’). Generalized linear models (GLMs) were then used to quantify the degree to which variations in neuronal activity during planning could be explained by individual phonemes across all possible combinations of phonemes per word (section on ‘Single-neuronal analysis’).

Overall, we find that the firing activities of many of the neurons (46.7%, n  = 127 of 272 units) were explained by the constituent phonemes of the word before utterance (−500 to 0 ms); GLM likelihood ratio test, P  < 0.01); meaning that their activity patterns were informative of the phonetic content of the word. Among these, the activities of 56 neurons (20.6% of the 272 units recorded) were further selectively tuned to the planned production of specific phonemes (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across all phoneme categories; Fig. 1d,e and Extended Data Figs. 2 and 3 ). Thus, for example, whereas certain neurons changed their firing rate when the upcoming words contained bilabial consonants (for example, /p/ or /b/), others changed their firing rate when they contained velar consonants. Of these neurons, most encoded information both about the planned places and manners of articulation ( n  = 37 or 66% overlap, two-sided hypergeometric test, P  < 0.0001) or planned places of articulation and vowels ( n  = 27 or 48% overlap, two-sided hypergeometric test, P  < 0.0001; Extended Data Fig. 4 ). Most also reflected the spectral properties of the articulated words on a phoneme-by-phoneme basis (64%, n  = 36 of 56; two-sided hypergeometric test, P  = 1.1 × 10 −10 ; Extended Data Fig. 5a,b ); together providing detailed information about the upcoming phonemes before utterance.

Because we had a complete representation of the upcoming phonemes for each word, we could also quantify the degree to which neuronal activities reflected their specific combinations. For example, we could ask whether the activities of certain neurons not only reflected planned words with velar consonants but also words that contained the specific combination of both velar and labial consonants. By aligning the activity of each neuron to its preferred phonetic composition (that is, the specific combination of phonemes to which the neuron most strongly responded) and by calculating the Hamming distance between this and all other possible phonetic compositions across words (Fig. 1c , right; section on ‘Single-neuronal analysis’), we find that the relation between the vectoral distances across words and neuronal activity was significant (two-sided Spearman’s ρ  = −0.97, P  = 5.14 × 10 −7 ; Fig. 1f ). These neurons therefore seemed not only to encode specific planned phonemes but also their specific composition with upcoming words.

Finally, we asked whether the constituent phonemes of the word could be robustly decoded from the activity patterns of the neuronal population. Using multilabel decoders to classify the upcoming phonemes of words not used for model training (section on ‘Population modelling’), we find that the composition of phonemes could be predicted from neuronal activity with significant accuracy (receiver operating characteristic area under the curve; ROC-AUC = 0.75 ± 0.03 mean ± s.d. observed versus 0.48 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Fig. 1g ). Similar findings were also made when examining the planned manners of articulation (AUC = 0.77 ± 0.03, P  < 0.001, two-sided Mann–Whitney U -test), primary cardinal vowels (AUC = 0.79 ± 0.04, P  < 0.001, two-sided Mann–Whitney U -test) and their spectral properties (AUC = 0.75 ± 0.03, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 5a , right). Taken together, these neurons therefore seemed to reliably predict the phonetic composition of the upcoming words before utterance.

Motoric and perceptual processes

Neurons that reflected the phonetic composition of the words during planning were largely distinct from those that reflected their composition during perception. It is possible, for instance, that similar response patterns could have been observed when simply hearing the words. Therefore, to test for this, we performed an extra ‘perception’ control in three of the participants whereby they listened to, rather than produced, the words ( n  = 126 recorded units; section on ‘Speech production task’). Here, we find that 29.3% ( n  = 37) of the neurons showed phonetic selectively during listening (Extended Data Fig. 6a ) and that their activities could be used to accurately predict the phonemes being heard (AUC = 0.70 ± 0.03 observed versus 0.48 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 6b ). We also find, however, that these cells were largely distinct from those that showed phonetic selectivity during planning ( n  = 10; 7.9% overlap) and that their activities were uninformative of phonemic content of the words being planned (AUC = 0.48 ± 0.01, P  = 0.99, two-sided Mann–Whitney U -test; Extended Data Fig. 6b ). Similar findings were also made when replaying the participant’s own voices to them (‘playback’ control; 0% overlap in neurons); together suggesting that speaking and listening engaged largely distinct but complementary sets of cells in the neural population.

Given the above observations, we also examined whether the activities of the neurons could have been explained by the acoustic–phonetic properties of the preceding spoken words. For example, it is possible that the activities of the neuron may have partly reflected the phonetic composition of the previous articulated word or their motoric components. Thus, to test for this, we repeated our analyses but now excluded words in which the preceding articulated word contained the phoneme being decoded (section on ‘Single-neuronal analysis’) and find that decoding performance remained significant (AUC = 0.72 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test). We also find that decoding performance remained significant when constricting (−400 to 0 ms window instead of −500:0 ms; AUC = 0.72 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test) or shifting the analysis window closer to utterance (−300 to +200 ms window results in AUC = 0.76 ± 0.1, P  < 0.001, two-sided Mann–Whitney U -test); indicating that these neurons coded for the phonetic composition of the upcoming words.

Syllabic and morphological features

To transform sets of consonants and vowels into words, the planned phonemes must also be arranged and segmented into distinct syllables 61 . For example, even though the words ‘casting’ and ‘stacking’ possess the same constituent phonemes, they are distinguished by their specific syllabic structure and order. Therefore, to examine whether neurons in the population may further reflect these sublexical features, we created an extra vector space based on the specific order and segmentation of phonemes (section on ‘Constructing a word feature space’). Here, focusing on the most common syllables to allow for tractable neuronal analysis (Extended Data Table 1 ), we find that the activities of 25.0% ( n  = 68 of 272) of the neurons reflected the presence of specific planned syllables (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across all syllable categories; Fig. 2a,b ). Thus, whereas certain neurons may respond selectively to a velar-low-alveolar syllable, other neurons may respond selectively to an alveolar-low-velar syllable. Together, the neurons responded preferentially to specific syllables when tested across words (two-sided Spearman’s ρ  = −0.96, P  = 1.85 × 10 −6 ; Fig. 2c ) and accurately predicted their content (AUC = 0.67 ± 0.03 observed versus 0.50 ± 0.02 chance, P  < 0.001, two-sided Mann–Whitney U -test; Fig. 2d ); suggesting that these subsets of neurons encoded information about the syllables.

figure 2

a , Peri-event time histograms were constructed by aligning the APs of each neuron to word onset. Data are presented as mean (line) values ± s.e.m. (shade). Examples of two representative neurons which selectively changed their activity to specific planned syllables. Inset, spike waveform morphology and scale bar (0.5 ms). b , Scatter plots of D 2 values (the degree to which specific features explained neuronal response, n  = 272 units) in relation to planned phonemes, syllables and morphemes. c , Average z -scored firing rates as a function of the Hamming distance between the preferred syllabic composition and all other compositions of the neuron. Data are presented as mean (line) values ± s.e.m. (shade). d , Decoding performance for planned syllables. The orange points provide the sampled distribution for the classifier’s ROC-AUC values ( n  = 50 random test/train splits; P  = 7.1 × 10 −18 two-sided Mann–Whitney U -test). Data are presented as mean ± s.d. e , To evaluate the selectivity of neurons to specific syllables, their activities were further compared for words that contained the preferred syllable of each neuron (that is, the syllable to which they responded most strongly; green) to (i) words that contained one or more of same individual phonemes but not necessarily their preferred syllable, (ii) words that contained different phonemes and syllables, (iii) words that contained the same phonemes but divided across different syllables and (iv) words that contained the same phonemes in a syllable but in different order (grey). Neuronal activities across all comparisons (to green points) were significant ( n  = 113; P  = 6.2 × 10 −20 , 8.8 × 10 −20 , 4.2 × 10 −20 and 1.4 × 10 −20 , for the comparisons above, respectively; two-sided Wilcoxon signed-rank test). Data are presented as mean (dot) values ± s.e.m.

Next, to confirm that these neurons were selectively tuned to specific syllables, we compared their activities for words that contained the preferred syllable of each neuron (for example, /d-iy/) to words that simply contained their constituent phonemes (for example, d or iy). Thus, for example, if these neurons reflected individual phonemes irrespective of their specific order, then we would observe no difference in response. On the basis of these comparisons, however, we find that the responses of the neurons to their preferred syllables was significantly greater than to that of their individual constituent phonemes ( z -score difference 0.92 ± 0.04; two-sided Wilcoxon signed-rank test, P  < 0.0001; Fig. 2e ). We also tested words containing syllables with the same constituent phonemes but in which the phonemes were simply in a different order (for example, /g-ah-d/ versus /d-ah-g/) but again find that the neurons were preferentially tuned to specific syllables ( z -score difference 0.99 ± 0.06; two-sided Wilcoxon signed-rank test, P  < 1.0 × 10 −6 ; Fig. 2e ). Then, we examined words that contained the same arrangements of phonemes but in which the phonemes themselves belonged to different syllables (for example, /r-oh-b/ versus r-oh/b-; accounting prosodic emphasis) and similarly find that the neurons were preferentially tuned to specific syllables ( z -score difference 1.01 ± 0.06; two-sided Wilcoxon signed-rank test, P  < 0.0001; Fig. 2e ). Therefore, rather than simply reflecting the phonetic composition of the upcoming words, these subsets of neurons encoded their specific segmentation and order in individual syllables.

Finally, we asked whether certain neurons may code for the inclusion of morphemes. Unlike phonemes, bound morphemes such as ‘–ed’ in ‘directed’ or ‘re–’ in ‘retry’ are capable of carrying specific meanings and are thus thought to be subserved by distinct neural mechanisms 62 , 63 . Therefore, to test for this, we also parsed each word on the basis of whether it contained a suffix or prefix (controlling for word length) and find that the activities of 11.4% ( n  = 31 of 272) of the neurons selectively changed for words that contained morphemes compared to those that did not (two-sided Wald test for each GLM coefficient, P  < 0.01, Bonferroni-corrected across morpheme categories; Extended Data Fig. 5c ). Moreover, neural activity across the population could be used to reliably predict the inclusion of morphemes before utterance (AUC = 0.76 ± 0.05 observed versus 0.52 ± 0.01 for shuffled data, P  < 0.001, two-sided Mann–Whitney U -test; Extended Data Fig. 5c ), together suggesting that the neurons coded for this sublexical feature.

Spatial distribution of neurons

Neurons that encoded information about the sublexical components of the upcoming words were broadly distributed across the cortex and cortical column depth. By tracking the location of each neuron in relation to the Neuropixels arrays, we find that there was a slightly higher preponderance of neurons that were tuned to phonemes (one-sided χ 2 test (2) = 0.7 and 5.2, P  > 0.05, for places and manners of articulation, respectively), syllables (one-sided χ 2 test (2) = 3.6, P  > 0.05) and morphemes (one-sided χ 2 test (2) = 4.9, P  > 0.05) at lower cortical depths, but that this difference was non-significant, suggesting a broad distribution (Extended Data Fig. 7 ). We also find, however, that the proportion of neurons that showed selectivity for phonemes increased as recordings were acquired more posteriorly along the rostral–caudal axis of the cortex (one-sided χ 2 test (4) = 45.9 and 52.2, P  < 0.01, for places and manners of articulation, respectively). Similar findings were also made for syllables and morphemes (one-sided χ 2 test (4) = 31.4 and 49.8, P  < 0.01, respectively; Extended Data Fig. 7 ); together suggesting a gradation of cellular representations, with caudal areas showing progressively higher proportions of selective neurons.

Collectively, the activities of these cell ensembles provided richly detailed information about the phonetic, syllabic and morphological components of upcoming words. Of the neurons that showed selectivity to any sublexical feature, 51% ( n  = 46 of 90 units) were significantly informative of more than one feature. Moreover, the selectivity of these neurons lay along a continuum and were closely correlated (two-sided test of Pearson’s correlation in D 2 across all sublexical feature comparisons, r  = 0.80, 0.51 and 0.37 for phonemes versus syllables, phonemes versus morphemes and syllables versus morphemes, respectively, all P  < 0.001; Fig. 2b ), with most cells exhibiting a mixture of representations for specific phonetic, syllabic or morphological features (two-sided Wilcoxon signed-rank test, P  < 0.0001). Figure 3a further illustrates this mixture of representations (Fig. 3a , left; t -distributed stochastic neighbour embedding (tSNE)) and their hierarchical structure (Fig. 3a , right; D 2 distribution), together revealing a detailed characterization of the phonetic, syllabic and morphological components of upcoming words at the level of the cell population.

figure 3

a , Left, response selectivity of neurons to specific word features (phonemes, syllables and morphemes) is visualized across the population using a tSNE procedure (that is, neurons with similar response characteristics were plotted in closer proximity). The hue of each point reflects the degree of selectivity to a particular sublexical feature whereas the size of each point reflects the degree to which those features explained neuronal response. Inset, the relative proportions of neurons showing selectivity and their overlap. Right, the D 2 metric (the degree to which specific features explained neuronal response) for each cell shown individually per feature. b , The relative degree to which the activities of the neurons were explained by the phonetic, syllabic and morphological features of the words ( D 2 metric) and their hierarchical structure (agglomerative hierarchical clustering). c , Distribution of peak decoding performances for phonemes, syllables and morphemes aligned to word utterance onset. Significant differences in peak decoding timings across sample distribution are labelled in brackets above ( n  = 50 random test/train splits; P  = 0.024, 0.002 and 0.002; pairwise, two-sided permutation tests of differences in medians for phonemes versus syllables, syllables versus morphemes and phonemes versus morphemes, respectively; Methods ). Data are presented as median (dot) values ± bootstrapped standard error of the median.

Temporal organization of representations

Given the above observations, we examined the temporal dynamic of neuronal activities during the production of speech. By tracking peak decoding in the period leading up to utterance onset (peak AUC; 50 model testing/training splits) 64 , we find these neural populations showed a consistent morphological–phonetic–syllabic dynamic in which decoding performance first peaked for morphemes. Peak decoding then followed for phonemes and syllables (Fig. 3b and Extended Data Fig. 8a,b ; section on ‘Population modelling’). Overall, decoding performance peaked for the morphological properties of words at −405 ± 67 ms before utterance, followed by peak decoding for phonemes at −195 ± 16 ms and syllables at −70 ± 62 ms (s.e.m.; Fig. 3b ). This temporal dynamic was highly unlikely to have been observed by chance (two-sided Kruskal–Wallis test, H  = 13.28, P  < 0.01) and was largely distinct from that observed during listening (two-sided Kruskal–Wallis test, H  = 14.75, P  < 0.001; Extended Data Fig. 6c ). The activities of these neurons therefore seemed to follow a consistent, temporally ordered morphological–phonetic–syllabic dynamic before utterance.

The activities of these neurons also followed a temporally structured transition from articulation planning to production. When comparing their activities before utterance onset (−500:0 ms) to those after (0:500 ms), we find that neurons which encoded information about the upcoming phonemes during planning encoded similar information during production ( P  < 0.001, Mann–Whitney U -test for phonemes and syllables; Fig. 4a ). Moreover, when using models that were originally trained on words before utterance onset to decode the properties of the articulated words during production (model-switch approach), we find that decoding accuracy for the phonetic, syllabic and morphological properties of the words all remained significant (AUC = 0.76 ± 0.02 versus 0.48 ± 0.03 chance, 0.65 ± 0.03 versus 0.51 ± 0.04 chance, 0.74 ± 0.06 versus 0.44 ± 0.07 chance, for phonemes, syllables and morphemes, respectively; P  < 0.001 for all, two-sided Mann–Whitney U -tests; Extended Data Fig. 8c ). Information about the sublexical features of words was therefore reliably represented during articulation planning and execution by the neuronal population.

figure 4

a , Top, the D 2 value of neuronal activity (the degree to which specific features explained neuronal response, n  = 272 units) during word planning (green) and production (orange) sorted across all population neurons. Middle, relationship between explanatory power ( D 2 ) of neuronal activity ( n  = 272 units) for phonemes (Spearman’s ρ  = 0.69), syllables (Spearman’s ρ  = 0.40) and morphemes (Spearman’s ρ  = 0.08) during planning and production ( P  = 1.3 × 10 −39 , P  = 6.6 × 10 −12 , P  = 0.18, respectively, two-sided test of Spearman rank-order correlation). Bottom, the D 2 metric for each cell during production per feature ( n  = 272 units). b , Top left, schematic illustration of speech planning (blue plane) and production (red plane) subspaces as traversed by a neuron for different phonemes (yellow arrows; Extended Data Fig. 9 ). Top right, subspace misalignment quantified by an alignment index (red) or Grassmannian chordal distance (red) compared to that expected from chance (grey), demonstrating that the subspaces occupied by the neural population ( n  = 272 units) during planning and production were distinct. Bottom, projection of neural population activity ( n  = 272 units) during word planning (blue) and production (red) onto the first three PCs for the planning (upper row) and production (lower row) subspaces.

Utilizing a dynamical systems approach to further allow for the unsupervised identification of functional subspaces (that is, wherein neural activity is embedded into a high-dimensional vector space; Fig. 4b , left; section on ‘Dynamical system and subspace analysis’) 31 , 34 , 65 , 66 , we find that the activities of the population were mostly low-dimensional, with more than 90% of the variance in neuronal activity being captured by its first four principal components (Fig. 4b , right). However, when tracking how the dimensions in which neural populations evolved over time, we also find that the subspaces which defined neural activity during articulation planning and production were largely distinct. In particular, whereas the first five subspaces captured 98.4% of variance in the trajectory of the population during planning, they captured only 11.9% of variance in the trajectory during articulation (two-sided permutation test, P  < 0.0001; Fig. 4b , bottom and Extended Data Fig. 9 ). Together, these cell ensembles therefore seemed to occupy largely separate preparatory and motoric subspaces while also allowing for information about the phonetic, syllabic and morphological contents of the words to be stably represented during the production of speech.

Using Neuropixels probes to obtain acute, fine-scaled recordings from single neurons in the language-dominant prefrontal cortex 3 , 4 , 5 , 6 —in a region proposed to be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and production 13 , 14 , 15 , 16 —we find a strikingly detailed organization of phonetic representations at a cellular level. In particular, we find that the activities of many of the neurons closely mirrored the way in which the word sounds were produced, meaning that they reflected how individual planned phonemes were generated through specific articulators 58 , 59 . Moreover, rather than simply representing phonemes independently of their order or structure, many of the neurons coded for their composition in the upcoming words. They also reliably predicted the arrangement and segmentation of phonemes into distinct syllables, together suggesting a process that could allow the structure and order of articulatory events to be encoded at a cellular level.

Collectively, this putative mechanism supports the existence of context-general representations of classes of speech sounds that speakers use to construct different word forms. In contrast, coding of sequences of phonemes as syllables may represent a context-specific representation of these speech sounds in a particular segmental context. This combination of context-general and context-specific representation of speech sound classes, in turn, is supportive of many speech production models which suggest that speakers hold abstract representations of discrete phonological units in a context-general way and that, as part of speech planning, these units are organized into prosodic structures that are context-specific 1 , 30 . Although the present study does not reveal whether these representations may be stored in and retrieved from a mental syllabary 1 or are constructed from abstract phonology ad hoc, it lays a groundwork from which to begin exploring these possibilities at a cellular scale. It also expands on previous observations in animal models such as marmosets 67 , 68 , singing mice 69 and canaries 70 on the syllabic structure and sequence of vocalization processes, providing us with some of the earliest lines of evidence for the neuronal coding of vocal-motor plans.

Another interesting finding from these studies is the diversity of phonetic feature representations and their organization across cortical depth. Although our recordings sampled locally from relatively small columnar populations, most phonetic features could be reliably decoded from their collective activities. Such findings suggest that phonetic information necessary for constructing words may be potentially fully represented in certain regions along the cortical column 10 , 46 , 47 , 48 , 49 , 50 . They also place these populations at a putative intersection for the shared coding of places and manners of articulation and demonstrate how these representations may be locally distributed. Such redundancy and accessibility of information in local cortical populations is consistent with that observed from animal models 31 , 32 , 33 , 34 , 35 and could serve to allow for the rapid orchestration of neuronal processes necessary for the real-time construction of words; especially during the production of natural speech. Our findings are also supportive of a putative ‘mirror’ system that could allow for the shared representation of phonetic features within the population when speaking and listening and for the real-time feedback of phonetic information by neurons during perception 23 , 71 .

A final notable observation from these studies is the temporal succession of neuronal encoding events. In particular, our findings are supportive of previous neurolinguistic theories suggesting closely coupled processes for coordination planned articulatory events that ultimately produces words. These models, for example, suggest that the morphology of a word is probably retrieved before its phonologic code, as the exact phonology depends on the morphemes in the word form 1 . They also suggest the later syllabification of planned phonemes which would enable them to be sequentially arranged in specific order (although different temporal orders have been suggested as well) 72 . Here, our findings provide tentative support for a structured sublexical coding succession that could allow for the discretization of such information during articulation. Our findings also suggest (through dynamical systems modelling) a mechanism that, consistent with previous observations on motor planning and execution 31 , 34 , 65 , 66 , could enable information to occupy distinct functional subspaces 34 , 73 and therefore allow for the rapid separation of neural processes necessary for the construction and articulation of words.

Taken together, these findings reveal a set of processes and framework in the language-dominant prefrontal cortex by which to begin understanding how words may be constructed during natural speech at a single-neuronal level through which to start defining their fine-scale spatial and temporal dynamics. Given their robust decoding performances (especially in the absence of natural language processing-based predictions), it is interesting to speculate whether such prefrontal recordings could also be used for synthetic speech prostheses or for the augmentation of other emerging approaches 21 , 22 , 74 used in brain–machine interfaces. It is important to note, however, that the production of words also involves more complex processes, including semantic retrieval, the arrangement of words in sentences, and prosody, which were not tested here. Moreover, future experiments will be required to investigate eloquent areas such as ventral premotor and superior posterior temporal areas not accessible with our present techniques. Here, this study provides a prospective platform by which to begin addressing these questions using a combination of ultrahigh-density microelectrode recordings, naturalistic speech tracking and acute real-time intraoperative neurophysiology to study human language at cellular scale.

Study participants

All aspects of the study were carried out in strict accordance with and were approved by the Massachusetts General Brigham Institutional Review Board. Right-handed native English speakers undergoing awake microelectrode recording-guided deep brain stimulator implantation were screened for enrolment. Clinical consideration for surgery was made by a multidisciplinary team of neurosurgeons, neurologists and neuropsychologists. Operative planning was made independently by the surgical team and without consideration of study participation. Participants were only enroled if: (1) the surgical plan was for awake microelectrode recording-guided placement, (2) the patient was at least 18 years of age, (3) they had intact language function with English fluency and (4) were able to provide informed consent for study participation. Participation in the study was voluntary and all participants were informed that they were free to withdraw from the study at any time.

Acute intraoperative single-neuronal recordings

Single-neuronal prefrontal recordings using neuropixels probes.

As part of deep brain stimulator implantation at our institution, participants are often awake and microelectrode recordings are used to optimize anatomical targeting of the deep brain structures 46 . During these cases, the electrodes often traverse part of the posterior language-dominant prefrontal cortex 3 , 4 , 5 , 6 in an area previously shown be involved in word planning 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 and sentence construction 13 , 14 , 15 , 16 and which broadly connects with premotor areas involved in their articulation 51 , 52 , 53 and lexical processing 17 , 18 , 19 by imaging studies (Extended Data Fig. 1a,b ). All microelectrode entry points and placements were based purely on planned clinical targeting and were made independently of any study consideration.

Sterile Neuropixels probes (v.1.0-S, IMEC, ethylene oxide sterilized by BioSeal 54 ) together with a 3B2 IMEC headstage were attached to cannula and a manipulator connected to a ROSA ONE Brain (Zimmer Biomet) robotic arm. Here, the probes were inserted into the cortical ribbon under direct robot navigational guidance through the implanted burr hole (Fig. 1a ). The probes (width 70 µm; length 10 mm; thickness 100 µm) consisted of a total of 960 contact sites (384 preselected recording channels) laid out in a chequerboard pattern with approximately 25 µm centre-to-centre nearest-neighbour site spacing. The IMEC headstage was connected through a multiplexed cable to a PXIe acquisition module card (IMEC), installed into a PXIe Chassis (PXIe-1071 chassis, National Instruments). Neuropixels recordings were performed using SpikeGLX (v.20201103 and v.20221012-phase30; http://billkarsh.github.io/SpikeGLX/ ) or OpenEphys (v.0.5.3.1 and v.0.6.0; https://open-ephys.org/ ) on a computer connected to the PXIe acquisition module recording the action potential band (AP, band-pass filtered from 0.3 to 10 kHz) sampled at 30 kHz and a local-field potential band (LFP, band-pass filtered from 0.5 to 500 Hz), sampled at 2,500 Hz. Once putative units were identified, the Neuropixels probe was briefly held in position to confirm signal stability (we did not screen putative neurons for speech responsiveness). Further description of this recording approach can be found in refs. 54 , 55 . After single-neural recordings from the cortex were completed, the Neuropixels probe was removed and subcortical neuronal recordings and deep brain stimulator placement proceeded as planned.

Single-unit isolation

Single-neuronal recordings were performed in two main steps. First, to track the activities of putative neurons at high spatiotemporal resolution and to account for intraoperative cortical motion, we use a Decentralized Registration of Electrophysiology Data software (DREDge; https://github.com/evarol/DREDge ) and interpolation approach ( https://github.com/williamunoz/InterpolationAfterDREDge ). Briefly, and as previously described 54 , 55 , 56 , an automated protocol was used to track LFP voltages using a decentralized correlation technique that re-aligned the recording channels in relation to brain movements (Fig. 1a , right). Following this step, we then interpolated the AP band continuous voltage data using the DREDge motion estimate to allow the activities of the putative neurons to be stably tracked over time. Next, single units were isolated from the motion-corrected interpolated signal using Kilosort (v.1.0; https://github.com/cortex-lab/KiloSort ) followed by Phy for cluster curation (v.2.0a1; https://github.com/cortex-lab/phy ; Extended Data Fig. 1c,d ). Here, units were selected on the basis of their waveform morphologies and separability in principal component space, their interspike interval profiles and similarity of waveforms across contacts. Only well-isolated single units with mean firing rates ≥0.1 Hz were included. The range of units obtained from these recordings was 16–115 units per participant.

Audio recordings and task synchronization

For task synchronization, we used the TTL output and audio output to send the synchronization trigger through the SMA input to the IMEC PXIe acquisition module card. To allow for added synchronizing, triggers were also recorded on an extra breakout analogue and digital input/output board (BNC2110, National Instruments) connected through a PXIe board (PXIe-6341 module, National Instruments).

Audio recordings were obtained at 44 kHz sampling frequency (TASCAM DR-40×4-Channel/ 4-Track Portable Audio Recorder and USB Interface with Adjustable Microphone) which had an audio input. These recordings were then sent to a NIDAQ board analogue input in the same PXIe acquisition module containing the IMEC PXIe board for high-fidelity temporal alignment with neuronal data. Synchronization of neuronal activity with behavioural events was performed through TTL triggers through a parallel port sent to both the IMEC PXIe board (the sync channel) and the analogue NIDAQ input as well as the parallel audio input into the analogue input channels on the NIDAQ board.

Audio recordings were annotated in semi-automated fashion (Audacity; v.2.3). Recorded audio for each word and sentence by the participants was analysed in Praat 75 and Audacity (v.2.3). Exact word and phoneme onsets and offsets were identified using the Montreal Forced Aligner (v.2.2; https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner ) 76 and confirmed with manual review of all annotated recordings. Together, these measures allowed for the millisecond-level alignment of neuronal activity with each produced word and phoneme.

Anatomical localization of recordings

Pre-operative high-resolution magnetic resonance imaging and postoperative head computerized tomography scans were coregistered by combination of ROSA software (Zimmer Biomet; v.3.1.6.276), Mango (v.4.1; https://mangoviewer.com/download.html ) and FreeSurfer (v.7.4.1; https://surfer.nmr.mgh.harvard.edu/fswiki/DownloadAndInstall ) to reconstruct the cortical surface and identify the cortical location from which Neuropixels recordings were obtained 77 , 78 , 79 , 80 , 81 . This registration allowed localization of the surgical areas that underlaid the cortical sites of recording (Fig. 1a and Extended Data Fig. 1a ) 54 , 55 , 56 . The MNI transformation of these coordinates was then carried out to register the locations in MNI space with Fieldtrip toolbox (v.20230602; https://www.fieldtriptoolbox.org/ ; Extended Data Fig. 1b ) 82 .

For depth calculation, we estimated the pial boundary of recordings according to the observed sharp signal change in signal from channels that were implanted in the brain parenchyma versus those outside the brain. We then referenced our single-unit recording depth (based on their maximum waveform amplitude channel) in relation to this estimated pial boundary. Here, all units were assessed on the basis of their relative depths in relation to the pial boundary as superficial, middle and deep (Extended Data Fig. 7 ).

Speech production task

The participants performed a priming-based naturalistic speech production task 57 in which they were given a scene on a screen that consisted of a scenario that had to be described in specific order and format. Thus, for example, the participant may be given a scene of a boy and a girl playing with a balloon or they may be given a scene of a dog chasing a cat. These scenes, together, required the participants to produce words that varied in phonetic, syllabic and morphosyntactic content. They were also highlighted in a way that required them to produce the words in a structured format. Thus, for example, a scene may be highlighted in a way that required the participants to produce the sentence “The mouse was being chased by the cat” or in a way that required them to produce the sentence “The cat was chasing the mouse” (Extended Data Fig. 2a ). Because the sentences had to be constructed de novo, it also required the participants to produce the words without providing explicit phonetic cues (for example, from hearing and then repeating the word ‘cat’). Taken together, this task therefore allowed neuronal activity to be examined whereby words (for example, ‘cat’), rather than independent phonetic sounds (for example, /k/), were articulated and in which the words were produced during natural speech (for example, constructing the sentence “the dog chased the cat”) rather than simply repeated (for example, hearing and then repeating the word ‘cat’).

Finally, to account for the potential contribution of sensory–perceptual responses, three of the participants also performed a ‘perception’ control in which they listened to words spoken to them. One of these participants further performed an auditory ‘playback’ control in which they listened to their own recorded voice. For this control, all words spoken by the participant were recorded using a high-fidelity microphone (Zoom ZUM-2 USM microphone) and then played back to them on a word-by-word level in randomized separate blocks.

Constructing a word feature space

To allow for single-neuronal analysis and to provide a compositional representation for each word, we grouped the constituent phonemes on the basis of the relative positions of articulatory organs associated with their production 60 . Here, for our primary analyses, we selected the places of articulation for consonants (for example, bilabial consonants) on the basis of established IPA categories defining the primary articulators involved in speech production. For consonants, phonemes were grouped on the basis of their places of articulation into glottal, velar, palatal, postalveolar, alveolar, dental, labiodental and bilabial. For vowels, we grouped phonemes on the basis of the relative height of the tongue with high vowels being produced with the tongue in a relatively high position and mid-low (that is, mid+low) vowels being produced with it in a lower position. Here, this grouping of phonemes is broadly referred to as ‘places of articulation’ together reflecting the main positions of articulatory organs and their combinations used to produce the words 58 , 59 . Finally, to allow for comparison and to test their generalizability, we examined the manners of articulation stop, fricative, affricate, nasal, liquid and glide for consonants which describe the nature of airflow restriction by various parts of the mouth and tongue. For vowels, we also evaluated the primary cardinal vowels i, e, ɛ, a, α, ɔ, o and u which are described, in combination, by the position of the tongue relative to the roof of the mouth, how far forward or back it lies and the relative positions of the lips 83 , 84 . A detailed summary of these phonetic groupings can be found in Extended Data Table 1 .

Phoneme feature space

To further evaluate the relationship between neuronal activity and the presence of specific constituent phonemes per word, the phonemes in each word were parsed according to their precise pronunciation provided by the English Lexicon Project (or the Longman Pronunciation Dictionary for American English where necessary) as described previously 85 . Thus, for example, the word ‘like’ (l-aɪ-k) would be parsed into a sequence of alveolar-mid-low-velar phonemes, whereas the word ‘bike’ (b-aɪ-k) would be parsed into a sequence of bilabial-mid-low-velar phonemes.

These constituent phonemes were then used to represent each word as a ten-dimensional vector in which the value in each position reflected the presence of each type of phoneme (Fig. 1c ). For example, the word ‘like’, containing a sequence of alveolar-mid-low-velar phonemes, was represented by the vector [0 0 0 1 0 0 1 0 0 1], with each entry representing the number of the respective type of phoneme in the word. Together, such vectors representing all words defined a phonetic ‘vector space’. Further analyses to evaluate the precise arrangement of phonemes per word are described further below. Goodness-of-fit and selectivity metrics used to evaluate single-neuronal responses to these phonemes and their specific combination in words are described further below.

Syllabic feature space

Next, to evaluate the relationship between neuronal activity and the specific arrangement of phonemes in syllables, we parsed the constituent syllables for each word using American pronunciations provided in ref. 85 . Thus, for example, ‘back’ would be defined as a labial-low-velar sequence. Here, to allow for neuronal analysis and to limit the combination of all possible syllables, we selected the ten most common syllable types. High and mid-low vowels were considered as syllables here only if they reflected syllables in themselves and were unbound from a consonant (for example, /ih/ in ‘hesitate’ or /ah-/ in ‘adore’). Similar to the phoneme space, the syllables were then transformed into an n -dimensional binary vector in which the value in each dimension reflected the presence of specific syllables (similar to construction of the phoneme space). Thus, for the n -dimensional representation of each word in this syllabic feature space, the value in each dimension could be also interpreted in relation to neuronal activity.

To account for the functional distinction between phonemes and morphemes 62 , 63 , we also parsed words into those that contained bound morphemes which were either prefixed (for example, ‘re–’) or suffixed (for example, ‘–ed’). Unlike phonemes, morphemes such as ‘–ed’ in ‘directed’ or ‘re–’ in ‘retry’ are the smallest linguistic units capable of carrying meaning and, therefore, accounting for their presence allowed their effect on neuronal responses to be further examined. To allow for neuronal analysis and to control for potential differences in neuronal activity due to word lengths, models also took into account the total number of phonemes per word.

Spectral features

To evaluate the time-varying spectral features of the articulated phonemes on a phoneme-by-phoneme basis, we identified the occurrence of each phoneme using a Montreal Forced Aligner (v.2.2; https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner ). For pitch, we calculated the spectral power in ten log-spaced frequency bins from 200 to 5,000 Hz for each phoneme per word. For amplitude, we took the root-mean-square of the recorded waveform of each phoneme.

Single-neuronal analysis

Evaluating the selectivity of single-neuronal responses.

To investigate the relationship between single-neuronal activity and specific word features, we used a regression analysis to determine the degree to which variation in neural activity could be explained by phonetic, syllabic or morphologic properties of spoken words 86 , 87 , 88 , 89 . For all analyses, neuronal activity was considered in relation to word utterance onset ( t  = 0) and taken as the mean spike count in the analysis window of interest (that is, −500 to 0 ms from word onset for word planning and 0 to +500 ms for word production). To limit the potential effects of preceding words on neuronal activity, words with planning periods that overlapped temporally were excluded from regression and selectivity analyses. For each neuron, we constructed a GLM that modelled the spike count rate as the realization of a Poisson process whose rate varied as a function of the linguistic (for example, phonetic, syllabic and morphologic) or acoustic features (for example, spectral power and root-mean-square amplitude) of the planned words.

Models were fit using the Python (v.3.9.17) library statsmodels (v.0.13.5) by iterative least-squares minimization of the Poisson negative log-likelihood function 86 . To assess the goodness-of-fit of the models, we used both the Akaike information criterion ( \({\rm{AIC}}=2k-2{\rm{ln}}(L)\) where k is the number of estimated parameters and L is the maximized value of the likelihood function) and a generalization of the R 2 score for the exponential family of regression models that we refer to as D 2 whereby 87 :

y is a vector of realized outcomes, μ is a vector of estimated means from a full (including all regressors) or restricted (without regressors of interest) model and \({K}({\bf{y}}\,,{\boldsymbol{\mu }})=2\bullet {\rm{llf}}({\bf{y}}\,;{\bf{y}})-2\bullet {\rm{llf}}({\boldsymbol{\mu }}\,;{\bf{y}})\) where \({\rm{llf}}({\boldsymbol{\mu }}\,;{\bf{y}})\) is the log-likelihood of the model and \({\rm{llf}}({\bf{y}}\,;{\bf{y}})\) is the log-likelihood of the saturated model. The D 2 value represents the proportion of reduction in uncertainty (measured by the Kullback–Leibler divergence) due to the inclusion of regressors. The statistical significance of model fit was evaluated using the likelihood ratio test compared with a model with all covariates except the regressors of interest (the task variables).

We characterized a neuron as selectively ‘tuned’ to a given word feature if the GLM of neuronal firing rates as a function of task variables for that feature exhibited a statistically significant model fit (likelihood ratio test with α set at 0.01). For neurons meeting this criterion, we also examined the point estimates and confidence intervals for each coefficient in the model. A vector of these coefficients (or, in our feature space, a vector of the sign of these coefficients) indicates a word with the combination of constituent elements expected to produce a maximal neuronal response. The multidimensional feature spaces also allowed us to define metrics that quantified the phonemic, syllabic or morphologic similarity between words. Here, we calculated the Hamming distance between the vector describing each word u and the vector of the sign of regression coefficients that defines each neuron’s maximal predicted response v , which is equal to the number of positions at which the corresponding values are different:

For each ‘tuned’ neuron, we compared the Z -scored firing rate elicited by each word as a function of the Hamming distance between the word and the ‘preferred word’ of the neuron to examine the ‘tuning’ characteristics of these neurons (Figs. 1f and  2c ). A Hamming distance of zero would therefore indicate that the words have phonetically identical compositions. Finally, to examine the relationship between neuronal activity and spectral features of each phoneme, we extracted the acoustic waveform for each phoneme and calculated the power in ten log-spaced spectral bands. We then constructed a ‘spectral vector’ representation for each word based on these ten values and fit a Poisson GLM of neuronal firing rates against these values. For amplitude analysis, we regressed neuronal firing rates against the root-mean-square amplitude of the waveform for each word.

Controlling for interdependency between phonetic and syllabic features

Three more word variations were used to examine the interdependency between phonetic and syllabic features. First, we compared firing rates for words containing specific syllables with words containing individual phonemes in that syllable but not the syllable itself (for example, simply /d/ in ‘god’ or ‘dog’). Second, we examined words containing syllables with the same constituent phonemes but in a different order (for example, /g-ah-d/ for ‘god’ versus /d-ah-g/ for ‘dog’). Thus, if neurons responded preferentially to specific syllables, then they should continue to respond to them preferentially even when comparing words that had the same arrangements of phonemes but in different or reverse order. Third, we examined words containing the same sequence of syllables but spanning a syllable boundary such that the cluster of phonemes did not constitute a syllable (that is, in the same syllable versus spanning across syllable boundaries).

Visualization of neuronal responses within the population

To allow for visualization of groupings of neurons with shared representational characteristics, we calculated the AIC and D 2 for phoneme, syllable and morpheme models for each neuron and conducted tSNE procedure which transformed these data into two dimensions such that neurons with similar feature representations are spatially closer together than those with dissimilar representations 90 . We used the tSNE implantation in the scikit-learn Python module (v.1.3.0). In Fig. 3a left, a tSNE was fit on the AIC values for phoneme, syllable and morpheme models for each neuron during the planning period with the following parameters: perplexity = 35, early exaggeration = 2 and using Euclidean distance as the metric. In Fig. 3a right and Fig. 4a bottom, a different tSNE was fit on the D 2 values for all planning and production models using the following parameters: perplexity = 10, early exaggeration = 10 and using a cosine distance metric. The resulting embeddings were mapped onto a grid of points according to a linear sum assignment algorithm between embeddings and grid points.

Population modelling

Modelling population activity.

To quantify the degree to which the neural population coded information about the planned phonemes, syllables and morphemes, we modelled the activity of the entire pseudopopulation of recorded neurons. To match trials across the different participants, we first labelled each word according to whether it contained the feature of interest and then matched words across subjects based on the features that were shared. Using this procedure, no trials or neural data were duplicated or upsampled, ensuring strict separation between training and testing sets during classifier training and subsequent evaluation.

For decoding, words were randomly split into training (75%) and testing (25%) trials across 50 iterations. A support vector machine (SVM) as implemented in the scikit-learn Python package (v.1.3.0) 91 was used to construct a hyperplane in n -dimensional space that optimally separates samples of different word features by solving the following minimization problem:

subject to \({y}_{i}({w}^{T}\phi ({x}_{i})+b)\ge 1-{\zeta }_{i}\) and \({\zeta }_{i}\ge 0\) for all \(i\in \left\{1,\ldots ,n\right\}\) , where w is the margin in feature space, C is the regularization strength, ζ i is the distance of each point from the margin, y i is the predicted class for each sample and ϕ ( x i ) is the image of each datapoint in transformed feature space. A radial basis function kernel with coefficient γ  = 1/272 was applied. The penalty term C was optimized for each classifier using a cross-validation procedure nested in the training set.

A separate classifier was trained for each dimension in a task space (for example, separate classifiers for bilabial, dental and alveolar consonants) and scores for each of these classifiers were averaged to calculate an overall decoding score for that feature type. Each decoder was trained to predict whether the upcoming word contained an instance of a specific phoneme, syllable or morpheme arrangement. For phonemes, we used nine of the ten phoneme groups (there were insufficient instances of palatal consonants to train a classifier; Extended Data Table 1 ). For syllables, we used ten syllables taken from the most common syllables across the study vocabulary (Extended Data Table 1 ). For morpheme analysis, a single classifier was trained to predict the presence or absence of any bound morpheme in the upcoming word.

Finally, to assess performance, we scored classifiers using the area under the curve of the receiver operating characteristic (AUC-ROC) model. With this scoring metric, a classifier that always guesses the most common class (that is, an uninformative classifier) results in a score of 0.5 whereas a perfect classification results in a score of 1. The overall decoding score for a particular feature space was the mean score of the classifier for each dimension in the space. The entire procedure was repeated 50 times with random train/test splits. Summary statistics for these 50 iterations are presented in the main text.

Model switching

Assessing decoder generalization across different experimental conditions provides a powerful method to evaluate the similarity of neuronal representations of information in different contexts 64 . To determine how neurons encoded the same word features but under different conditions, we trained SVM decoders using neuronal data during one condition (for example, word production) but tested the decoder using data from another (for example, no word production). Before decoder training or testing, trials were split into disjoint training and testing sets, from which the neuronal data were extracted in the epoch of interest. Thus, trials used to train the model were never used to test the model while testing either native decoder performance or decoder generalizability.

Modelling temporal dynamic

To further study the temporal dynamic of neuronal response, we trained decoders to predict the phonemes, syllables and morpheme arrangement for each word across successive time points before utterance 64 . For each neuron, we aligned all spikes to utterance onset, binned spikes into 5 ms windows and convolved with a Gaussian kernel with standard deviation of 25 ms to generate an estimated instantaneous firing rate at each point in time during word planning. For each time point, we evaluated the performance of decoders of phonemes, syllables and morphemes trained on these data over 50 random splits of training and testing trials. The distribution of times of peak decoding performance across the planning or perception period revealed the dynamic of information encoding by these neurons during word planning or perception and we then calculated the median peak decoding times for phonemes, syllables or morphemes.

Dynamical system and subspace analysis

To study the dimensionality of neuronal activity and to evaluate the functional subspaces occupied by the neuronal population, we used dynamical systems approach that quantified the time-dependent changes in neural activity patterns 31 . For the dynamical system analysis, activity for all words were averaged for each neuron to come up with a single peri-event time projection (aligned to word onset) which allowed all neurons to be analysed together as a pseudopopulation. First, we calculated the instantaneous firing rates of the neuron which showed selectivity to any word feature (phonemes, syllables or morpheme arrangement) into 5 ms bins convolved with a Gaussian filter with standard deviation of 50 ms. We used equal 500 ms windows set at −500 to 0 ms before utterance onset for the planning phase and 0 to 500 ms following utterance onset for the production phase to allow for comparison. These data were then standardized to zero mean and unit variance. Finally, the neural data were concatenated into a T   ×   N matrix of sampled instantaneous firing rates for each of the N neurons at every time T .

Together, these matrices represented the evolution of the system in N -dimensional space over time. A principal component analysis revealed a small set of five principal components (PC) embedded in the full N -dimensional space that captured most of the variance in the data for each epoch (Fig. 4b ). Projection of the data into this space yields a T  × 5 matrix representing the evolution of the system in five-dimensional space over time. The columns of the N  × 5 principal components form an orthonormal basis for the five-dimensional subspace occupied by the system during each epoch.

Next, to quantify the relationship between these subspaces during planning and production, we took two approaches. First, we calculated the alignment index from ref. 66 :

where D A is the matrix defined by the orthonormal basis of subspace A, C B is the covariance of the neuronal data as it evolves in space B, \({\sigma }_{{\rm{B}}}(i)\) is the i th singular value of the covariance matrix C B and Tr(∙) is the matrix trace. The alignment index A ranges from 0 to 1 and quantifies the fraction of variance in space B recovered when the data are projected into space A. Higher values indicate that variance in the data is adequately captured by either subspace.

As discussed in ref. 66 , subspace misalignment in the form of low alignment index A can arise by chance when considering high-dimensional neuronal data because of the probability that two randomly selected sets of dimensions in high-dimensional space may not align well. Therefore, to further explore the degree to which our subspace misalignment was attributable to chance, we used the Monte Carlo analysis to generate random subspaces from data with the same covariance structure as the true (observed) data:

where V is a random subspace, U and S are the eigenvectors and eigenvalues of the covariance matrix of the observed data across all epochs being compared, v is a matrix of white noise and orth(∙) orthogonalizes the matrix. The alignment index A of the subspaces defined by the resulting basis vectors V was recalculated 1,000 times to generate a distribution of alignment index values A attributable to chance alone (compare Fig. 4b ).

Finally, we calculated the projection error between each pair of subspaces on the basis of relationships between the three orthonormal bases (rather than a projection of the data into each of these subspaces). The set of all (linear) subspaces of dimension k   <   n embedded in an n -dimensional vector space V forms a manifold known as the Grassmannian, endowed with several metrics which can be used to quantify distances between two subspaces on the manifold. Thus, the subspaces (defined by the columns of a T   ×   N ′ matrix, where N ′ is the number of selected principal components; five in our case) explored by the system during planning and production are points on the Grassmannian manifold of the full N -neuron dimensional vector space. Here, we used the Grassmannian chordal distance 92 :

where A and B are matrices whose columns are the orthonormal basis for their respective subspaces and \({\parallel \cdot \parallel }_{F}\) is the Frobenius norm. By normalizing this distance by the Frobenius norm of subspace A , we scale the distance metric from 0 to 1, where 0 indicates a subspace identical to A (that is, completely overlapping) and increasing values indicate greater misalignment from A . Random sampling of subspaces under the null hypothesis was repeated using the same procedure outlined above.

Participant demographics

Across the participants, there was no statistically significant difference in word length based on sex (three-way analysis of variance, F (1,4257) = 1.78, P  = 0.18) or underlying diagnosis (essential tremor versus Parkinson’s disease; F (1,4257) = 0.45, P  = 0.50). Among subjects with Parkinson’s disease, there was a significant difference based on disease severity (both ON score and OFF score) with more advanced disease (higher scores) correlating with longer word lengths ( F (1,3295) = 145.8, P  = 7.1 × 10 −33 for ON score and F (1,3295) = 1,006.0, P  = 6.7 × 10 −193 for OFF score, P  < 0.001) and interword intervals ( F (1,3291) = 14.9, P  = 1.1 × 10 −4 for ON score and F (1,3291) = 31.8, P  = 1.9 × 10 −8 for OFF score). Modelling neuronal activities in relation to these interword intervals (bottom versus top quartile), decoding performances were slightly higher for longer compared to shorter delays (0.76 ± 0.01 versus 0.68 ± 0.01, P  < 0.001, two-sided Mann–Whitney U -test).

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

All the primary data supporting the main findings of this study are available online at https://doi.org/10.6084/m9.figshare.24720501 .  Source data are provided with this paper.

Code availability

All codes necessary for reproducing the main findings of this study are available online at https://doi.org/10.6084/m9.figshare.24720501 .

Levelt, W. J. M., Roelofs, A. & Meyer, A. S. A Theory of Lexical Access in Speech Production Vol. 22 (Cambridge Univ. Press, 1999).

Kazanina, N., Bowers, J. S. & Idsardi, W. Phonemes: lexical access and beyond. Psychon. Bull. Rev. 25 , 560–585 (2018).

Article   PubMed   Google Scholar  

Bohland, J. W. & Guenther, F. H. An fMRI investigation of syllable sequence production. NeuroImage 32 , 821–841 (2006).

Basilakos, A., Smith, K. G., Fillmore, P., Fridriksson, J. & Fedorenko, E. Functional characterization of the human speech articulation network. Cereb. Cortex 28 , 1816–1830 (2017).

Article   PubMed Central   Google Scholar  

Tourville, J. A., Nieto-Castañón, A., Heyne, M. & Guenther, F. H. Functional parcellation of the speech production cortex. J. Speech Lang. Hear. Res. 62 , 3055–3070 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Lee, D. K. et al. Neural encoding and production of functional morphemes in the posterior temporal lobe. Nat. Commun. 9 , 1877 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Glanz, O., Hader, M., Schulze-Bonhage, A., Auer, P. & Ball, T. A study of word complexity under conditions of non-experimental, natural overt speech production using ECoG. Front. Hum. Neurosci. 15 , 711886 (2021).

Yellapantula, S., Forseth, K., Tandon, N. & Aazhang, B. NetDI: methodology elucidating the role of power and dynamical brain network features that underpin word production. eNeuro 8 , ENEURO.0177-20.2020 (2020).

Hoffman, P. Reductions in prefrontal activation predict off-topic utterances during speech production. Nat. Commun. 10 , 515 (2019).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Glasser, M. F. et al. A multi-modal parcellation of human cerebral cortex. Nature 536 , 171–178 (2016).

Chang, E. F. et al. Pure apraxia of speech after resection based in the posterior middle frontal gyrus. Neurosurgery 87 , E383–E389 (2020).

Hazem, S. R. et al. Middle frontal gyrus and area 55b: perioperative mapping and language outcomes. Front. Neurol. 12 , 646075 (2021).

Fedorenko, E. et al. Neural correlate of the construction of sentence meaning. Proc. Natl Acad. Sci. USA 113 , E6256–E6262 (2016).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Nelson, M. J. et al. Neurophysiological dynamics of phrase-structure building during sentence processing. Proc. Natl Acad. Sci. USA 114 , E3669–E3678 (2017).

Walenski, M., Europa, E., Caplan, D. & Thompson, C. K. Neural networks for sentence comprehension and production: an ALE-based meta-analysis of neuroimaging studies. Hum. Brain Mapp. 40 , 2275–2304 (2019).

Elin, K. et al. A new functional magnetic resonance imaging localizer for preoperative language mapping using a sentence completion task: validity, choice of baseline condition and test–retest reliability. Front. Hum. Neurosci. 16 , 791577 (2022).

Duffau, H. et al. The role of dominant premotor cortex in language: a study using intraoperative functional mapping in awake patients. Neuroimage 20 , 1903–1914 (2003).

Ikeda, S. et al. Neural decoding of single vowels during covert articulation using electrocorticography. Front. Hum. Neurosci. 8 , 125 (2014).

Ghosh, S. S., Tourville, J. A. & Guenther, F. H. A neuroimaging study of premotor lateralization and cerebellar involvement in the production of phonemes and syllables. J. Speech Lang. Hear. Res. 51 , 1183–1202 (2008).

Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495 , 327–332 (2013).

Anumanchipalli, G. K., Chartier, J. & Chang, E. F. Speech synthesis from neural decoding of spoken sentences. Nature 568 , 493–498 (2019).

Moses, D. A. et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 385 , 217–227 (2021).

Wang, R. et al. Distributed feedforward and feedback cortical processing supports human speech production. Proc. Natl Acad. Sci. USA 120 , e2300255120 (2023).

Coudé, G. et al. Neurons controlling voluntary vocalization in the Macaque ventral premotor cortex. PLoS ONE 6 , e26822 (2011).

Hahnloser, R. H. R., Kozhevnikov, A. A. & Fee, M. S. An ultra-sparse code underlies the generation of neural sequences in a songbird. Nature 419 , 65–70 (2002).

Aronov, D., Andalman, A. S. & Fee, M. S. A specialized forebrain circuit for vocal babbling in the juvenile songbird. Science 320 , 630–634 (2008).

Article   ADS   CAS   PubMed   Google Scholar  

Stavisky, S. D. et al. Neural ensemble dynamics in dorsal motor cortex during speech in people with paralysis. eLife 8 , e46015 (2019).

Tankus, A., Fried, I. & Shoham, S. Structured neuronal encoding and decoding of human speech features. Nat. Commun. 3 , 1015 (2012).

Article   ADS   PubMed   Google Scholar  

Basilakos, A., Smith, K. G., Fillmore, P., Fridriksson, J. & Fedorenko, E. Functional characterization of the human speech articulation network. Cereb. Cortex 28 , 1816–1830 (2018).

Keating, P. & Shattuck-Hufnagel, S. A prosodic view of word form encoding for speech production. UCLA Work. Pap. Phon. 101 , 112–156 (1989).

Google Scholar  

Vyas, S., Golub, M. D., Sussillo, D. & Shenoy, K. V. Computation through neural population dynamics. Ann. Rev. Neurosci. 43 , 249–275 (2020).

Article   CAS   PubMed   Google Scholar  

Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Ryu, S. I. & Shenoy, K. V. Cortical preparatory activity: representation of movement or first cog in a dynamical machine? Neuron 68 , 387–400 (2010).

Shenoy, K. V., Sahani, M. & Churchland, M. M. Cortical control of arm movements: a dynamical systems perspective. Ann. Rev. Neurosci. 36 , 337–359 (2013).

Kaufman, M. T., Churchland, M. M., Ryu, S. I. & Shenoy, K. V. Cortical activity in the null space: permitting preparation without movement. Nat. Neurosci. 17 , 440–448 (2014).

Mante, V., Sussillo, D., Shenoy, K. V. & Newsome, W. T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature 503 , 78–84 (2013).

Vitevitch, M. S. & Luce, P. A. Phonological neighborhood effects in spoken word perception and production. Ann. Rev. Linguist. 2 , 75–94 (2016).

Jamali, M. et al. Dorsolateral prefrontal neurons mediate subjective decisions and their variation in humans. Nat. Neurosci. 22 , 1010–1020 (2019).

Mian, M. K. et al. Encoding of rules by neurons in the human dorsolateral prefrontal cortex. Cereb. Cortex 24 , 807–816 (2014).

Patel, S. R. et al. Studying task-related activity of individual neurons in the human brain. Nat. Protoc. 8 , 949–957 (2013).

Sheth, S. A. et al. Human dorsal anterior cingulate cortex neurons mediate ongoing behavioural adaptation. Nature 488 , 218–221 (2012).

Williams, Z. M., Bush, G., Rauch, S. L., Cosgrove, G. R. & Eskandar, E. N. Human anterior cingulate neurons and the integration of monetary reward with motor responses. Nat. Neurosci. 7 , 1370–1375 (2004).

Jang, A. I., Wittig, J. H. Jr., Inati, S. K. & Zaghloul, K. A. Human cortical neurons in the anterior temporal lobe reinstate spiking activity during verbal memory retrieval. Curr. Biol. 27 , 1700–1705 (2017).

Ponce, C. R. et al. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences. Cell 177 , 999–1009 (2019).

Yoshor, D., Ghose, G. M., Bosking, W. H., Sun, P. & Maunsell, J. H. Spatial attention does not strongly modulate neuronal responses in early human visual cortex. J. Neurosci. 27 , 13205–13209 (2007).

Jamali, M. et al. Single-neuronal predictions of others’ beliefs in humans. Nature 591 , 610–614 (2021).

Hickok, G. & Poeppel, D. Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language. Cognition 92 , 67–99 (2004).

Poologaindran, A., Lowe, S. R. & Sughrue, M. E. The cortical organization of language: distilling human connectome insights for supratentorial neurosurgery. J. Neurosurg. 134 , 1959–1966 (2020).

Genon, S. et al. The heterogeneity of the left dorsal premotor cortex evidenced by multimodal connectivity-based parcellation and functional characterization. Neuroimage 170 , 400–411 (2018).

Milton, C. K. et al. Parcellation-based anatomic model of the semantic network. Brain Behav. 11 , e02065 (2021).

Sun, H. et al. Functional segregation in the left premotor cortex in language processing: evidence from fMRI. J. Integr. Neurosci. 12 , 221–233 (2013).

Peeva, M. G. et al. Distinct representations of phonemes, syllables and supra-syllabic sequences in the speech production network. Neuroimage 50 , 626–638 (2010).

Paulk, A. C. et al. Large-scale neural recordings with single neuron resolution using Neuropixels probes in human cortex. Nat. Neurosci. 25 , 252–263 (2022).

Coughlin, B. et al. Modified Neuropixels probes for recording human neurophysiology in the operating room. Nat. Protoc. 18 , 2927–2953 (2023).

Windolf, C. et al. Robust online multiband drift estimation in electrophysiology data.In Proc. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, Rhodes Island, 2023).

Mehri, A. & Jalaie, S. A systematic review on methods of evaluate sentence production deficits in agrammatic aphasia patients: validity and reliability issues. J. Res. Med. Sci. 19 , 885–898 (2014).

PubMed   PubMed Central   Google Scholar  

Abbott, L. F. & Sejnowski, T. J. Neural Codes and Distributed Representations: Foundations of Neural Computation (MIT, 1999).

Green, D. M. & Swets, J. A. Signal Detection Theory and Psychophysics (Wiley, 1966).

Association, I. P. & Staff, I. P. A. Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet (Cambridge Univ. Press, 1999).

Indefrey, P. & Levelt, W. J. M. in The New Cognitive Neurosciences 2nd edn (ed. Gazzaniga, M. S.) 845–865 (MIT, 2000).

Slobin, D. I. Thinking for speaking. In Proc. 13th Annual Meeting of the Berkeley Linguistics Society (eds Aske, J. et al.) 435–445 (Berkeley Linguistics Society, 1987).

Pillon, A. Morpheme units in speech production: evidence from laboratory-induced verbal slips. Lang. Cogn. Proc. 13 , 465–498 (1998).

Article   Google Scholar  

King, J. R. & Dehaene, S. Characterizing the dynamics of mental representations: the temporal generalization method. Trends Cogn. Sci. 18 , 203–210 (2014).

Machens, C. K., Romo, R. & Brody, C. D. Functional, but not anatomical, separation of “what” and “when” in prefrontal cortex. J. Neurosci. 30 , 350–360 (2010).

Elsayed, G. F., Lara, A. H., Kaufman, M. T., Churchland, M. M. & Cunningham, J. P. Reorganization between preparatory and movement population responses in motor cortex. Nat. Commun. 7 , 13239 (2016).

Roy, S., Zhao, L. & Wang, X. Distinct neural activities in premotor cortex during natural vocal behaviors in a New World primate, the Common Marmoset ( Callithrix jacchus ). J. Neurosci. 36 , 12168–12179 (2016).

Eliades, S. J. & Miller, C. T. Marmoset vocal communication: behavior and neurobiology. Dev. Neurobiol. 77 , 286–299 (2017).

Okobi, D. E. Jr, Banerjee, A., Matheson, A. M. M., Phelps, S. M. & Long, M. A. Motor cortical control of vocal interaction in neotropical singing mice. Science 363 , 983–988 (2019).

Cohen, Y. et al. Hidden neural states underlie canary song syntax. Nature 582 , 539–544 (2020).

Hickok, G. Computational neuroanatomy of speech production. Nat. Rev. Neurosci. 13 , 135–145 (2012).

Sahin, N. T., Pinker, S., Cash, S. S., Schomer, D. & Halgren, E. Sequential processing of lexical, grammatical and phonological information within Broca’s area. Science 326 , 445–449 (2009).

Russo, A. A. et al. Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation. Neuron 107 , 745–758 (2020).

Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620 , 1031–1036 (2023).

Boersma, P. & Weenink, D. Praat: Doing Phonetics by Computer (2020); www.fon.hum.uva.nl/praat/ .

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M. & Sonderegger, M. Montreal forced aligner: trainable text-speech alignment using kaldi. In Proc. Annual Conference of the International Speech Communication Association 498–502 (ISCA, 2017).

Lancaster, J. L. et al. Automated regional behavioral analysis for human brain images. Front. Neuroinform. 6 , 23 (2012).

Lancaster, J. L. et al. Automated analysis of fundamental features of brain structures. Neuroinformatics 9 , 371–380 (2011).

Fischl, B. & Dale, A. M. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proc. Natl Acad. Sci. USA 97 , 11050–11055 (2000).

Fischl, B., Liu, A. & Dale, A. M. Automated manifold surgery: constructing geometrically accurate and topologically correct models of the human cerebral cortex. IEEE Trans. Med. Imaging 20 , 70–80 (2001).

Reuter, M., Schmansky, N. J., Rosas, H. D. & Fischl, B. Within-subject template estimation for unbiased longitudinal image analysis. Neuroimage 61 , 1402–1418 (2012).

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J. M. FieldTrip: open source software for advanced analysis of MEG, EEG and invasive electrophysiological data. Comput. Intell. Neurosci. 2011 , 156869 (2011).

Noiray, A., Iskarous, K., Bolanos, L. & Whalen, D. Tongue–jaw synergy in vowel height production: evidence from American English. In 8th International Seminar on Speech Production (eds Sock, R. et al.) 81–84 (ISSP, 2008).

Flege, J. E., Fletcher, S. G., McCutcheon, M. J. & Smith, S. C. The physiological specification of American English vowels. Lang. Speech 29 , 361–388 (1986).

Wells, J. Longman Pronunciation Dictionary (Pearson, 2008).

Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python in Science Conference (eds van der Walt, S. & Millman, J.) 92–96 (SCIPY, 2010).

Cameron, A. C. & Windmeijer, F. A. G. An R -squared measure of goodness of fit for some common nonlinear regression models. J. Econometr. 77 , 329–342 (1997).

Article   MathSciNet   Google Scholar  

Hamilton, L. S. & Huth, A. G. The revolution will not be controlled: natural stimuli in speech neuroscience. Lang. Cogn. Neurosci. 35 , 573–582 (2020).

Hamilton, L. S., Oganian, Y., Hall, J. & Chang, E. F. Parallel and distributed encoding of speech across human auditory cortex. Cell 184 , 4626–4639 (2021).

Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 , 2579–2605 (2008).

Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12 , 2825–2830 (2011).

MathSciNet   Google Scholar  

Ye, K. & Lim, L.-H. Schubert varieties and distances between subspaces of different dimensions. SIAM J. Matrix Anal. Appl. 37 , 1176–1197 (2016).

Download references

Acknowledgements

We thank all the participants for their generosity and willingness to take part in the research. We also thank A. Turk and S. Hufnagel for their insightful comments and suggestions as well as D. J. Kellar, Y. Chou, A. Zhang, A. O’Donnell and B. Mash for their assistance and contributions to the intraoperative setup and feedback. Finally, we thank B. Coughlin, E. Trautmann, C. Windolf, E. Varol, D. Soper, S. Stavisky and K. Shenoy for their assistance in developing the data processing pipeline. A.R.K. and W.M. are supported by the NIH Neuroscience Resident Research Program R25NS065743, M.J. is supported by CIHR and Foundations of Human Behavior Initiative, A.C.P. is supported by UG3NS123723, Tiny Blue Dot Foundation and P50MH119467. J.C. is supported by American Association of University Women, S.S.C. is supported by R44MH125700 and Tiny Blue Dot Foundation and Z.M.W. is supported by R01DC019653 and U01NS121616.

Author information

These authors contributed equally: Arjun R. Khanna, William Muñoz, Young Joon Kim

These authors jointly supervised this work: Sydney Cash, Ziv M. Williams

Authors and Affiliations

Department of Neurosurgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

Arjun R. Khanna, William Muñoz, Yoav Kfir, Mohsen Jamali, Jing Cai, Martina L. Mustroph, Irene Caprara, Mackenna Mejdell, Jeffrey Schweitzer & Ziv M. Williams

Harvard Medical School, Boston, MA, USA

Young Joon Kim & Abigail Zuckerman

Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

Angelique C. Paulk, Richard Hardstone, Domokos Meszéna & Sydney Cash

Harvard-MIT Division of Health Sciences and Technology, Boston, MA, USA

Ziv M. Williams

Harvard Medical School, Program in Neuroscience, Boston, MA, USA

You can also search for this author in PubMed   Google Scholar

Contributions

A.R.K. and Y.J.K. performed the analyses. Z.M.W., J.S. and W.M. performed the intraoperative neuronal recordings. W.M., Y.J.K., A.C.P., R.H. and D.M. performed the data processing and neuronal alignments. W.M. performed the spike sorting. A.C.P. and W.M. reconstructed the recording locations. A.R.K., W.M., Y.J.K., Y.K., A.C.P., M.J., J.C., M.L.M., I.C. and D.M. performed the experiments. Y.K. and M.J. implemented the task. M.M. and A.Z. transcribed the speech signals. A.C.P., S.C. and Z.M.W. devised the intraoperative Neuropixels recording approach. A.R.K., W.M., Y.J.K., A.C.P., M.J., J.S. and S.C. edited the manuscript and Z.M.W. conceived and designed the study, wrote the manuscript and directed and supervised all aspects of the research.

Corresponding author

Correspondence to Ziv M. Williams .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Eyiyemisi Damisah, Yves Boubenec and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 single-unit isolations from the human prefrontal cortex using neuropixels recordings..

a . Individual recording sites on a standardized 3D brain model (FreeSurfer), on side ( top ), zoomed-in oblique ( inset ) and top ( bottom ) views. Recordings lay across the posterior middle frontal gyrus of the language-dominant prefrontal cortex and roughly ranged in distribution from alongside anterior area 55b to 8a. b . Recording coordinates for the five participants are given in MNI space. c . Left , representative example of raw, motion-corrected action potential traces recorded across neighbouring channels over time. Right , an example of overlayed spike waveform morphologies and their distribution across neighbouring channels recorded from a Neuropixels array. d . Isolation metrics for the recorded population (n = 272 units) together with an example of spikes from four concomitantly recorded units (labelled red, blue, cyan and yellow) in principal component space.

Extended Data Fig. 2 Naturalistic speech production task performance and phonetic selectivity across neurons and participants.

a . A priming-based speech production task that provided participants with pictorial representations of naturalistic events and that had to be verbally described in specific order. The task trial example is given here for illustrative purposes (created with BioRender.com). b . Mean word production times across participants and their standard deviation of the mean. The blue bars and dots represent performances for the five participants in which recordings were acquired (n = 964, 1252, 406, 836, 805 words, respectively). The grey bar and dots represent healthy control (n = 1534 words). c . Percentage of modulated neurons that responded selectively to specific planned phonemes across participants. All participants possessed neurons that responded to various phonetic features (one-sided χ 2  = 10.7, 6.9, 7.4, 0.5 and 1.3, p = 0.22, 0.44, 0.49, 0.97, 0.86, for participants 1–5, respectively).

Extended Data Fig. 3 Examples of single-neuronal activities and their temporal dynamics.

a . Peri-event time histograms were constructed by aligning the action potentials of each neuron to word onset. Data are presented as mean (line) values ± standard error of the mean (shade). Examples of three representative neurons that selectively changed their activity to specific planned phonemes. Inset , spike waveform morphology and scale bar (0.5 ms). b . Peri-event time histogram and action potential raster for the same neurons above but now aligned to the onset of the articulated phonemes themselves. Data are presented as mean (line) values ± standard error of the mean (shade). c . Sankey diagram displaying the proportions of neurons (n = 56) that displayed a change in activity polarity (increases in orange and decreases in purple) from planning to production.

Extended Data Fig. 4 Generalizability of explanatory power across phonetic groupings for consonants and vowels.

a . Scatter plots of the model explanatory power (D 2 ) for different phonetic groupings across the cell population (n = 272 units). Phonetic groupings were based on the planned (i) places of articulation of consonants and/or vowels (ii) manners of articulation of consonants and (iii) primary cardinal vowels (Extended Data Table 1 ). Model D 2 explanatory power across all phonetic groupings were significantly correlated (from top left to bottom right, p = 1.6×10 −146 , p = 2.8×10 −70 , p = 6.1×10 −54 , p = 1.4×10 −57 , p = 2.3×10 −43 and p = 5.9×10 −43 , two-sided tests of Spearman rank-order correlations). Spearman’s ρ are 0.96, 0.83, 0.77, respectively for left to right top panels and 0.78, 0.71, 0.71, respectively for left to right bottom panels (dashed regression lines). Among phoneme-selective neurons, the planned places of articulation provided the highest explanatory power (two-sided Wilcoxon signed-rank test of model D 2 values, W = 716, p = 7.9×10 −16 ) and the best model fits (two-sided Wilcoxon signed-rank test of AIC, W = 2255, p = 1.3×10 −5 ) compared to manners of articulation. They also provided the highest explanatory power (two-sided Wilcoxon signed-rank test of model D 2 values, W = 846, p = 9.7×10 −15 ) and fits (two-sided Wilcoxon signed-rank test of AIC, W = 2088, p = 2.0×10 −6 ) compared to vowels. b . Multidimensional scaling (MDS) representation of all neurons across phonetic groupings. Neurons with similar response characteristics are plotted closer together. The hue of each point reflects the degree of selectivity to specific phonetic features. Here, the colour scale for places of articulation is provided in red, manners of articulation in green and vowels in blue. The size of each point reflects the magnitude of the maximum explanatory power in relation to each cell’s phonetic selectivity (maximum D 2 for places of articulation of consonants and/or vowels, manners of articulation of consonants and primary cardinal vowels).

Extended Data Fig. 5 Explanatory power for the acoustic–phonetic properties of phonemes and neuronal tuning to morphemes.

a . Left , scatter plot of the D 2 explanatory power of neurons for planned phonemes and their observed spectral frequencies during articulation (n = 272 units; Spearman’s ρ = 0.75, p = 9.3×10 −50 , two-sided test of Spearman rank-order correlation). Right , decoding performances for the spectral frequency of phonemes (n = 50 random test/train splits; p = 7.1×10 −18 , two-sided Mann–Whitney U-test). Data are presented as mean values ± standard error of the mean. b . Venn diagrams of neurons that were modulated by phonemes during planning and those that were modulated by the spectral frequency (left) and amplitude (right) of the phonemes during articulation. c . Left , peri-event time histogram and raster for a representative neuron exhibiting selectivity to words that contained bound morphemes (for example, –ing , –ed ) compared to words that did not. Data are presented as mean (line) values ± standard error of the mean (shade). Inset , spike waveform morphology and scale bar (0.5 ms). Right , decoding performance distribution for morphemes (n = 50 random test/train splits; p = 1.0×10 −17 , two-sided Mann–Whitney U-test). Data are presented as mean values ± standard deviation.

Extended Data Fig. 6 Phonetic representations of words during speech perception and the comparison of speaking to listening.

a . Left , Venn diagrams of neurons that selectively changed their activity to specific phonemes during word planning (−500:0 ms from word utterance onset) and perception (0:500 ms from word utterance onset). Right , average z-scored firing rate for selective neurons during word planning (black) and perception (grey) as a function of the Hamming distance. Here, the Hamming distance was based on the neurons’ preferred phonetic compositions during production and compared for the same neurons during perception. Data are presented as mean (line) values ± standard error of the mean (shade). b . Left , classifier decoding performances for selective neurons during word planning. The points provide the sampled distribution for the classifier’s ROC-AUC values (black) compared to random chance (grey; n = 50 random test/train splits; p = 7.1×10 −18 , two-sided Mann–Whitney U-test). Middle , decoding performance for selective neurons during perception (n = 50 random test/train splits; 7.1×10 −18 , two-sided Mann–Whitney U-test). Right , word planning-perception model-switch decoding performances for selective neurons. Here, models were trained on neural data for specific phonemes during planning and then used to decode those same phonemes during perception (n = 50 random test/train splits; p > 0.05, two-sided Mann–Whitney U-test; Methods ). The boundaries and midline of the boxplots represent the 25 th and 75 th percentiles and the median, respectively. c . Peak decoding performance for phonemes, syllables and morphemes as a function of time from perceived word onset. Peak decoding for morphemes was observed significantly later than for phonemes and syllables during perception (n = 50 random test/train splits; two-sided Kruskal–Wallis, H = 14.8, p = 0.00062). Data are presented here as median (dot) values ± bootstrapped standard error of the median.

Extended Data Fig. 7 Spatial distribution of representations based on cortical location and depth.

a . Relationship between recording location along the rostral–caudal axis of the prefrontal cortex and the proportion of neurons that displayed selectivity to specific phonemes, syllables and morphemes. Neurons that displayed selectivity were more likely to be found posteriorly (one-sided χ 2 test, p = 2.6×10 −9 , 3.0×10 −11 , 2.5×10 −6 , 3.9×10 −10 , for places of articulation, manners of articulation, syllables and morpheme, respectively). b . Relationship between recording depth along the cortical column and the proportion of neurons that display selectivity to specific phonemes, syllables and morphemes. Neurons that displayed selectivity were broadly distributed along the cortical column (one-sided χ 2 test, p > 0.05). Here, S indicates superficial, M middle and D deep.

Extended Data Fig. 8 Receiver operating characteristic curves across planned phonetic representations and decoding model-switching performances for word planning and production.

a . ROC-AUC curves for neurons across different phonemes, grouped by placed of articulation, during planning (there were insufficient palatal consonants to allow for classification and are therefore not displayed here). b . Average (solid line) and shuffled (dotted line) data across all phonemes. Data are presented as mean (line) values ± standard error of the mean (shade). c . Planning-production model-switch decoding performance sample distribution (n = 50 random test/train splits) for all selective neurons. Here, models were trained on neuronal data recorded during planning and then used to decode those same phoneme ( left ), syllable ( middle ), or morpheme ( right ) on neuronal data recorded during production. Slightly lower decoding performances were noted for syllables and morphemes when comparing word planning to production (p = 0.020 for syllable comparison and p = 0.032 for morpheme comparison, two-sided Mann–Whitney U-test). Data are presented as mean values ± standard deviation.

Extended Data Fig. 9 Example of phonetic representations in planning and production subspaces.

Modelled depiction of the neuronal population trajectory (bootstrap resampled) across averaged trials with (green) and without (grey) mid-low phonemes, projected into a plane within the “planning” subspace (y-axis) and a plane within the “production” subspace (z-axis). Projection planes within planning and production subspaces were chosen to enable visualization of trajectory divergence. Zero indicates word onset on the x-axis. Separation between the population trajectory during trials with and without mid-low phonemes is apparent in the planning subspace (y-axis) independently of the projection subspace (z-axis) because these subspaces are orthogonal. The orange plane indicates a hypothetical decision boundary learned by a classifier to separate neuronal activities between mid-low and non-mid-low trials. Because the classifier decision boundary is not constrained to lie within a particular subspace, classifier performance may therefore generalize across planning and production epochs, despite the near-orthogonality of these respective subspaces.

Supplementary information

Reporting summary, source data, source data fig. 1, source data fig. 2, source data fig. 3, source data fig. 4, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Khanna, A.R., Muñoz, W., Kim, Y.J. et al. Single-neuronal elements of speech production in humans. Nature 626 , 603–610 (2024). https://doi.org/10.1038/s41586-023-06982-w

Download citation

Received : 22 June 2023

Accepted : 14 December 2023

Published : 31 January 2024

Issue Date : 15 February 2024

DOI : https://doi.org/10.1038/s41586-023-06982-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

How speech is produced and perceived in the human cortex.

  • Yves Boubenec

Nature (2024)

Mind-reading devices are revealing the brain’s secrets

  • Miryam Naddaf

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

speech sound production meaning

  • Tools and Resources
  • Customer Services
  • Applied Linguistics
  • Biology of Language
  • Cognitive Science
  • Computational Linguistics
  • Historical Linguistics
  • History of Linguistics
  • Language Families/Areas/Contact
  • Linguistic Theories
  • Neurolinguistics
  • Phonetics/Phonology
  • Psycholinguistics
  • Sign Languages
  • Sociolinguistics
  • Share This Facebook LinkedIn Twitter

Article contents

The source–filter theory of speech.

  • Isao Tokuda Isao Tokuda Ritsumeikan University
  • https://doi.org/10.1093/acrefore/9780199384655.013.894
  • Published online: 29 November 2021

In the source-filter theory, the mechanism of speech production is described as a two-stage process: (a) The air flow coming from the lungs induces tissue vibrations of the vocal folds (i.e., two small muscular folds located in the larynx) and generates the “source” sound. Turbulent airflows are also created at the glottis or at the vocal tract to generate noisy sound sources. (b) Spectral structures of these source sounds are shaped by the vocal tract “filter.” Through the filtering process, frequency components corresponding to the vocal tract resonances are amplified, while the other frequency components are diminished. The source sound mainly characterizes the vocal pitch (i.e., fundamental frequency), while the filter forms the timbre. The source-filter theory provides a very accurate description of normal speech production and has been applied successfully to speech analysis, synthesis, and processing. Separate control of the source (phonation) and the filter (articulation) is advantageous for acoustic communications, especially for human language, which requires expression of various phonemes realized by a flexible maneuver of the vocal tract configuration. Based on this idea, the articulatory phonetics focuses on the positions of the vocal organs to describe the produced speech sounds.

The source-filter theory elucidates the mechanism of “resonance tuning,” that is, a specialized way of singing. To increase efficiency of the vocalization, soprano singers adjust the vocal tract filter to tune one of the resonances to the vocal pitch. Consequently, the main source sound is strongly amplified to produce a loud voice, which is well perceived in a large concert hall over the orchestra.

It should be noted that the source–filter theory is based upon the assumption that the source and the filter are independent from each other. Under certain conditions, the source and the filter interact with each other. The source sound is influenced by the vocal tract geometry and by the acoustic feedback from the vocal tract. Such source–filter interaction induces various voice instabilities, for example, sudden pitch jump, subharmonics, resonance, quenching, and chaos.

  • source–filter theory
  • speech production
  • vocal fold vibration
  • turbulent air flow
  • vocal tract acoustics
  • resonance tuning
  • source–filter interaction

1. Background

Human speech sounds are generated by a complex interaction of components of human anatomy. Most speech sounds begin with the respiratory system, which expels air from the lungs (figure 1 ). The air goes through the trachea and enters into the larynx, where two small muscular folds, called “vocal folds,” are located. As the vocal folds are brought together to form a narrow air passage, the airstream causes them to vibrate in a periodic manner (Titze, 2008 ). The vocal fold vibrations modulate the air pressure and produce a periodic sound. The produced sounds, when the vocal folds are vibrating, are called “voiced sounds,” while those in which the vocal folds do not vibrate are called “unvoiced sounds.” The air passages above the larynx are called the “vocal tract.” Turbulent air flows generated at constricted parts of the glottis or the vocal tract also contribute to aperiodic source sounds distributed over a wide range of frequencies. The shape of the vocal tract and consequently the positions of the articulators (i.e., jaw, tongue, velum, lips, mouth, teeth, and hard palate) provide a crucial factor to determine acoustical characteristics of the speech sounds. The state of the vocal folds, as well as the positions, shapes, and sizes of the articulators, changes over time to produce various phonetic sounds sequentially.

Figure 1. Concept of the source-filter theory. Airflow from the lung induces vocal fold vibrations, where glottal source sound is created. The vocal tract filter shapes the spectral structure of the source sound. The filtered speech sound is finally radiated from the mouth.

To systematically understand the mechanism of speech production, the source-filter theory divides such process into two stages (Chiba & Kajiyama, 1941 ; Fant, 1960 ) (see figure 1 ): (a) The air flow coming from the lungs induces tissue vibration of the vocal folds that generates the “source” sound. Turbulent noise sources are also created at constricted parts of the glottis or the vocal tract. (b) Spectral structures of these source sounds are shaped by the vocal tract “filter.” Through the filtering process, frequency components, which correspond to the resonances of the vocal tract, are amplified, while the other frequency components are diminished. The source sound characterizes mainly the vocal pitch, while the filter forms the overall spectral structure.

The source-filter theory provides a good approximation of normal human speech, under which the source sounds are only weakly influenced by the vocal tract filter, and has been applied successfully to speech analysis, synthesis, and processing (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). Independent control of the source (phonation) and the filter (articulation) is advantageous for acoustic communications with language, which requires expression of various phonemes with a flexible maneuver of the vocal tract configuration (Fitch, 2010 ; Lieberman, 1977 ).

2. Source-Filter Theory

There are four main types of sound sources that provide an acoustic input to the vocal tract filter: glottal source, aspiration source, frication source, and transient source (Stevens, 1999 , 2005 ).

The glottal source is generated by the vocal fold vibrations. The vocal folds are muscular folds located in the larynx. The opening space between the left and right vocal folds is called “glottal area.” When the vocal folds are closely located to each other, the airflow coming from the lungs can cause the vocal fold tissues to vibrate. With combined effects of pressure, airflow, tissue elasticity, and collision between the left and right vocal folds, the vocal folds give rise to vibrations, which periodically modulate acoustic air pressure at the glottis. The number of the periodic glottal vibrations per second is called “fundamental frequency ( f o )” and is expressed in Hz or cycles per second. In the spectral space, the glottal source sound determines the strengths of the fundamental frequency and its integer multiples (harmonics). The glottal wave provides sources for voiced sounds such as vowels (e.g., [a],[e],[i],[o],[u]), diphthongs (i.e., combinations of two vowel sounds), and voiced consonants (e.g., [b],[d],[ɡ],[v],[z],[ð],[ʒ],[ʤ], [h],[w],[n],[m],[r],[j],[ŋ],[l]).

In addition to the glottal source, noisy signals also serve as the sound sources for consonants. Here, air turbulence developed at constricted or obstructed parts of the airway contributes to random (aperiodic) pressure fluctuations over a wide range of frequencies. Among such noisy signals, the one generated through the glottis or immediately above the glottis is called “aspiration noise.” It is characterized by a strong burst of breath that accompanies either the release or the closure of some obstruents. “Frication noise,” on the other hand, is generated by forcing air through a supraglottal constriction created by placing two articulators close together (e.g., constrictions between lower lip and upper teeth, between back of the tongue and soft palate, and between side of the tongue and molars) (Shadle, 1985 , 1991 ). When an airway in the vocal tract is completely closed and then released, “transient noise” is generated. By forming a closure in the vocal tract, a pressure is built up in the mouth behind the closure. As the closure is released, a brief burst of turbulence is produced, which lasts for a few milliseconds.

Some speech sounds may involve more than one sound source. For instance, a voiced fricative combines the glottal source and the frication noise. A breathy voice may come from the glottal source and the aspiration noise, whereas voiceless fricatives can combine two noise sources generated at the glottis and at the supralaryngeal constriction. These sound sources are fed into the vocal-tract filter to create speech sounds.

In the source-filter theory, the vocal tract acts as an acoustic filter to modify the source sound. Through this acoustic filter, certain frequency components are passed to the output speech, while the others are attenuated. The characteristics of the filter depend upon the shape of the vocal tract. As a simple case, consider acoustic characteristics of an uniform tube of length L = 17.5 cm , that is, a standard length for a male vocal tract (see figure 2 ). At one end, the tube is closed (as glottis), while, at the other end, it is open (as mouth). Inside of the tube, longitudinal sound waves travel either toward the mouth or toward the glottis. The wave propagates by alternately compressing and expanding the air in the tube segments. By this compression/expansion, the air molecules are slightly displaced from their rest positions. Accordingly, the acoustic air pressure inside of the tube changes in time, depending upon the longitudinal displacement of the air along the direction of the traveling wave. Profile of the acoustic air pressure inside the tube is determined by the traveling waves going to the mouth or to the glottis. What is formed here is the “standing wave,” the peak amplitude profile of which does not move in space. The locations at which the absolute value of the amplitude is minimum are called “nodes,” whereas the locations at which the absolute value of the amplitude is maximum are called “antinodes.” Since the air molecules cannot vibrate much at the closed end of the tube, the closed end becomes a node. The open end of the tube, on the other hand, becomes an antinode, since the air molecules can move freely there. Various standing waves that satisfy these boundary conditions can be formed. In figure 2 , 1 / 4 (purple), 3 / 4 (green), and 5 / 4 (sky blue) waves indicate first, second, and third resonances, respectively. Depending upon the number of the nodes in the tube, wavelengths of the standing waves are determined as λ = 4 L , 4 / 3 L , 4 / 5 L . The corresponding frequencies are obtained as f = c / λ = 490 , 1470, 2450 Hz, where c = 343 m / s represents the sound speed. These resonant frequencies are called “formants” in phonetics.

Figure 2. Standing waves of an uniform tube. For a tube having one closed end (glottis) and one open end (mouth), only odd-numbered harmonics are available. 1 / 4 (purple), 3 / 4 (green), and 5 / 4 (sky blue) waves correspond to the first, second, and third resonances (“ 1 / 4 wave” means 1 / 4 of one-cycle waveform is inside the tube).

Next, consider that a source sound is input to this acoustic tube. In the source sound (voiced source or noise, or both), acoustic energy is distributed in a broad range of frequencies. The source sound induces vibrations of the air column inside the tube and produces a sound wave in the external air as the output. The strength at which an input frequency is output from this acoustic filter depends upon the characteristics of the tube. If the input frequency component is close to one of the formants, the tube resonates with the input and propagates the corresponding vibration. Consequently, the frequency components near the formant frequencies are passed to the output at their full strength. If the input frequency component is far from any of these formants, however, the tube does not resonate with the input. Such frequency components are strongly attenuated and achieve only low oscillation amplitudes in the output. In this way, the acoustic tube, or the vocal tract, filters the source sound. This filtering process can be characterized by a transfer function, which describes dependence of the amplification ratio between the input and output acoustic signals on the frequency. Physically, the transfer function is determined by the shape of the vocal tract.

Finally, the sound wave is radiated from the lips of the mouth and the nose. Their radiation characteristics are also included in the vocal-tract transfer function.

2.3 Convolution of the Source and the Filter

Humans are able to control phonation (source generation) and articulation (filtering process) largely independently. The speech sounds are therefore considered as the response of the vocal-tract filter, into which a sound source is fed. To model such source-filter systems for speech production, the sound source, or excitation signal x t , is often implemented as a periodic impulse train for voiced speech, while white noise is used as a source for unvoiced speech. If the vocal-tract configuration does not changed in time, the vocal-tract filter becomes a linear time-invariant (LTI) system, and the output signal y t can be expressed by a convolution of the input signal x t and the impulse response of the system h t as

where the asterisk denotes the convolution. Equation ( 1 ), which is described in the time domain, can be also expressed in the frequency domain as

The frequency domain formula states that the speech spectrum Y ω is modeled as a product of the source spectrum X ω and the spectrum of the vocal-tract filter H ω . The spectrum of the vocal-tract filter H ω is represented by the product of the vocal-tract transfer function T ω and the radiation characteristics from the mouth and the nose R ω , that is, H ω = T ω R ω .

There exist several ways to estimate the vocal-tract filter H ω . The most popular approach is the inverse filtering, in which autoregressive parameters are estimated from an acoustic speech signal by the method of least-squares (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). The transfer function can then be recovered from the estimated autoregressive parameters. In practice, however, the inverse-filtering is limited to non-nasalized or slightly nasalized vowels. An alternative approach is based upon the measurement of the vocal tract shape. For a human subject, a cross-sectional area of the vocal tract can be measured by X-ray photography or magnetic resonance imaging (MRI). Once the area function of the vocal tract is obtained, the corresponding transfer function can be computed by the so-called transmission line model, which assumes one-dimensional plane-wave propagation inside the vocal tract (Sondhi & Schroeter, 1987 ; Story et al., 1996 ).

Figure 3. (a) Vocal tract area function for a male speaker’s vowel [a]. (b) Transfer function calculated from the area function of (a). (c) Power spectrum of the source sound generated from Liljencrants-Fant model. (d) Power spectrum of the speech signal generated from the source-filter theory.

As an example to illustrate the source-filter modeling, a sound of vowel /a/ is synthesized in figure 3 . The vocal tract area function of figure 3 (a) was measured from a male subject by the MRI (Story et al., 1996 ). By the transmission line model, the transfer function H ω is obtained as figure 3 (b) . The first and the second formants are located at F 1 = 805 Hz and F 2 = 1205 . By the inverse Fourier transform, the impulse response of the vocal tract system h t is derived. As a glottal source sound, the Liljencrants-Fant synthesize model (Fant et al., 1985 ) is utilized. The fundamental frequency is set to f o = 100 Hz , which gives rise to a sharp peak in the power spectrum in figure 3 (c) . Except for the peaks appearing at higher harmonics of f o , the spectral structure of the glottal source is rather flat. As shown in figure 3 (d) , convolution of the source signal with the vocal tract filter amplifies the higher harmonics of f o located close to the formants.

Since the source-filter modeling captures essence of the speech production, it has been successfully applied to speech analysis, synthesis, and processing (Atal & Schroeder, 1978 ; Markel & Gray, 2013 ). It was Chiba and Kajiyama ( 1941 ) who first explained the mechanisms of speech production based on the concept of phonation (source) and articulation (filter). Their idea was combined with Fant’s filter theory (Fant, 1960 ), which led to the “source-filter theory of vowel production” in the studies of speech production.

So far, the source-filter modeling has been applied only to the glottal source, in which the vocal fold vibrations provide the main source sounds. There are other sound sources, such as the frication noise. In the frication noise, air turbulence is developed at constricted (or obstructed) parts of the airway. Such random source also excites the resonances of the vocal tract in a similar manner as the glottal source (Stevens, 1999 , 2005 ). Its marked difference from the glottal source is that the filter property is determined by the vocal tract shape downstream from the constriction (or obstruction). For instance, if the constriction is at the lips, there exists no cavity downstream from the constriction, and therefore the acoustic source is radiated directly from the mouth opening with no filtering. When the constriction is upstream from the lips, the shape of the airway between the constriction and the lips determines the filter properties. It should be also noted that the turbulent source, generated at the constriction, depends sensitively on a three-dimensional geometry of the vocal tract. Therefore, the three-dimensional shape of the vocal tract (not the one-dimensional shape of the area function) should be taken into account to model the frication noise (Shadle, 1985 , 1991 ).

3. Resonance Tuning

As an interesting application of the source-filter theory, “resonance tuning” (Sundberg, 1989 ) is illustrated. In female speech, the first and the second formants lie between 300 and 900 Hz and between 900 and 2,800 Hz, respectively. In soprano singing, the vocal pitch can reach to these two ranges. To increase the efficiency of the vocalization at high f o , a soprano singer adjusts the shape of the vocal tract to tune the first or second resonance ( R 1 or R 2 ) to the fundamental frequency f o . When one of the harmonics of the f o coincides with a formant resonance, the resulting acoustic power (and musical success) is enhanced.

Figure 4. Resonance tuning. (a) The same transfer function as figure 3 (b). (b) Power spectrum of the source sound, whose fundamental frequency f o is tuned to the first resonance R 1 of the vocal tract. (c) Power spectrum of the speech signal generated from the source-filter theory. (d) Dependence of the amplification rate (i.e., power ratio between the output speech and the input source) on the fundamental frequency f o .

Figure 4 shows an example of the resonance tuning, in which the fundamental frequency is tuned to the first resonance R 1 of the vowel /a/ as f o = 805 Hz . As recognized in the output speech spectrum (figure 4 (c) ), the vocal tract filter strongly amplifies the fundamental frequency component of the vocal source, while the other harmonics are attenuated. Since only a single frequency component is emphasized, the output speech sounds like a pure tone. Figure 4 (d) shows dependence of the amplification ratio (i.e., the power ratio between the output speech and the input source) on the fundamental frequency f o . Indeed, the power of the output speech is maximized at the resonance tuning point of f o = 805 Hz . Without losing the source power, loud voices can be produced with less effort from the singers and, moreover, they are well perceived in a large concert hall over the orchestra (Joliveau et al., 2004 ).

Despite the significant increase in loudness, comprehensibility is sacrificed. With a strong enhancement of the fundamental frequency f o , its higher harmonics are weakened considerably, making it difficult to perceive the formant structure (figure 4 (c) ). This explains why it is difficult to identify words sung in the high range by sopranos.

The resonance tuning discussed here has been based on the linear convolution of the source and the filter, which are assumed to be independent from each other. In reality, however, the source and the filter interact with each other. Depending upon the acoustic property of the vocal tract, it facilitates the vocal fold oscillations and makes the vocal source stronger. Consequently, this source-filter interaction can make the output speech sound even louder in addition to the linear resonance effect. Such interaction will be explained in more detail in section 4 .

It should be of interest to note that some animals such as songbirds and gibbons utilize the technique of resonance tuning in their vocalizations (Koda et al., 2012 ; Nowicki, 1987 ; Riede et al., 2006 ). It has been found through X-ray filming as well as via heliox experiments that these animals adjust the vocal tract resonance to track the fundamental frequency f o . This may facilitate the acoustic communication by increasing the loudness of their vocalization. Again, higher harmonic components, which are needed to emphasize the formants in human language communications, are suppressed. Whether the animals utilize formants information in their communications is under debate (Fitch, 2010 ; Lieberman, 1977 ) but, at least in this context, production of a loud sound is more advantageous for long-distance alarm calls and pure-tone singing of animals.

4. Source-Filter Interaction

The linear source–filter theory, under which speech is represented as a convolution of the source and the filter, is based upon the assumption that the vocal fold vibrations as well as the turbulent noise sources are only weakly influenced by the vocal tract. Such an assumption is, however, valid mostly for male adult speech. The actual process of speech production is nonlinear. The vocal fold oscillations are due to combined effects of pressure, airflow, tissue elasticity, and tissue collision. It is natural that such a complex system obeys nonlinear equations of motion. Aerodynamics inside the glottis and the vocal tract is also governed by nonlinear equations in a strict sense. Moreover, there exists a mutual interaction between the source and the filter (Flanagan, 1968 ; Lucero et al., 2012 ; Rothenberg, 1981 ; Titze, 2008 ; Titze & Alipour, 2006 ). First, the source sound, which is generated from the vocal folds, is influenced by the vocal tract, since the vocal tract determines pressure above the vocal folds to change the aerodynamics of the glottal flow. As described in section 2.3 , the turbulent source is also very sensitive to the vocal tract geometry. Second, the source sound, which then propagates through the vocal tract, is not only radiated from the mouth but is also partially reflected back to the glottis through the vocal tract. Such reflection can influence the vocal fold oscillations, especially when the fundamental frequency or its harmonics is closely located to one of the vocal tract resonances, for instance, in singing. The strong acoustic feedback makes the interrelation between the source and the filter nonlinear and induces various voice instabilities, for example., sudden pitch jump, subharmonics, resonance, quenching, and chaos (Hatzikirou et al., 2006 ; Lucero et al., 2012 ; Migimatsu & Tokuda, 2019 ; Titze et al., 2008 ).

Figure 5. Example of a glissando singing. A male subject glided the fundamental frequency ( f o ) from 120 Hz to 350 Hz and then back. The first resonance ( R 1 = 270 Hz ) is indicated by a black bold line. The pitch jump occurred when f o crossed R 1 .

Figure 5 shows a spectrogram that demonstrates such pitch jump. The horizontal axis represents time, while the vertical axis represents spectral power of a singing voice. In this recording, a male singer glided his pitch in a certain frequency range. Accordingly, the fundamental frequency increases from 120 Hz to 350 Hz and then decreases back to 120 Hz. Around 270Hz, the fundamental frequency or its higher harmonics crosses one of the resonances of the vocal tract (black bold line of figure 5 ), and it jumps abruptly. At such frequency crossing point, acoustic reflection from the vocal tract to the vocal folds becomes very strong and non-negligible. The source-filter interaction has two aspects (Story et al., 2000 ). On one side, the vocal tract acoustics facilitates the vocal fold oscillations and contributes to the production of a loud vocal sound as discussed in the resonance tuning (section 3 ). On the other side, the vocal tract acoustics inhibits the vocal fold oscillations and consequently induces a voice instability. For instance, the vocal folds oscillation can stop suddenly or spontaneously jump to another fundamental frequency as exemplified by the glissando singing of figure 5 . To avoid such voice instabilities, singers must weaken the level of the acoustic coupling, possibly by adjusting the epilarynx, whenever the frequency crossing takes place (Lucero et al., 2012 ; Titze et al., 2008 ).

5. Conclusions

Summarizing, the source-filter theory has been described as a basic framework to model human speech production. The source is generated from the vocal fold oscillations and/or the turbulent airflows developed above the glottis. The vocal tract functions as a filter to modify the spectral structure of the source sounds. This filtering mechanism has been explained in terms of the resonances of the acoustical tube. Independence between the source and the filter is vital for language-based acoustic communications in humans, which require flexible maneuvering of the vocal tract configuration to express various phonemes sequentially and smoothly (Fitch, 2010 ; Lieberman, 1977 ). As an application of the source-filter theory, resonance tuning is explained as a technique utilized by soprano singers and some animals. Finally, existence of the source-filter interaction has been described. It is inevitable that the source sound is aerodynamically influenced by the vocal tract, since they are closely located to each other. Moreover, acoustic pressure wave reflecting back from the vocal tract to the glottis influences the vocal fold oscillations and can induce various voice instabilities. The source-filter interaction may become strong when the fundamental frequency or its higher harmonics crosses one of the vocal tract resonances, for example, in singing.

Further Reading

  • Atal, B. S. , & Schroeder, M. (1978). Linear prediction analysis of speech based on a pole-zero representation. The Journal of the Acoustical Society of America , 64 (5), 1310–1318.
  • Chiba, T. , & Kajiyama, M. (1941). The vowel: Its nature and structure . Tokyo, Japan: Kaiseikan.
  • Fant, G. (1960). Acoustic theory of speech production . The Hague, The Netherlands: Mouton.
  • Lieberman, P. (1977). Speech physiology and acoustic phonetics: An introduction . New York: Macmillan.
  • Markel, J. D. , & Gray, A. J. (2013). Linear prediction of speech (Vol. 12). New York: Springer Science & Business Media.
  • Stevens, K. N. (1999). Acoustic phonetics . Cambridge, MA: MIT Press.
  • Sundberg, J. (1989). The science of singing voice . DeKalb, IL: Northern Illinois University Press.
  • Titze, I. R. (1994). Principles of voice production . Englewood Cliffs, NJ: Prentice Hall.
  • Titze, I. R. , & Alipour, F. (2006). The myoelastic aerodynamic theory of phonation . Iowa, IA: National Center for Voice and Speech.
  • Fant, G. , Liljencrants, J. , & Lin, Q. (1985). A four-parameter model of glottal flow. Speech Transmission Laboratory. Quarterly Progress and Status Report , 26 (4), 1–13.
  • Fitch, W. T. (2010). The evolution of language . Cambridge, UK: Cambridge University Press.
  • Flanagan, J. L. (1968). Source-system interaction in the vocal tract. Annals of the New York Academy of Sciences , 155 (1), 9–17.
  • Hatzikirou, H. , Fitch, W. T. , & Herzel, H. (2006). Voice instabilities due to source-tract interactions. Acta Acoustica United With Acoustica , 92 , 468–475.
  • Joliveau, E. , Smith, J. , & Wolfe, J. (2004). Acoustics: Tuning of vocal tract resonance by sopranos. Nature , 427 (6970), 116.
  • Koda, H. , Nishimura, T. , Tokuda, I. T. , Oyakawa, C. , Nihonmatsu, T. , & Masataka, N. (2012). Soprano singing in gibbons. American Journal of Physical Anthropology , 149 (3), 347–355.
  • Lucero, J. C. , Lourenço, K. G. , Hermant, N. , Van Hirtum, A. , & Pelorson, X. (2012). Effect of source–tract acoustical coupling on the oscillation onset of the vocal folds. The Journal of the Acoustical Society of America , 132 (1), 403–411.
  • Migimatsu, K. , & Tokuda, I. T. (2019). Experimental study on nonlinear source–filter interaction using synthetic vocal fold models. The Journal of the Acoustical Society of America , 146 (2), 983–997.
  • Nowicki, S. (1987). Vocal tract resonances in oscine bird sound production: Evidence from birdsongs in a helium atmosphere. Nature , 325 (6099), 53–55.
  • Riede, T. , Suthers, R. A. , Fletcher, N. H. , & Blevins, W. E. (2006). Songbirds tune their vocal tract to the fundamental frequency of their song. Proceedings of the National Academy of Sciences , 103 (14), 5543–5548.
  • Rothenberg, M. (1981). The voice source in singing. In J. Sundberg (Ed.), Research aspects on singing (pp. 15–33). Stockholm, Sweden: Royal Swedish Academy of Music.
  • Shadle, C. H. (1985). The acoustics of fricative consonants [Doctoral thesis]. Cambridge, MA: Massachusetts Institute of Technology, released as MIT-RLE Technical Report No. 506.
  • Shadle, C. H. (1991). The effect of geometry on source mechanisms of fricative consonants. Journal of Phonetics , 19 (3–4), 409–424.
  • Sondhi, M. , & Schroeter, J. (1987). A hybrid time-frequency domain articulatory speech synthesizer. IEEE Transactions on Acoustics, Speech, and Signal Processing , 35 (7), 955–967.
  • Stevens, K. N. (2005). The acoustic/articulatory interface. Acoustical Science and Technology , 26 (5), 410–417.
  • Story, B. H. , Laukkanen, A.M. , & Titze, I. R. (2000). Acoustic impedance of an artificially lengthened and constricted vocal tract. Journal of Voice , 14 (4), 455–469.
  • Story, B. H. , Titze, I. R. , & Hoffman, E. A. (1996). Vocal tract area functions from magnetic resonance imaging. The Journal of the Acoustical Society of America , 100 (1), 537–554.
  • Sundberg, J. (1989). The science of singing voice . DeKlab, IL: Northern Illinois University Press.
  • Titze, I. R. (2008). Nonlinear source–filter coupling in phonation: Theory. The Journal of the Acoustical Society of America , 123 (4), 1902–1915.
  • Titze, I. , Riede, T. , & Popolo, P. (2008). Nonlinear source–filter coupling in phonation: Vocal exercises. The Journal of the Acoustical Society of America , 123 (4), 1902–1915.

Related Articles

  • Articulatory Phonetics
  • Child Phonology
  • Speech Perception in Phonetics
  • Direct Perception of Speech
  • Phonetics of Singing in Western Classical Style
  • Phonetics of Vowels
  • Phonetics of Consonants
  • Audiovisual Speech Perception and the McGurk Effect
  • The Motor Theory of Speech Perception
  • Articulatory Phonology
  • The Phonetics of Prosody
  • Tongue Muscle Anatomy: Architecture and Function

Printed from Oxford Research Encyclopedias, Linguistics. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

date: 18 April 2024

  • Cookie Policy
  • Privacy Policy
  • Legal Notice
  • [66.249.64.20|195.190.12.77]
  • 195.190.12.77

Character limit 500 /500

  • Bipolar Disorder
  • Therapy Center
  • When To See a Therapist
  • Types of Therapy
  • Best Online Therapy
  • Best Couples Therapy
  • Best Family Therapy
  • Managing Stress
  • Sleep and Dreaming
  • Understanding Emotions
  • Self-Improvement
  • Healthy Relationships
  • Student Resources
  • Personality Types
  • Guided Meditations
  • Verywell Mind Insights
  • 2023 Verywell Mind 25
  • Mental Health in the Classroom
  • Editorial Process
  • Meet Our Review Board
  • Crisis Support

What Is a Speech Sound Disorder?

Elizabeth is a freelance health and wellness writer. She helps brands craft factual, yet relatable content that resonates with diverse audiences.

speech sound production meaning

Daniel B. Block, MD, is an award-winning, board-certified psychiatrist who operates a private practice in Pennsylvania.

speech sound production meaning

Halfpoint Images / Getty Images

Speech sound disorders are a blanket description for a child’s difficulty in learning, articulating, or using the sounds/sound patterns of their language. These difficulties are usually clear when compared to the communication abilities of children within the same age group.

Speech developmental disorders may indicate challenges with motor speech. Here, a child experiences difficulty moving the muscles necessary for speech production. This child may also face reduced coordination when attempting to speak.

Speech sound disorders are recognized where speech patterns do not correspond with the movements/gestures made when speaking.  

Speech impairments are a common early childhood occurrence—an estimated 2% to 13% of children live with these difficulties. Children with these disorders may struggle with reading and writing. This can interfere with their expected academic performance. Speech sound disorders are often confused with language conditions such as specific language impairment (SLI).

This article will examine the distinguishing features of this disorder. It will also review factors responsible for speech challenges, and the different ways they can manifest. Lastly, we’ll cover different treatment methods that make managing this disorder possible.

Symptoms of Speech Sound Disorder

A speech sound disorder may manifest in different ways. This usually depends on the factors responsible for the challenge, or how extreme it is.

There are different patterns of error that may signal a speech sound disorder. These include:

  • Removing a sound from a word
  • Including a sound in a word
  • Replacing hard to pronounce sounds with an unsuitable alternative
  • Difficulty pronouncing the same sound in different words (e.g., "pig" and "kit")
  • Repeating sounds or words
  • Lengthening words
  • Pauses while speaking
  • Tension when producing sounds
  • Head jerks during speech
  • Blinking while speaking
  • Shame while speaking
  • Changes in voice pitch
  • Running out of breath while speaking

It’s important to note that children develop at different rates. This can reflect in the ease and ability to produce sounds. But where children repeatedly make sounds or statements that are difficult to understand, this could indicate a speech disorder.

Diagnosis of Speech Sound Disorders

For a correct diagnosis, a speech-language pathologist can determine whether or not a child has a speech-sound disorder.

This determination may be made in line with the requirements of the DSM-5 diagnostic criteria . These guidelines require that:

  • The child experience persistent difficulty with sound production (this affects communication and speech comprehension)
  • Symptoms of the disorder appear early during the child’s development stages
  • This disorder limits communication. It affects social interactions, academic achievements, and job performance.
  • The disorder is not caused by other conditions like a congenital disorder or an acquired condition like hearing loss . Hereditary disorders are, however, exempted. 

Causes of Speech Sound Disorders

There is no known cause of speech sound disorders. However, several risk factors may increase the odds of developing a speech challenge. These include:

  • Gender : Male children are more likely to develop a speech sound disorder
  • Family history : Children with family members living with speech disorders may acquire a similar challenge.
  • Socioeconomics : Being raised in a low socioeconomic environment may contribute to the development of speech and literacy challenges.
  • Pre- and post-natal challenges : Difficulties faced during pregnancy such as maternal infections and stressors may worsen the chances of speech disorders in a child. Likewise, delivery complications, premature birth, and low-birth-weight could lead to speech disorders.
  • Disabilities : Down syndrome, autism , and other disabilities may be linked to speech-sound disorders.
  • Physical challenges : Children with a cleft lip may experience speech sound difficulties.
  • Brain damage : These disorders may also be caused by an infection or trauma to a child’s brain . This is seen in conditions like cerebral palsy where the muscles affecting speech are injured.

Types of Speech Sound Disorders

By the time a child turns three, at least half of what they say should be properly understood. By ages four and five, most sounds should be pronounced correctly—although, exceptions may arise when pronouncing “l”, “s”,”r”,”v”, and other similar sounds. By seven or eight, harder sounds should be properly pronounced. 

A child with a speech sound disorder will continue to struggle to pronounce words, even past the expected age. Difficulty with speech patterns may signal one of the following speech sound disorders:

This refers to interruptions while speaking. Stuttering is the most common form of disfluency. It is recognized for recurring breaks in the free flow of speech. After the age of four, a child with disfluency will still repeat words or phrases while speaking. This child may include extra words or sounds when communicating—they may also make words longer by stressing syllables.

This disorder may cause tension while speaking. Other times, head jerking or blinking may be observed with disfluency. 

Children with this disorder often feel frustrated when speaking, it may also cause embarrassment during interactions. 

Articulation Disorder

When a child is unable to properly produce sounds, this may be caused by inexact placement, speed, pressure, or movement from the lips, tongue, or throat.  

This usually signals an articulation disorder, where sounds like “r”, “l”, or “s” may be changed. In these cases, a child’s communication may be understood by only close family members.

Phonological Disorder

A phonological disorder is present where a child is unable to make the speech sounds expected of their age. Here, mistakes may be made when producing sounds. Other times, sounds like consonants may be omitted when speaking.  

Voice Disorder

Where a child is observed to have a raspy voice, this may be an early sign of a voice disorder. Other indicators include voice breaks, a change in pitch, or an excessively loud or soft voice.  

Children that run out of breath while speaking may also live with this disorder. Likewise, children may sound very nasally, or can appear to have inadequate air coming out of their nose if they have a voice disorder.

Childhood apraxia of speech occurs when a child lacks the proper motor skills for sound production. Children with this condition will find it difficult to plan and produce movements in the tongue, lips, jaw, and palate required for speech.  

Treatment of Speech Sound Disorder

Parents of children with speech sound disorders may feel at a loss for the next steps to take. To avoid further strain to the child, it’s important to avoid showing excessive concern.

Instead, listening patiently to their needs, letting them speak without completing their sentences, and showing usual love and care can go a long way.

For professional assistance, a speech-language pathologist can assist with improving a child’s communication. These pathologists will typically use oral motor exercises to enhance speech.

These oral exercises may also include nonspeech oral exercises such as blowing, oral massages and brushing, cheek puffing, whistleblowing, etc.

Nonspeech oral exercises help to strengthen weak mouth muscles, and can help with learning the common ways of communicating.

Parents and children with speech sound disorders may also join support groups for information and assistance with the condition.

A Word From Verywell

It can be frustrating to witness the challenges in communication. But while it's understandable to long for typical communication from a child—the differences caused by speech disorders can be managed with the right care and supervision. Speaking to a speech therapist, and showing love o children with speech disorders can be important first steps in overcoming these conditions.

Eadie P, Morgan A, Ukoumunne OC, Ttofari Eecen K, Wake M, Reilly S. Speech sound disorder at 4 years: prevalence, comorbidities, and predictors in a community cohort of children . Dev Med Child Neurol . 2015;57(6):578-584. doi:10.1111/dmcn.12635

McLeod S, Harrison LJ, McAllister L, McCormack J. Speech sound disorders in a community study of preschool children . Am J Speech Lang Pathol . 2013;22(3):503-522. doi:10.1044/1058-0360(2012/11-0123)

Murphy CF, Pagan-Neves LO, Wertzner HF, Schochat E. Children with speech sound disorder: comparing a non-linguistic auditory approach with a phonological intervention approach to improve phonological skills . Front Psychol . 2015;6:64. Published 2015 Feb 4. doi:10.3389/fpsyg.2015.00064

Penn Medicine. Speech and Language Disorders-Symptoms and Causes .

PsychDB. Speech Sound Disorder (Phonological Disorder) .

Sices L, Taylor HG, Freebairn L, Hansen A, Lewis B. Relationship between speech-sound disorders and early literacy skills in preschool-age children: impact of comorbid language impairment . J Dev Behav Pediatr . 2007;28(6):438-447. doi:10.1097/DBP.0b013e31811ff8ca

American Speech-Language-Hearing Association. Speech Sound Disorders: Articulation and Phonology .

American Speech-Language-Hearing Association. Speech Sound Disorders .

MedlinePlus. Phonological Disorder .

National Institute on Deafness and Other Communication Disorders. Articulation Disorder .

National Institute of Health. Phonological Disorder.

Lee AS, Gibbon FE. Non-speech oral motor treatment for children with developmental speech sound disorders . Cochrane Database Syst Rev . 2015;2015(3):CD009383. Published 2015 Mar 25. doi:10.1002/14651858.CD009383.pub2

By Elizabeth Plumptre Elizabeth is a freelance health and wellness writer. She helps brands craft factual, yet relatable content that resonates with diverse audiences.

Copyright © 2024 SpeechPathology.com - All Rights Reserved

Facebook tracking pixel

20Q: Principles of Motor Learning and Intervention for Speech Sound Disorders

Carol koch, edd, ccc-slp, asha fellow, bcs-cl.

  • 20Q with Ann Kummer
  • Articulation, Phonology, and Speech Sound Disorders

To earn CEUs for this article, become a member.

unlimit ed ceu access | $129/year

From the desk of ann kummer.

Figure

Many speech-language pathologists report that, in working with children with speech sound disorders, they can achieve correct placement fairly easily. However, it takes a long time before the child begins to use the sound in everyday connected speech. Perhaps this problem can be improved with a more thorough understanding of the principles of motor learning.

Motor learning is a complex process that occurs in the brain in response to learning to perform a new motor sequence. Practice is a key component of motor learning. The motor skill or sequence must be repeated (e.g., practiced) until it can be performed consistently without conscious thought. 

Motor learning is needed to acquire all motor skills, such as playing a musical instrument, dancing, and playing sports. In the same way, it is needed to learn to produce speech sounds correctly. It is common knowledge that the more the individual practices a motor movement, the more proficient he/she will be in performing that skill and in a shorter amount of time. Therefore, therapy must be designed to achieve the largest number of correct productions possible in each session. Practice at home is also very important, even if the practice is limited to a few minutes each day.

I am a strong believer in the importance of intensive practice when possible, and frequent short practice sessions throughout the week in order to achieve carryover in the shortest amount of time. Therefore, I’m thrilled that Dr. Carol Koch submitted this article about the principles of motor learning and the importance of using these principles in speech therapy.

Carol Koch, EdD, CCC-SLP is a Professor at Samford University. Much of her clinical work has been in early intervention, with a focus on children with autism spectrum disorder and children with severe speech sound disorders, including childhood apraxia of speech. Her research and teaching interests have also encompassed early phonological development, speech sound disorders, and CAS. She has been honored as an ASHA Fellow and is a Board-Certified Specialist in Child Language. Recently, Dr. Koch published a textbook, Clinical Management of Speech Sound Disorders: A Case-Based Approach. She is also a co-author of the Contrast Cues for Speech and Literacy and the “Box of” set of cues for articulation therapy and the Box of /ɹ/ Facilitating Contexts and Screener through Bjorem Speech Publications.

This is a very interesting and important article for all speech-language pathologists who work with speech sound disorders. 

Now…read on, learn, and enjoy!

Ann W. Kummer, PhD, CCC-SLP, FASHA, 2017 ASHA Honors Contributing Editor 

Browse the complete collection of 20Q with Ann Kummer CEU articles at  www.speechpathology.com/20Q

Learning Outcomes

After this course, readers will be able to: 

  • Explain the principles of motor learning relevant to speech sound disorders
  • Provide knowledge of results and knowledge of performance feedback
  • Apply the principles of motor learning to speech sound intervention 

presenter headshot

1. What are the principles of motor learning and why are they important?

As the traditional or motor-based approaches for the treatment of speech sound disorders specifically focus on the motor aspects of sound production, a basic understanding of motor learning is beneficial. The traditional approach emphasizes the teaching of the placement of the articulators and the motor movement patterns needed for speech sound production. Therefore, speech sound production is a motor-based skill.

Motor learning is a “set of processes associated with practice or experience leading to relatively permanent changes in the capability for movement” (Schmidt & Lee, 2005, p. 302). A learned motor skill results from two different levels of performance that are demonstrated during the acquisition and learning phase and the retention and transfer phase. During the acquisition and learning phase, motor performance is demonstrated through the establishment of the ability to execute the specific motor skill. This perspective emphasizes that acquisition is the product of practice. Retention and transfer reflect the level of learning that is considered the permanent change in the ability to demonstrate the skilled movements as measured by retention of the skill after the training and practice have been completed. The level of performance during the practice phase of motor learning does not predict retention and transfer of the skill (Maas et al., 2008).

Motor-based approaches have a long history in the treatment of speech sound disorders, yet the research is limited regarding the principles of motor learning and speech-motor learning. Maas and colleagues (2008) have examined the application of the basic principles with intact motor systems. This research can be applied to traditional motor-based interventions with children who demonstrate speech sound disorders.

Maas and colleagues (2008) have emphasized three areas of study in motor learning principles in which evidence supports the application to the intervention of speech sound disorders in children. The three areas are pertinent to the conditions of practice and include prepractice, principles of practice, and principles of feedback. It is important to utilize this structure in the implementation of motor-based articulation intervention.

Further, the principles of motor learning are applied differently depending on where the child’s articulation skills are along a continuum of motor skills development from acquisition to retention. Application of the principles of motor learning to speech production offers promising insight into optimizing treatment (Maas et al., 2014).

figure

2. What motor learning principles are relevant to speech sound production and speech sound disorders?

While motor learning principles have not been extensively researched as applied to speech motor learning, evidence of their application can be found throughout scholarly resources addressing speech sound intervention. Research outside of our discipline provides evidence of the basic principles of motor learning that may be applied to intervention for children with speech sound disorders (Maas et al., 2008). Future research in extending the principles to speech motor learning may serve to further validate the motor-based or traditional approach for the intervention of speech sound disorders.

Motor learning principles related to the conditions and the structure of practice and those related to the nature of feedback are the most relevant to speech sound disorders and intervention. Motor learning principles associated with the conditions of practice include practice schedule, practice amount, practice variability, attentional focus, and target complexity. The motor learning principles associated with the nature of feedback include feedback type, feedback frequency, feedback timing, and feedback control (Bislick et al., 2012; Maas et al., 2008).

3. Explain the motor-learning principle of pre-practice.

In the area of motor learning, pre-practice refers to areas of consideration prior to beginning practice that can facilitate optimal outcomes (Bernthal et al., 2013; Maas et al., 2008; Yorkston, Beukelman et al., 2010). Prepractice activities are designed to prepare the learner for the therapy session (Schmidt & Lee, 2005). Important considerations for prepractice are motivation to learn, an understanding of the task, and level of stimulability for sound production errors (Maas et al., 2008).

Motivation can be enhanced by ensuring that the learner understands that therapy activities are designed to improve speech intelligibility and reduce communication breakdowns. Selection of functionally relevant targets (family names, favorite activities) with input from the learner may also increase motivation. Making sure the learner understands the task through offering an appropriate level of instruction (consider language skills), cues, and modeling is also useful for promoting motivation.

4. Explain the motor learning principles of practice conditions related to practice amount.

Learning any motor skills requires practice (Schmidt & Lee, 2005). Therapy must therefore provide adequate practice to learn the targeted behavior. The options for practice amount include a small number of practice trials or sessions or a large number of practice trials or sessions. What we know anecdotally is that the learner must have maximum opportunity to practice correct production of the target sounds. Therefore, the focus is on creating a sufficient number of production trials each session to facilitate the acquisition and retention of new skills. This can be accomplished by emphasizing production practice and minimizing the time spent providing reinforcements. The literature for nonspeech tasks suggests that small amounts of practice are beneficial for acquisition but that variability and larger amounts of practice are associated with improved retention of skills. Currently, there is no empirical evidence regarding speech practice amount with respect to speech motor learning.

5. Explain motor learning principle of practice conditions related to practice distribution.

The next principle of practice consideration is how practice should be planned and distributed. Massed practice involves practicing targets many times over a short period with a shorter time between sessions. Distributed practice refers to how a set amount of therapy practice is distributed over time, with more time between sessions. Maas and colleagues (2008) propose that many shorter treatment sessions produce a better outcome than fewer longer sessions. Massed practice versus distributed practice may also have implications for motor learning. Massed practice appears to promote motor performance, the accuracy of speech sound production, or speech sound acquisition. Whereas distributed practice has been shown to support retention and transfer, which implies that motor learning has resulted in a permanent change in a skill (Maas et al., 2005). It is also unknown of the impact of practice amount on the effectiveness of massed versus distributed practice. 

6. Explain the motor learning principle for practice conditions of practice variability.

Practice variability refers to the variations in phonetic or motor sequences used as stimuli. Constant practice involves practicing the same target within the same context. For example, sessions that focus on the production of the phoneme /s/ in the initial word position. Variable practice involves practice on different sounds in different contexts. For example, sessions that focus on the phonemes /k, g, s, z/ in the initial word position and the final word position.

There is some evidence to support that constant practice is beneficial in early practice during the acquisition phase. Further, that motor learning appears to be promoted with words that have different movement sequences, different co-articulatory contexts, and different manners of production across the phoneme sequences (Maas et al., 2008; Yorkston et al., 2010).

7. What is the practice condition of practice schedule?

Practice schedule refers to either blocked or random practice. Blocked practice involves repeated production of the same stimuli during sessions or treatment phases. For example, treatment sessions that focus on production of /s/ before progressing to production of /z/. Random practice involves different targets practiced during the same session or phases of intervention. For example, a session that focuses on production practice of /f, v/. Maas and colleagues (2008) also propose that random presentation of stimuli or targets promotes the development of motor learning better than blocked practice. Therefore, research evidence suggests that random practice is more effective at facilitating motor learning, which results in production accuracy that is maintained in conversational speech.

8. Explain the practice condition related to attentional focus.

Attentional focus can be viewed as being either internal or external. Internal attentional focus is related to the focus on articulatory movements, such as a place of articulation. External attentional focus refers to the outcome or the effects of the movements, the acoustic signal, and the sound the child produces. The effect of attentional focus on motor learning for speech has not been explored. However, in the nonspeech motor domain, it appears that an external focus supports and promotes more automatic movement patterns and greater retention/learning of the skill than an internal focus (Maas et al., 2008).

9. What is the practice condition of target complexity?

Target complexity or movement complexity refers to the sounds and sound sequences selected for intervention targets. Simple targets are those target words that contain earlier acquired sounds and simple word shapes, such as plosives and CV syllables/words. Complex targets are target words that contain the more difficult, later emerging speech sounds and sound sequences, such as fricatives and CCVC words.

Emerging evidence suggests that complex movement patterns promote learning of simpler movement patterns, but the reverse can not be supported by evidence. Therefore, it appears that targeting more complex items may be more efficient than targeting less complex items.

10. Explain the motor learning principles related to type of feedback.

Feedback allows the speech-language pathologist to provide information about the client’s performance. This type of augmented feedback has been shown to be effective in the facilitation of motor learning related to speech sound intervention (Schmidt & Lee, 2005; Wulf & Shea, 2004). Different types of augmented feedback have also been studied. Knowledge of results type of feedback provides information about whether the production was correct or incorrect. The clinician may say, “You produced the correct sound” or “That was good”. Knowledge of performance feedback provides more specific information about the nature of the production. The feedback addresses specifically what was correct or incorrect about the positioning of the articulators, the movement, or the manner of production.

For example, for a correct production, the clinician may say, “I really like how you made that sound in the back of your mouth”. Alternately, for an incorrect production, the clinician might offer feedback such as, “That was a good try, but let’s try again with your tongue behind your teeth”. Knowledge of performance may be more beneficial during the speech sound acquisition stage when the child has not yet established an internal representation of the target sound (Newell et al., 1990). The specific performance feedback provides the client with more information about the nature of the production. For an inaccurate production, knowledge of performance feedback guides the client to the specific aspect of the production to be changed in order to achieve accurate production of the target. Knowledge of performance feedback facilitates performance during the acquisition stage of speech sound intervention. Conversely, feedback that reflects knowledge of results requires the child to determine the nature of the specific error. Maas et al. (2008) suggest that once the target skill is established, the nature of the feedback should change from knowledge of performance to knowledge of results. Thus, knowledge of results augmented feedback leads to enhanced retention/transfer of motor learning. 

11. What are some additional examples of “knowledge of results” feedback?

Knowledge of results type of feedback focuses on the accuracy of the production. Here are a few examples:

  • “Great job!”
  • “That sounded great!”
  • “I heard you make the /f/ sound in that word!”

12. What are some additional examples of “knowledge of performance” feedback?

Knowledge of performance type of feedback reflects specific information about how the child produced the target sound. Here are a few examples:

  • “I really like how you kept your tongue behind your teeth for the /s/ sound.”
  • “Great job using your top teeth on your bottom lip to make the /f/ sound.”
  • “I heard you change from the /t/ to the /k/ sound when you said 'car'.”

13. Explain the principles of motor learning for how feedback is provided, specifically related to feedback frequency.

How feedback is provided is another consideration. Feedback frequency refers to how often feedback is provided. High-frequency feedback is given after every attempt at the production of the elicited target. Low-frequency feedback is provided after some, but not all, attempts at the production of the elicited target. During treatment sessions, clinicians may adjust feedback frequency according to a variety of schedules, such as providing feedback on 80%, 50%, 20%, or 0% of trials. It appears that high-frequency feedback is beneficial during the acquisition stage of speech-motor learning. Evidence from current practice suggests that quickly reducing the frequency of feedback may be more effective in facilitating the retention or transfer of speech-motor learning. Infrequent feedback provides the child the opportunity to monitor and evaluate their own performance (Lowe & Buckwald, 2017; Maas et al., 2008).

The impact of feedback frequency may also depend on other factors. Research evidence suggests that practice variability, attentional focus, complexity, and the learner’s skill level and ability to self-monitor or self-evaluate interact with feedback frequency and produce different results.

14. Explain the principle of motor learning related to the timing of feedback.

The timing of feedback may also be a factor in motor speech learning (Bankson et al., 2013; Maas et al., 2008). Feedback may be either immediate or delayed. Feedback that is delivered with a slight delay may provide the client with the opportunity to self-evaluate the production. As previously stated, self-evaluation of speech sound production prior to augmented feedback may more effectively facilitate speech-motor learning. Feedback timing and the impact on performance may also be affected by attentional focus. A learner who struggles with awareness of articulatory placement may require immediate feedback that reflects knowledge of performance. Likewise, a learner who struggles with monitoring speech output may also require a higher frequency of immediate feedback since a delayed feedback situation may not result in self-evaluation of performance.

15. How do you choose stimuli following principles of motor learning?

This certainly is a complex question. And truthfully, there is no one “correct” answer. Clinicians must assess each client to determine which combination of practice conditions and how feedback is provided are optimal for that client. However, motor learning principles certainly can guide and inform these decisions. In addition, factors such as functionality or relevance of stimuli, stimulability, and target complexity are important considerations in intervention planning and the selection of intervention targets.

16. Are there stages or phases to intervention based on the principles of motor learning?

The facilitation of motor learning may follow a number of theoretical models that explain the process of speech-motor learning. Initially, the learner is introduced to the new skill, a new pattern of skilled movement that results in the correct production of a speech sound. Verbal instructions, demonstrations, and modeling are important elements in assisting the learner in producing the target sound. Frequent feedback and accurate production are also important during this phase of intervention.

As the learner refines the new skill, continued practice helps to establish retention of that skill. Feedback is faded as speech sound accuracy increases. Self-monitoring may also be utilized to maximize external focus on the results of the articulatory movements and resulting acoustic output.

Lastly, the learner advances from skill execution to the integration of the new motor skill, accurate speech sound production. This phase allows the learner to utilize the new skill effortlessly in many phonetic contexts and in many communication contexts.

17. How does feedback change throughout the course of intervention?

Based on the principles of motor learning, feedback type and frequency change throughout the course of intervention. As the child progresses from skill acquisition to skill retention, feedback frequency is reduced. The goal is for the child to begin to rely on their own external focus and self-monitoring to self-assess their own productions for accuracy. Further, as the child progresses from skill acquisition to skill retention, feedback type changes from knowledge of performance to knowledge of results.

18. How are feedback and conditions of practice structured for acquisition?

There is some limited evidence to support the efficacy of certain feedback and practice conditions of practice that are more optimal during the skill acquisition phase of intervention. Broadly, the following can be applied during the acquisition phase of speech sound intervention:

  • Small number of targets
  • Massed practice
  • Internal focus
  • Simple targets
  • Knowledge of performance feedback
  • High-frequency feedback
  • Immediate feedback

19. How are feedback and conditions of practice structured for retention?

There is some limited evidence to support the efficacy of certain feedback and practice conditions of practice that are more optimal during the skill retention phase of intervention. (Bislick et al., 2012). Broadly, the following can be applied during the retention phase of speech sound intervention:

  • Large number of targets
  • Distributed practice
  • Variable practice
  • Random practice
  • External focus
  • Complex targets
  • Knowledge of results feedback
  • Low-frequency feedback
  • Delayed feedback

FIgure

Bernthal, J. E., Bankson, N. W., & Flipsen, P. (2017). Articulation and phonological disorders: Speech sound disorders in children. (8th Ed.). Pearson.

Bislick, L. P., Weir, P. C., Spencer, K., Kendall, D., & Yorkston, K. M. (2012). Do principles of motor learning enhance retention and transfer of speech skills? A systematic review. Aphasiology, 26 (5), 709-728.

Lowe, M. S., & Buchwald, A. (2017). The impact of feedback frequency on performance in a novel speech motor learning task. Journal of Speech, Language, and Hearing Research, 60 , 1712-1725.

Maas, E., Gildersleeve-Neumann, Jakielski, K. J., & Stoeckel, R. (2014). Motor-based intervention protocols in treatment of childhood apraxia of speech (CAS). Current Developmental Disorders Reports, 1 , 197-206.

Maas, E., Robin, D. A., Austermann Hula, S. N., Freedman, S. E., Wulf, G., Ballard, K. J., & Schmidt, R. A. (2008). The principles of motor leaning in treatment of motor speech disorders. American Journal of Speech-Language Pathology, 17 , 277-298.

Schmidt, R. A., & Lee, T. D. (2005). Motor control and learning: A behavioral emphasis (4th Ed.). Human Kinetics.

Wulf, G., & Shea, C. H. (2004). Skill acquisition in sport. Routledge.

Yorkston, K. M., Beukelman, D. R., Strand, E. A., & Hakel, M. (2010). Management of motor speech disorders in children and adults. (3rd Ed.). Pre-Ed.

Koch, C. (2023). 20Q: Principles of motor learning and intervention for speech sound disorders.  SpeechPathology.com . Article 20589. Available at www.speechpathology.com

carol koch

Carol Koch, EdD , CCC-SLP, ASHA Fellow, BCS-CL

Carol Koch, EdD, CCC-SLP, is a Professor at Samford University. Much of her clinical work has been in early intervention, with a focus on children with autism spectrum disorder and children with severe speech sound disorders, including childhood apraxia of speech.  Her research and teaching interests have also encompassed early phonological development, speech sound disorders, and CAS.  She has been honored as an ASHA Fellow and is a Board Certified Specialist in Child Language. Recently, Dr. Koch published a textbook, Clinical Management of Speech Sound Disorders:  A Case-Based Approach. She is also a co-author of the Contrast Cues for Speech and Literacy and the “Box of” set of cues for articulation therapy and the Box of /ɹ/ Facilitating Contexts and Screener through Bjorem Speech Publications.

Related Courses

20q: speech sound disorders: "old" and "new" tools, course: #8775 level: intermediate 1 hour, 20q: current topics in supervision, course: #9928 level: intermediate 1 hour, 20q: dynamics of school-based speech and language therapy variables, course: #10002 level: advanced 1 hour, 20q: a continuum approach for sorting out processing disorders, course: #10008 level: intermediate 1 hour, 20q: evaluation and treatment of speech/resonance disorders and velopharyngeal dysfunction, course: #8729 level: intermediate 1 hour.

Our site uses cookies to improve your experience. By using our site, you agree to our Privacy Policy .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Acoust Soc Am

Logo of jas

Mechanics of human voice production and control

As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation. An important goal of voice research is to establish a causal theory linking voice physiology and biomechanics to how speakers use and control voice to communicate meaning and personal information. Establishing such a causal theory has important implications for clinical voice management, voice training, and many speech technology applications. This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality. Current efforts to develop mechanical and computational models of voice production are also critically reviewed. Finally, issues and future challenges in developing a causal theory of voice production and perception are discussed.

I. INTRODUCTION

In the broad sense, voice refers to the sound we produce to communicate meaning, ideas, opinions, etc. In the narrow sense, voice, as in this review, refers to sounds produced by vocal fold vibration, or voiced sounds. This is in contrast to unvoiced sounds which are produced without vocal fold vibration, e.g., fricatives which are produced by airflow through constrictions in the vocal tract, plosives produced by sudden release of a complete closure of the vocal tract, or other sound producing mechanisms such as whispering. For voiced sound production, vocal fold vibration modulates airflow through the glottis and produces sound (the voice source), which propagates through the vocal tract and is selectively amplified or attenuated at different frequencies. This selective modification of the voice source spectrum produces perceptible contrasts, which are used to convey different linguistic sounds and meaning. Although this selective modification is an important component of voice production, this review focuses on the voice source and its control within the larynx.

For effective communication of meaning, the voice source, as a carrier for the selective spectral modification by the vocal tract, contains harmonic energy across a large range of frequencies that spans at least the first few acoustic resonances of the vocal tract. In order to be heard over noise, such harmonic energy also has to be reasonably above the noise level within this frequency range, unless a breathy voice quality is desired. The voice source also contains important information of the pitch, loudness, prosody, and voice quality, which convey meaning (see Kreiman and Sidtis, 2011 , Chap. 8 for a review), biological information (e.g., size), and paralinguistic information (e.g., the speaker's social status, personal traits, and emotional state; Sundberg, 1987 ; Kreiman and Sidtis, 2011 ). For example, the same vowel may sound different when spoken by different people. Sometimes a simple “hello” is all it takes to recognize a familiar voice on the phone. People tend to use different voices to different speakers on different occasions, and it is often possible to tell if someone is happy or sad from the tone of their voice.

One of the important goals of voice research is to understand how the vocal system produces voice of different source characteristics and how people associate percepts to these characteristics. Establishing a cause-effect relationship between voice physiology and voice acoustics and perception will allow us to answer two essential questions in voice science and effective clinical care ( Kreiman et al. , 2014 ): when the output voice changes, what physiological alteration caused this change; if a change to voice physiology occurs, what change in perceived voice quality can be expected? Clinically, such knowledge would lead to the development of a physically based theory of voice production that is capable of better predicting voice outcomes of clinical management of voice disorders, thus improving both diagnosis and treatment. More generally, an understanding of this relationship could lead to a better understanding of the laryngeal adjustments that we use to change voice quality, adopt different speaking or singing styles, or convey personal information such as social status and emotion. Such understanding may also lead to the development of improved computer programs for synthesis of naturally sounding, speaker-specific speech of varying emotional percepts.

Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception. Changes in voice physiology or physics that cannot be heard are not that interesting. On the other hand, the physiology and physics may impose constraints on the co-variations among fundamental frequency (F0), vocal intensity, and voice quality, and thus the way we use and control our voice. Thus, understanding voice production and voice control requires an integrated approach, in which physiology, vocal fold vibration, and acoustics are considered as a whole instead of disconnected components. Traditionally, the multi-disciplinary nature of voice production has led to a clear divide between research activities in voice production, voice perception, and their clinical or speech applications, with few studies attempting to link them together. Although much advancement has been made in understanding the physics of phonation, some misconceptions still exist in textbooks in otolaryngology and speech pathology. For example, the Bernoulli effect, which has been shown to play a minor role in phonation, is still considered an important factor in initiating and sustaining phonation in many textbooks and reviews. Tension and stiffness are often used interchangeably despite that they have different physical meanings. The role of the thyroarytenoid muscle in regulating medial compression of the membranous vocal folds is often understated. On the other hand, research on voice production often focuses on the glottal flow and vocal fold vibration, but can benefit from a broader consideration of the acoustics of the produced voice and their implications for voice communication.

This paper provides a review on our current understanding of the cause-effect relation between voice physiology, voice production, and voice perception, with the hope that it will help better bridge research efforts in different aspects of voice studies. An overview of vocal fold physiology is presented in Sec. II , with an emphasis on laryngeal regulation of the geometry, mechanical properties, and position of the vocal folds. The physical mechanisms of self-sustained vocal fold vibration and sound generation are discussed in Sec. III , with a focus on the roles of various physical components and features in initiating phonation and affecting the produced acoustics. Some misconceptions of the voice production physics are also clarified. Section IV discusses the physiologic control of F0, vocal intensity, and voice quality. Section V reviews past and current efforts in developing mechanical and computational models of voice production. Issues and future challenges in establishing a causal theory of voice production and perception are discussed in Sec. VI .

II. VOCAL FOLD PHYSIOLOGY AND BIOMECHANICS

A. vocal fold anatomy and biomechanics.

The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987 ), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific output sounds. The vocal folds are located in the larynx and form a constriction to the airway [Fig. 1(a) ]. Each vocal fold is about 11–15 mm long in adult women and 17–21 mm in men, and stretches across the larynx along the anterior-posterior direction, attaching anteriorly to the thyroid cartilage and posteriorly to the anterolateral surface of the arytenoid cartilages [Fig. 1(c) ]. Both the arytenoid [Fig. 1(d) ] and thyroid [Fig. 1(e) ] cartilages sit on top of the cricoid cartilage and interact with it through the cricoarytenoid joint and cricothyroid joint, respectively. The relative movement of these cartilages thus provides a means to adjust the geometry, mechanical properties, and position of the vocal folds, as further discussed below. The three-dimensional airspace between the two opposing vocal folds is the glottis. The glottis can be divided into a membranous portion, which includes the anterior portion of the glottis and extends from the anterior commissure to the vocal process of the arytenoid, and a cartilaginous portion, which is the posterior space between the arytenoid cartilages.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g001.jpg

(Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

The vocal folds are layered structures, consisting of an inner muscular layer (the thyroarytenoid muscle) with muscle fibers aligned primarily along the anterior-posterior direction, a soft tissue layer of the lamina propria, and an outmost epithelium layer [Figs. 1(a) and 1(b) ]. The thyroarytenoid (TA) muscle is sometimes divided into a medial and a lateral bundle, with each bundle responsible for a certain vocal fold posturing function. However, such functional division is still a topic of debate ( Zemlin, 1997 ). The lamina propria consists of the extracellular matrix (ECM) and interstitial substances. The two primary ECM proteins are the collagen and elastin fibers, which are aligned mostly along the length of the vocal folds in the anterior-posterior direction ( Gray et al. , 2000 ). Based on the density of the collagen and elastin fibers [Fig. 1(b) ], the lamina propria can be divided into a superficial layer with limited and loose elastin and collagen fibers, an intermediate layer of dominantly elastin fibers, and a deep layer of mostly dense collagen fibers ( Hirano and Kakita, 1985 ; Kutty and Webb, 2009 ). In comparison, the lamina propria (about 1 mm thick) is much thinner than the TA muscle.

Conceptually, the vocal fold is often simplified into a two-layer body-cover structure ( Hirano, 1974 ; Hirano and Kakita, 1985 ). The body layer includes the muscular layer and the deep layer of the lamina propria, and the cover layer includes the intermediate and superficial lamina propria and the epithelium layer. This body-cover concept of vocal fold structure will be adopted in the discussions below. Another grouping scheme divides the vocal fold into three layers. In addition to a body and a cover layer, the intermediate and deep layers of the lamina propria are grouped into a vocal ligament layer ( Hirano, 1975 ). It is hypothesized that this layered structure plays a functional role in phonation, with different combinations of mechanical properties in different layers leading to production of different voice source characteristics ( Hirano, 1974 ). However, because of lack of data of the mechanical properties in each vocal fold layer and how they vary at different conditions of laryngeal muscle activation, a definite understanding of the functional roles of each vocal fold layer is still missing.

The mechanical properties of the vocal folds have been quantified using various methods, including tensile tests ( Hirano and Kakita, 1985 ; Zhang et al. , 2006b ; Kelleher et al. , 2013a ), shear rheometry ( Chan and Titze, 1999 ; Chan and Rodriguez, 2008 ; Miri et al. , 2012 ), indentation ( Haji et al. , 1992a , b ; Tran et al. , 1993 ; Chhetri et al. , 2011 ), and a surface wave method ( Kazemirad et al. , 2014 ). These studies showed that the vocal folds exhibit a nonlinear, anisotropic, viscoelastic behavior. A typical stress-strain curve of the vocal folds under anterior-posterior tensile test is shown in Fig. ​ Fig.2. 2 . The slope of the curve, or stiffness, quantifies the extent to which the vocal folds resist deformation in response to an applied force. In general, after an initial linear range, the slope of the stress-strain curve (stiffness) increases gradually with further increase in the strain (Fig. ​ (Fig.2), 2 ), presumably due to the gradual engagement of the collagen fibers. Such nonlinear mechanical behavior provides a means to regulate vocal fold stiffness and tension through vocal fold elongation or shortening, which plays an important role in the control of the F0 or pitch of voice production. Typically, the stress is higher during loading than unloading, indicating a viscous behavior of the vocal folds. Due to the presence of the AP-aligned collagen, elastin, and muscle fibers, the vocal folds also exhibit anisotropic mechanical properties, stiffer along the AP direction than in the transverse plane. Experiments ( Hirano and Kakita, 1985 ; Alipour and Vigmostad, 2012 ; Miri et al. , 2012 ; Kelleher et al. , 2013a ) showed that the Young's modulus along the AP direction in the cover layer is more than 10 times (as high as 80 times in Kelleher et al. , 2013a ) larger than in the transverse plane. Stiffness anisotropy has been shown to facilitate medial-lateral motion of the vocal folds ( Zhang, 2014 ) and complete glottal closure during phonation ( Xuan and Zhang, 2014 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g002.jpg

Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Accurate measurement of vocal fold mechanical properties at typical phonation conditions is challenging, due to both the small size of the vocal folds and the relatively high frequency of phonation. Although tensile tests and shear rheometry allow direct measurement of material modules, the small sample size often leads to difficulties in mounting tissue samples to the testing equipment, thus creating concerns of accuracy. These two methods also require dissecting tissue samples from the vocal folds and the laryngeal framework, making it impossible for in vivo measurement. The indentation method is ideal for in vivo measurement and, because of the small size of indenters used, allows characterization of the spatial variation of mechanical properties of the vocal folds. However, it is limited for measurement of mechanical properties at conditions of small deformation. Although large indentation depths can be used, data interpretation becomes difficult and thus it is not suitable for assessment of the nonlinear mechanical properties of the vocal folds.

There has been some recent work toward understanding the contribution of individual ECM components to the macro-mechanical properties of the vocal folds and developing a structurally based constitutive model of the vocal folds (e.g., Chan et al. , 2001 ; Kelleher et al. , 2013b ; Miri et al. , 2013 ). The contribution of interstitial fluid to the viscoelastic properties of the vocal folds and vocal fold stress during vocal fold vibration and collision has also been investigated using a biphasic model of the vocal folds in which the vocal fold was modeled as a solid phase interacting with an interstitial fluid phase ( Zhang et al. , 2008 ; Tao et al. , 2009 , Tao et al. , 2010 ; Bhattacharya and Siegmund, 2013 ). This structurally based approach has the potential to predict vocal fold mechanical properties from the distribution of collagen and elastin fibers and interstitial fluids, which may provide new insights toward the differential mechanical properties between different vocal fold layers at different physiologic conditions.

B. Vocal fold posturing

Voice communication requires fine control and adjustment of pitch, loudness, and voice quality. Physiologically, such adjustments are made through laryngeal muscle activation, which stiffens, deforms, or repositions the vocal folds, thus controlling the geometry and mechanical properties of the vocal folds and glottal configuration.

One important posturing is adduction/abduction of the vocal folds, which is primarily achieved through motion of the arytenoid cartilages. Anatomical analysis and numerical simulations have shown that the cricoarytenoid joint allows the arytenoid cartilages to slide along and rotate about the long axis of the cricoid cartilage, but constrains arytenoid rotation about the short axis of the cricoid cartilage ( Selbie et al. , 1998 ; Hunter et al. , 2004 ; Yin and Zhang, 2014 ). Activation of the lateral cricoarytenoid (LCA) muscles, which attach anteriorly to the cricoid cartilage and posteriorly to the arytenoid cartilages, induce mainly an inward rotation motion of the arytenoid about the cricoid cartilages in the coronal plane, and moves the posterior portion of the vocal folds toward the glottal midline. Activation of the interarytenoid (IA) muscles, which connect the posterior surfaces of the two arytenoids, slides and approximates the arytenoid cartilages [Fig. 1(c) ], thus closing the cartilaginous glottis. Because both muscles act on the posterior portion of the vocal folds, combined action of the two muscles is able to completely close the posterior portion of the glottis, but is less effective in closing the mid-membranous glottis (Fig. ​ (Fig.3; 3 ; Choi et al. , 1993 ; Chhetri et al. , 2012 ; Yin and Zhang, 2014 ). Because of this inefficiency in mid-membranous approximation, LCA/IA muscle activation is unable to produce medial compression between the two vocal folds in the membranous portion, contrary to current understandings ( Klatt and Klatt, 1990 ; Hixon et al. , 2008 ). Complete closure and medial compression of the mid-membranous glottis requires the activation of the TA muscle ( Choi et al. , 1993 ; Chhetri et al. , 2012 ). The TA muscle forms the bulk of the vocal folds and stretches from the thyroid prominence to the anterolateral surface of the arytenoid cartilages (Fig. ​ (Fig.1). 1 ). Activation of the TA muscle produces a whole-body rotation of the vocal folds in the horizontal plane about the point of its anterior attachment to the thyroid cartilage toward the glottal midline ( Yin and Zhang, 2014 ). This rotational motion is able to completely close the membranous glottis but often leaves a gap posteriorly (Fig. ​ (Fig.3). 3 ). Complete closure of both the membranous and cartilaginous glottis thus requires combined activation of the LCA/IA and TA muscles. The posterior cricoarytenoid (PCA) muscles are primarily responsible for opening the glottis but may also play a role in voice production of very high pitches, as discussed below.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g003.jpg

Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993) .

Vocal fold tension is regulated by elongating or shortening the vocal folds. Because of the nonlinear material properties of the vocal folds, changing vocal fold length also leads to changes in vocal fold stiffness, which otherwise would stay constant for linear materials. The two laryngeal muscles involved in regulating vocal fold length are the cricothyroid (CT) muscle and the TA muscle. The CT muscle consists of two bundles. The vertically oriented bundle, the pars recta, connects the anterior surface of the cricoid cartilage and the lower border of the thyroid lamina. Its contraction approximates the thyroid and cricoid cartilages anteriorly through a rotation about the cricothyroid joint. The other bundle, the pars oblique, is oriented upward and backward, connecting the anterior surface of the cricoid cartilage to the inferior cornu of the thyroid cartilage. Its contraction displaces the cricoid and arytenoid cartilages backwards ( Stone and Nuttall, 1974 ), although the thyroid cartilage may also move forward slightly. Contraction of both bundles thus elongates the vocal folds and increases the stiffness and tension in both the body and cover layers of the vocal folds. In contrast, activation of the TA muscle, which forms the body layer of the vocal folds, increase the stiffness and tension in the body layer. Activation of the TA muscle, in addition to an initial effect of mid-membranous vocal fold approximation, also shortens the vocal folds, which decreases both the stiffness and tension in the cover layer ( Hirano and Kakita, 1985 ; Yin and Zhang, 2013 ). One exception is when the tension in the vocal fold cover is already negative (i.e., under compression), in which case shortening the vocal folds further through TA activation decreases tension (i.e., increased compression force) but may increase stiffness in the cover layer. Activation of the LCA/IA muscles generally does not change the vocal fold length much and thus has only a slight effect on vocal fold stiffness and tension ( Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, activation of the LCA/IA muscles (and also the PCA muscles) does stabilize the arytenoid cartilage and prevent it from moving forward when the cricoid cartilage is pulled backward due to the effect of CT muscle activation, thus facilitating extreme vocal fold elongation, particularly for high-pitch voice production. As noted above, due to the lack of reliable measurement methods, our understanding of how vocal fold stiffness and tension vary at different muscular activation conditions is limited.

Activation of the CT and TA muscles also changes the medial surface shape of the vocal folds and the glottal channel geometry. Specifically, TA muscle activation causes the inferior part of the medial surface to bulge out toward the glottal midline ( Hirano and Kakita, 1985 ; Hirano, 1988 ; Vahabzadeh-Hagh et al. , 2016 ), thus increasing the vertical thickness of the medial surface. In contrast, CT activation reduces this vertical thickness of the medial surface. Although many studies have investigated the prephonatory glottal shape (convergent, straight, or divergent) on phonation ( Titze, 1988a ; Titze et al. , 1995 ), a recent study showed that the glottal channel geometry remains largely straight under most conditions of laryngeal muscle activation ( Vahabzadeh-Hagh et al. , 2016 ).

III. PHYSICS OF VOICE PRODUCTION

A. sound sources of voice production.

The phonation process starts from the adduction of the vocal folds, which approximates the vocal folds to reduce or close the glottis. Contraction of the lungs initiates airflow and establishes pressure buildup below the glottis. When the subglottal pressure exceeds a certain threshold pressure, the vocal folds are excited into a self-sustained vibration. Vocal fold vibration in turn modulates the glottal airflow into a pulsating jet flow, which eventually develops into turbulent flow into the vocal tract.

In general, three major sound production mechanisms are involved in this process ( McGowan, 1988 ; Hofmans, 1998 ; Zhao et al. , 2002 ; Zhang et al. , 2002a ), including a monopole sound source due to volume of air displaced by vocal fold vibration, a dipole sound source due to the fluctuating force applied by the vocal folds to the airflow, and a quadrupole sound source due to turbulence developed immediately downstream of the glottal exit. When the false vocal folds are tightly adducted, an additional dipole source may arise as the glottal jet impinges onto the false vocal folds ( Zhang et al. , 2002b ). The monopole sound source is generally small considering that the vocal folds are nearly incompressible and thus the net volume flow displacement is small. The dipole source is generally considered as the dominant sound source and is responsible for the harmonic component of the produced sound. The quadrupole sound source is generally much weaker than the dipole source in magnitude, but it is responsible for broadband sound production at high frequencies.

For the harmonic component of the voice source, an equivalent monopole sound source can be defined at a plane just downstream of the region of major sound sources, with the source strength equal to the instantaneous pulsating glottal volume flow rate. In the source-filter theory of phonation ( Fant, 1970 ), this monopole sound source is the input signal to the vocal tract, which acts as a filter and shapes the sound source spectrum into different sounds before they are radiated from the mouth to the open as the voice we hear. Because of radiation from the mouth, the sound source is proportional to the time derivative of the glottal flow. Thus, in the voice literature, the time derivate of the glottal flow, instead of the glottal flow, is considered as the voice source.

The phonation cycle is often divided into an open phase, in which the glottis opens (the opening phase) and closes (the closing phase), and a closed phase, in which the glottis is closed or remains a minimum opening area when the glottal closure is incomplete. The glottal flow increases and decreases in the open phase, and remains zero during the closed phase or minimum for incomplete glottal closure (Fig. ​ (Fig.4). 4 ). Compared to the glottal area waveform, the glottal flow waveform reaches its peak at a later time in the cycle so that the glottal flow waveform is more skewed to the right. This skewing in the glottal flow waveform to the right is due to the acoustic mass in the glottis and the vocal tract (when the F0 is lower than a nearby vocal tract resonance frequency), which causes a delay in the increase in the glottal flow during the opening phase, and a faster decay in the glottal flow during the closing phase ( Rothenberg, 1981 ; Fant, 1982 ). Because of this waveform skewing to the right, the negative peak of the time derivative of the glottal flow in the closing phase is often much more dominant than the positive peak in the opening phase. The instant of the most negative peak is thus considered the point of main excitation of the vocal tract and the corresponding negative peak, also referred to as the maximum flow declination rate (MFDR), is a major determinant of the peak amplitude of the produced voice. After the negative peak, the time derivative of the glottal flow waveform returns to zero as phonation enters the closed phase.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g004.jpg

(Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

Much work has been done to directly link features of the glottal flow waveform to voice acoustics and potentially voice quality (e.g., Fant, 1979 , 1982 ; Fant et al. , 1985 ; Gobl and Chasaide, 2010 ). These studies showed that the low-frequency spectral shape (the first few harmonics) of the voice source is primarily determined by the relative duration of the open phase with respect to the oscillation period (To/T in Fig. ​ Fig.4, 4 , also referred to as the open quotient). A longer open phase often leads to a more dominant first harmonic (H1) in the low-frequency portion of the resulting voice source spectrum. For a given oscillation period, shortening the open phrase causes most of the glottal flow change to occur within a duration (To) that is increasingly shorter than the period T. This leads to an energy boost in the low-frequency portion of the source spectrum that peaks around a frequency of 1/To. For a glottal flow waveform of a very short open phase, the second harmonic (H2) or even the fourth harmonic (H4) may become the most dominant harmonic. Voice source with a weak H1 relative to H2 or H4 is often associated with a pressed voice quality.

The spectral slope in the high-frequency range is primarily related to the degree of discontinuity in the time derivative of the glottal flow waveform. Due to the waveform skewing discussed earlier, the most dominant source of discontinuity often occurs around the instant of main excitation when the time derivative of the glottal flow waveform returns from the negative peak to zero within a time scale of Ta (Fig. ​ (Fig.4). 4 ). For an abrupt glottal flow cutoff ( Ta  = 0), the time derivative of the glottal flow waveform has a strong discontinuity at the point of main excitation, which causes the voice source spectrum to decay asymptotically at a roll-off rate of −6 dB per octave toward high frequencies. Increasing Ta from zero leads to a gradual return from the negative peak to zero. When approximated by an exponential function, this gradual return functions as a lower-pass filter, with a cutoff frequency around 1/ Ta , and reduces the excitation of harmonics above the cutoff frequency 1/ Ta . Thus, in the frequency range concerning voice perception, increasing Ta often leads to reduced higher-order harmonic excitation. In the extreme case when there is minimal vocal fold contact, the time derivative of the glottal flow waveform is so smooth that the voice source spectrum only has a few lower-order harmonics. Perceptually, strong excitation of higher-order harmonics is often associated with a bright output sound quality, whereas voice source with limited excitation of higher-order harmonics is often perceived to be weak.

Also of perceptual importance is the turbulence noise produced immediately downstream of the glottis. Although small in amplitude, the noise component plays an important role in voice quality perception, particularly for female voice in which aspiration noise is more persistent than in male voice. While the noise component of voice is often modeled as white noise, its spectrum often is not flat and may exhibit different spectral shapes, depending on the glottal opening and flow rate as well as the vocal tract shape. Interaction between the spectral shape and relative levels of harmonic and noise energy in the voice source has been shown to influence the perception of voice quality ( Kreiman and Gerratt, 2012 ).

It is worth noting that many of the source parameters are not independent from each other and often co-vary. How they co-vary at different voicing conditions, which is essential to natural speech synthesis, remains to be the focus of many studies (e.g., Sundberg and Hogset, 2001 ; Gobl and Chasaide, 2003 ; Patel et al. , 2011 ).

B. Mechanisms of self-sustained vocal fold vibration

That vocal fold vibration results from a complex airflow-vocal fold interaction within the glottis rather than repetitive nerve stimulation of the larynx was first recognized by van den Berg (1958) . According to his myoelastic-aerodynamic theory of voice production, phonation starts from complete adduction of the vocal folds to close the glottis, which allows a buildup of the subglottal pressure. The vocal folds remain closed until the subglottal pressure is sufficiently high to push them apart, allowing air to escape and producing a negative (with respect to atmospheric pressure) intraglottal pressure due to the Bernoulli effect. This negative Bernoulli pressure and the elastic recoil pull the vocal folds back and close the glottis. The cycle then repeats, which leads to sustained vibration of the vocal folds.

While the myoelastic-aerodynamic theory correctly identifies the interaction between the vocal folds and airflow as the underlying mechanism of self-sustained vocal fold vibration, it does not explain how energy is transferred from airflow into the vocal folds to sustain this vibration. Traditionally, the negative intraglottal pressure is considered to play an important role in closing the glottis and sustaining vocal fold vibration. However, it is now understood that a negative intraglottal pressure is not a critical requirement for achieving self-sustained vocal fold vibration. Similarly, an alternatingly convergent-divergent glottal channel geometry during phonation has been considered a necessary condition that leads to net energy transfer from airflow into the vocal folds. We will show below that an alternatingly convergent-divergent glottal channel geometry does not always guarantee energy transfer or self-sustained vocal fold vibration.

For flow conditions typical of human phonation, the glottal flow can be reasonably described by Bernoulli's equation up to the point when airflow separates from the glottal wall, often at the glottal exit at which the airway suddenly expands. According to Bernoulli's equation, the flow pressure p at a location within the glottal channel with a time-varying cross-sectional area A is

where P sub and P sup are the subglottal and supraglottal pressure, respectively, and A sep is the time-varying glottal area at the flow separation location. For simplicity, we assume that the flow separates at the upper margin of the medial surface. To achieve a net energy transfer from airflow to the vocal folds over one cycle, the air pressure on the vocal fold surface has to be at least partially in-phase with vocal fold velocity. Specifically, the intraglottal pressure needs to be higher in the opening phase than in the closing phase of vocal fold vibration so that the airflow does more work on the vocal folds in the opening phase than the work the vocal folds do back to the airflow in the closing phase.

Theoretical analysis of the energy transfer between airflow and vocal folds ( Ishizaka and Matsudaira, 1972 ; Titze, 1988a ) showed that this pressure asymmetry can be achieved by a vertical phase difference in vocal fold surface motion (also referred to as a mucosal wave), i.e., different portions of the vocal fold surface do not necessarily move inward and outward together as a whole. This mechanism is illustrated in Fig. ​ Fig.5, 5 , the upper left of which shows vocal fold surface shape in the coronal plane for six consecutive, equally spaced instants during one vibration cycle in the presence of a vertical phase difference. Instants 2 and 3 in solid lines are in the closing phase whereas 5 and 6 in dashed lines are in the opening phase. Consider for an example energy transfer at the lower margin of the medial surface. Because of the vertical phase difference, the glottal channel has a different shape in the opening phase (dashed lines 5 and 6) from that in the closing (solid lines 3 and 2) when the lower margin of the medial surface crosses the same locations. Particularly, when the lower margin of the medial surface leads the upper margin in phase, the glottal channel during opening (e.g., instant 6) is always more convergent [thus a smaller A sep / A in Eq. (1) ] or less divergent than that in the closing (e.g., instant 2) for the same location of the lower margin, resulting in an air pressure [Eq. (1) ] that is higher in the opening phase than the closing phase (Fig. ​ (Fig.5, 5 , top row). As a result, energy is transferred from airflow into the vocal folds over one cycle, as indicated by a non-zero area enclosed by the aerodynamic force-vocal fold displacement curve in Fig. ​ Fig.5 5 (top right). The existence of a vertical phase difference in vocal fold surface motion is generally considered as the primary mechanism of phonation onset.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g005.jpg

Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

In contrast, without a vertical phase difference, the vocal fold surface during opening (Fig. ​ (Fig.5, 5 , bottom left; dashed lines 5 and 6) and closing (solid lines 3 and 2) would be identical when the lower margin crosses the same positions, for which Bernoulli's equation would predict symmetric flow pressure between the opening and closing phases, and zero net energy transfer over one cycle (Fig. ​ (Fig.5, 5 , middle row). Under this condition, the pressure asymmetry between the opening and closing phases has to be provided by an external mechanism that directly imposes a phase difference between the intraglottal pressure and vocal fold movement. In the presence of such an external mechanism, the intraglottal pressure is no longer the same between opening and closing even when the glottal channel has the same shape as the vocal fold crosses the same locations, resulting in a net energy transfer over one cycle from airflow to the vocal folds (Fig. ​ (Fig.5, 5 , bottom row). This energy transfer mechanism is often referred to as negative damping, because the intraglottal pressure depends on vocal fold velocity and appears in the system equations of vocal fold motion in a form similar to a damping force, except that energy is transferred to the vocal folds instead of being dissipated. Negative damping is the only energy transfer mechanism in a single degree-of-freedom system or when the entire medial surface moves in phase as a whole.

In humans, a negative damping can be provided by an inertive vocal tract ( Flanagan and Landgraf, 1968 ; Ishizaka and Matsudaira, 1972 ; Ishizaka and Flanagan, 1972 ) or a compliant subglottal system ( Zhang et al. , 2006a ). Because the negative damping associated with acoustic loading is significant only for frequencies close to an acoustic resonance, phonation sustained by such negative damping alone always occurs at a frequency close to that acoustic resonance ( Flanagan and Landgraf, 1968 ; Zhang et al. , 2006a ). Although there is no direct evidence of phonation sustained dominantly by acoustic loading in humans, instabilities in voice production (or voice breaks) have been reported when the fundamental frequency of vocal fold vibration approaches one of the vocal tract resonances (e.g., Titze et al. , 2008 ). On the other hand, this entrainment of phonation frequency to the acoustic resonance limits the degree of independent control of the voice source and the spectral modification by the vocal tract, and is less desirable for effective speech communication. Considering that humans are capable of producing a large variety of voice types independent of vocal tract shapes, negative damping due to acoustic coupling to the sub- or supra-glottal acoustics is unlikely the primary mechanism of energy transfer in voice production. Indeed, excised larynges are able to vibrate without a vocal tract. On the other hand, experiments have shown that in humans the vocal folds vibrate at a frequency close to an in vacuo vocal fold resonance ( Kaneko et al. , 1986 ; Ishizaka, 1988 ; Svec et al. , 2000 ) instead of the acoustic resonances of the sub- and supra-glottal tracts, suggesting that phonation is essentially a resonance phenomenon of the vocal folds.

A negative damping can be also provided by glottal aerodynamics. For example, glottal flow acceleration and deceleration may cause the flow to separate at different locations between opening and closing even when the glottis has identical geometry. This is particularly the case for a divergent glottal channel geometry, which often results in asymmetric flow separation and pressure asymmetry between the glottal opening and closing phases ( Park and Mongeau, 2007 ; Alipour and Scherer, 2004 ). The effect of this negative damping mechanism is expected to be small at phonation onset at which the vocal fold vibration amplitude and thus flow unsteadiness is small and the glottal channel is less likely to be divergent. However, its contribution to energy transfer may increase with increasing vocal fold vibration amplitude and flow unsteadiness ( Howe and McGowan, 2010 ). It is important to differentiate this asymmetric flow separation between glottal opening and closing due to unsteady flow effects from a quasi-steady asymmetric flow separation that is caused by asymmetry in the glottal channel geometry between opening and closing. In the latter case, because flow separation may occur at a more upstream location for a divergent glottal channel than a convergent glottal channel, an asymmetric glottal channel geometry (e.g., a glottis opening convergent and closing divergent) may lead to asymmetric flow separation between glottal opening and closing. Compared to conditions of a fixed flow separation (i.e., flow separates at the same location during the entire cycle, as in Fig. ​ Fig.5), 5 ), such geometry-induced asymmetric flow separation actually reduces pressure asymmetry between glottal opening and closing [this can be shown using Eq. (1) ] and thus weakens net energy transfer. In reality, these two types of asymmetric flow separation mechanisms (due to unsteady effects or changes in glottal channel geometry) interact and can result in very complex flow separation patterns ( Alipour and Scherer, 2004 ; Sciamarella and Le Quere, 2008 ; Sidlof et al. , 2011 ), which may or may not enhance energy transfer.

From the discussion above it is clear that a negative Bernoulli pressure is not a critical requirement in either one of the two mechanisms. Being proportional to vocal fold displacement, the negative Bernoulli pressure is not a negative damping and does not directly provide the required pressure asymmetry between glottal opening and closing. On the other hand, the existence of a vertical phase difference in vocal fold vibration is determined primarily by vocal fold properties (as discussed below), rather than whether the intraglottal pressure is positive or negative during a certain phase of the oscillation cycle.

Although a vertical phase difference in vocal fold vibration leads to a time-varying glottal channel geometry, an alternatingly convergent-divergent glottal channel geometry does not guarantee self-sustained vocal fold vibration. For example, although the in-phase vocal fold motion in the bottom left of Fig. ​ Fig.5 5 (the entire medial surface moves in and out together) leads to an alternatingly convergent-divergent glottal geometry, the glottal geometry is identical between glottal opening and closing and thus this motion is unable to produce net energy transfer into the vocal folds without a negative damping mechanism (Fig. ​ (Fig.5, 5 , middle row). In other words, an alternatingly convergent-divergent glottal geometry is an effect, not cause, of self-sustained vocal fold vibration. Theoretically, the glottis can maintain a convergent or divergent shape during the entire oscillation cycle and yet still self-oscillate, as observed in experiments using physical vocal fold models which had a divergent shape during most portions of the oscillation cycle ( Zhang et al. , 2006a ).

C. Eigenmode synchronization and nonlinear dynamics

The above shows that net energy transfer from airflow into the vocal folds is possible in the presence of a vertical phase difference. But how is this vertical phase difference established, and what determines the vertical phase difference and the vocal fold vibration pattern? In voice production, vocal fold vibration with a vertical phase difference results from a process of eigenmode synchronization, in which two or more in vacuo eigenmodes of the vocal folds are synchronized to vibrate at the same frequency but with a phase difference ( Ishizaka and Matsudaira, 1972 ; Ishizaka, 1981 ; Horacek and Svec, 2002 ; Zhang et al. , 2007 ), in the same way as a travelling wave formed by superposition of two standing waves. An eigenmode or resonance is a pattern of motion of the system that is allowed by physical laws and boundary constraints to the system. In general, for each mode, the vibration pattern is such that all parts of the system move either in-phase or 180° out of phase, similar to a standing wave. Each eigenmode has an inherently distinct eigenfrequency (or resonance frequency) at which the eigenmode can be maximally excited. An example of eigenmodes that is often encountered in speech science is formants, which are peaks in the output voice spectra due to excitation of acoustic resonances of the vocal tract, with the formant frequency dependent on vocal tract geometry. Figure ​ Figure6 6 shows three typical eigenmodes of the vocal fold in the coronal plane. In Fig. ​ Fig.6, 6 , the thin line indicates the resting vocal fold surface shape, whereas the solid and dashed lines indicate extreme positions of the vocal fold when vibrating at the corresponding eigenmode, spaced 180° apart in a vibratory cycle. The first eigenmode shows an up and down motion in the vertical direction, which does not modulate glottal airflow much. The second eigenmode has a dominantly in-phase medial-lateral motion along the medial surface, which does modulate airflow. The third eigenmode also exhibits dominantly medial-lateral motion, but the upper portion of the medial surface vibrates 180° out of phase with the lower portion of the medial surface. Such out-of-phase motion as in the third eigenmode is essential to achieving vocal fold vibration with a large vertical phase difference, e.g., when synchronized with an in-phase eigenmode as in Fig. 6(b) .

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g006.jpg

Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

In the absence of airflow, the vocal fold in vacuo eigenmodes are generally neutral or damped, meaning that when excited they will gradually decay in amplitude with time. When the vocal folds are subject to airflow, however, the vocal fold-airflow coupling modifies the eigenmodes and, in some conditions, synchronizes two eigenmodes to the same frequency (Fig. ​ (Fig.7). 7 ). Although vibration in each eigenmode by itself does not produce net energy transfer (Fig. ​ (Fig.5, 5 , middle row), when two modes are synchronized at the same frequency but with a phase difference in time, the vibration velocity associated with one eigenmode [e.g., the eigenmode in Fig. 6(b) ] will be at least partially in-phase with the pressure induced by the other eigenmode [e.g., the eigenmode in Fig. 6(c) ], and this cross-model pressure-velocity interaction will produce net energy transfer into the vocal folds ( Ishizaka and Matsudaira, 1972 ; Zhang et al. , 2007 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g007.jpg

A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

The minimum subglottal pressure required to synchronize two eigenmodes and initiate net energy transfer, or the phonation threshold pressure, is proportional to the frequency spacing between the two eigenmodes being synchronized and the coupling strength between the two eigenmodes ( Zhang, 2010 ):

where ω 0,1 and ω 0,2 are the eigenfrequencies of the two in vacuo eigenmodes participating in the synchronization process and β is the coupling strength between the two eigenmodes. Thus, the closer the two eigenmodes are to each other in frequency or the more strongly they are coupled, the less pressure is required to synchronize them. This is particularly the case in an anisotropic material such as the vocal folds in which the AP stiffness is much larger than the stiffness in the transverse plane. Under such anisotropic stiffness conditions, the first few in vacuo vocal fold eigenfrequencies tend to cluster together and are much closer to each other compared to isotropic stiffness conditions ( Titze and Strong, 1975 ; Berry, 2001 ). Such clustering of eigenmodes makes it possible to initiate vocal fold vibration at very low subglottal pressures.

The coupling strength β between the two eigenmodes in Eq. (2) depends on the prephonatory glottal opening, with the coupling strength increasing with decreasing glottal opening (thus lowered phonation threshold pressure). In addition, the coupling strength also depends on the spatial similarity between the air pressure distribution over the vocal fold surface induced by one eigenmode and vocal fold surface velocity of the other eigenmode ( Zhang, 2010 ). In other words, the coupling strength β quantifies the cross-mode energy transfer efficiency between the eigenmodes that are being synchronized. The higher the degree of cross-mode pressure-velocity similarity, the better the two eigenmodes are coupled, and the less subglottal pressure is required to synchronize them.

In reality, the vocal folds have an infinite number of eigenmodes. Which eigenmodes are synchronized and eventually excited depends on the frequency spacing and relative coupling strength among different eigenmodes. Because vocal fold vibration depends on the eigenmodes that are eventually excited, changes in the eigenmode synchronization pattern often lead to changes in the F0, vocal fold vibration pattern, and the resulting voice quality. Previous studies have shown that a slight change in vocal fold properties such as stiffness or medial surface shape may cause phonation to occur at a different eigenmode, leading to a qualitatively different vocal fold vibration pattern and abrupt changes in F0 ( Tokuda et al. , 2007 ; Zhang, 2009 ). Eigenmode synchronization is not limited to two vocal fold eigenmodes, either. It may also occur between a vocal fold eigenmode and an eigenmode of the subglottal or supraglottal system. In this sense, the negative damping due to subglottal or supraglottal acoustic loading can be viewed as the result of synchronization between one of the vocal fold modes and one of the acoustic resonances.

Eigenmode synchronization discussed above corresponds to a 1:1 temporal synchronization of two eigenmodes. For a certain range of vocal fold conditions, e.g., when asymmetry (left-right or anterior-posterior) exists in the vocal system or when the vocal folds are strongly coupled with the sub- or supra-glottal acoustics, synchronization may occur so that the two eigenmodes are synchronized not toward the same frequency, but at a frequency ratio of 1:2, 1:3, etc., leading to subharmonics or biphonation ( Ishizaka and Isshiki, 1976 ; Herzel, 1993 ; Herzel et al. , 1994 ; Neubauer et al. , 2001 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Titze, 2008 ; Lucero et al. , 2015 ). Temporal desynchronization of eigenmodes often leads to irregular or chaotic vocal fold vibration ( Herzel et al. , 1991 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Steinecke and Herzel, 1995 ). Transition between different synchronization patterns, or bifurcation, often leads to a sudden change in the vocal fold vibration pattern and voice quality.

These studies show that the nonlinear interaction between vocal fold eigenmodes is a central feature of the phonation process, with different synchronization or desynchronization patterns producing a large variety of voice types. Thus, by changing the geometrical and biomechanical properties of the vocal folds, either through laryngeal muscle activation or mechanical modification as in phonosurgery, we can select eigenmodes and eigenmode synchronization pattern to control or modify our voice, in the same way as we control speech formants by moving articulators in the vocal tract to modify vocal tract acoustic resonances.

The concept of eigenmode and eigenmode synchronization is also useful for phonation modeling, because eigenmodes can be used as building blocks to construct more complex motion of the system. Often, only the first few eigenmodes are required for adequate reconstruction of complex vocal fold vibrations (both regular and irregular; Herzel et al. , 1994 ; Berry et al. , 1994 ; Berry et al. , 2006 ), which would significantly reduce the degrees of freedom required in computational models of phonation.

D. Biomechanical requirements of glottal closure during phonation

An important feature of normal phonation is the complete closure of the membranous glottis during vibration, which is essential to the production of high-frequency harmonics. Incomplete closure of the membranous glottis, as often observed in pathological conditions, often leads to voice production of a weak and/or breathy quality.

It is generally assumed that approximation of the vocal folds through arytenoid adduction is sufficient to achieve glottal closure during phonation, with the duration of glottal closure or the closed quotient increasing with increasing degree of vocal fold approximation. While a certain degree of vocal fold approximation is obviously required for glottal closure, there is evidence suggesting that other factors also are in play. For example, excised larynx experiments have shown that some larynges would vibrate with incomplete glottal closure despite that the arytenoids are tightly sutured together ( Isshiki, 1989 ; Zhang, 2011 ). Similar incomplete glottal closure is also observed in experiments using physical vocal fold models with isotropic material properties ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In these experiments, increasing the subglottal pressure increased the vocal fold vibration amplitude but often did not lead to improvement in the glottal closure pattern ( Xuan and Zhang, 2014 ). These studies show that addition stiffness or geometry conditions are required to achieve complete membranous glottal closure.

Recent studies have started to provide some insight toward these additional biomechanical conditions. Xuan and Zhang (2014) showed that embedding fibers along the anterior-posterior direction in otherwise isotropic models is able to improve glottal closure ( Xuan and Zhang, 2014 ). With an additional thin stiffer outmost layer simulating the epithelium, these physical models are able to vibrate with a considerably long closed period. It is interesting that this improvement in the glottal closure pattern occurred only when the fibers were embedded to a location close to the vocal fold surface in the cover layer. Embedding fibers in the body layer did not improve the closure pattern at all. This suggests a possible functional role of collagen and elastin fibers in the intermediate and deep layers of the lamina propria in facilitating glottal closure during vibration.

The difference in the glottal closure pattern between isotropic and anisotropic vocal folds could be due to many reasons. Compared to isotropic vocal folds, anisotropic vocal folds (or fiber-embedded models) are better able to maintain their adductory position against the subglottal pressure and are less likely to be pushed apart by air pressure ( Zhang, 2011 ). In addition, embedding fibers along the AP direction may also enhance the medial-lateral motion, further facilitating glottal closure. Zhang (2014) showed that the first few in vacuo eigenmodes of isotropic vocal folds exhibit similar in-phase, up-and-down swing-like motion, with the medial-lateral and superior-inferior motions locked in a similar phase relationship. Synchronization of modes of similar vibration patterns necessarily leads to qualitatively the same vibration patterns, in this case an up-and-down swing-like motion, with vocal fold vibration dominantly along the superior-inferior direction, as observed in recent physical model experiments ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In contrast, for vocal folds with the AP stiffness much higher than the transverse stiffness, the first few in vacuo modes exhibit qualitatively distinct vibration patterns, and the medial-lateral motion and the superior-inferior motion are no longer locked in a similar phase in the first few in vacuo eigenmodes. This makes it possible to strongly excite large medial-lateral motion without proportional excitation of the superior-inferior motion. As a result, anisotropic models exhibit large medial-lateral motion with a vertical phase difference along the medial surface. The improved capability to maintain adductory position against the subglottal pressure and to vibrate with large medial-lateral motion may contribute to the improved glottal closure pattern observed in the experiment of Xuan and Zhang (2014) .

Geometrically, a thin vocal fold has been shown to be easily pushed apart by the subglottal pressure ( Zhang, 2016a ). Although a thin anisotropic vocal fold vibrates with a dominantly medial-lateral motion, this is insufficient to overcome its inability to maintain position against the subglottal pressure. As a result, the glottis never completely closes during vibration, which leads to a relatively smooth glottal flow waveform and weak excitation of higher-order harmonics in the radiated output voice spectrum ( van den Berg, 1968 ; Zhang, 2016a ). Increasing vertical thickness of the medial surface allows the vocal fold to better resist the glottis-opening effect of the subglottal pressure, thus maintaining the adductory position and achieving complete glottal closure.

Once these additional stiffness and geometric conditions (i.e., certain degree of stiffness anisotropy and not-too-small vertical vocal fold thickness) are met, the duration of glottal closure can be regulated by varying the vertical phase difference in vocal fold motion along the medial surface. A non-zero vertical phase difference means that, when the lower margins of the medial surfaces start to open, the glottis would continue to remain closed until the upper margins start to open. One important parameter affecting the vertical phase difference is the vertical thickness of the medial surface or the degree of medial bulging in the inferior portion of the medial surface. Given the same condition of vocal fold stiffness and vocal fold approximation, the vertical phase difference during vocal fold vibration increases with increasing vertical medial surface thickness (Fig. ​ (Fig.8). 8 ). Thus, the thicker the medial surface, the larger the vertical phase difference, and the longer the closed phase (Fig. ​ (Fig.8; 8 ; van den Berg, 1968 ; Alipour and Scherer, 2000 ; Zhang, 2016a ). Similarly, the vertical phase difference and thus the duration of glottal closure can be also increased by reducing the elastic surface wave speed in the superior-inferior direction ( Ishizaka and Flanagan, 1972 ; Story and Titze, 1995 ), which depends primarily on the stiffness in the transverse plane and to a lesser degree on the AP stiffness, or increasing the body-cover stiffness ratio ( Story and Titze, 1995 ; Zhang, 2009 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g008.jpg

(Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (G ap ), and the resting glottal angle ( α ). Reprinted with permission of ASA from Zhang (2016a) .

Theoretically, the duration of glottal closure can be controlled by changing the ratio between the vocal fold equilibrium position (or the mean glottal opening) and the vocal fold vibration amplitude. Both stiffening the vocal folds and tightening vocal fold approximation are able to move the vocal fold equilibrium position toward glottal midline. However, such manipulations often simultaneously reduce the vibration amplitude. As a result, the overall effect on the duration of glottal closure is unclear. Zhang (2016a) showed that stiffening the vocal folds or increasing vocal fold approximation did not have much effect on the duration of glottal closure except around onset when these manipulations led to significant improvement in vocal fold contact.

E. Role of flow instabilities

Although a Bernoulli-based flow description is often used for phonation models, the realistic glottal flow is highly three-dimensional and much more complex. The intraglottal pressure distribution is shown to be affected by the three-dimensionality of the glottal channel geometry ( Scherer et al. , 2001 ; Scherer et al. , 2010 ; Mihaescu et al. , 2010 ; Li et al. , 2012 ). As the airflow separates from the glottal wall as it exits the glottis, a jet forms downstream of the flow separation point, which leads to the development of shear layer instabilities, vortex roll-up, and eventually vortex shedding from the jet and transition into turbulence. The vortical structures would in turn induce disturbances upstream, which may lead to oscillating flow separation point, jet attachment to one side of the glottal wall instead of going straight, and possibly alternating jet flapping ( Pelorson et al. , 1994 ; Shinwari et al. , 2003 ; Triep et al. , 2005 ; Kucinschi et al. , 2006 ; Erath and Plesniak, 2006 ; Neubauer et al. , 2007 ; Zheng et al. , 2009 ). Recent experiments and simulations also showed that for a highly divergent glottis, airflow may separate inside the glottis, which leads to the formation and convection of intraglottal vortices ( Mihaescu et al. , 2010 ; Khosla et al. , 2014 ; Oren et al. , 2014 ).

Some of these flow features have been incorporated in phonation models (e.g., Liljencrants, 1991 ; Pelorson et al. , 1994 ; Kaburagi and Tanabe, 2009 ; Erath et al. , 2011 ; Howe and McGowan, 2013 ). Resolving other features, particularly the jet instability, vortices, and turbulence downstream of the glottis, demands significantly increased computational costs so that simulation of a few cycles of vocal fold vibration often takes days or months. On the other hand, the acoustic and perceptual relevance of these intraglottal and supraglottal flow structures has not been established. From the sound production point of view, these complex flow structures in the downstream glottal flow field are sound sources of quadrupole type (dipole type when obstacles are present in the pathway of airflow, e.g., tightly adducted false vocal folds). Due to the small length scales associated with the flow structures, these sound sources are broadband in nature and mostly at high frequencies (generally above 2 kHz), with an amplitude much smaller than the harmonic component of the voice source. Therefore, if the high-frequency component of voice is of interest, these flow features have to be accurately modeled, although the degree of accuracy required to achieve perceptual sufficiency has yet to be determined.

It has been postulated that the vortical structures may directly affect the near-field glottal fluid-structure interaction and thus vocal fold vibration and the harmonic component of the voice source. Once separated from the vocal fold walls, the glottal jet starts to develop jet instabilities and is therefore susceptible to downstream disturbances, especially when the glottis takes on a divergent shape. In this way, the unsteady supraglottal flow structures may interact with the boundary layer at the glottal exit and affect the flow separation point within the glottal channel ( Hirschberg et al. , 1996 ). Similarly, it has been hypothesized that intraglottal vortices can induce a local negative pressure on the medial surface of the vocal folds as the intraglottal vortices are convected downstream and thus may facilitate rapid glottal closure during voice production ( Khosla et al. , 2014 ; Oren et al. , 2014 ).

While there is no doubt that these complex flow features affect vocal fold vibration, the question remains concerning how large an influence these vortical structures have on vocal fold vibration and the produced acoustics. For the flow conditions typical of voice production, many of the flow features or instabilities have time scales much different from that of vocal fold vibration. For example, vortex shedding at typical voice conditions occurs generally at frequencies above 1000 Hz ( Zhang et al. , 2004 ; Kucinschi et al. , 2006 ). Considering that phonation is essentially a resonance phenomenon of the vocal folds (Sec. III B ) and the mismatch between vocal fold resonance and typical frequency scales of the vortical structures, it is questionable that compared to vocal fold inertia and elastic recoil, the pressure perturbations on vocal fold surface due to intraglottal or supraglottal vortical structures are strong enough or last for a long enough period to have a significant effect on voice production. Given a longitudinal shear modulus of the vocal fold of about 10 kPa and a shear strain of 0.2, the elastic recoil stress of the vocal fold is approximately 2000 Pa. The pressure perturbations induced by intraglottal or supraglottal vortices are expected to be much smaller than the subglottal pressure. Assuming an upper limit of about 20% of the subglottal pressure for the pressure perturbations (as induced by intraglottal vortices, Oren et al. , 2014 ; in reality this number is expected to be much smaller at normal loudness conditions and even smaller for supraglottal vortices) and a subglottal pressure of 800 Pa (typical of normal speech production), the pressure perturbation on vocal fold surface is about 160 Pa, which is much smaller than the elastic recoil stress. Specifically to the intraglottal vortices, while a highly divergent glottal geometry is required to create intraglottal vortices, the presence of intraglottal vortices induces a negative suction force applied mainly on the superior portion of the medial surface and, if the vortices are strong enough, would reduce the divergence of the glottal channel. In other words, while intraglottal vortices are unable to create the necessary divergence conditions required for their creation, their existence tends to eliminate such conditions.

There have been some recent studies toward quantifying the degree of the influence of the vortical structures on phonation. In an excised larynx experiment without a vocal tract, it has been observed that the produced sound does not change much when sticking a finger very close to the glottal exit, which presumably would have significantly disturbed the supraglottal flow field. A more rigorous experiment was designed in Zhang and Neubauer (2010) in which they placed an anterior-posteriorly aligned cylinder in the supraglottal flow field and traversed it in the flow direction at different left-right locations and observed the acoustics consequences. The hypothesis was that, if these supraglottal flow structures had a significant effect on vocal fold vibration and acoustics, disturbing these flow structures would lead to noticeable changes in the produced sound. However, their experiment found no significant changes in the sound except when the cylinder was positioned within the glottal channel.

The potential impact of intraglottal vortices on phonation has also been numerically investigated ( Farahani and Zhang, 2014 ; Kettlewell, 2015 ). Because of the difficulty in removing intraglottal vortices without affecting other aspects of the glottal flow, the effect of the intraglottal vortices was modeled as a negative pressure superimposed on the flow pressure predicted by a base glottal flow model. In this way, the effect of the intraglottal vortices can be selectively activated or deactivated independently of the base flow so that its contribution to phonation can be investigated. These studies showed that intraglottal vortices only have small effects on vocal fold vibration and the glottal flow. Kettlewell (2015) further showed that the vortices are either not strong enough to induce significant pressure perturbation on vocal fold surfaces or, if they are strong enough, the vortices advect rapidly into the supraglottal region and the induced pressure perturbations would be too brief to have any impact to overcome the inertia of the vocal fold tissue.

Although phonation models using simplified flow models neglecting flow vortical structures are widely used and appear to qualitatively compare well with experiments ( Pelorson et al. , 1994 ; Zhang et al. , 2002a ; Ruty et al. , 2007 ; Kaburagi and Tanabe, 2009 ), more systematic investigations are required to reach a definite conclusion regarding the relative importance of these flow structures to phonation and voice perception. This may be achieved by conducting parametric studies in a large range of conditions over which the relative strength of these vortical structures are known to vary significantly and observing their consequences on voice production. Such an improved understanding would facilitate the development of computationally efficient reduced-order models of phonation.

IV. BIOMECHANICS OF VOICE CONTROL

A. fundamental frequency.

In the discussion of F0 control, an analogy is often made between phonation and vibration in strings in the voice literature (e.g., Colton et al. , 2011 ). The vibration frequency of a string is determined by its length, tension, and mass. By analogy, the F0 of voice production is also determined by its length, tension, and mass, with the mass interpreted as the mass of the vocal folds that is set into vibration. Specifically, F0 increases with increasing tension, decreasing mass, and decreasing vocal fold length. While the string analogy is conceptually simple and heuristically useful, some important features of the vocal folds are missing. Other than the vague definition of an effective mass, the string model, which implicitly assumes cross-section dimension much smaller than length, completely neglects the contribution of vocal fold stiffness in F0 control. Although stiffness and tension are often not differentiated in the voice literature, they have different physical meanings and represent two different mechanisms that resist deformation (Fig. ​ (Fig.2). 2 ). Stiffness is a property of the vocal fold and represents the elastic restoring force in response to deformation, whereas tension or stress describes the mechanical state of the vocal folds. The string analogy also neglects the effect of vocal fold contact, which introduces additional stiffening effect.

Because phonation is essentially a resonance phenomenon of the vocal folds, the F0 is primarily determined by the frequency of the vocal fold eigenmodes that are excited. In general, vocal fold eigenfrequencies depend on both vocal fold geometry, including length, depth, and thickness, and the stiffness and stress conditions of the vocal folds. Shorter vocal folds tend to have high eigenfrequencies. Thus, because of the small vocal fold size, children tend to have the highest F0, followed by female and then male. Vocal fold eigenfrequencies also increase with increasing stiffness or stress (tension), both of which provide a restoring force to resist vocal fold deformation. Thus, stiffening or tensioning the vocal folds would increase the F0 of the voice. In general, the effect of stiffness on vocal fold eigenfrequencies is more dominant than tension when the vocal fold is slightly elongated or shortened, at which the tension is small or even negative and the string model would underestimate F0 or fail to provide a prediction. As the vocal fold gets further elongated and tension increases, the stiffness and tension become equally important in affecting vocal fold eigenfrequencies ( Titze and Hunter, 2004 ; Yin and Zhang, 2013 ).

When vocal fold contact occurs during vibration, the vocal fold collision force appears as an additional restoring force ( Ishizaka and Flanagan, 1972 ). Depending on the extent, depth of influence, and duration of vocal fold collision, this additional force can significantly increase the effective stiffness of the vocal folds and thus F0. Because the vocal fold contact pattern depends on the degree of vocal fold approximation, subglottal pressure, and vocal fold stiffness and geometry, changes in any of these parameters may have an effect on F0 by affecting vocal fold contact ( van den Berg and Tran, 1959 ; Zhang, 2016a ).

In humans, F0 can be increased by increasing either vocal fold eigenfrequencies or the extent and duration of vocal fold contact. Control of vocal fold eigenfrequencies is largely achieved by varying the stiffness and tension along the AP direction. Due to the nonlinear material properties of the vocal folds, both the AP stiffness and tension can be controlled by elongating or shortening the vocal folds, through activation of the CT muscle. Although elongation also increases vocal fold length which lowers F0, the effect of the increase in stiffness and tension on F0 appears to dominate that of increasing length.

The effect of TA muscle activation on F0 control is a little more complex. In addition to shortening vocal fold length, TA activation tensions and stiffens the body layer, decreases tension in the cover layer, but may decrease or increase the cover stiffness ( Yin and Zhang, 2013 ). Titze et al. (1988) showed that depending on the depth of the body layer involved in vibration, increasing TA activation can either increase or decrease vocal fold eigenfrequencies. On the other hand, Yin and Zhang (2013) showed that for an elongated vocal fold, as is often the case in phonation, the overall effect of TA activation is to reduce vocal fold eigenfrequencies. Only for conditions of a slightly elongated or shortened vocal folds, TA activation may increase vocal fold eigenfrequencies. In addition to the effect on vocal fold eigenfrequencies, TA activation increases vertical thickness of the vocal folds and produces medial compression between the two folds, both of which increase the extent and duration of vocal tract contact and would lead to an increased F0 ( Hirano et al. , 1969 ). Because of these opposite effects on vocal fold eigenfrequencies and vocal fold contact, the overall effect of TA activation on F0 would vary depending on the specific vocal fold conditions.

Increasing subglottal pressure or activation of the LCA/IA muscles by themselves do not have much effect on vocal fold eigenfrequencies ( Hirano and Kakita, 1985 ; Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, they often increase the extent and duration of vocal fold contact during vibration, particularly with increasing subglottal pressure, and thus lead to increased F0 ( Hirano et al. , 1969 ; Ishizaka and Flanagan, 1972 ; Zhang, 2016a ). Due to nonlinearity in vocal fold material properties, increased vibration amplitude at high subglottal pressures may lead to increased effective stiffness and tension, which may also increase F0 ( van den Berg and Tan, 1959 ; Ishizaka and Flanagan, 1972 ; Titze, 1989 ). Ishizaka and Flanagan (1972) showed in their two-mass model that vocal fold contact and material nonlinearity combined can lead to an increase of about 40 Hz in F0 when the subglottal pressure is increased from about 200 to 800 Pa. In the continuum model of Zhang (2016a) , which includes the effect of vocal fold contact but not vocal fold material nonlinearity, increasing subglottal pressure alone can increase the F0 by as large as 20 Hz/kPa.

B. Vocal intensity

Because voice is produced at the glottis, filtered by the vocal tract, and radiated from the mouth, an increase in vocal intensity can be achieved by either increasing the source intensity or enhancing the radiation efficiency. The source intensity is controlled primarily by the subglottal pressure, which increases the vibration amplitude and the negative peak or MFDR of the time derivative of the glottal flow. The subglottal pressure depends primarily on the alveolar pressure in the lungs, which is controlled by the respiratory muscles and the lung volume. In general, conditions of the laryngeal system have little effect on the establishment of the alveolar pressure and subglottal pressure ( Hixon, 1987 ; Finnegan et al. , 2000 ). However, an open glottis often results in a small glottal resistance and thus a considerable pressure drop in the lower airway and a reduced subglottal pressure. An open glottis also leads to a large glottal flow rate and a rapid decline in the lung volume, thus reducing the duration of speech between breaths and increasing the respiratory effort required in order to maintain a target subglottal pressure ( Zhang, 2016b ).

In the absence of a vocal tract, laryngeal adjustments, which control vocal fold stiffness, geometry, and position, do not have much effect on the source intensity, as shown in many studies using laryngeal, physical, or computational models of phonation ( Tanaka and Tanabe, 1986 ; Titze, 1988b ; Zhang, 2016a ). In the experiment by Tanaka and Tanabe (1986) , for a constant subglottal pressure, stimulation of the CT and LCA muscles had almost no effects on vocal intensity whereas stimulation of the TA muscle slightly decreased vocal intensity. In an excised larynx experiment, Titze (1988b) found no dependence of vocal intensity on the glottal width. Similar secondary effects of laryngeal adjustments have also been observed in a recent computational study ( Zhang, 2016a ). Zhang (2016a) also showed that the effect of laryngeal adjustments may be important at subglottal pressures slightly above onset, in which case an increase in either AP stiffness or vocal fold approximation may lead to improved vocal fold contact and glottal closure, which significantly increased the MFDR and thus vocal intensity. However, these effects became less efficient with increasing vocal intensity.

The effect of laryngeal adjustments on vocal intensity becomes a little more complicated in the presence of the vocal tract. Changing vocal tract shape by itself does not amplify the produced sound intensity because sound propagation in the vocal tract is a passive process. However, changes in vocal tract shape may provide a better impedance match between the glottis and the free space outside the mouth and thus improve efficiency of sound radiation from the mouth ( Titze and Sundberg, 1992 ). This is particularly the case for harmonics close to a formant, which are often amplified more than the first harmonic and may become the most energetic harmonic in the spectrum of the output voice. Thus, vocal intensity can be increased through laryngeal adjustments that increase excitation of harmonics close to the first formant of the vocal tract ( Fant, 1982 ; Sundberg, 1987 ) or by adjusting vocal tract shape to match one of the formants with one of the dominant harmonics in the source spectrum.

In humans, all three strategies (respiratory, laryngeal, and articulatory) are used to increase vocal intensity. When asked to produce an intensity sweep from soft to loud voice, one generally starts with a slightly breathy voice with a relatively open glottis, which requires the least laryngeal effort but is inefficient in voice production. From this starting position, vocal intensity can be increased by increasing either the subglottal pressure, which increases vibration amplitude, or vocal fold adduction (approximation and/or thickening). For a soft voice with minimal vocal fold contact and minimal higher-order harmonic excitation, increasing vocal fold adduction is particularly efficient because it may significantly improve vocal fold contact, in both spatial extent and duration, thus significantly boosting the excitation of harmonics close to the first formant. In humans, for low to medium vocal intensity conditions, vocal intensity increase is often accompanied by simultaneous increases in the subglottal pressure and the glottal resistance ( Isshiki, 1964 ; Holmberg et al. , 1988 ; Stathopoulos and Sapienza, 1993 ). Because the pitch level did not change much in these experiments, the increase in glottal resistance was most likely due to tighter vocal fold approximation through LCA/IA activation. The duration of the closed phase is often observed to increase with increasing vocal intensity ( Henrich et al. , 2005 ), indicating increased vocal fold thickening or medial compression, which are primarily controlled by the TA muscle. Thus, it seems that both the LCA/IA/TA muscles and subglottal pressure increase play a role in vocal intensity increase at low to medium intensity conditions. For high vocal intensity conditions, when further increase in vocal fold adduction becomes less effective ( Hirano et al. , 1969 ), vocal intensity increase appears to rely dominantly on the subglottal pressure increase.

On the vocal tract side, Titze (2002) showed that the vocal intensity can be increased by matching a wide epilarynx with lower glottal resistance or a narrow epilarynx with higher glottal resistance. Tuning the first formant (e.g., by opening mouth wider) to match the F0 is often used in soprano singing to maximize vocal output ( Joliveau et al. , 2004 ). Because radiation efficiency can be improved through adjustments in either the vocal folds or the vocal tract, this makes it possible to improve radiation efficiency yet still maintain desired pitch or articulation, whichever one wishes to achieve.

C. Voice quality

Voice quality generally refers to aspects of the voice other than pitch and loudness. Due to the subjective nature of voice quality perception, many different descriptions are used and authors often disagree with the meanings of these descriptions ( Gerratt and Kreiman, 2001 ; Kreiman and Sidtis, 2011 ). This lack of a clear and consistent definition of voice quality makes it difficult for studies of voice quality and identifying its physiological correlates and controls. Acoustically, voice quality is associated with the spectral amplitude and shape of the harmonic and noise components of the voice source, and their temporal variations. In the following we focus on physiological factors that are known to have an impact on the voice spectra and thus are potentially perceptually important.

One of the first systematic investigations of the physiological controls of voice quality was conducted by Isshiki (1989 , 1998) using excised larynges, in which regions of normal, breathy, and rough voice qualities were mapped out in the three-dimensional parameter space of the subglottal pressure, vocal fold stiffness, and prephonatory glottal opening area (Fig. ​ (Fig.9). 9 ). He showed that for a given vocal fold stiffness and prephonatory glottal opening area, increasing subglottal pressure led to voice production of a rough quality. This effect of the subglottal pressure can be counterbalanced by increasing vocal fold stiffness, which increased the region of normal voice in the parameter space of Fig. ​ Fig.9. 9 . Unfortunately, the details of this study, including the definition and manipulation of vocal fold stiffness and perceptual evaluation of different voice qualities, are not fully available. The importance of the coordination between the subglottal pressure and laryngeal conditions was also demonstrated in van den Berg and Tan (1959) , which showed that although different vocal registers were observed, each register occurred in a certain range of laryngeal conditions and subglottal pressures. For example, for conditions of low longitudinal tension, a chest-like phonation was possible only for small airflow rates. At large values of the subglottal pressure, “it was impossible to obtain good sound production. The vocal folds were blown too wide apart…. The shape of the glottis became irregularly curved and this curving was propagated along the glottis.” Good voice production at large flow rates was possible only with thyroid cartilage compression which imitates the effect of TA muscle activation. Irregular vocal fold vibration at high subglottal pressures has also been observed in physical model experiments (e.g., Xuan and Zhang, 2014 ). Irregular or chaotic vocal fold vibration at conditions of pressure-stiffness mismatch has also been reported in the numerical simulation of Berry et al. (1994) , which showed that while regular vocal fold vibration was observed for typical vocal fold stiffness conditions, irregular vocal fold vibration (e.g., subharmonic or chaotic vibration) was observed when the cover layer stiffness was significantly reduced while maintaining the same subglottal pressure.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g009.jpg

A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989) .

The experiments of van den Berg and Tan (1959) and Isshiki (1989) also showed that weakly adducted vocal folds (weak LCA/IA/TA activation) often lead to vocal fold vibration with incomplete glottal closure during phonation. When the airflow is sufficiently high, the persistent glottal gap would lead to increased turbulent noise production and thus phonation of a breathy quality (Fig. ​ (Fig.9). 9 ). The incomplete glottal closure may occur in the membranous or the cartilaginous portion of the glottis. When the incomplete glottal closure is limited to the cartilaginous glottis, the resulting voice is breathy but may still have strong harmonics at high frequencies. When the incomplete glottal closure occurs in the membranous glottis, the reduced or slowed vocal fold contact would also reduce excitation of higher-order harmonics, resulting in a breathy and weak quality of the produced voice. When the vocal folds are sufficiently separated, the coupling between the two vocal folds may be weakened enough so that each vocal fold can vibrate at a different F0. This would lead to biphonation or voice containing two distinct fundamental frequencies, resulting in a perception similar to that of the beat frequency phenomenon.

Compared to a breathy voice, a pressed voice is presumably produced with tight vocal fold approximation or even some degree of medial compression in the membranous portion between the two folds. A pressed voice is often characterized by a second harmonic that is stronger than the first harmonic, or a negative H1-H2, with a long period of glottal closure during vibration. Although a certain degree of vocal fold approximation and stiffness anisotropy is required to achieve vocal fold contact during phonation, the duration of glottal closure has been shown to be primarily determined by the vertical thickness of the vocal fold medial surface ( van den Berg, 1968 ; Zhang, 2016a ). Thus, although it is generally assumed that a pressed voice can be produced with tight arytenoid adduction through LCA/IA muscle activation, activation of the LCA/IA muscles alone is unable to achieve prephonatory medial compression in the membranous glottis or change the vertical thickness of the medial surface. Activation of the TA muscle appears to be essential in producing a voice change from a breathy to a pressed voice quality. A weakened TA muscle, as in aging or muscle atrophy, would lead to difficulties in producing a pressed voice or even sufficient glottal closure during phonation. On the other hand, strong TA muscle activation, as in for example, spasmodic dysphonia, may lead to too tight a closure of the glottis and a rough voice quality ( Isshiki, 1989 ).

In humans, vocal fold stiffness, vocal fold approximation, and geometry are regulated by the same set of laryngeal muscles and thus often co-vary, which has long been considered as one possible origin of vocal registers and their transitions ( van den Berg, 1968 ). Specifically, it has been hypothesized that changes in F0 are often accompanied by changes in the vertical thickness of the vocal fold medial surface, which lead to changes in the spectral characteristics of the produced voice. The medial surface thickness is primarily controlled by the CT and TA muscles, which also regulate vocal fold stiffness and vocal fold approximation. Activation of the CT muscle reduces the medial surface thickness, but also increases vocal fold stiffness and tension, and in some conditions increases the resting glottal opening ( van den Berg and Tan, 1959 ; van den Berg, 1968 ; Hirano and Kakita, 1985 ). Because the LCA/IA/TA muscles are innervated by the same nerve and often activated together, an increase in the medial surface thickness through TA muscle activation is often accompanied by increased vocal fold approximation ( Hirano and Kakita, 1985 ) and contact. Thus, if one attempts to increase F0 primarily by activation of the LCA/IA/TA muscles, the vocal folds are likely to have a large medial surface thickness and probably low AP stiffness, which will lead to a chest-like voice production, with large vertical phase difference along the medial surface, long closure of the glottis, small flow rate, and strong harmonic excitation. In the extreme case of strong TA activation and minimum CT activation and very low subglottal pressure, the glottis can remain closed for most of the cycle, leading to a vocal fry-like voice production. In contrast, if one attempts to increase F0 by increasing CT activation alone, the vocal folds, with a small medial surface thickness, are likely to produce a falsetto-like voice production, with incomplete glottal closure and a nearly sinusoidal flow waveform, very high F0, and a limited number of harmonics.

V. MECHANICAL AND COMPUTER MODELS FOR VOICE APPLICATIONS

Voice applications generally fall into two major categories. In the clinic, simulation of voice production has the potential to predict outcomes of clinical management of voice disorders, including surgery and voice therapy. For such applications, accurate representation of vocal fold geometry and material properties to the degree that matches actual clinical treatment is desired, and for this reason continuum models of the vocal folds are preferred over lumped-element models. Computational cost is not necessarily a concern in such applications but still has to be practical. In contrast, for some other applications, particularly in speech technology applications, the primary goal is to reproduce speech acoustics or at least perceptually relevant features of speech acoustics. Real-time capability is desired in these applications, whereas realistic representation of the underlying physics involved is often not necessary. In fact, most of the current speech synthesis systems consider speech purely as an acoustic signal and do not model the physics of speech production at all. However, models that take into consideration the underlying physics, at least to some degree, may hold the most promise in speech synthesis of natural-sounding, speaker-specific quality.

A. Mechanical vocal fold models

Early efforts on artificial speech production, dating back to as early as the 18th century, focused on mechanically reproducing the speech production system. A detailed review can be found in Flanagan (1972) . The focus of these early efforts was generally on articulation in the vocal tract rather than the voice source, which is understandable considering that meaning is primarily conveyed through changes in articulation and the lack of understanding of the voice production process. The vibrating element in these mechanical models, either a vibrating reed or a slotted rubber sheet stretched over an opening, is only a rough approximation of the human vocal folds.

More sophisticated mechanical models have been developed more recently to better reproduce the three-dimensional layered structure of the vocal folds. A membrane (cover)-cushion (body) two-layer rubber vocal fold model was first developed by Smith (1956) . Similar mechanical models were later developed and used in voice production research (e.g., Isogai et al. , 1988 ; Kakita, 1988 ; Titze et al. , 1995 ; Thomson et al. , 2005 ; Ruty et al. , 2007 ; Drechsel and Thomson, 2008 ), using silicone or rubber materials or liquid-filled membranes. Recent studies ( Murray and Thomson, 2012 ; Xuan and Zhang, 2014 ) have also started to embed fibers into these models to simulate the anisotropic material properties due to the presence of collagen and elastin fibers in the vocal folds. A similar layered vocal fold model has been incorporated into a mechanical talking robot system ( Fukui et al. , 2005 ; Fukui et al. , 2007 ; Fukui et al. , 2008 ). The most recent version of the talking robot, Waseda Talker, includes mechanisms for the control of pitch and resting glottal opening, and is able to produce voice of modal, creaky, or breathy quality. Nevertheless, although a mechanical voice production system may find application in voice prosthesis or humanoid robotic systems in the future, current mechanical models are still a long way from reproducing or even approaching humans' capability and flexibility in producing and controlling voice.

B. Formant synthesis and parametric voice source models

Compared to mechanically reproducing the physical process involved in speech production, it is easier to reproduce speech as an acoustic signal. This is particularly the case for speech synthesis. One approach adopted in most of the current speech synthesis systems is to concatenate segments of pre-recorded natural voice into new speech phrases or sentences. While relatively easy to implement, in order to achieve natural-sounding speech, this approach requires a large database of words spoken in different contexts, which makes it difficult to apply to personalized speech synthesis of varying emotional percepts.

Another approach is to reproduce only perceptually relevant acoustic features of speech, as in formant synthesis. The target acoustic features to be reproduced generally include the F0, sound amplitude, and formant frequencies and bandwidths. This approach gained popularity with the development of electrical synthesizers and later computer simulations which allow flexible and accurate control of these acoustic features. Early formant-based synthesizers used simple sound sources, often a filtered impulse train as the sound source for voiced sounds and white noise for unvoiced sounds. Research on the voice sources (e.g., Fant, 1979 ; Fant et al. , 1985 ; Rothenberg et al. , 1971 ; Titze and Talkin, 1979 ) has led to the development of parametric voice source models in the time domain, which are capable of producing voice source waveforms of varying F0, amplitude, open quotient, and degree of abruptness of the glottal flow shutoff, and thus synthesis of different voice qualities.

While parametric voice source models provide flexibility in source variations, synthetic speech generated by the formant synthesis still suffers limited naturalness. This limited naturalness may result from the primitive rules used in specifying dynamic controls of the voice source models ( Klatt, 1987 ). Also, the source model control parameters are not independent from each other and often co-vary during phonation. A challenge in formant synthesis is thus to specify voice source parameter combinations and their time variation patterns that may occur in realistic voice production of different voice qualities by different speakers. It is also possible that some perceptually important features are missing from time-domain voice source models ( Klatt, 1987 ). Human perception of voice characteristics is better described in the frequency domain as the auditory system performs an approximation to Fourier analysis of the voice and sound in general. While time-domain models have better correspondence to the physical events occurring during phonation (e.g., glottal opening and closing, and the closed phase), it is possible some spectral details of perceptual importance are not captured in the simple time-domain voice source models. For example, spectral details in the low and middle frequencies have been shown to be of considerable importance to naturalness judgment, but are difficult to be represented in a time-domain source model ( Klatt, 1987 ). A recent study ( Kreiman et al. , 2015 ) showed that spectral-domain voice source models are able to create significantly better matches to natural voices than time-domain voice source models. Furthermore, because of the independence between the voice source and the sub- and supra-glottal systems in formant synthesis, interactions and co-variations between vocal folds and the sub- and supra-glottal systems are by design not accounted for. All these factors may contribute to the limited naturalness of the formant synthesized speech.

C. Physically based computer models

An alternative approach to natural speech synthesis is to computationally model the voice production process based on physical principles. The control parameters would be geometry and material properties of the vocal system or, in a more realistic way, respiratory and laryngeal muscle activation. This approach avoids the need to specify consistent characteristics of either the voice source or the formants, thus allowing synthesis and modification of natural voice in a way intuitively similar to human voice production and control.

The first such computer model of voice production is the one-mass model by Flanagan and Landgraf (1968) , in which the vocal fold is modeled as a horizontally moving single-degree of freedom mass-spring-damper system. This model is able to vibrate in a restricted range of conditions when the natural frequency of the mass-spring system is close to one of the acoustic resonances of the subglottal or supraglottal tracts. Ishizaka and Flanagan (1972) extended this model to a two-mass model in which the upper and lower parts of the vocal fold are modeled as two separate masses connected by an additional spring along the vertical direction. The two-mass model is able to vibrate with a vertical phase difference between the two masses, and thus able to vibrate independently of the acoustics of the sub- and supra-glottal tracts. Many variants of the two-mass model have since been developed. Titze (1973) developed a 16-mass model to better represent vocal fold motion along the anterior-posterior direction. To better represent the body-cover layered structure of the vocal folds, Story and Titze (1995) extended the two-mass model to a three-mass model, adding an additional lateral mass representing the inner muscular layer. Empirical rules have also been developed to relate control parameters of the three-mass model to laryngeal muscle activation levels ( Titze and Story, 2002 ) so that voice production can be simulated with laryngeal muscle activity as input. Designed originally for speech synthesis purpose, these lumped-element models of voice production are generally fast in computational time and ideal for real-time speech synthesis.

A drawback of the lumped-element models of phonation is that the model control parameters cannot be directly measured or easily related to the anatomical structure or material properties of the vocal folds. Thus, these models are not as useful in applications in which a realistic representation of voice physiology is required, as, for example, in the clinical management of voice disorders. To better understand the voice source and its control under different voicing conditions, more sophisticated computational models of the vocal folds based on continuum mechanics have been developed to understand laryngeal muscle control of vocal fold geometry, stiffness, and tension, and how changes in these vocal fold properties affect the glottal fluid-structure interaction and the produced voice. One of the first such models is the finite-difference model by Titze and Talkin (1979) , which coupled a three-dimensional vocal fold model of linear elasticity with the one-dimensional glottal flow model of Ishizaka and Flanagan (1972) . In the past two decades more refined phonation models using a two-dimensional or three-dimensional Navier-Stokes description of the glottal flow have been developed (e.g., Alipour et al. , 2000 ; Zhao et al. , 2002 ; Tao et al. , 2007 ; Luo et al. , 2009 ; Zheng et al. , 2009 ; Bhattacharya and Siegmund, 2013 ; Xue et al. , 2012 , 2014 ). Continuum models of laryngeal muscle activation have also been developed to model vocal fold posturing ( Hunter et al. , 2004 ; Gommel et al. , 2007 ; Yin and Zhang, 2013 , 2014 ). By directly modeling the voice production process, continuum models with realistic geometry and material properties ideally hold the most promise in reproducing natural human voice production. However, because the phonation process is highly nonlinear and involves large displacement and deformation of the vocal folds and complex glottal flow patterns, modeling this process in three dimensions is computationally very challenging and time-consuming. As a result, these computational studies are often limited to one or two specific aspects instead of the entire voice production process, and the acoustics of the produced voice, other than F0 and vocal intensity, are often not investigated. For practical applications, real-time or not, reduced-order models with significantly improved computational efficiency are required. Some reduced-order continuum models, with simplifications in both the glottal flow and vocal fold dynamics, have been developed and used in large-scale parametric studies of voice production (e.g., Titze and Talkin, 1979 ; Zhang, 2016a ), which appear to produce qualitatively reasonable predictions. However, these simplifications have yet to be rigorously validated by experiment.

VI. FUTURE CHALLENGES

We currently have a general understanding of the physical principles of voice production. Toward establishing a cause-effect theory of voice production, much is to be learned about voice physiology and biomechanics. This includes the geometry and mechanical properties of the vocal folds and their variability across subject, sex, and age, and how they vary across different voicing conditions under laryngeal muscle activation. Even less is known about changes in vocal fold geometry and material properties in pathologic conditions. The surface conditions of the vocal folds and their mechanical properties have been shown to affect vocal fold vibration ( Dollinger et al. , 2014 ; Bhattacharya and Siegmund, 2015 ; Tse et al. , 2015 ), and thus need to be better quantified. While in vivo animal or human larynx models ( Moore and Berke, 1988 ; Chhetri et al. , 2012 ; Berke et al. , 2013 ) could provide such information, more reliable measurement methods are required to better quantify the viscoelastic properties of the vocal fold, vocal fold tension, and the geometry and movement of the inner vocal fold layers. While macro-mechanical properties are of interest, development of vocal fold constitutive laws based on ECM distribution and interstitial fluids within the vocal folds would allow us to better understand how vocal fold mechanical properties change with prolonged vocal use, vocal fold injury, and wound healing, which otherwise is difficult to quantify.

While oversimplification of the vocal folds to mass and tension is of limited practical use, the other extreme is not appealing, either. With improved characterization and understanding of vocal fold properties, establishing a cause-effect relationship between voice physiology and production thus requires identifying which of these physiologic features are actually perceptually relevant and under what conditions, through systematic parametric investigations. Such investigations will also facilitate the development of reduced-order computational models of phonation in which perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified. We discussed earlier that many of the complex supraglottal flow phenomena have questionable perceptual relevance. Similar relevance questions can be asked with regard to the geometry and mechanical properties of the vocal folds. For example, while the vocal folds exhibit complex viscoelastic properties, what are the main material properties that are definitely required in order to reasonably predict vocal fold vibration and voice quality? Does each of the vocal fold layers, in particular, the different layers of the lamina propria, have a functional role in determining the voice output or preventing vocal injury? Current vocal fold models often use a simplified vocal fold geometry. Could some geometric features of a realistic vocal fold that are not included in current models have an important role in affecting voice efficiency and voice quality? Because voice communication spans a large range of voice conditions (e.g., pitch, loudness, and voice quality), the perceptual relevance and adequacy of specific features (i.e., do changes in specific features lead to perceivable changes in voice?) should be investigated across a large number of voice conditions rather than a few selected conditions. While physiologic models of phonation allow better reproduction of realistic vocal fold conditions, computational models are more suitable for such systematic parametric investigations. Unfortunately, due to the high computational cost, current studies using continuum models are often limited to a few conditions. Thus, the establishment of cause-effect relationship and the development of reduced-order models are likely to be iterative processes, in which the models are gradually refined to include more physiologic details to be considered in the cause-effect relationship.

A causal theory of voice production would allow us to map out regions in the physiological parameter space that produce distinct vocal fold vibration patterns and voice qualities of interest (e.g., normal, breathy, rough voices for clinical applications; different vocal registers for singing training), similar to that described by Isshiki (1989 ; also Fig. ​ Fig.9). 9 ). Although the voice production system is quite complex, control of voice should be both stable and simple, which is required for voice to be a robust and easily controlled means of communication. Understanding voice production in the framework of nonlinear dynamics and eigenmode interactions and relating it to voice quality may facilitate toward this goal. Toward practical clinical applications, such a voice map would help us understand what physiologic alteration caused a given voice change (the inverse problem), and what can be done to restore the voice to normal. Development of efficient and reliable tools addressing the inverse problem has important applications in the clinical diagnosis of voice disorders. Some methods already exist that solve the inverse problem in lumped-element models (e.g., Dollinger et al. , 2002 ; Hadwin et al. , 2016 ), and these can be extended to physiologically more realistic continuum models.

Solving the inverse problem would also provide an indirect approach toward understanding the physiologic states that lead to percepts of different emotional states or communication of other personal traits, which are otherwise difficult to measure directly in live human beings. When extended to continuous speech production, this approach may also provide insights into the dynamic physiologic control of voice in running speech (e.g., time contours of the respiratory and laryngeal adjustments). Such information would facilitate the development of computer programs capable of natural-sounding, conversational speech synthesis, in which the time contours of control parameters may change with context, speaking style, or emotional state of the speaker.

ACKNOWLEDGMENTS

This study was supported by research Grant Nos. R01 DC011299 and R01 DC009229 from the National Institute on Deafness and Other Communication Disorders, the National Institutes of Health. The author would like to thank Dr. Liang Wu for assistance in preparing the MRI images in Fig. ​ Fig.1, 1 , Dr. Jennifer Long for providing the image in Fig. 1(b) , Dr. Gerald Berke for providing the stroboscopic recording from which Fig. ​ Fig.3 3 was generated, and Dr. Jody Kreiman, Dr. Bruce Gerratt, Dr. Ronald Scherer, and an anonymous reviewer for the helpful comments on an earlier version of this paper.

IMAGES

  1. Production of Speech Sounds

    speech sound production meaning

  2. PPT

    speech sound production meaning

  3. PPT

    speech sound production meaning

  4. Study of speech sounds- Consonants

    speech sound production meaning

  5. Speech Sound Development Chart

    speech sound production meaning

  6. Speech production (Phonation) mechanism involved in speech production

    speech sound production meaning

VIDEO

  1. Top Tips Speech Sound Production

  2. Sounds Production by Vocal Cord

  3. WOW (Wealth of Words)

  4. The Power of Sound Design In Movies

  5. Mechanism of Speech Production (ENG)

  6. Sound is produced from a vibration

COMMENTS

  1. Speech Sound Disorders-Articulation and Phonology

    Speech Sound Disorders. Speech sound disorders is an umbrella term referring to any difficulty or combination of difficulties with perception, motor production, or phonological representation of speech sounds and speech segments—including phonotactic rules governing permissible speech sound sequences in a language.. Speech sound disorders can be organic or functional in nature.

  2. Speech production

    Speech production is the process by which thoughts are translated into speech. This includes the selection of words, the organization of relevant grammatical forms, and then the articulation of the resulting sounds by the motor system using the vocal apparatus.Speech production can be spontaneous such as when a person creates the words of a conversation, reactive such as when they name a ...

  3. Speech Production

    Speech production is a complex process that includes the articulation of sounds and words, relying on the intricate interplay of hearing, perception, and information processing by the brain and ...

  4. 1

    The production of a speech sound may be divided into four separate but interrelated processes: the initiation of the air stream, normally in the lungs; its phonation in the larynx through the operation of the vocal folds; its direction by the velum into either the oral cavity or the nasal cavity (the oro-nasal process); and finally its ...

  5. Articulating: The Neural Mechanisms of Speech Production

    The model of word production proposed by Levelt (1989) begins at conceptual preparation, in which the intended meaning of an utterance is initially formulated. This is followed by a lexical selection stage, in which candidate items in the lexicon, or lemmas, ... Production of a speech sound (which can be a frequently produced phoneme, ...

  6. Speech Production

    Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process ...

  7. Phonetics

    phonetics, the study of speech sounds and their physiological production and acoustic qualities. It deals with the configurations of the vocal tract used to produce speech sounds (articulatory phonetics), the acoustic properties of speech sounds (acoustic phonetics), and the manner of combining sounds so as to make syllables, words, and sentences (linguistic phonetics).

  8. The Complex Linguistics Behind Speech Sounds: An Exploration of

    The field of phonetics examines the actual production of sounds humans make to communicate. It analyzes the mechanisms of speech, from vocal cord vibrations to tongue placement modifying air flow through the mouth. Phonology then studies how languages organize these sounds into logical systems to convey meaning.

  9. Speech

    Speech is the faculty of producing articulated sounds, which, when blended together, form language. Human speech is served by a bellows-like respiratory activator, which furnishes the driving energy in the form of an airstream; a phonating sound generator in the larynx (low in the throat) to transform the energy; a sound-molding resonator in ...

  10. Speech Production

    Speech production is one of the most complex human activities. It involves coordinating numerous muscles and complex cognitive processes. The area of speech production is related to Articulatory Phonetics, Acoustic Phonetics and Speech Perception, which are all studying various elements of language and are part of a broader field of Linguistics.

  11. 2.1 How Humans Produce Speech

    Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation). The field of phonetics studies the sounds of human ...

  12. Understanding Voice Production

    The "spoken word" results from three components of voice production: voiced sound, resonance, and articulation. Voiced sound: The basic sound produced by vocal fold vibration is called "voiced sound." This is frequently described as a "buzzy" sound. Voiced sound for singing differs significantly from voiced sound for speech.

  13. A Neural Theory of Speech Acquisition and Production

    The production of a speech sound in the DIVA model starts with activation of neurons associated with that sound in the model's speech sound map. A "speech sound" can be a phoneme, syllable, or even short syllable sequence, with the syllable being the most typical unit represented by a single "neuron" in the speech sound map, with each ...

  14. Speech Production From a Developmental Perspective

    As in an information-processing approach to speech production, a developmental approach requires a perceptual-motor map, specifically a mapping between auditory speech and articulatory movement that is likely mediated by somatosensory information (e.g., Guenther, 1995; Guenther et al., 2006; Perkell et al., 1993 ).

  15. The SLP's Guide to Speech Sound Disorders ...

    An articulation disorder is characterized by difficulty producing individual speech sounds. The impairment is at the phonetic/motoric level, meaning that a sound may be substituted or distorted in a predictable way. Example: A student produces the /s/ and /sh/ sounds with lateral airflow (e.g., a lateral lisp).

  16. Exploring the Overlap Between Dyslexia and Speech Sound Production

    Purpose. Children with dyslexia have speech production deficits in a variety of spoken language contexts. In this article, we discuss the nature of speech production errors in children with dyslexia, including those who have a history of speech sound disorder and those who do not, to familiarize speech-language pathologists with speech production-specific risk factors that may help predict ...

  17. Single-neuronal elements of speech production in humans

    Humans can produce a remarkably wide array of word sounds to convey specific meanings. To produce fluent speech, linguistic analyses suggest a structured succession of processes involved in ...

  18. The Source-Filter Theory of Speech

    To systematically understand the mechanism of speech production, the source-filter theory divides such process into two stages (Chiba & Kajiyama, 1941; Fant, 1960) (see figure 1): (a) The air flow coming from the lungs induces tissue vibration of the vocal folds that generates the "source" sound.Turbulent noise sources are also created at constricted parts of the glottis or the vocal tract.

  19. Speech Sound Disorder: Types, Causes, Treatment

    Difficulty pronouncing the same sound in different words (e.g., "pig" and "kit") Repeating sounds or words. Lengthening words. Pauses while speaking. Tension when producing sounds. Head jerks during speech. Blinking while speaking. Shame while speaking. Changes in voice pitch.

  20. PDF Assessment of Speech or Sound Production

    articulators which needs to be corrected on a sound-by sound basis. When describing speech sound production errors in terms of phonology, the assumption is that there is a problem with the patterning of the sounds, and it is connected to the meaning of language. In that case, remediation should focus on changing the patterns of sound production ...

  21. 20Q: Principles of Motor Learning and Intervention for Speech Sound

    Therefore, speech sound production is a motor-based skill. Motor learning is a "set of processes associated with practice or experience leading to relatively permanent changes in the capability for movement" (Schmidt & Lee, 2005, p. 302). A learned motor skill results from two different levels of performance that are demonstrated during the ...

  22. Auditory Experience, Speech Sound Production Growth, and Early Literacy

    Speech sound production abilities can also be markedly weak in CHH due to inconsistent access to the speech signal (Ambrose et al., 2014; ... What's meaning got to do with it: The role of vocabulary in word reading and reading comprehension. Journal of Educational Psychology, 98 (3), ...

  23. Mechanics of human voice production and control

    As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation.