• Daily Crossword
  • Word Puzzle
  • Word Finder

Word of the Day

  • Synonym of the Day
  • Word of the Year
  • Language stories
  • All featured
  • Gender and sexuality
  • All pop culture
  • Grammar Coach ™
  • Writing hub
  • Grammar essentials
  • Commonly confused
  • All writing tips
  • Pop culture
  • Writing tips

Advertisement

Losing her speech made her feel isolated from humanity.

Synonyms: communication , conversation , parley , parlance

He expresses himself better in speech than in writing.

We waited for some speech that would indicate her true feelings.

Synonyms: talk , mention , comment , asseveration , assertion , observation

a fiery speech.

Synonyms: discourse , talk

  • any single utterance of an actor in the course of a play, motion picture, etc.

Synonyms: patois , tongue

Your slovenly speech is holding back your career.

  • a field of study devoted to the theory and practice of oral communication.
  • Archaic. rumor .

to have speech with somebody

speech therapy

  • that which is spoken; utterance
  • a talk or address delivered to an audience
  • a person's characteristic manner of speaking
  • a national or regional language or dialect
  • See parole linguistics another word for parole

Discover More

Other words from.

  • self-speech noun

Word History and Origins

Origin of speech 1

Synonym Study

Example sentences.

Kids are interacting with Alexas that can record their voice data and influence their speech and social development.

The attorney general delivered a controversial speech Wednesday.

For example, my company, Teknicks, is working with an online K-12 speech and occupational therapy provider.

Instead, it would give tech companies a powerful incentive to limit Brazilians’ freedom of speech at a time of political unrest.

However, the president did give a speech in Suresnes, France, the next day during a ceremony hosted by the American Battle Monuments Commission.

Those are troubling numbers, for unfettered speech is not incidental to a flourishing society.

There is no such thing as speech so hateful or offensive it somehow “justifies” or “legitimizes” the use of violence.

We need to recover and grow the idea that the proper answer to bad speech is more and better speech.

Tend to your own garden, to quote the great sage of free speech, Voltaire, and invite people to follow your example.

The simple, awful truth is that free speech has never been particularly popular in America.

Alessandro turned a grateful look on Ramona as he translated this speech, so in unison with Indian modes of thought and feeling.

And so this is why the clever performer cannot reproduce the effect of a speech of Demosthenes or Daniel Webster.

He said no more in words, but his little blue eyes had an eloquence that left nothing to mere speech.

After pondering over Mr. Blackbird's speech for a few moments he raised his head.

Albinia, I have refrained from speech as long as possible; but this is really too much!

Related Words

More about speech, what is speech .

Speech is the ability to express thoughts and emotions through vocal sounds and gestures. The act of doing this is also known as speech .

Speech is something only humans are capable of doing and this ability has contributed greatly to humanity’s ability to develop civilization. Speech allows humans to communicate much more complex information than animals are able to.

Almost all animals make sounds or noises with the intent to communicate with each other, such as mating calls and yelps of danger. However, animals aren’t actually talking to each other. That is, they aren’t forming sentences or sharing complicated information. Instead, they are making simple noises that trigger another animal’s natural instincts.

While speech does involve making noises, there is a lot more going on than simple grunts and growls. First, humans’ vocal machinery, such as our lungs, throat, vocal chords, and tongue, allows for a wide range of intricate sounds. Second, the human brain is incredibly complex, allowing humans to process vocal sounds and understand combinations of them as words and oral communication. The human brain is essential for speech . While chimpanzees and other apes have vocal organs similar to humans’, their brains are much less advanced and they are unable to learn speech .

Why is speech important?

The first records of the word speech come from before the year 900. It ultimately comes from the Old English word sprecan , meaning “to speak.” Scientists debate on the exact date that humanity first learned to speak, with estimates ranging from 50,000 to 2 million years ago.

Related to the concept of speech is the idea of language . A language is the collection of symbols, sounds, gestures, and anything else that a group of people use to communicate with each other, such as English, Swahili, and American Sign Language . Speech is actually using those things to orally communicate with someone else.

Did you know … ?

But what about birds that “talk”? Parrots in particular are famous for their ability to say human words and sentences. Birds are incapable of speech . What they are actually doing is learning common sounds that humans make and mimicking them. They don’t actually understand what anything they are repeating actually means.

What are real-life examples of speech ?

Speech is essential to human communication.

Dutch is just enough like German that I can read text on signs and screens, but not enough that I can understand speech. — Clark Smith Cox III (@clarkcox) September 8, 2009
I can make squirrels so excited, I could almost swear they understand human speech! — Neil Oliver (@thecoastguy) July 20, 2020

What other words are related to speech ?

  • communication
  • information

Quiz yourself!

True or False?

Humans are the only animals capable of speech .

[ ak -s uh -lot-l ]

Start each day with the Word of the Day in your inbox!

By clicking "Sign Up", you are accepting Dictionary.com Terms & Conditions and Privacy Policies.

Speech Production

  • Reference work entry
  • First Online: 01 January 2015
  • pp 1493–1498
  • Cite this reference work entry

Book cover

  • Laura Docio-Fernandez 3 &
  • Carmen García Mateo 4  

1213 Accesses

3 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

T. Hewett, R. Baecker, S. Card, T. Carey, J. Gasen, M. Mantei, G. Perlman, G. Strong, W. Verplank, Chapter 2: Human-computer interaction, in ACM SIGCHI Curricula for Human-Computer Interaction ed. by B. Hefley (ACM, 2007)

Google Scholar  

G. Fant, Acoustic Theory of Speech Production , 1st edn. (Mouton, The Hague, 1960)

G. Fant, Glottal flow: models and interaction. J. Phon. 14 , 393–399 (1986)

R.D. Kent, S.G. Adams, G.S. Turner, Models of speech production, in Principles of Experimental Phonetics , ed. by N.J. Lass (Mosby, St. Louis, 1996), pp. 2–45

T.L. Burrows, Speech Processing with Linear and Neural Network Models (1996)

J.R. Deller, J.G. Proakis, J.H.L. Hansen, Discrete-Time Processing of Speech Signals , 1st edn. (Macmillan, New York, 1993)

Download references

Author information

Authors and affiliations.

Department of Signal Theory and Communications, University of Vigo, Vigo, Spain

Laura Docio-Fernandez

Atlantic Research Center for Information and Communication Technologies, University of Vigo, Pontevedra, Spain

Carmen García Mateo

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Center for Biometrics and Security, Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China

Departments of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA

Anil K. Jain

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer Science+Business Media New York

About this entry

Cite this entry.

Docio-Fernandez, L., García Mateo, C. (2015). Speech Production. In: Li, S.Z., Jain, A.K. (eds) Encyclopedia of Biometrics. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7488-4_199

Download citation

DOI : https://doi.org/10.1007/978-1-4899-7488-4_199

Published : 03 July 2015

Publisher Name : Springer, Boston, MA

Print ISBN : 978-1-4899-7487-7

Online ISBN : 978-1-4899-7488-4

eBook Packages : Computer Science Reference Module Computer Science and Engineering

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Logo for Open Library Publishing Platform

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

2.1 How Humans Produce Speech

Phonetics studies human speech. Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation).

Check Yourself

Video script.

The field of phonetics studies the sounds of human speech.  When we study speech sounds we can consider them from two angles.   Acoustic phonetics ,  in addition to being part of linguistics, is also a branch of physics.  It’s concerned with the physical, acoustic properties of the sound waves that we produce.  We’ll talk some about the acoustics of speech sounds, but we’re primarily interested in articulatory phonetics , that is, how we humans use our bodies to produce speech sounds. Producing speech needs three mechanisms.

The first is a source of energy.  Anything that makes a sound needs a source of energy.  For human speech sounds, the air flowing from our lungs provides energy.

The second is a source of the sound:  air flowing from the lungs arrives at the larynx. Put your hand on the front of your throat and gently feel the bony part under your skin.  That’s the front of your larynx . It’s not actually made of bone; it’s cartilage and muscle. This picture shows what the larynx looks like from the front.

Larynx external

This next picture is a view down a person’s throat.

Cartilages of the Larynx

What you see here is that the opening of the larynx can be covered by two triangle-shaped pieces of skin.  These are often called “vocal cords” but they’re not really like cords or strings.  A better name for them is vocal folds .

The opening between the vocal folds is called the glottis .

We can control our vocal folds to make a sound.  I want you to try this out so take a moment and close your door or make sure there’s no one around that you might disturb.

First I want you to say the word “uh-oh”. Now say it again, but stop half-way through, “Uh-”. When you do that, you’ve closed your vocal folds by bringing them together. This stops the air flowing through your vocal tract.  That little silence in the middle of “uh-oh” is called a glottal stop because the air is stopped completely when the vocal folds close off the glottis.

Now I want you to open your mouth and breathe out quietly, “haaaaaaah”. When you do this, your vocal folds are open and the air is passing freely through the glottis.

Now breathe out again and say “aaah”, as if the doctor is looking down your throat.  To make that “aaaah” sound, you’re holding your vocal folds close together and vibrating them rapidly.

When we speak, we make some sounds with vocal folds open, and some with vocal folds vibrating.  Put your hand on the front of your larynx again and make a long “SSSSS” sound.  Now switch and make a “ZZZZZ” sound. You can feel your larynx vibrate on “ZZZZZ” but not on “SSSSS”.  That’s because [s] is a voiceless sound, made with the vocal folds held open, and [z] is a voiced sound, where we vibrate the vocal folds.  Do it again and feel the difference between voiced and voiceless.

Now take your hand off your larynx and plug your ears and make the two sounds again with your ears plugged. You can hear the difference between voiceless and voiced sounds inside your head.

I said at the beginning that there are three crucial mechanisms involved in producing speech, and so far we’ve looked at only two:

  • Energy comes from the air supplied by the lungs.
  • The vocal folds produce sound at the larynx.
  • The sound is then filtered, or shaped, by the articulators .

The oral cavity is the space in your mouth. The nasal cavity, obviously, is the space inside and behind your nose. And of course, we use our tongues, lips, teeth and jaws to articulate speech as well.  In the next unit, we’ll look in more detail at how we use our articulators.

So to sum up, the three mechanisms that we use to produce speech are:

  • respiration at the lungs,
  • phonation at the larynx, and
  • articulation in the mouth.

Essentials of Linguistics Copyright © 2018 by Catherine Anderson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

  • Skip to main content
  • Keyboard shortcuts for audio player

13.7 Cosmos & Culture

When did human speech evolve.

Barbara J. King

About 1.75 million years ago, our human ancestors, the hominins (who you may remember as the hominids ), achieved a technological breakthrough. They began to craft stone hand axes (called Acheulean tools) in ways that required more planning and precision than had been used in earlier tool-making processes. Around the same time, these prehistoric people began to talk.

In other words, tool-making skills and language skills evolved together; our language, as well as our technology, has a long prehistory.

speech human definition

Language may have evolved in concert with tool making. Sergey Lavrentev/iStockphoto.com hide caption

Language may have evolved in concert with tool making.

That's the conclusion of research published last Friday by archaeologist Natalie Thais Uomini and psychologist Georg Friedrich Meyer in the journal PLOS ONE. Theirs is a provocative study that uses modern brain-imaging techniques to probe thorny questions of our distant past.

Asking when our ancestors first began to talk is a challenging query. It's quite different from seeking the origins of bipedalism, where skeletal material may reveal key clues, or of the origins of technology or art, where artifacts may hold the answers. Speech organs don't fossilize and it's challenging to link up artifacts with the earliest speech.

This new paper, then, deserves some attention. I'd like to explain the context for why and how the researchers tackled the origins of speech, report their experimental findings and consider some critical responses to their conclusions.

Uomini and Meyer kick off their article with a key distinction: The earliest stone tools ( Oldowan tools) in the archaeological record of hominin activity are securely dated (so far at least) to 2.5 million years ago. By contrast, the timing for the origin of speech is hotly debated — unsurprisingly, given the challenges recounted above. Dates suggested range between two million and 50,000 years ago. That's a huge span and a clear motivator for more research.

Uomini and Meyer's approach — building on earlier formulations, such those presented in an influential 1991 paper by Patricia Greenfield — was to measure patterns of brain activation in modern people as they demonstrated both linguistic and technological skills, which share what is called "the need for structured and hierarchical action plans." The authors decided to seek "direct evidence that both skills draw on common brain areas or result in common brain activation patterns ."

In order to do this, Uomini and Meyer recruited 10 experienced flint knappers who were willing to go about their craft while wired up to an fTCD device — a functional transcranial Doppler ultrasonography machine that measures cerebral blood flow. Unlike the fMRI and PET techniques, the fTCD doesn't require that the person hold still during the scanning. In fact, it accommodates a great deal of movement.

The participants, while wired to the fTCD, were given two tasks: to make a hand ax in the tradition of ancient hominins (the technology task) and to think up, but not verbalize aloud, a list of words, all starting with the same designated letter (the linguistic task). The tasks were interspersed with control periods (striking the core but not making a tool and sitting quietly respectively). The researchers' prediction was as follows:

Individuals who show highly lateralized rapid blood flow changes for language should show a similar response during stone knapping.

And that is exactly what they found, "common cerebral blood flow lateralization signatures" in the participants, a finding "consistent with" a co-evolution of linguistic and skilled manual-motor skills. Uomini and Meyer then go for the big, evolutionary conclusion:

Our results support the hypothesis that aspects of language might have emerged as early as 1.75 million years ago, with the start of Acheulean technology.

In my own theorizing on the evolution of language, I've always thought that an earlier rather than a later date for the origins of speech was likely.

The communication skills (both vocal and gestural) of our closest living relatives — chimpanzees, bonobos and gorillas — are complex, and presumably (though not definitely) indicate the evolutionary platform from which hominin linguistic skills evolved. But does the Uomini and Meyer approach relying on blood-flow patterns in modern people really help us learn about the past?

Writing in the AAAS publication Science , Michael Balter has reported on assessments of the new research by other scholars in the field. Most notable to me is archaeologist Thomas Wynn's concern that the fTCD technique measures blood flow to large areas of the brain but without as high a resolution as fMRI or PET.

I asked Iain Davidson , emeritus professor of archaeology at the University of New England in Australia, a person expert in matters of human evolution, language and tools (see him in action starting at 38:00 in this video ) for his thoughts on the new research. Davidson replied to me in an email message:

Of course a modern person aiming to make a hand ax does so with a plan and with real conceptual thought about how to proceed, and it might be a great relief to the new phrenologists of brain monitoring that their studies show this. But it does not tell us anything about how hand axes were made or what relationship that may have had with cognitive function when hominins had different brains and unknown need for plans or conceptualisation in making such tools.

I agree with Davidson. The methodology used in this research — carried out via the portable fTCD — may have a high coolness factor, but as far as giving us credible clues to hominin speech goes? I'd say no: the actions and blood-flow patterns of ten 21st century people can't get us there. Will we ever discover when our kind began to talk? That remains an open question.

Barbara's most recent book is How Animals Grieve . You can up with what she is thinking on Twitter: @bjkingape

  • Patricia Greenfield
  • Georg Friedrich Meyer
  • Natalie Thais Uomini
  • evolutionary anthropology
  • anthropology

The Classroom | Empowering Students in Their College Journey

The Parts of Human Speech Organs & Their Definitions

Types of Phonetics

Types of Phonetics

Imagine being unable to verbally respond to a verbal greeting. Thinking about the ability to speak as an important part of your day may not cross your mind. If that speech ability was taken away, you might find yourself unable to communicate not only basic speech but also emotional responses like fear, confusion or anxiety. Although you may not give your speech organs much thought, they are integrally tied to how you function. From the lungs to the mouth, the organs of speech and their function in sound production and speech play important roles in many aspects of your life.

Breathing and Speaking Connections

Looking at the speech mechanism and organs of speech begins with the vital lungs. The lungs are located in the chest cavity and expand and contract to push air out of the mouth. Simple airflow is not enough to produce speech. The airflow must be modified by other speech organs to be more than just respiration. When you exhale, air moves out of your lungs through your windpipe or trachea. At the top of the trachea is one of the other primary organs of speech: the larynx or voice box.

Vibrations of the Larynx

Three more parts of the speech mechanism and organs of speech are the larynx, epiglottis and vocal folds. The larynx is covered by a flap of skin called the epiglottis. The epiglottis blocks the trachea to keep food from going into your lungs when you swallow. Across the larynx are two thin bands of tissue called the vocal folds or vocal cords. Depending on how the folds are positioned, air coming through the trachea makes them vibrate and buzz. These vibrations are called a "voiced" or soft sound. Placing finger tips over the Adam's apple or larynx at the front of your neck while humming makes it possible to feel the vocal fold vibration.

Articulators of Speech

The inside of your mouth is also called the oral cavity and controls the shape of words. At the back of the oral cavity on the roof of the mouth is the soft palate or velum. When you pronounce oral sounds, such as "cat" or "bag," the velum is located in the up position to block air flow through the nasal cavity. When you pronounce nasal sounds, such as "can" or "mat," the velum drops down to allow air to pass through the nasal cavity. In front of the velum is the hard palate. Your tongue presses or taps against the hard palate when you pronounce certain words, such as "tiptoe." Developmental or physical issues related to speech organs that are articulators of speech can result in a need for speech therapy.

Teeth, Tongue and Lips

Say "Thank you." Feel how your tongue presses against the inside of your front teeth. The convex area directly behind your teeth is known as the teeth ridge. For the purpose of linguistics, the tongue is divided into three regions: the blade, front and back. The tip of the tongue, which touches the teeth ridge, is called the blade. The middle of the tongue, which lines up with the hard palate, is called the front of the tongue. Finally, beneath the soft palate is the back of the tongue. The final speech organ is the most visible and obvious: the lips. Your lips influence the shape of the sounds leaving the oral cavity. Each of these organs of speech and their definitions is important to the process of speech, articulation and expressions through sounds.

Related Articles

How to Transcribe Words Into IPA Format

How to Transcribe Words Into IPA Format

How to Articulate for Effective Speaking

How to Articulate for Effective Speaking

Medical Terminology Exercises

Medical Terminology Exercises

Examples of Diffusion in Organs

Examples of Diffusion in Organs

How to Teach Kids About Germs & Hygiene

How to Teach Kids About Germs & Hygiene

How to Encode & Decode a Communication Model

How to Encode & Decode a Communication Model

Effective Uses of Verbal Communication

Effective Uses of Verbal Communication

Speech Techniques for High School

Speech Techniques for High School

  • The University of Iowa: Three Parts of Speech
  • Aston University: How Sound is Produced
  • The Scientist: Why Human Speech is Special

Carolyn Robbins began writing in 2006. Her work appears on various websites and covers various topics including neuroscience, physiology, nutrition and fitness. Robbins graduated with a bachelor of science degree in biology and theology from Saint Vincent College.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Acoust Soc Am

Logo of jas

Mechanics of human voice production and control

As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation. An important goal of voice research is to establish a causal theory linking voice physiology and biomechanics to how speakers use and control voice to communicate meaning and personal information. Establishing such a causal theory has important implications for clinical voice management, voice training, and many speech technology applications. This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality. Current efforts to develop mechanical and computational models of voice production are also critically reviewed. Finally, issues and future challenges in developing a causal theory of voice production and perception are discussed.

I. INTRODUCTION

In the broad sense, voice refers to the sound we produce to communicate meaning, ideas, opinions, etc. In the narrow sense, voice, as in this review, refers to sounds produced by vocal fold vibration, or voiced sounds. This is in contrast to unvoiced sounds which are produced without vocal fold vibration, e.g., fricatives which are produced by airflow through constrictions in the vocal tract, plosives produced by sudden release of a complete closure of the vocal tract, or other sound producing mechanisms such as whispering. For voiced sound production, vocal fold vibration modulates airflow through the glottis and produces sound (the voice source), which propagates through the vocal tract and is selectively amplified or attenuated at different frequencies. This selective modification of the voice source spectrum produces perceptible contrasts, which are used to convey different linguistic sounds and meaning. Although this selective modification is an important component of voice production, this review focuses on the voice source and its control within the larynx.

For effective communication of meaning, the voice source, as a carrier for the selective spectral modification by the vocal tract, contains harmonic energy across a large range of frequencies that spans at least the first few acoustic resonances of the vocal tract. In order to be heard over noise, such harmonic energy also has to be reasonably above the noise level within this frequency range, unless a breathy voice quality is desired. The voice source also contains important information of the pitch, loudness, prosody, and voice quality, which convey meaning (see Kreiman and Sidtis, 2011 , Chap. 8 for a review), biological information (e.g., size), and paralinguistic information (e.g., the speaker's social status, personal traits, and emotional state; Sundberg, 1987 ; Kreiman and Sidtis, 2011 ). For example, the same vowel may sound different when spoken by different people. Sometimes a simple “hello” is all it takes to recognize a familiar voice on the phone. People tend to use different voices to different speakers on different occasions, and it is often possible to tell if someone is happy or sad from the tone of their voice.

One of the important goals of voice research is to understand how the vocal system produces voice of different source characteristics and how people associate percepts to these characteristics. Establishing a cause-effect relationship between voice physiology and voice acoustics and perception will allow us to answer two essential questions in voice science and effective clinical care ( Kreiman et al. , 2014 ): when the output voice changes, what physiological alteration caused this change; if a change to voice physiology occurs, what change in perceived voice quality can be expected? Clinically, such knowledge would lead to the development of a physically based theory of voice production that is capable of better predicting voice outcomes of clinical management of voice disorders, thus improving both diagnosis and treatment. More generally, an understanding of this relationship could lead to a better understanding of the laryngeal adjustments that we use to change voice quality, adopt different speaking or singing styles, or convey personal information such as social status and emotion. Such understanding may also lead to the development of improved computer programs for synthesis of naturally sounding, speaker-specific speech of varying emotional percepts.

Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception. Changes in voice physiology or physics that cannot be heard are not that interesting. On the other hand, the physiology and physics may impose constraints on the co-variations among fundamental frequency (F0), vocal intensity, and voice quality, and thus the way we use and control our voice. Thus, understanding voice production and voice control requires an integrated approach, in which physiology, vocal fold vibration, and acoustics are considered as a whole instead of disconnected components. Traditionally, the multi-disciplinary nature of voice production has led to a clear divide between research activities in voice production, voice perception, and their clinical or speech applications, with few studies attempting to link them together. Although much advancement has been made in understanding the physics of phonation, some misconceptions still exist in textbooks in otolaryngology and speech pathology. For example, the Bernoulli effect, which has been shown to play a minor role in phonation, is still considered an important factor in initiating and sustaining phonation in many textbooks and reviews. Tension and stiffness are often used interchangeably despite that they have different physical meanings. The role of the thyroarytenoid muscle in regulating medial compression of the membranous vocal folds is often understated. On the other hand, research on voice production often focuses on the glottal flow and vocal fold vibration, but can benefit from a broader consideration of the acoustics of the produced voice and their implications for voice communication.

This paper provides a review on our current understanding of the cause-effect relation between voice physiology, voice production, and voice perception, with the hope that it will help better bridge research efforts in different aspects of voice studies. An overview of vocal fold physiology is presented in Sec. II , with an emphasis on laryngeal regulation of the geometry, mechanical properties, and position of the vocal folds. The physical mechanisms of self-sustained vocal fold vibration and sound generation are discussed in Sec. III , with a focus on the roles of various physical components and features in initiating phonation and affecting the produced acoustics. Some misconceptions of the voice production physics are also clarified. Section IV discusses the physiologic control of F0, vocal intensity, and voice quality. Section V reviews past and current efforts in developing mechanical and computational models of voice production. Issues and future challenges in establishing a causal theory of voice production and perception are discussed in Sec. VI .

II. VOCAL FOLD PHYSIOLOGY AND BIOMECHANICS

A. vocal fold anatomy and biomechanics.

The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987 ), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific output sounds. The vocal folds are located in the larynx and form a constriction to the airway [Fig. 1(a) ]. Each vocal fold is about 11–15 mm long in adult women and 17–21 mm in men, and stretches across the larynx along the anterior-posterior direction, attaching anteriorly to the thyroid cartilage and posteriorly to the anterolateral surface of the arytenoid cartilages [Fig. 1(c) ]. Both the arytenoid [Fig. 1(d) ] and thyroid [Fig. 1(e) ] cartilages sit on top of the cricoid cartilage and interact with it through the cricoarytenoid joint and cricothyroid joint, respectively. The relative movement of these cartilages thus provides a means to adjust the geometry, mechanical properties, and position of the vocal folds, as further discussed below. The three-dimensional airspace between the two opposing vocal folds is the glottis. The glottis can be divided into a membranous portion, which includes the anterior portion of the glottis and extends from the anterior commissure to the vocal process of the arytenoid, and a cartilaginous portion, which is the posterior space between the arytenoid cartilages.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g001.jpg

(Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

The vocal folds are layered structures, consisting of an inner muscular layer (the thyroarytenoid muscle) with muscle fibers aligned primarily along the anterior-posterior direction, a soft tissue layer of the lamina propria, and an outmost epithelium layer [Figs. 1(a) and 1(b) ]. The thyroarytenoid (TA) muscle is sometimes divided into a medial and a lateral bundle, with each bundle responsible for a certain vocal fold posturing function. However, such functional division is still a topic of debate ( Zemlin, 1997 ). The lamina propria consists of the extracellular matrix (ECM) and interstitial substances. The two primary ECM proteins are the collagen and elastin fibers, which are aligned mostly along the length of the vocal folds in the anterior-posterior direction ( Gray et al. , 2000 ). Based on the density of the collagen and elastin fibers [Fig. 1(b) ], the lamina propria can be divided into a superficial layer with limited and loose elastin and collagen fibers, an intermediate layer of dominantly elastin fibers, and a deep layer of mostly dense collagen fibers ( Hirano and Kakita, 1985 ; Kutty and Webb, 2009 ). In comparison, the lamina propria (about 1 mm thick) is much thinner than the TA muscle.

Conceptually, the vocal fold is often simplified into a two-layer body-cover structure ( Hirano, 1974 ; Hirano and Kakita, 1985 ). The body layer includes the muscular layer and the deep layer of the lamina propria, and the cover layer includes the intermediate and superficial lamina propria and the epithelium layer. This body-cover concept of vocal fold structure will be adopted in the discussions below. Another grouping scheme divides the vocal fold into three layers. In addition to a body and a cover layer, the intermediate and deep layers of the lamina propria are grouped into a vocal ligament layer ( Hirano, 1975 ). It is hypothesized that this layered structure plays a functional role in phonation, with different combinations of mechanical properties in different layers leading to production of different voice source characteristics ( Hirano, 1974 ). However, because of lack of data of the mechanical properties in each vocal fold layer and how they vary at different conditions of laryngeal muscle activation, a definite understanding of the functional roles of each vocal fold layer is still missing.

The mechanical properties of the vocal folds have been quantified using various methods, including tensile tests ( Hirano and Kakita, 1985 ; Zhang et al. , 2006b ; Kelleher et al. , 2013a ), shear rheometry ( Chan and Titze, 1999 ; Chan and Rodriguez, 2008 ; Miri et al. , 2012 ), indentation ( Haji et al. , 1992a , b ; Tran et al. , 1993 ; Chhetri et al. , 2011 ), and a surface wave method ( Kazemirad et al. , 2014 ). These studies showed that the vocal folds exhibit a nonlinear, anisotropic, viscoelastic behavior. A typical stress-strain curve of the vocal folds under anterior-posterior tensile test is shown in Fig. ​ Fig.2. 2 . The slope of the curve, or stiffness, quantifies the extent to which the vocal folds resist deformation in response to an applied force. In general, after an initial linear range, the slope of the stress-strain curve (stiffness) increases gradually with further increase in the strain (Fig. ​ (Fig.2), 2 ), presumably due to the gradual engagement of the collagen fibers. Such nonlinear mechanical behavior provides a means to regulate vocal fold stiffness and tension through vocal fold elongation or shortening, which plays an important role in the control of the F0 or pitch of voice production. Typically, the stress is higher during loading than unloading, indicating a viscous behavior of the vocal folds. Due to the presence of the AP-aligned collagen, elastin, and muscle fibers, the vocal folds also exhibit anisotropic mechanical properties, stiffer along the AP direction than in the transverse plane. Experiments ( Hirano and Kakita, 1985 ; Alipour and Vigmostad, 2012 ; Miri et al. , 2012 ; Kelleher et al. , 2013a ) showed that the Young's modulus along the AP direction in the cover layer is more than 10 times (as high as 80 times in Kelleher et al. , 2013a ) larger than in the transverse plane. Stiffness anisotropy has been shown to facilitate medial-lateral motion of the vocal folds ( Zhang, 2014 ) and complete glottal closure during phonation ( Xuan and Zhang, 2014 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g002.jpg

Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Accurate measurement of vocal fold mechanical properties at typical phonation conditions is challenging, due to both the small size of the vocal folds and the relatively high frequency of phonation. Although tensile tests and shear rheometry allow direct measurement of material modules, the small sample size often leads to difficulties in mounting tissue samples to the testing equipment, thus creating concerns of accuracy. These two methods also require dissecting tissue samples from the vocal folds and the laryngeal framework, making it impossible for in vivo measurement. The indentation method is ideal for in vivo measurement and, because of the small size of indenters used, allows characterization of the spatial variation of mechanical properties of the vocal folds. However, it is limited for measurement of mechanical properties at conditions of small deformation. Although large indentation depths can be used, data interpretation becomes difficult and thus it is not suitable for assessment of the nonlinear mechanical properties of the vocal folds.

There has been some recent work toward understanding the contribution of individual ECM components to the macro-mechanical properties of the vocal folds and developing a structurally based constitutive model of the vocal folds (e.g., Chan et al. , 2001 ; Kelleher et al. , 2013b ; Miri et al. , 2013 ). The contribution of interstitial fluid to the viscoelastic properties of the vocal folds and vocal fold stress during vocal fold vibration and collision has also been investigated using a biphasic model of the vocal folds in which the vocal fold was modeled as a solid phase interacting with an interstitial fluid phase ( Zhang et al. , 2008 ; Tao et al. , 2009 , Tao et al. , 2010 ; Bhattacharya and Siegmund, 2013 ). This structurally based approach has the potential to predict vocal fold mechanical properties from the distribution of collagen and elastin fibers and interstitial fluids, which may provide new insights toward the differential mechanical properties between different vocal fold layers at different physiologic conditions.

B. Vocal fold posturing

Voice communication requires fine control and adjustment of pitch, loudness, and voice quality. Physiologically, such adjustments are made through laryngeal muscle activation, which stiffens, deforms, or repositions the vocal folds, thus controlling the geometry and mechanical properties of the vocal folds and glottal configuration.

One important posturing is adduction/abduction of the vocal folds, which is primarily achieved through motion of the arytenoid cartilages. Anatomical analysis and numerical simulations have shown that the cricoarytenoid joint allows the arytenoid cartilages to slide along and rotate about the long axis of the cricoid cartilage, but constrains arytenoid rotation about the short axis of the cricoid cartilage ( Selbie et al. , 1998 ; Hunter et al. , 2004 ; Yin and Zhang, 2014 ). Activation of the lateral cricoarytenoid (LCA) muscles, which attach anteriorly to the cricoid cartilage and posteriorly to the arytenoid cartilages, induce mainly an inward rotation motion of the arytenoid about the cricoid cartilages in the coronal plane, and moves the posterior portion of the vocal folds toward the glottal midline. Activation of the interarytenoid (IA) muscles, which connect the posterior surfaces of the two arytenoids, slides and approximates the arytenoid cartilages [Fig. 1(c) ], thus closing the cartilaginous glottis. Because both muscles act on the posterior portion of the vocal folds, combined action of the two muscles is able to completely close the posterior portion of the glottis, but is less effective in closing the mid-membranous glottis (Fig. ​ (Fig.3; 3 ; Choi et al. , 1993 ; Chhetri et al. , 2012 ; Yin and Zhang, 2014 ). Because of this inefficiency in mid-membranous approximation, LCA/IA muscle activation is unable to produce medial compression between the two vocal folds in the membranous portion, contrary to current understandings ( Klatt and Klatt, 1990 ; Hixon et al. , 2008 ). Complete closure and medial compression of the mid-membranous glottis requires the activation of the TA muscle ( Choi et al. , 1993 ; Chhetri et al. , 2012 ). The TA muscle forms the bulk of the vocal folds and stretches from the thyroid prominence to the anterolateral surface of the arytenoid cartilages (Fig. ​ (Fig.1). 1 ). Activation of the TA muscle produces a whole-body rotation of the vocal folds in the horizontal plane about the point of its anterior attachment to the thyroid cartilage toward the glottal midline ( Yin and Zhang, 2014 ). This rotational motion is able to completely close the membranous glottis but often leaves a gap posteriorly (Fig. ​ (Fig.3). 3 ). Complete closure of both the membranous and cartilaginous glottis thus requires combined activation of the LCA/IA and TA muscles. The posterior cricoarytenoid (PCA) muscles are primarily responsible for opening the glottis but may also play a role in voice production of very high pitches, as discussed below.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g003.jpg

Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993) .

Vocal fold tension is regulated by elongating or shortening the vocal folds. Because of the nonlinear material properties of the vocal folds, changing vocal fold length also leads to changes in vocal fold stiffness, which otherwise would stay constant for linear materials. The two laryngeal muscles involved in regulating vocal fold length are the cricothyroid (CT) muscle and the TA muscle. The CT muscle consists of two bundles. The vertically oriented bundle, the pars recta, connects the anterior surface of the cricoid cartilage and the lower border of the thyroid lamina. Its contraction approximates the thyroid and cricoid cartilages anteriorly through a rotation about the cricothyroid joint. The other bundle, the pars oblique, is oriented upward and backward, connecting the anterior surface of the cricoid cartilage to the inferior cornu of the thyroid cartilage. Its contraction displaces the cricoid and arytenoid cartilages backwards ( Stone and Nuttall, 1974 ), although the thyroid cartilage may also move forward slightly. Contraction of both bundles thus elongates the vocal folds and increases the stiffness and tension in both the body and cover layers of the vocal folds. In contrast, activation of the TA muscle, which forms the body layer of the vocal folds, increase the stiffness and tension in the body layer. Activation of the TA muscle, in addition to an initial effect of mid-membranous vocal fold approximation, also shortens the vocal folds, which decreases both the stiffness and tension in the cover layer ( Hirano and Kakita, 1985 ; Yin and Zhang, 2013 ). One exception is when the tension in the vocal fold cover is already negative (i.e., under compression), in which case shortening the vocal folds further through TA activation decreases tension (i.e., increased compression force) but may increase stiffness in the cover layer. Activation of the LCA/IA muscles generally does not change the vocal fold length much and thus has only a slight effect on vocal fold stiffness and tension ( Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, activation of the LCA/IA muscles (and also the PCA muscles) does stabilize the arytenoid cartilage and prevent it from moving forward when the cricoid cartilage is pulled backward due to the effect of CT muscle activation, thus facilitating extreme vocal fold elongation, particularly for high-pitch voice production. As noted above, due to the lack of reliable measurement methods, our understanding of how vocal fold stiffness and tension vary at different muscular activation conditions is limited.

Activation of the CT and TA muscles also changes the medial surface shape of the vocal folds and the glottal channel geometry. Specifically, TA muscle activation causes the inferior part of the medial surface to bulge out toward the glottal midline ( Hirano and Kakita, 1985 ; Hirano, 1988 ; Vahabzadeh-Hagh et al. , 2016 ), thus increasing the vertical thickness of the medial surface. In contrast, CT activation reduces this vertical thickness of the medial surface. Although many studies have investigated the prephonatory glottal shape (convergent, straight, or divergent) on phonation ( Titze, 1988a ; Titze et al. , 1995 ), a recent study showed that the glottal channel geometry remains largely straight under most conditions of laryngeal muscle activation ( Vahabzadeh-Hagh et al. , 2016 ).

III. PHYSICS OF VOICE PRODUCTION

A. sound sources of voice production.

The phonation process starts from the adduction of the vocal folds, which approximates the vocal folds to reduce or close the glottis. Contraction of the lungs initiates airflow and establishes pressure buildup below the glottis. When the subglottal pressure exceeds a certain threshold pressure, the vocal folds are excited into a self-sustained vibration. Vocal fold vibration in turn modulates the glottal airflow into a pulsating jet flow, which eventually develops into turbulent flow into the vocal tract.

In general, three major sound production mechanisms are involved in this process ( McGowan, 1988 ; Hofmans, 1998 ; Zhao et al. , 2002 ; Zhang et al. , 2002a ), including a monopole sound source due to volume of air displaced by vocal fold vibration, a dipole sound source due to the fluctuating force applied by the vocal folds to the airflow, and a quadrupole sound source due to turbulence developed immediately downstream of the glottal exit. When the false vocal folds are tightly adducted, an additional dipole source may arise as the glottal jet impinges onto the false vocal folds ( Zhang et al. , 2002b ). The monopole sound source is generally small considering that the vocal folds are nearly incompressible and thus the net volume flow displacement is small. The dipole source is generally considered as the dominant sound source and is responsible for the harmonic component of the produced sound. The quadrupole sound source is generally much weaker than the dipole source in magnitude, but it is responsible for broadband sound production at high frequencies.

For the harmonic component of the voice source, an equivalent monopole sound source can be defined at a plane just downstream of the region of major sound sources, with the source strength equal to the instantaneous pulsating glottal volume flow rate. In the source-filter theory of phonation ( Fant, 1970 ), this monopole sound source is the input signal to the vocal tract, which acts as a filter and shapes the sound source spectrum into different sounds before they are radiated from the mouth to the open as the voice we hear. Because of radiation from the mouth, the sound source is proportional to the time derivative of the glottal flow. Thus, in the voice literature, the time derivate of the glottal flow, instead of the glottal flow, is considered as the voice source.

The phonation cycle is often divided into an open phase, in which the glottis opens (the opening phase) and closes (the closing phase), and a closed phase, in which the glottis is closed or remains a minimum opening area when the glottal closure is incomplete. The glottal flow increases and decreases in the open phase, and remains zero during the closed phase or minimum for incomplete glottal closure (Fig. ​ (Fig.4). 4 ). Compared to the glottal area waveform, the glottal flow waveform reaches its peak at a later time in the cycle so that the glottal flow waveform is more skewed to the right. This skewing in the glottal flow waveform to the right is due to the acoustic mass in the glottis and the vocal tract (when the F0 is lower than a nearby vocal tract resonance frequency), which causes a delay in the increase in the glottal flow during the opening phase, and a faster decay in the glottal flow during the closing phase ( Rothenberg, 1981 ; Fant, 1982 ). Because of this waveform skewing to the right, the negative peak of the time derivative of the glottal flow in the closing phase is often much more dominant than the positive peak in the opening phase. The instant of the most negative peak is thus considered the point of main excitation of the vocal tract and the corresponding negative peak, also referred to as the maximum flow declination rate (MFDR), is a major determinant of the peak amplitude of the produced voice. After the negative peak, the time derivative of the glottal flow waveform returns to zero as phonation enters the closed phase.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g004.jpg

(Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

Much work has been done to directly link features of the glottal flow waveform to voice acoustics and potentially voice quality (e.g., Fant, 1979 , 1982 ; Fant et al. , 1985 ; Gobl and Chasaide, 2010 ). These studies showed that the low-frequency spectral shape (the first few harmonics) of the voice source is primarily determined by the relative duration of the open phase with respect to the oscillation period (To/T in Fig. ​ Fig.4, 4 , also referred to as the open quotient). A longer open phase often leads to a more dominant first harmonic (H1) in the low-frequency portion of the resulting voice source spectrum. For a given oscillation period, shortening the open phrase causes most of the glottal flow change to occur within a duration (To) that is increasingly shorter than the period T. This leads to an energy boost in the low-frequency portion of the source spectrum that peaks around a frequency of 1/To. For a glottal flow waveform of a very short open phase, the second harmonic (H2) or even the fourth harmonic (H4) may become the most dominant harmonic. Voice source with a weak H1 relative to H2 or H4 is often associated with a pressed voice quality.

The spectral slope in the high-frequency range is primarily related to the degree of discontinuity in the time derivative of the glottal flow waveform. Due to the waveform skewing discussed earlier, the most dominant source of discontinuity often occurs around the instant of main excitation when the time derivative of the glottal flow waveform returns from the negative peak to zero within a time scale of Ta (Fig. ​ (Fig.4). 4 ). For an abrupt glottal flow cutoff ( Ta  = 0), the time derivative of the glottal flow waveform has a strong discontinuity at the point of main excitation, which causes the voice source spectrum to decay asymptotically at a roll-off rate of −6 dB per octave toward high frequencies. Increasing Ta from zero leads to a gradual return from the negative peak to zero. When approximated by an exponential function, this gradual return functions as a lower-pass filter, with a cutoff frequency around 1/ Ta , and reduces the excitation of harmonics above the cutoff frequency 1/ Ta . Thus, in the frequency range concerning voice perception, increasing Ta often leads to reduced higher-order harmonic excitation. In the extreme case when there is minimal vocal fold contact, the time derivative of the glottal flow waveform is so smooth that the voice source spectrum only has a few lower-order harmonics. Perceptually, strong excitation of higher-order harmonics is often associated with a bright output sound quality, whereas voice source with limited excitation of higher-order harmonics is often perceived to be weak.

Also of perceptual importance is the turbulence noise produced immediately downstream of the glottis. Although small in amplitude, the noise component plays an important role in voice quality perception, particularly for female voice in which aspiration noise is more persistent than in male voice. While the noise component of voice is often modeled as white noise, its spectrum often is not flat and may exhibit different spectral shapes, depending on the glottal opening and flow rate as well as the vocal tract shape. Interaction between the spectral shape and relative levels of harmonic and noise energy in the voice source has been shown to influence the perception of voice quality ( Kreiman and Gerratt, 2012 ).

It is worth noting that many of the source parameters are not independent from each other and often co-vary. How they co-vary at different voicing conditions, which is essential to natural speech synthesis, remains to be the focus of many studies (e.g., Sundberg and Hogset, 2001 ; Gobl and Chasaide, 2003 ; Patel et al. , 2011 ).

B. Mechanisms of self-sustained vocal fold vibration

That vocal fold vibration results from a complex airflow-vocal fold interaction within the glottis rather than repetitive nerve stimulation of the larynx was first recognized by van den Berg (1958) . According to his myoelastic-aerodynamic theory of voice production, phonation starts from complete adduction of the vocal folds to close the glottis, which allows a buildup of the subglottal pressure. The vocal folds remain closed until the subglottal pressure is sufficiently high to push them apart, allowing air to escape and producing a negative (with respect to atmospheric pressure) intraglottal pressure due to the Bernoulli effect. This negative Bernoulli pressure and the elastic recoil pull the vocal folds back and close the glottis. The cycle then repeats, which leads to sustained vibration of the vocal folds.

While the myoelastic-aerodynamic theory correctly identifies the interaction between the vocal folds and airflow as the underlying mechanism of self-sustained vocal fold vibration, it does not explain how energy is transferred from airflow into the vocal folds to sustain this vibration. Traditionally, the negative intraglottal pressure is considered to play an important role in closing the glottis and sustaining vocal fold vibration. However, it is now understood that a negative intraglottal pressure is not a critical requirement for achieving self-sustained vocal fold vibration. Similarly, an alternatingly convergent-divergent glottal channel geometry during phonation has been considered a necessary condition that leads to net energy transfer from airflow into the vocal folds. We will show below that an alternatingly convergent-divergent glottal channel geometry does not always guarantee energy transfer or self-sustained vocal fold vibration.

For flow conditions typical of human phonation, the glottal flow can be reasonably described by Bernoulli's equation up to the point when airflow separates from the glottal wall, often at the glottal exit at which the airway suddenly expands. According to Bernoulli's equation, the flow pressure p at a location within the glottal channel with a time-varying cross-sectional area A is

where P sub and P sup are the subglottal and supraglottal pressure, respectively, and A sep is the time-varying glottal area at the flow separation location. For simplicity, we assume that the flow separates at the upper margin of the medial surface. To achieve a net energy transfer from airflow to the vocal folds over one cycle, the air pressure on the vocal fold surface has to be at least partially in-phase with vocal fold velocity. Specifically, the intraglottal pressure needs to be higher in the opening phase than in the closing phase of vocal fold vibration so that the airflow does more work on the vocal folds in the opening phase than the work the vocal folds do back to the airflow in the closing phase.

Theoretical analysis of the energy transfer between airflow and vocal folds ( Ishizaka and Matsudaira, 1972 ; Titze, 1988a ) showed that this pressure asymmetry can be achieved by a vertical phase difference in vocal fold surface motion (also referred to as a mucosal wave), i.e., different portions of the vocal fold surface do not necessarily move inward and outward together as a whole. This mechanism is illustrated in Fig. ​ Fig.5, 5 , the upper left of which shows vocal fold surface shape in the coronal plane for six consecutive, equally spaced instants during one vibration cycle in the presence of a vertical phase difference. Instants 2 and 3 in solid lines are in the closing phase whereas 5 and 6 in dashed lines are in the opening phase. Consider for an example energy transfer at the lower margin of the medial surface. Because of the vertical phase difference, the glottal channel has a different shape in the opening phase (dashed lines 5 and 6) from that in the closing (solid lines 3 and 2) when the lower margin of the medial surface crosses the same locations. Particularly, when the lower margin of the medial surface leads the upper margin in phase, the glottal channel during opening (e.g., instant 6) is always more convergent [thus a smaller A sep / A in Eq. (1) ] or less divergent than that in the closing (e.g., instant 2) for the same location of the lower margin, resulting in an air pressure [Eq. (1) ] that is higher in the opening phase than the closing phase (Fig. ​ (Fig.5, 5 , top row). As a result, energy is transferred from airflow into the vocal folds over one cycle, as indicated by a non-zero area enclosed by the aerodynamic force-vocal fold displacement curve in Fig. ​ Fig.5 5 (top right). The existence of a vertical phase difference in vocal fold surface motion is generally considered as the primary mechanism of phonation onset.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g005.jpg

Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

In contrast, without a vertical phase difference, the vocal fold surface during opening (Fig. ​ (Fig.5, 5 , bottom left; dashed lines 5 and 6) and closing (solid lines 3 and 2) would be identical when the lower margin crosses the same positions, for which Bernoulli's equation would predict symmetric flow pressure between the opening and closing phases, and zero net energy transfer over one cycle (Fig. ​ (Fig.5, 5 , middle row). Under this condition, the pressure asymmetry between the opening and closing phases has to be provided by an external mechanism that directly imposes a phase difference between the intraglottal pressure and vocal fold movement. In the presence of such an external mechanism, the intraglottal pressure is no longer the same between opening and closing even when the glottal channel has the same shape as the vocal fold crosses the same locations, resulting in a net energy transfer over one cycle from airflow to the vocal folds (Fig. ​ (Fig.5, 5 , bottom row). This energy transfer mechanism is often referred to as negative damping, because the intraglottal pressure depends on vocal fold velocity and appears in the system equations of vocal fold motion in a form similar to a damping force, except that energy is transferred to the vocal folds instead of being dissipated. Negative damping is the only energy transfer mechanism in a single degree-of-freedom system or when the entire medial surface moves in phase as a whole.

In humans, a negative damping can be provided by an inertive vocal tract ( Flanagan and Landgraf, 1968 ; Ishizaka and Matsudaira, 1972 ; Ishizaka and Flanagan, 1972 ) or a compliant subglottal system ( Zhang et al. , 2006a ). Because the negative damping associated with acoustic loading is significant only for frequencies close to an acoustic resonance, phonation sustained by such negative damping alone always occurs at a frequency close to that acoustic resonance ( Flanagan and Landgraf, 1968 ; Zhang et al. , 2006a ). Although there is no direct evidence of phonation sustained dominantly by acoustic loading in humans, instabilities in voice production (or voice breaks) have been reported when the fundamental frequency of vocal fold vibration approaches one of the vocal tract resonances (e.g., Titze et al. , 2008 ). On the other hand, this entrainment of phonation frequency to the acoustic resonance limits the degree of independent control of the voice source and the spectral modification by the vocal tract, and is less desirable for effective speech communication. Considering that humans are capable of producing a large variety of voice types independent of vocal tract shapes, negative damping due to acoustic coupling to the sub- or supra-glottal acoustics is unlikely the primary mechanism of energy transfer in voice production. Indeed, excised larynges are able to vibrate without a vocal tract. On the other hand, experiments have shown that in humans the vocal folds vibrate at a frequency close to an in vacuo vocal fold resonance ( Kaneko et al. , 1986 ; Ishizaka, 1988 ; Svec et al. , 2000 ) instead of the acoustic resonances of the sub- and supra-glottal tracts, suggesting that phonation is essentially a resonance phenomenon of the vocal folds.

A negative damping can be also provided by glottal aerodynamics. For example, glottal flow acceleration and deceleration may cause the flow to separate at different locations between opening and closing even when the glottis has identical geometry. This is particularly the case for a divergent glottal channel geometry, which often results in asymmetric flow separation and pressure asymmetry between the glottal opening and closing phases ( Park and Mongeau, 2007 ; Alipour and Scherer, 2004 ). The effect of this negative damping mechanism is expected to be small at phonation onset at which the vocal fold vibration amplitude and thus flow unsteadiness is small and the glottal channel is less likely to be divergent. However, its contribution to energy transfer may increase with increasing vocal fold vibration amplitude and flow unsteadiness ( Howe and McGowan, 2010 ). It is important to differentiate this asymmetric flow separation between glottal opening and closing due to unsteady flow effects from a quasi-steady asymmetric flow separation that is caused by asymmetry in the glottal channel geometry between opening and closing. In the latter case, because flow separation may occur at a more upstream location for a divergent glottal channel than a convergent glottal channel, an asymmetric glottal channel geometry (e.g., a glottis opening convergent and closing divergent) may lead to asymmetric flow separation between glottal opening and closing. Compared to conditions of a fixed flow separation (i.e., flow separates at the same location during the entire cycle, as in Fig. ​ Fig.5), 5 ), such geometry-induced asymmetric flow separation actually reduces pressure asymmetry between glottal opening and closing [this can be shown using Eq. (1) ] and thus weakens net energy transfer. In reality, these two types of asymmetric flow separation mechanisms (due to unsteady effects or changes in glottal channel geometry) interact and can result in very complex flow separation patterns ( Alipour and Scherer, 2004 ; Sciamarella and Le Quere, 2008 ; Sidlof et al. , 2011 ), which may or may not enhance energy transfer.

From the discussion above it is clear that a negative Bernoulli pressure is not a critical requirement in either one of the two mechanisms. Being proportional to vocal fold displacement, the negative Bernoulli pressure is not a negative damping and does not directly provide the required pressure asymmetry between glottal opening and closing. On the other hand, the existence of a vertical phase difference in vocal fold vibration is determined primarily by vocal fold properties (as discussed below), rather than whether the intraglottal pressure is positive or negative during a certain phase of the oscillation cycle.

Although a vertical phase difference in vocal fold vibration leads to a time-varying glottal channel geometry, an alternatingly convergent-divergent glottal channel geometry does not guarantee self-sustained vocal fold vibration. For example, although the in-phase vocal fold motion in the bottom left of Fig. ​ Fig.5 5 (the entire medial surface moves in and out together) leads to an alternatingly convergent-divergent glottal geometry, the glottal geometry is identical between glottal opening and closing and thus this motion is unable to produce net energy transfer into the vocal folds without a negative damping mechanism (Fig. ​ (Fig.5, 5 , middle row). In other words, an alternatingly convergent-divergent glottal geometry is an effect, not cause, of self-sustained vocal fold vibration. Theoretically, the glottis can maintain a convergent or divergent shape during the entire oscillation cycle and yet still self-oscillate, as observed in experiments using physical vocal fold models which had a divergent shape during most portions of the oscillation cycle ( Zhang et al. , 2006a ).

C. Eigenmode synchronization and nonlinear dynamics

The above shows that net energy transfer from airflow into the vocal folds is possible in the presence of a vertical phase difference. But how is this vertical phase difference established, and what determines the vertical phase difference and the vocal fold vibration pattern? In voice production, vocal fold vibration with a vertical phase difference results from a process of eigenmode synchronization, in which two or more in vacuo eigenmodes of the vocal folds are synchronized to vibrate at the same frequency but with a phase difference ( Ishizaka and Matsudaira, 1972 ; Ishizaka, 1981 ; Horacek and Svec, 2002 ; Zhang et al. , 2007 ), in the same way as a travelling wave formed by superposition of two standing waves. An eigenmode or resonance is a pattern of motion of the system that is allowed by physical laws and boundary constraints to the system. In general, for each mode, the vibration pattern is such that all parts of the system move either in-phase or 180° out of phase, similar to a standing wave. Each eigenmode has an inherently distinct eigenfrequency (or resonance frequency) at which the eigenmode can be maximally excited. An example of eigenmodes that is often encountered in speech science is formants, which are peaks in the output voice spectra due to excitation of acoustic resonances of the vocal tract, with the formant frequency dependent on vocal tract geometry. Figure ​ Figure6 6 shows three typical eigenmodes of the vocal fold in the coronal plane. In Fig. ​ Fig.6, 6 , the thin line indicates the resting vocal fold surface shape, whereas the solid and dashed lines indicate extreme positions of the vocal fold when vibrating at the corresponding eigenmode, spaced 180° apart in a vibratory cycle. The first eigenmode shows an up and down motion in the vertical direction, which does not modulate glottal airflow much. The second eigenmode has a dominantly in-phase medial-lateral motion along the medial surface, which does modulate airflow. The third eigenmode also exhibits dominantly medial-lateral motion, but the upper portion of the medial surface vibrates 180° out of phase with the lower portion of the medial surface. Such out-of-phase motion as in the third eigenmode is essential to achieving vocal fold vibration with a large vertical phase difference, e.g., when synchronized with an in-phase eigenmode as in Fig. 6(b) .

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g006.jpg

Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

In the absence of airflow, the vocal fold in vacuo eigenmodes are generally neutral or damped, meaning that when excited they will gradually decay in amplitude with time. When the vocal folds are subject to airflow, however, the vocal fold-airflow coupling modifies the eigenmodes and, in some conditions, synchronizes two eigenmodes to the same frequency (Fig. ​ (Fig.7). 7 ). Although vibration in each eigenmode by itself does not produce net energy transfer (Fig. ​ (Fig.5, 5 , middle row), when two modes are synchronized at the same frequency but with a phase difference in time, the vibration velocity associated with one eigenmode [e.g., the eigenmode in Fig. 6(b) ] will be at least partially in-phase with the pressure induced by the other eigenmode [e.g., the eigenmode in Fig. 6(c) ], and this cross-model pressure-velocity interaction will produce net energy transfer into the vocal folds ( Ishizaka and Matsudaira, 1972 ; Zhang et al. , 2007 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g007.jpg

A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

The minimum subglottal pressure required to synchronize two eigenmodes and initiate net energy transfer, or the phonation threshold pressure, is proportional to the frequency spacing between the two eigenmodes being synchronized and the coupling strength between the two eigenmodes ( Zhang, 2010 ):

where ω 0,1 and ω 0,2 are the eigenfrequencies of the two in vacuo eigenmodes participating in the synchronization process and β is the coupling strength between the two eigenmodes. Thus, the closer the two eigenmodes are to each other in frequency or the more strongly they are coupled, the less pressure is required to synchronize them. This is particularly the case in an anisotropic material such as the vocal folds in which the AP stiffness is much larger than the stiffness in the transverse plane. Under such anisotropic stiffness conditions, the first few in vacuo vocal fold eigenfrequencies tend to cluster together and are much closer to each other compared to isotropic stiffness conditions ( Titze and Strong, 1975 ; Berry, 2001 ). Such clustering of eigenmodes makes it possible to initiate vocal fold vibration at very low subglottal pressures.

The coupling strength β between the two eigenmodes in Eq. (2) depends on the prephonatory glottal opening, with the coupling strength increasing with decreasing glottal opening (thus lowered phonation threshold pressure). In addition, the coupling strength also depends on the spatial similarity between the air pressure distribution over the vocal fold surface induced by one eigenmode and vocal fold surface velocity of the other eigenmode ( Zhang, 2010 ). In other words, the coupling strength β quantifies the cross-mode energy transfer efficiency between the eigenmodes that are being synchronized. The higher the degree of cross-mode pressure-velocity similarity, the better the two eigenmodes are coupled, and the less subglottal pressure is required to synchronize them.

In reality, the vocal folds have an infinite number of eigenmodes. Which eigenmodes are synchronized and eventually excited depends on the frequency spacing and relative coupling strength among different eigenmodes. Because vocal fold vibration depends on the eigenmodes that are eventually excited, changes in the eigenmode synchronization pattern often lead to changes in the F0, vocal fold vibration pattern, and the resulting voice quality. Previous studies have shown that a slight change in vocal fold properties such as stiffness or medial surface shape may cause phonation to occur at a different eigenmode, leading to a qualitatively different vocal fold vibration pattern and abrupt changes in F0 ( Tokuda et al. , 2007 ; Zhang, 2009 ). Eigenmode synchronization is not limited to two vocal fold eigenmodes, either. It may also occur between a vocal fold eigenmode and an eigenmode of the subglottal or supraglottal system. In this sense, the negative damping due to subglottal or supraglottal acoustic loading can be viewed as the result of synchronization between one of the vocal fold modes and one of the acoustic resonances.

Eigenmode synchronization discussed above corresponds to a 1:1 temporal synchronization of two eigenmodes. For a certain range of vocal fold conditions, e.g., when asymmetry (left-right or anterior-posterior) exists in the vocal system or when the vocal folds are strongly coupled with the sub- or supra-glottal acoustics, synchronization may occur so that the two eigenmodes are synchronized not toward the same frequency, but at a frequency ratio of 1:2, 1:3, etc., leading to subharmonics or biphonation ( Ishizaka and Isshiki, 1976 ; Herzel, 1993 ; Herzel et al. , 1994 ; Neubauer et al. , 2001 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Titze, 2008 ; Lucero et al. , 2015 ). Temporal desynchronization of eigenmodes often leads to irregular or chaotic vocal fold vibration ( Herzel et al. , 1991 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Steinecke and Herzel, 1995 ). Transition between different synchronization patterns, or bifurcation, often leads to a sudden change in the vocal fold vibration pattern and voice quality.

These studies show that the nonlinear interaction between vocal fold eigenmodes is a central feature of the phonation process, with different synchronization or desynchronization patterns producing a large variety of voice types. Thus, by changing the geometrical and biomechanical properties of the vocal folds, either through laryngeal muscle activation or mechanical modification as in phonosurgery, we can select eigenmodes and eigenmode synchronization pattern to control or modify our voice, in the same way as we control speech formants by moving articulators in the vocal tract to modify vocal tract acoustic resonances.

The concept of eigenmode and eigenmode synchronization is also useful for phonation modeling, because eigenmodes can be used as building blocks to construct more complex motion of the system. Often, only the first few eigenmodes are required for adequate reconstruction of complex vocal fold vibrations (both regular and irregular; Herzel et al. , 1994 ; Berry et al. , 1994 ; Berry et al. , 2006 ), which would significantly reduce the degrees of freedom required in computational models of phonation.

D. Biomechanical requirements of glottal closure during phonation

An important feature of normal phonation is the complete closure of the membranous glottis during vibration, which is essential to the production of high-frequency harmonics. Incomplete closure of the membranous glottis, as often observed in pathological conditions, often leads to voice production of a weak and/or breathy quality.

It is generally assumed that approximation of the vocal folds through arytenoid adduction is sufficient to achieve glottal closure during phonation, with the duration of glottal closure or the closed quotient increasing with increasing degree of vocal fold approximation. While a certain degree of vocal fold approximation is obviously required for glottal closure, there is evidence suggesting that other factors also are in play. For example, excised larynx experiments have shown that some larynges would vibrate with incomplete glottal closure despite that the arytenoids are tightly sutured together ( Isshiki, 1989 ; Zhang, 2011 ). Similar incomplete glottal closure is also observed in experiments using physical vocal fold models with isotropic material properties ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In these experiments, increasing the subglottal pressure increased the vocal fold vibration amplitude but often did not lead to improvement in the glottal closure pattern ( Xuan and Zhang, 2014 ). These studies show that addition stiffness or geometry conditions are required to achieve complete membranous glottal closure.

Recent studies have started to provide some insight toward these additional biomechanical conditions. Xuan and Zhang (2014) showed that embedding fibers along the anterior-posterior direction in otherwise isotropic models is able to improve glottal closure ( Xuan and Zhang, 2014 ). With an additional thin stiffer outmost layer simulating the epithelium, these physical models are able to vibrate with a considerably long closed period. It is interesting that this improvement in the glottal closure pattern occurred only when the fibers were embedded to a location close to the vocal fold surface in the cover layer. Embedding fibers in the body layer did not improve the closure pattern at all. This suggests a possible functional role of collagen and elastin fibers in the intermediate and deep layers of the lamina propria in facilitating glottal closure during vibration.

The difference in the glottal closure pattern between isotropic and anisotropic vocal folds could be due to many reasons. Compared to isotropic vocal folds, anisotropic vocal folds (or fiber-embedded models) are better able to maintain their adductory position against the subglottal pressure and are less likely to be pushed apart by air pressure ( Zhang, 2011 ). In addition, embedding fibers along the AP direction may also enhance the medial-lateral motion, further facilitating glottal closure. Zhang (2014) showed that the first few in vacuo eigenmodes of isotropic vocal folds exhibit similar in-phase, up-and-down swing-like motion, with the medial-lateral and superior-inferior motions locked in a similar phase relationship. Synchronization of modes of similar vibration patterns necessarily leads to qualitatively the same vibration patterns, in this case an up-and-down swing-like motion, with vocal fold vibration dominantly along the superior-inferior direction, as observed in recent physical model experiments ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In contrast, for vocal folds with the AP stiffness much higher than the transverse stiffness, the first few in vacuo modes exhibit qualitatively distinct vibration patterns, and the medial-lateral motion and the superior-inferior motion are no longer locked in a similar phase in the first few in vacuo eigenmodes. This makes it possible to strongly excite large medial-lateral motion without proportional excitation of the superior-inferior motion. As a result, anisotropic models exhibit large medial-lateral motion with a vertical phase difference along the medial surface. The improved capability to maintain adductory position against the subglottal pressure and to vibrate with large medial-lateral motion may contribute to the improved glottal closure pattern observed in the experiment of Xuan and Zhang (2014) .

Geometrically, a thin vocal fold has been shown to be easily pushed apart by the subglottal pressure ( Zhang, 2016a ). Although a thin anisotropic vocal fold vibrates with a dominantly medial-lateral motion, this is insufficient to overcome its inability to maintain position against the subglottal pressure. As a result, the glottis never completely closes during vibration, which leads to a relatively smooth glottal flow waveform and weak excitation of higher-order harmonics in the radiated output voice spectrum ( van den Berg, 1968 ; Zhang, 2016a ). Increasing vertical thickness of the medial surface allows the vocal fold to better resist the glottis-opening effect of the subglottal pressure, thus maintaining the adductory position and achieving complete glottal closure.

Once these additional stiffness and geometric conditions (i.e., certain degree of stiffness anisotropy and not-too-small vertical vocal fold thickness) are met, the duration of glottal closure can be regulated by varying the vertical phase difference in vocal fold motion along the medial surface. A non-zero vertical phase difference means that, when the lower margins of the medial surfaces start to open, the glottis would continue to remain closed until the upper margins start to open. One important parameter affecting the vertical phase difference is the vertical thickness of the medial surface or the degree of medial bulging in the inferior portion of the medial surface. Given the same condition of vocal fold stiffness and vocal fold approximation, the vertical phase difference during vocal fold vibration increases with increasing vertical medial surface thickness (Fig. ​ (Fig.8). 8 ). Thus, the thicker the medial surface, the larger the vertical phase difference, and the longer the closed phase (Fig. ​ (Fig.8; 8 ; van den Berg, 1968 ; Alipour and Scherer, 2000 ; Zhang, 2016a ). Similarly, the vertical phase difference and thus the duration of glottal closure can be also increased by reducing the elastic surface wave speed in the superior-inferior direction ( Ishizaka and Flanagan, 1972 ; Story and Titze, 1995 ), which depends primarily on the stiffness in the transverse plane and to a lesser degree on the AP stiffness, or increasing the body-cover stiffness ratio ( Story and Titze, 1995 ; Zhang, 2009 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g008.jpg

(Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (G ap ), and the resting glottal angle ( α ). Reprinted with permission of ASA from Zhang (2016a) .

Theoretically, the duration of glottal closure can be controlled by changing the ratio between the vocal fold equilibrium position (or the mean glottal opening) and the vocal fold vibration amplitude. Both stiffening the vocal folds and tightening vocal fold approximation are able to move the vocal fold equilibrium position toward glottal midline. However, such manipulations often simultaneously reduce the vibration amplitude. As a result, the overall effect on the duration of glottal closure is unclear. Zhang (2016a) showed that stiffening the vocal folds or increasing vocal fold approximation did not have much effect on the duration of glottal closure except around onset when these manipulations led to significant improvement in vocal fold contact.

E. Role of flow instabilities

Although a Bernoulli-based flow description is often used for phonation models, the realistic glottal flow is highly three-dimensional and much more complex. The intraglottal pressure distribution is shown to be affected by the three-dimensionality of the glottal channel geometry ( Scherer et al. , 2001 ; Scherer et al. , 2010 ; Mihaescu et al. , 2010 ; Li et al. , 2012 ). As the airflow separates from the glottal wall as it exits the glottis, a jet forms downstream of the flow separation point, which leads to the development of shear layer instabilities, vortex roll-up, and eventually vortex shedding from the jet and transition into turbulence. The vortical structures would in turn induce disturbances upstream, which may lead to oscillating flow separation point, jet attachment to one side of the glottal wall instead of going straight, and possibly alternating jet flapping ( Pelorson et al. , 1994 ; Shinwari et al. , 2003 ; Triep et al. , 2005 ; Kucinschi et al. , 2006 ; Erath and Plesniak, 2006 ; Neubauer et al. , 2007 ; Zheng et al. , 2009 ). Recent experiments and simulations also showed that for a highly divergent glottis, airflow may separate inside the glottis, which leads to the formation and convection of intraglottal vortices ( Mihaescu et al. , 2010 ; Khosla et al. , 2014 ; Oren et al. , 2014 ).

Some of these flow features have been incorporated in phonation models (e.g., Liljencrants, 1991 ; Pelorson et al. , 1994 ; Kaburagi and Tanabe, 2009 ; Erath et al. , 2011 ; Howe and McGowan, 2013 ). Resolving other features, particularly the jet instability, vortices, and turbulence downstream of the glottis, demands significantly increased computational costs so that simulation of a few cycles of vocal fold vibration often takes days or months. On the other hand, the acoustic and perceptual relevance of these intraglottal and supraglottal flow structures has not been established. From the sound production point of view, these complex flow structures in the downstream glottal flow field are sound sources of quadrupole type (dipole type when obstacles are present in the pathway of airflow, e.g., tightly adducted false vocal folds). Due to the small length scales associated with the flow structures, these sound sources are broadband in nature and mostly at high frequencies (generally above 2 kHz), with an amplitude much smaller than the harmonic component of the voice source. Therefore, if the high-frequency component of voice is of interest, these flow features have to be accurately modeled, although the degree of accuracy required to achieve perceptual sufficiency has yet to be determined.

It has been postulated that the vortical structures may directly affect the near-field glottal fluid-structure interaction and thus vocal fold vibration and the harmonic component of the voice source. Once separated from the vocal fold walls, the glottal jet starts to develop jet instabilities and is therefore susceptible to downstream disturbances, especially when the glottis takes on a divergent shape. In this way, the unsteady supraglottal flow structures may interact with the boundary layer at the glottal exit and affect the flow separation point within the glottal channel ( Hirschberg et al. , 1996 ). Similarly, it has been hypothesized that intraglottal vortices can induce a local negative pressure on the medial surface of the vocal folds as the intraglottal vortices are convected downstream and thus may facilitate rapid glottal closure during voice production ( Khosla et al. , 2014 ; Oren et al. , 2014 ).

While there is no doubt that these complex flow features affect vocal fold vibration, the question remains concerning how large an influence these vortical structures have on vocal fold vibration and the produced acoustics. For the flow conditions typical of voice production, many of the flow features or instabilities have time scales much different from that of vocal fold vibration. For example, vortex shedding at typical voice conditions occurs generally at frequencies above 1000 Hz ( Zhang et al. , 2004 ; Kucinschi et al. , 2006 ). Considering that phonation is essentially a resonance phenomenon of the vocal folds (Sec. III B ) and the mismatch between vocal fold resonance and typical frequency scales of the vortical structures, it is questionable that compared to vocal fold inertia and elastic recoil, the pressure perturbations on vocal fold surface due to intraglottal or supraglottal vortical structures are strong enough or last for a long enough period to have a significant effect on voice production. Given a longitudinal shear modulus of the vocal fold of about 10 kPa and a shear strain of 0.2, the elastic recoil stress of the vocal fold is approximately 2000 Pa. The pressure perturbations induced by intraglottal or supraglottal vortices are expected to be much smaller than the subglottal pressure. Assuming an upper limit of about 20% of the subglottal pressure for the pressure perturbations (as induced by intraglottal vortices, Oren et al. , 2014 ; in reality this number is expected to be much smaller at normal loudness conditions and even smaller for supraglottal vortices) and a subglottal pressure of 800 Pa (typical of normal speech production), the pressure perturbation on vocal fold surface is about 160 Pa, which is much smaller than the elastic recoil stress. Specifically to the intraglottal vortices, while a highly divergent glottal geometry is required to create intraglottal vortices, the presence of intraglottal vortices induces a negative suction force applied mainly on the superior portion of the medial surface and, if the vortices are strong enough, would reduce the divergence of the glottal channel. In other words, while intraglottal vortices are unable to create the necessary divergence conditions required for their creation, their existence tends to eliminate such conditions.

There have been some recent studies toward quantifying the degree of the influence of the vortical structures on phonation. In an excised larynx experiment without a vocal tract, it has been observed that the produced sound does not change much when sticking a finger very close to the glottal exit, which presumably would have significantly disturbed the supraglottal flow field. A more rigorous experiment was designed in Zhang and Neubauer (2010) in which they placed an anterior-posteriorly aligned cylinder in the supraglottal flow field and traversed it in the flow direction at different left-right locations and observed the acoustics consequences. The hypothesis was that, if these supraglottal flow structures had a significant effect on vocal fold vibration and acoustics, disturbing these flow structures would lead to noticeable changes in the produced sound. However, their experiment found no significant changes in the sound except when the cylinder was positioned within the glottal channel.

The potential impact of intraglottal vortices on phonation has also been numerically investigated ( Farahani and Zhang, 2014 ; Kettlewell, 2015 ). Because of the difficulty in removing intraglottal vortices without affecting other aspects of the glottal flow, the effect of the intraglottal vortices was modeled as a negative pressure superimposed on the flow pressure predicted by a base glottal flow model. In this way, the effect of the intraglottal vortices can be selectively activated or deactivated independently of the base flow so that its contribution to phonation can be investigated. These studies showed that intraglottal vortices only have small effects on vocal fold vibration and the glottal flow. Kettlewell (2015) further showed that the vortices are either not strong enough to induce significant pressure perturbation on vocal fold surfaces or, if they are strong enough, the vortices advect rapidly into the supraglottal region and the induced pressure perturbations would be too brief to have any impact to overcome the inertia of the vocal fold tissue.

Although phonation models using simplified flow models neglecting flow vortical structures are widely used and appear to qualitatively compare well with experiments ( Pelorson et al. , 1994 ; Zhang et al. , 2002a ; Ruty et al. , 2007 ; Kaburagi and Tanabe, 2009 ), more systematic investigations are required to reach a definite conclusion regarding the relative importance of these flow structures to phonation and voice perception. This may be achieved by conducting parametric studies in a large range of conditions over which the relative strength of these vortical structures are known to vary significantly and observing their consequences on voice production. Such an improved understanding would facilitate the development of computationally efficient reduced-order models of phonation.

IV. BIOMECHANICS OF VOICE CONTROL

A. fundamental frequency.

In the discussion of F0 control, an analogy is often made between phonation and vibration in strings in the voice literature (e.g., Colton et al. , 2011 ). The vibration frequency of a string is determined by its length, tension, and mass. By analogy, the F0 of voice production is also determined by its length, tension, and mass, with the mass interpreted as the mass of the vocal folds that is set into vibration. Specifically, F0 increases with increasing tension, decreasing mass, and decreasing vocal fold length. While the string analogy is conceptually simple and heuristically useful, some important features of the vocal folds are missing. Other than the vague definition of an effective mass, the string model, which implicitly assumes cross-section dimension much smaller than length, completely neglects the contribution of vocal fold stiffness in F0 control. Although stiffness and tension are often not differentiated in the voice literature, they have different physical meanings and represent two different mechanisms that resist deformation (Fig. ​ (Fig.2). 2 ). Stiffness is a property of the vocal fold and represents the elastic restoring force in response to deformation, whereas tension or stress describes the mechanical state of the vocal folds. The string analogy also neglects the effect of vocal fold contact, which introduces additional stiffening effect.

Because phonation is essentially a resonance phenomenon of the vocal folds, the F0 is primarily determined by the frequency of the vocal fold eigenmodes that are excited. In general, vocal fold eigenfrequencies depend on both vocal fold geometry, including length, depth, and thickness, and the stiffness and stress conditions of the vocal folds. Shorter vocal folds tend to have high eigenfrequencies. Thus, because of the small vocal fold size, children tend to have the highest F0, followed by female and then male. Vocal fold eigenfrequencies also increase with increasing stiffness or stress (tension), both of which provide a restoring force to resist vocal fold deformation. Thus, stiffening or tensioning the vocal folds would increase the F0 of the voice. In general, the effect of stiffness on vocal fold eigenfrequencies is more dominant than tension when the vocal fold is slightly elongated or shortened, at which the tension is small or even negative and the string model would underestimate F0 or fail to provide a prediction. As the vocal fold gets further elongated and tension increases, the stiffness and tension become equally important in affecting vocal fold eigenfrequencies ( Titze and Hunter, 2004 ; Yin and Zhang, 2013 ).

When vocal fold contact occurs during vibration, the vocal fold collision force appears as an additional restoring force ( Ishizaka and Flanagan, 1972 ). Depending on the extent, depth of influence, and duration of vocal fold collision, this additional force can significantly increase the effective stiffness of the vocal folds and thus F0. Because the vocal fold contact pattern depends on the degree of vocal fold approximation, subglottal pressure, and vocal fold stiffness and geometry, changes in any of these parameters may have an effect on F0 by affecting vocal fold contact ( van den Berg and Tran, 1959 ; Zhang, 2016a ).

In humans, F0 can be increased by increasing either vocal fold eigenfrequencies or the extent and duration of vocal fold contact. Control of vocal fold eigenfrequencies is largely achieved by varying the stiffness and tension along the AP direction. Due to the nonlinear material properties of the vocal folds, both the AP stiffness and tension can be controlled by elongating or shortening the vocal folds, through activation of the CT muscle. Although elongation also increases vocal fold length which lowers F0, the effect of the increase in stiffness and tension on F0 appears to dominate that of increasing length.

The effect of TA muscle activation on F0 control is a little more complex. In addition to shortening vocal fold length, TA activation tensions and stiffens the body layer, decreases tension in the cover layer, but may decrease or increase the cover stiffness ( Yin and Zhang, 2013 ). Titze et al. (1988) showed that depending on the depth of the body layer involved in vibration, increasing TA activation can either increase or decrease vocal fold eigenfrequencies. On the other hand, Yin and Zhang (2013) showed that for an elongated vocal fold, as is often the case in phonation, the overall effect of TA activation is to reduce vocal fold eigenfrequencies. Only for conditions of a slightly elongated or shortened vocal folds, TA activation may increase vocal fold eigenfrequencies. In addition to the effect on vocal fold eigenfrequencies, TA activation increases vertical thickness of the vocal folds and produces medial compression between the two folds, both of which increase the extent and duration of vocal tract contact and would lead to an increased F0 ( Hirano et al. , 1969 ). Because of these opposite effects on vocal fold eigenfrequencies and vocal fold contact, the overall effect of TA activation on F0 would vary depending on the specific vocal fold conditions.

Increasing subglottal pressure or activation of the LCA/IA muscles by themselves do not have much effect on vocal fold eigenfrequencies ( Hirano and Kakita, 1985 ; Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, they often increase the extent and duration of vocal fold contact during vibration, particularly with increasing subglottal pressure, and thus lead to increased F0 ( Hirano et al. , 1969 ; Ishizaka and Flanagan, 1972 ; Zhang, 2016a ). Due to nonlinearity in vocal fold material properties, increased vibration amplitude at high subglottal pressures may lead to increased effective stiffness and tension, which may also increase F0 ( van den Berg and Tan, 1959 ; Ishizaka and Flanagan, 1972 ; Titze, 1989 ). Ishizaka and Flanagan (1972) showed in their two-mass model that vocal fold contact and material nonlinearity combined can lead to an increase of about 40 Hz in F0 when the subglottal pressure is increased from about 200 to 800 Pa. In the continuum model of Zhang (2016a) , which includes the effect of vocal fold contact but not vocal fold material nonlinearity, increasing subglottal pressure alone can increase the F0 by as large as 20 Hz/kPa.

B. Vocal intensity

Because voice is produced at the glottis, filtered by the vocal tract, and radiated from the mouth, an increase in vocal intensity can be achieved by either increasing the source intensity or enhancing the radiation efficiency. The source intensity is controlled primarily by the subglottal pressure, which increases the vibration amplitude and the negative peak or MFDR of the time derivative of the glottal flow. The subglottal pressure depends primarily on the alveolar pressure in the lungs, which is controlled by the respiratory muscles and the lung volume. In general, conditions of the laryngeal system have little effect on the establishment of the alveolar pressure and subglottal pressure ( Hixon, 1987 ; Finnegan et al. , 2000 ). However, an open glottis often results in a small glottal resistance and thus a considerable pressure drop in the lower airway and a reduced subglottal pressure. An open glottis also leads to a large glottal flow rate and a rapid decline in the lung volume, thus reducing the duration of speech between breaths and increasing the respiratory effort required in order to maintain a target subglottal pressure ( Zhang, 2016b ).

In the absence of a vocal tract, laryngeal adjustments, which control vocal fold stiffness, geometry, and position, do not have much effect on the source intensity, as shown in many studies using laryngeal, physical, or computational models of phonation ( Tanaka and Tanabe, 1986 ; Titze, 1988b ; Zhang, 2016a ). In the experiment by Tanaka and Tanabe (1986) , for a constant subglottal pressure, stimulation of the CT and LCA muscles had almost no effects on vocal intensity whereas stimulation of the TA muscle slightly decreased vocal intensity. In an excised larynx experiment, Titze (1988b) found no dependence of vocal intensity on the glottal width. Similar secondary effects of laryngeal adjustments have also been observed in a recent computational study ( Zhang, 2016a ). Zhang (2016a) also showed that the effect of laryngeal adjustments may be important at subglottal pressures slightly above onset, in which case an increase in either AP stiffness or vocal fold approximation may lead to improved vocal fold contact and glottal closure, which significantly increased the MFDR and thus vocal intensity. However, these effects became less efficient with increasing vocal intensity.

The effect of laryngeal adjustments on vocal intensity becomes a little more complicated in the presence of the vocal tract. Changing vocal tract shape by itself does not amplify the produced sound intensity because sound propagation in the vocal tract is a passive process. However, changes in vocal tract shape may provide a better impedance match between the glottis and the free space outside the mouth and thus improve efficiency of sound radiation from the mouth ( Titze and Sundberg, 1992 ). This is particularly the case for harmonics close to a formant, which are often amplified more than the first harmonic and may become the most energetic harmonic in the spectrum of the output voice. Thus, vocal intensity can be increased through laryngeal adjustments that increase excitation of harmonics close to the first formant of the vocal tract ( Fant, 1982 ; Sundberg, 1987 ) or by adjusting vocal tract shape to match one of the formants with one of the dominant harmonics in the source spectrum.

In humans, all three strategies (respiratory, laryngeal, and articulatory) are used to increase vocal intensity. When asked to produce an intensity sweep from soft to loud voice, one generally starts with a slightly breathy voice with a relatively open glottis, which requires the least laryngeal effort but is inefficient in voice production. From this starting position, vocal intensity can be increased by increasing either the subglottal pressure, which increases vibration amplitude, or vocal fold adduction (approximation and/or thickening). For a soft voice with minimal vocal fold contact and minimal higher-order harmonic excitation, increasing vocal fold adduction is particularly efficient because it may significantly improve vocal fold contact, in both spatial extent and duration, thus significantly boosting the excitation of harmonics close to the first formant. In humans, for low to medium vocal intensity conditions, vocal intensity increase is often accompanied by simultaneous increases in the subglottal pressure and the glottal resistance ( Isshiki, 1964 ; Holmberg et al. , 1988 ; Stathopoulos and Sapienza, 1993 ). Because the pitch level did not change much in these experiments, the increase in glottal resistance was most likely due to tighter vocal fold approximation through LCA/IA activation. The duration of the closed phase is often observed to increase with increasing vocal intensity ( Henrich et al. , 2005 ), indicating increased vocal fold thickening or medial compression, which are primarily controlled by the TA muscle. Thus, it seems that both the LCA/IA/TA muscles and subglottal pressure increase play a role in vocal intensity increase at low to medium intensity conditions. For high vocal intensity conditions, when further increase in vocal fold adduction becomes less effective ( Hirano et al. , 1969 ), vocal intensity increase appears to rely dominantly on the subglottal pressure increase.

On the vocal tract side, Titze (2002) showed that the vocal intensity can be increased by matching a wide epilarynx with lower glottal resistance or a narrow epilarynx with higher glottal resistance. Tuning the first formant (e.g., by opening mouth wider) to match the F0 is often used in soprano singing to maximize vocal output ( Joliveau et al. , 2004 ). Because radiation efficiency can be improved through adjustments in either the vocal folds or the vocal tract, this makes it possible to improve radiation efficiency yet still maintain desired pitch or articulation, whichever one wishes to achieve.

C. Voice quality

Voice quality generally refers to aspects of the voice other than pitch and loudness. Due to the subjective nature of voice quality perception, many different descriptions are used and authors often disagree with the meanings of these descriptions ( Gerratt and Kreiman, 2001 ; Kreiman and Sidtis, 2011 ). This lack of a clear and consistent definition of voice quality makes it difficult for studies of voice quality and identifying its physiological correlates and controls. Acoustically, voice quality is associated with the spectral amplitude and shape of the harmonic and noise components of the voice source, and their temporal variations. In the following we focus on physiological factors that are known to have an impact on the voice spectra and thus are potentially perceptually important.

One of the first systematic investigations of the physiological controls of voice quality was conducted by Isshiki (1989 , 1998) using excised larynges, in which regions of normal, breathy, and rough voice qualities were mapped out in the three-dimensional parameter space of the subglottal pressure, vocal fold stiffness, and prephonatory glottal opening area (Fig. ​ (Fig.9). 9 ). He showed that for a given vocal fold stiffness and prephonatory glottal opening area, increasing subglottal pressure led to voice production of a rough quality. This effect of the subglottal pressure can be counterbalanced by increasing vocal fold stiffness, which increased the region of normal voice in the parameter space of Fig. ​ Fig.9. 9 . Unfortunately, the details of this study, including the definition and manipulation of vocal fold stiffness and perceptual evaluation of different voice qualities, are not fully available. The importance of the coordination between the subglottal pressure and laryngeal conditions was also demonstrated in van den Berg and Tan (1959) , which showed that although different vocal registers were observed, each register occurred in a certain range of laryngeal conditions and subglottal pressures. For example, for conditions of low longitudinal tension, a chest-like phonation was possible only for small airflow rates. At large values of the subglottal pressure, “it was impossible to obtain good sound production. The vocal folds were blown too wide apart…. The shape of the glottis became irregularly curved and this curving was propagated along the glottis.” Good voice production at large flow rates was possible only with thyroid cartilage compression which imitates the effect of TA muscle activation. Irregular vocal fold vibration at high subglottal pressures has also been observed in physical model experiments (e.g., Xuan and Zhang, 2014 ). Irregular or chaotic vocal fold vibration at conditions of pressure-stiffness mismatch has also been reported in the numerical simulation of Berry et al. (1994) , which showed that while regular vocal fold vibration was observed for typical vocal fold stiffness conditions, irregular vocal fold vibration (e.g., subharmonic or chaotic vibration) was observed when the cover layer stiffness was significantly reduced while maintaining the same subglottal pressure.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g009.jpg

A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989) .

The experiments of van den Berg and Tan (1959) and Isshiki (1989) also showed that weakly adducted vocal folds (weak LCA/IA/TA activation) often lead to vocal fold vibration with incomplete glottal closure during phonation. When the airflow is sufficiently high, the persistent glottal gap would lead to increased turbulent noise production and thus phonation of a breathy quality (Fig. ​ (Fig.9). 9 ). The incomplete glottal closure may occur in the membranous or the cartilaginous portion of the glottis. When the incomplete glottal closure is limited to the cartilaginous glottis, the resulting voice is breathy but may still have strong harmonics at high frequencies. When the incomplete glottal closure occurs in the membranous glottis, the reduced or slowed vocal fold contact would also reduce excitation of higher-order harmonics, resulting in a breathy and weak quality of the produced voice. When the vocal folds are sufficiently separated, the coupling between the two vocal folds may be weakened enough so that each vocal fold can vibrate at a different F0. This would lead to biphonation or voice containing two distinct fundamental frequencies, resulting in a perception similar to that of the beat frequency phenomenon.

Compared to a breathy voice, a pressed voice is presumably produced with tight vocal fold approximation or even some degree of medial compression in the membranous portion between the two folds. A pressed voice is often characterized by a second harmonic that is stronger than the first harmonic, or a negative H1-H2, with a long period of glottal closure during vibration. Although a certain degree of vocal fold approximation and stiffness anisotropy is required to achieve vocal fold contact during phonation, the duration of glottal closure has been shown to be primarily determined by the vertical thickness of the vocal fold medial surface ( van den Berg, 1968 ; Zhang, 2016a ). Thus, although it is generally assumed that a pressed voice can be produced with tight arytenoid adduction through LCA/IA muscle activation, activation of the LCA/IA muscles alone is unable to achieve prephonatory medial compression in the membranous glottis or change the vertical thickness of the medial surface. Activation of the TA muscle appears to be essential in producing a voice change from a breathy to a pressed voice quality. A weakened TA muscle, as in aging or muscle atrophy, would lead to difficulties in producing a pressed voice or even sufficient glottal closure during phonation. On the other hand, strong TA muscle activation, as in for example, spasmodic dysphonia, may lead to too tight a closure of the glottis and a rough voice quality ( Isshiki, 1989 ).

In humans, vocal fold stiffness, vocal fold approximation, and geometry are regulated by the same set of laryngeal muscles and thus often co-vary, which has long been considered as one possible origin of vocal registers and their transitions ( van den Berg, 1968 ). Specifically, it has been hypothesized that changes in F0 are often accompanied by changes in the vertical thickness of the vocal fold medial surface, which lead to changes in the spectral characteristics of the produced voice. The medial surface thickness is primarily controlled by the CT and TA muscles, which also regulate vocal fold stiffness and vocal fold approximation. Activation of the CT muscle reduces the medial surface thickness, but also increases vocal fold stiffness and tension, and in some conditions increases the resting glottal opening ( van den Berg and Tan, 1959 ; van den Berg, 1968 ; Hirano and Kakita, 1985 ). Because the LCA/IA/TA muscles are innervated by the same nerve and often activated together, an increase in the medial surface thickness through TA muscle activation is often accompanied by increased vocal fold approximation ( Hirano and Kakita, 1985 ) and contact. Thus, if one attempts to increase F0 primarily by activation of the LCA/IA/TA muscles, the vocal folds are likely to have a large medial surface thickness and probably low AP stiffness, which will lead to a chest-like voice production, with large vertical phase difference along the medial surface, long closure of the glottis, small flow rate, and strong harmonic excitation. In the extreme case of strong TA activation and minimum CT activation and very low subglottal pressure, the glottis can remain closed for most of the cycle, leading to a vocal fry-like voice production. In contrast, if one attempts to increase F0 by increasing CT activation alone, the vocal folds, with a small medial surface thickness, are likely to produce a falsetto-like voice production, with incomplete glottal closure and a nearly sinusoidal flow waveform, very high F0, and a limited number of harmonics.

V. MECHANICAL AND COMPUTER MODELS FOR VOICE APPLICATIONS

Voice applications generally fall into two major categories. In the clinic, simulation of voice production has the potential to predict outcomes of clinical management of voice disorders, including surgery and voice therapy. For such applications, accurate representation of vocal fold geometry and material properties to the degree that matches actual clinical treatment is desired, and for this reason continuum models of the vocal folds are preferred over lumped-element models. Computational cost is not necessarily a concern in such applications but still has to be practical. In contrast, for some other applications, particularly in speech technology applications, the primary goal is to reproduce speech acoustics or at least perceptually relevant features of speech acoustics. Real-time capability is desired in these applications, whereas realistic representation of the underlying physics involved is often not necessary. In fact, most of the current speech synthesis systems consider speech purely as an acoustic signal and do not model the physics of speech production at all. However, models that take into consideration the underlying physics, at least to some degree, may hold the most promise in speech synthesis of natural-sounding, speaker-specific quality.

A. Mechanical vocal fold models

Early efforts on artificial speech production, dating back to as early as the 18th century, focused on mechanically reproducing the speech production system. A detailed review can be found in Flanagan (1972) . The focus of these early efforts was generally on articulation in the vocal tract rather than the voice source, which is understandable considering that meaning is primarily conveyed through changes in articulation and the lack of understanding of the voice production process. The vibrating element in these mechanical models, either a vibrating reed or a slotted rubber sheet stretched over an opening, is only a rough approximation of the human vocal folds.

More sophisticated mechanical models have been developed more recently to better reproduce the three-dimensional layered structure of the vocal folds. A membrane (cover)-cushion (body) two-layer rubber vocal fold model was first developed by Smith (1956) . Similar mechanical models were later developed and used in voice production research (e.g., Isogai et al. , 1988 ; Kakita, 1988 ; Titze et al. , 1995 ; Thomson et al. , 2005 ; Ruty et al. , 2007 ; Drechsel and Thomson, 2008 ), using silicone or rubber materials or liquid-filled membranes. Recent studies ( Murray and Thomson, 2012 ; Xuan and Zhang, 2014 ) have also started to embed fibers into these models to simulate the anisotropic material properties due to the presence of collagen and elastin fibers in the vocal folds. A similar layered vocal fold model has been incorporated into a mechanical talking robot system ( Fukui et al. , 2005 ; Fukui et al. , 2007 ; Fukui et al. , 2008 ). The most recent version of the talking robot, Waseda Talker, includes mechanisms for the control of pitch and resting glottal opening, and is able to produce voice of modal, creaky, or breathy quality. Nevertheless, although a mechanical voice production system may find application in voice prosthesis or humanoid robotic systems in the future, current mechanical models are still a long way from reproducing or even approaching humans' capability and flexibility in producing and controlling voice.

B. Formant synthesis and parametric voice source models

Compared to mechanically reproducing the physical process involved in speech production, it is easier to reproduce speech as an acoustic signal. This is particularly the case for speech synthesis. One approach adopted in most of the current speech synthesis systems is to concatenate segments of pre-recorded natural voice into new speech phrases or sentences. While relatively easy to implement, in order to achieve natural-sounding speech, this approach requires a large database of words spoken in different contexts, which makes it difficult to apply to personalized speech synthesis of varying emotional percepts.

Another approach is to reproduce only perceptually relevant acoustic features of speech, as in formant synthesis. The target acoustic features to be reproduced generally include the F0, sound amplitude, and formant frequencies and bandwidths. This approach gained popularity with the development of electrical synthesizers and later computer simulations which allow flexible and accurate control of these acoustic features. Early formant-based synthesizers used simple sound sources, often a filtered impulse train as the sound source for voiced sounds and white noise for unvoiced sounds. Research on the voice sources (e.g., Fant, 1979 ; Fant et al. , 1985 ; Rothenberg et al. , 1971 ; Titze and Talkin, 1979 ) has led to the development of parametric voice source models in the time domain, which are capable of producing voice source waveforms of varying F0, amplitude, open quotient, and degree of abruptness of the glottal flow shutoff, and thus synthesis of different voice qualities.

While parametric voice source models provide flexibility in source variations, synthetic speech generated by the formant synthesis still suffers limited naturalness. This limited naturalness may result from the primitive rules used in specifying dynamic controls of the voice source models ( Klatt, 1987 ). Also, the source model control parameters are not independent from each other and often co-vary during phonation. A challenge in formant synthesis is thus to specify voice source parameter combinations and their time variation patterns that may occur in realistic voice production of different voice qualities by different speakers. It is also possible that some perceptually important features are missing from time-domain voice source models ( Klatt, 1987 ). Human perception of voice characteristics is better described in the frequency domain as the auditory system performs an approximation to Fourier analysis of the voice and sound in general. While time-domain models have better correspondence to the physical events occurring during phonation (e.g., glottal opening and closing, and the closed phase), it is possible some spectral details of perceptual importance are not captured in the simple time-domain voice source models. For example, spectral details in the low and middle frequencies have been shown to be of considerable importance to naturalness judgment, but are difficult to be represented in a time-domain source model ( Klatt, 1987 ). A recent study ( Kreiman et al. , 2015 ) showed that spectral-domain voice source models are able to create significantly better matches to natural voices than time-domain voice source models. Furthermore, because of the independence between the voice source and the sub- and supra-glottal systems in formant synthesis, interactions and co-variations between vocal folds and the sub- and supra-glottal systems are by design not accounted for. All these factors may contribute to the limited naturalness of the formant synthesized speech.

C. Physically based computer models

An alternative approach to natural speech synthesis is to computationally model the voice production process based on physical principles. The control parameters would be geometry and material properties of the vocal system or, in a more realistic way, respiratory and laryngeal muscle activation. This approach avoids the need to specify consistent characteristics of either the voice source or the formants, thus allowing synthesis and modification of natural voice in a way intuitively similar to human voice production and control.

The first such computer model of voice production is the one-mass model by Flanagan and Landgraf (1968) , in which the vocal fold is modeled as a horizontally moving single-degree of freedom mass-spring-damper system. This model is able to vibrate in a restricted range of conditions when the natural frequency of the mass-spring system is close to one of the acoustic resonances of the subglottal or supraglottal tracts. Ishizaka and Flanagan (1972) extended this model to a two-mass model in which the upper and lower parts of the vocal fold are modeled as two separate masses connected by an additional spring along the vertical direction. The two-mass model is able to vibrate with a vertical phase difference between the two masses, and thus able to vibrate independently of the acoustics of the sub- and supra-glottal tracts. Many variants of the two-mass model have since been developed. Titze (1973) developed a 16-mass model to better represent vocal fold motion along the anterior-posterior direction. To better represent the body-cover layered structure of the vocal folds, Story and Titze (1995) extended the two-mass model to a three-mass model, adding an additional lateral mass representing the inner muscular layer. Empirical rules have also been developed to relate control parameters of the three-mass model to laryngeal muscle activation levels ( Titze and Story, 2002 ) so that voice production can be simulated with laryngeal muscle activity as input. Designed originally for speech synthesis purpose, these lumped-element models of voice production are generally fast in computational time and ideal for real-time speech synthesis.

A drawback of the lumped-element models of phonation is that the model control parameters cannot be directly measured or easily related to the anatomical structure or material properties of the vocal folds. Thus, these models are not as useful in applications in which a realistic representation of voice physiology is required, as, for example, in the clinical management of voice disorders. To better understand the voice source and its control under different voicing conditions, more sophisticated computational models of the vocal folds based on continuum mechanics have been developed to understand laryngeal muscle control of vocal fold geometry, stiffness, and tension, and how changes in these vocal fold properties affect the glottal fluid-structure interaction and the produced voice. One of the first such models is the finite-difference model by Titze and Talkin (1979) , which coupled a three-dimensional vocal fold model of linear elasticity with the one-dimensional glottal flow model of Ishizaka and Flanagan (1972) . In the past two decades more refined phonation models using a two-dimensional or three-dimensional Navier-Stokes description of the glottal flow have been developed (e.g., Alipour et al. , 2000 ; Zhao et al. , 2002 ; Tao et al. , 2007 ; Luo et al. , 2009 ; Zheng et al. , 2009 ; Bhattacharya and Siegmund, 2013 ; Xue et al. , 2012 , 2014 ). Continuum models of laryngeal muscle activation have also been developed to model vocal fold posturing ( Hunter et al. , 2004 ; Gommel et al. , 2007 ; Yin and Zhang, 2013 , 2014 ). By directly modeling the voice production process, continuum models with realistic geometry and material properties ideally hold the most promise in reproducing natural human voice production. However, because the phonation process is highly nonlinear and involves large displacement and deformation of the vocal folds and complex glottal flow patterns, modeling this process in three dimensions is computationally very challenging and time-consuming. As a result, these computational studies are often limited to one or two specific aspects instead of the entire voice production process, and the acoustics of the produced voice, other than F0 and vocal intensity, are often not investigated. For practical applications, real-time or not, reduced-order models with significantly improved computational efficiency are required. Some reduced-order continuum models, with simplifications in both the glottal flow and vocal fold dynamics, have been developed and used in large-scale parametric studies of voice production (e.g., Titze and Talkin, 1979 ; Zhang, 2016a ), which appear to produce qualitatively reasonable predictions. However, these simplifications have yet to be rigorously validated by experiment.

VI. FUTURE CHALLENGES

We currently have a general understanding of the physical principles of voice production. Toward establishing a cause-effect theory of voice production, much is to be learned about voice physiology and biomechanics. This includes the geometry and mechanical properties of the vocal folds and their variability across subject, sex, and age, and how they vary across different voicing conditions under laryngeal muscle activation. Even less is known about changes in vocal fold geometry and material properties in pathologic conditions. The surface conditions of the vocal folds and their mechanical properties have been shown to affect vocal fold vibration ( Dollinger et al. , 2014 ; Bhattacharya and Siegmund, 2015 ; Tse et al. , 2015 ), and thus need to be better quantified. While in vivo animal or human larynx models ( Moore and Berke, 1988 ; Chhetri et al. , 2012 ; Berke et al. , 2013 ) could provide such information, more reliable measurement methods are required to better quantify the viscoelastic properties of the vocal fold, vocal fold tension, and the geometry and movement of the inner vocal fold layers. While macro-mechanical properties are of interest, development of vocal fold constitutive laws based on ECM distribution and interstitial fluids within the vocal folds would allow us to better understand how vocal fold mechanical properties change with prolonged vocal use, vocal fold injury, and wound healing, which otherwise is difficult to quantify.

While oversimplification of the vocal folds to mass and tension is of limited practical use, the other extreme is not appealing, either. With improved characterization and understanding of vocal fold properties, establishing a cause-effect relationship between voice physiology and production thus requires identifying which of these physiologic features are actually perceptually relevant and under what conditions, through systematic parametric investigations. Such investigations will also facilitate the development of reduced-order computational models of phonation in which perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified. We discussed earlier that many of the complex supraglottal flow phenomena have questionable perceptual relevance. Similar relevance questions can be asked with regard to the geometry and mechanical properties of the vocal folds. For example, while the vocal folds exhibit complex viscoelastic properties, what are the main material properties that are definitely required in order to reasonably predict vocal fold vibration and voice quality? Does each of the vocal fold layers, in particular, the different layers of the lamina propria, have a functional role in determining the voice output or preventing vocal injury? Current vocal fold models often use a simplified vocal fold geometry. Could some geometric features of a realistic vocal fold that are not included in current models have an important role in affecting voice efficiency and voice quality? Because voice communication spans a large range of voice conditions (e.g., pitch, loudness, and voice quality), the perceptual relevance and adequacy of specific features (i.e., do changes in specific features lead to perceivable changes in voice?) should be investigated across a large number of voice conditions rather than a few selected conditions. While physiologic models of phonation allow better reproduction of realistic vocal fold conditions, computational models are more suitable for such systematic parametric investigations. Unfortunately, due to the high computational cost, current studies using continuum models are often limited to a few conditions. Thus, the establishment of cause-effect relationship and the development of reduced-order models are likely to be iterative processes, in which the models are gradually refined to include more physiologic details to be considered in the cause-effect relationship.

A causal theory of voice production would allow us to map out regions in the physiological parameter space that produce distinct vocal fold vibration patterns and voice qualities of interest (e.g., normal, breathy, rough voices for clinical applications; different vocal registers for singing training), similar to that described by Isshiki (1989 ; also Fig. ​ Fig.9). 9 ). Although the voice production system is quite complex, control of voice should be both stable and simple, which is required for voice to be a robust and easily controlled means of communication. Understanding voice production in the framework of nonlinear dynamics and eigenmode interactions and relating it to voice quality may facilitate toward this goal. Toward practical clinical applications, such a voice map would help us understand what physiologic alteration caused a given voice change (the inverse problem), and what can be done to restore the voice to normal. Development of efficient and reliable tools addressing the inverse problem has important applications in the clinical diagnosis of voice disorders. Some methods already exist that solve the inverse problem in lumped-element models (e.g., Dollinger et al. , 2002 ; Hadwin et al. , 2016 ), and these can be extended to physiologically more realistic continuum models.

Solving the inverse problem would also provide an indirect approach toward understanding the physiologic states that lead to percepts of different emotional states or communication of other personal traits, which are otherwise difficult to measure directly in live human beings. When extended to continuous speech production, this approach may also provide insights into the dynamic physiologic control of voice in running speech (e.g., time contours of the respiratory and laryngeal adjustments). Such information would facilitate the development of computer programs capable of natural-sounding, conversational speech synthesis, in which the time contours of control parameters may change with context, speaking style, or emotional state of the speaker.

ACKNOWLEDGMENTS

This study was supported by research Grant Nos. R01 DC011299 and R01 DC009229 from the National Institute on Deafness and Other Communication Disorders, the National Institutes of Health. The author would like to thank Dr. Liang Wu for assistance in preparing the MRI images in Fig. ​ Fig.1, 1 , Dr. Jennifer Long for providing the image in Fig. 1(b) , Dr. Gerald Berke for providing the stroboscopic recording from which Fig. ​ Fig.3 3 was generated, and Dr. Jody Kreiman, Dr. Bruce Gerratt, Dr. Ronald Scherer, and an anonymous reviewer for the helpful comments on an earlier version of this paper.

  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical Literature
  • Classical Reception
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Language Acquisition
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Culture
  • Music and Religion
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Science
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Society
  • Law and Politics
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Games
  • Computer Security
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business History
  • Business Strategy
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Methodology
  • Economic Systems
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Cognitive Psychology

  • < Previous chapter
  • Next chapter >

The Oxford Handbook of Cognitive Psychology

26 Speech Perception

Sven L. Mattys, Department of Psychology, University of York, York, UK

  • Published: 03 June 2013
  • Cite Icon Cite
  • Permissions Icon Permissions

Speech perception is conventionally defined as the perceptual and cognitive processes leading to the discrimination, identification, and interpretation of speech sounds. However, to gain a broader understanding of the concept, such processes must be investigated relative to their interaction with long-term knowledge—lexical information in particular. This chapter starts with a review of some of the fundamental characteristics of the speech signal and by an evaluation of the constraints that these characteristics impose on modeling speech perception. Long-standing questions are then discussed in the context of classic and more recent theories. Recurrent themes include the following: (1) the involvement of articulatory knowledge in speech perception, (2) the existence of a speech-specific mode of auditory processing, (3) the multimodal nature of speech perception, (4) the relative contribution of bottom-up and top-down flows of information to sound categorization, (5) the impact of the auditory environment on speech perception in infancy, and (6) the flexibility of the speech system in the face of novel or atypical input.

The complexity, variability, and fine temporal properties of the acoustic signal of speech have puzzled psycholinguists and speech engineers for decades. How can a signal seemingly devoid of regularity be decoded and recognized almost instantly, without any formal training, and despite being often experienced in suboptimal conditions? Without any real effort, we identify over a dozen speech sounds (phonemes) per second, recognize the words they constitute, almost immediately understand the message generated by the sentences they form, and often elaborate appropriate verbal and nonverbal responses before the utterance ends.

Unlike theories of letter perception and written-word recognition, theories of speech perception and spoken-word recognition have devoted a great deal of their investigation to a description of the signal itself, most of it carried out within the field of phonetics. In particular, the fact that speech is conveyed in the auditory modality has dramatic implications for the perceptual and cognitive operations underpinning its recognition. Research in speech perception has focused on the constraining effects of three main properties of the auditory signal: sequentiality, variability, and continuity.

Nature of the Speech Signal

Sequentiality.

One of the most obvious disadvantages of the auditory system compared to its visual counterpart is that the distribution of the auditory information is time bound, transient, and solely under the speaker’s control. Moreover, the auditory signal conveys its acoustic content in a relatively serial fashion, one bit of information at a time. The extreme spreading of information over time in the speech domain has important consequences for the mechanisms involved in perceiving and interpreting the input.

Illustration of the sequential nature of speech processing. ( A ) Waveform of a complete sentence, that is, air pressure changes (Y axis) over time (X axis). ( B–D ) Illustration of a listener’s progressive processing of the sentence at three successive points in time. The visible waveform represents the portion of signal that is available for processing at time t1 ( B ), t2 ( C ), and t3 ( D ).

In particular, given that relatively little information is conveyed per unit of time, the extraction of meaning can only be done within a window of time that far exceeds the amount of information that can be held in echoic memory (Huggins, 1975 ; Nooteboom, 1979 ). Likewise, given that there are no such things as “auditory saccades,” in which listeners would be able to skip ahead of the signal or replay the words or sentences they just heard, speech perception and lexical-sentential integration must take place sequentially, in real time (Fig. 26.1 ).

For a large part, listeners are extremely good at keeping up with the rapid flow of speech sounds. Marslen-Wilson ( 1987 ) showed that many words in sentences are often recognized well before their offset, sometimes as early as 200 ms after their onset, the average duration of one or two syllables. Other words, however, can only be disentangled from competitors later on, especially when they are short and phonetically reduced, for example, “you are” pronounced as “you’re” (Bard, Shillcock, & Altmann, 1988 ). Yet, in general, there is a consensus that speech perception and lexical access closely shadow the unfolding of the signal (e.g., the Cohort Model; Marslen-Wilson, 1987 ), even though “right-to-left” effects can sometimes be observed as well (Dahan, 2010 ).

Given the inevitable sequentiality of speech perception and the limited amount of information that humans can hold in their auditory short-term memory, an obvious question is whether fast speech, which allows more information to be packed into the same amount of time, helps listeners handle the transient nature of speech and, specifically, whether it affects the mechanisms leading to speech recognition. A problem, however, is that fast speech tends to be less clearly articulated (hypoarticulated), and hence, less intelligible. Thus, any processing gain due to denser information packing might be offset by diminished intelligibility. However, this confound can be avoided experimentally. Indeed, speech rate can be accelerated with minimal loss of intrinsic intelligibility via computer-assisted signal compression (e.g., Foulke & Sticht, 1969 ; van Buuren, Festen, & Houtgast, 1999 ). Time compression experiments have led to mixed results. Dupoux and Mehler ( 1990 ), for instance, found no effect of speech rate on how phonemes are perceived in monosyllabic versus disyllabic words. They started from the observation that the initial consonant of a monosyllabic word is detected faster if the word is high frequency than if it is low frequency, whereas frequency has no effect in multisyllabic words. This difference can be attributed to the use of a lexical route with short words and of a phonemic route with longer words. That is, short words are mapped directly onto lexical representations, whereas longer words undergo a process of decomposition into phonemes first. Critically, Dupoux and Mehler reported that a frequency effect did not appear when the duration of the disyllabic words was compressed to that of the monosyllabic words, suggesting that whether listeners use a lexical or phonemic route to identify phonemes depends on structural factors (number of phonemes or syllables) rather than time. Thus, on this account, the transient nature of speech has only a limited effect on the mechanisms underlying speech recognition.

In contrast, others have found significant effects of speech rate on lexical access. For example, both Pitt and Samuel ( 1995 ) and Radeau, Morais, Mousty, and Bertelson ( 2000 ) observed that the uniqueness point of a word, that is, the sequential point at which it can be uniquely specified (e.g., “spag” for “spaghetti”), could be dramatically altered when speech rate was manipulated. However, most changes were observed at slower rates, not at faster rates. Thus, changes in speech rate can have effects on recognition mechanisms, but these are observed mainly with time expansion, not with time compression. In sum, although the studies by Dupoux and Mehler ( 1990 ), Pitt and Samuel ( 1995 ), and Radeau et al. ( 2000 ) highlight different effects of time manipulation on speech processing, they all agree that packing more information per unit of time by accelerating speech rate does not compensate for the transient nature of the speech signal and for memory limitations. This is probably due to intrinsic perceptual and mnemonic limitations on how fast information can be processed by the speech system—at any rate.

In general, the sequential nature of speech processing is a feature that many models have struggled to implement not only because it requires taking into account echoic and short-term memory mechanisms (Mattys, 1997 ) but also because the sequentiality problem is compounded by a lack of clear boundaries between phonemes and between words, as described later.

The inspection of a speech waveform does not reveal clear acoustic correlates of what the human ear perceives as phoneme and word boundaries. The lack of boundaries is due to coarticulation between phonemes (the blending of articulatory gestures between adjacent phonemes) within and across words. Even though the degree of coarticulation between phonemes is somewhat less pronounced across than within words (Fougeron & Keating, 1997 ), the lack of clear and reliable gaps between words, along with the sequential nature of speech delivery, makes speech continuity one of the most challenging obstacles for both psycholinguistic theory and automatic speech recognition applications. Yet the absence of phoneme and word boundary markers hardly seems to pose a problem for everyday listening, as the subjective experience of speech is not one of continuity but, rather, of discreteness—that is, a string of sounds making up a string of words.

A great deal of the segmentation problem can be solved, at least in theory, based on lexical knowledge and contextual information. Key notions, here, are lexical competition and segmentation by lexical subtraction. In this view, lexical candidates are activated in multiple locations in the speech signal—that is, multiple alignment—and they compete for a segmentation solution that does not leave any fragments lexically unaccounted for (e.g., “great wall” is favored over “gray twall,” because “twall” in not an English word). Importantly, this knowledge-driven approach does not assign a specific computational status to segmentation, other than being the mere consequence of mechanisms associated with lexical competition (e.g., McClelland & Elman, 1986 ; Norris, 1994 ).

Another source of information for word segmentation draws upon broad prosodic and segmental regularities in the signal, which listeners use as heuristics for locating word boundaries. For example, languages whose words have a predominant rhythmic pattern (e.g., word-initial stress is predominant in English; word-final lengthening is predominant in French) provide a relatively straightforward—though probabilistic—segmentation strategy to their listeners (Cutler, 1994 ). The heuristic for English would go as follows: every time a strong syllable is encountered, a boundary is posited before that syllable . For French, it would be: every time a lengthened syllable is encountered, a boundary is posited after that syllable . Another documented heuristic is based on phonotactic probability, that is, the likelihood that specific phonemes follow each other in the words of a language (McQueen, 1998 ). Specifically, phonemes that are rarely found next to each other in words (e.g., very few English words contain the /fh/ diphone) would be probabilistically interpreted as having occurred across a word boundary (e.g., “tou gh h ero”). Finally, a wide array of acoustic-phonetic cues can also give away the position of a word boundary (Umeda & Coker, 1974 ). Indeed, phonemes tend to be realized differently depending on their position relative to a word or a syllable boundary. For example, in English, word-initial vowels are frequently glottalized (brief closure of the glottis, e.g., /e/ in “isle e nd,” compared to no closure in “I l e nd”), word-initial stop consonants are often aspirated (burst of air accompanying the release of a consonant, e.g., /t/ in “gray t anker” compared to no aspiration in “grea t anchor”).

It is important to note that, in everyday speech, lexically and sublexically driven segmentation cues usually coincide and reinforce each other. However, in suboptimal listening conditions (e.g., noise) or in rare cases where a conflict arises between those two sources of information, listeners have been shown to downplay sublexical discrepancies and give more heed to lexical plausibility (Mattys, White, & Melhorn, 2005 ; Fig. 26.2 ).

Variability

Perhaps the most defining challenge for the field of speech perception is the enormous variability of

Sketch of Mattys, White, and Melhorn’s ( 2005 ) hierarchical approach to speech segmentation. The relative weights of speech segmentation cues are illustrated by the width of the gray triangle. In optimal listening conditions, the cues in Tier I dominate. When lexical access is compromised or ambiguous, the cues in Tier II take over. Cues from Tier III are recruited when both lexical and segmental cues are compromised (e.g., background of severe noise). (Reprinted from Mattys, S. L., White, L., & Melhorn, J. F [2005]. Integration of multiple speech segmentation cues: A hierarchical framework. Journal of Experimental Psychology: General , 134 , 477–500 [Figure 7], by permission of the American Psychological Association.)

the signal relative to the stored representations onto which it must be matched. Variability can be found at the word level, where there are infinite ways a given word can be pronounced depending on accents, voice quality, speech rate, and so on, leading to a multitude of surface realizations for a unique target representation. But this many-to-one mapping problem is not different from the one encountered with written words in different handwritings or object recognition in general. In those cases, signal normalization can be effectively achieved by defining a set of core features unique to each word or object stored in memory and by reducing the mapping process to those features only.

The real issue with speech variability happens at a lower level, namely, phoneme categorization. Unlike letters whose realizations have at least some commonality from one instance to another, phonemes can vary widely in their acoustic manifestation—even within the same speaker. For example, as shown in Figure 26.3A , the realization of the phoneme /d/ has no immediately apparent acoustic commonality in /di/ and /du/ (Delattre, Liberman, & Cooper, 1955 ). This lack of acoustic invariance is the consequence of coarticulation: The articulation of /d/ in /di/ is partly determined by the articulatory preparation for /i/, and likewise for /d/ in /du/. The power of coarticulation is easily demonstrated by observing a speaker’s mouth prior to saying /di/ compared to /du/. The mode of articulation of /i/ (unrounded) versus /u/ (rounded) is visible on the speaker’s lips even before /d/ has been uttered. The resulting acoustics of /d/ preceding each vowel have therefore little in common.

The success of the search for acoustic cues, or invariants, capable of uniquely identifying phonemes or phonetic categories has been highly feature specific. For example, as illustrated in Figure 26.3A , the place of articulation of phonemes (i.e., the place in the vocal tract where the airstream is most constricted, which distinguishes, e.g., /b/, /d/, /g/) has been difficult to map onto specific acoustic cues. However, the difference between voiced and unvoiced stop consonants (/b/, /d/, /g/ vs. /p/, /t/, /k/) can be traced back fairly reliably to the duration between the release of the consonant and the moment when the vocal folds start vibrating, that is, the voice onset time (VOT; Liberman, Delattre, & Cooper, 1958 ). In English, the VOT of voiced stop consonants is typically around 0 ms (or at least shorter than 20 ms), whereas it is generally over 25 ms for voiceless consonants. Although this contrast has been shown to be somewhat influenced by consonant type and vocalic context (e.g., Lisker & Abramson, 1970 ), VOT is a fairly robust cue for the voiced-voiceless distinction.

( A ) Stylized spectrograms of /di/ and /du/. The dark bars, or formants, represent areas of peak energy on the frequency scale (Y axis), which correlate with zones of high resonance in the vocal tract. The curvy leads into the formants are formant transitions. They show coarticulation between the consonant and the following vowel. Note the dissimilarity between the second formant transitions in /di/ (rising) and /du/ (falling). However, as shown in ( B ), the extrapolation back in time of the two second formants’ transitions point to a common frequency locus.

Vowels are subject to coarticulatory influences, too, but the spectral structure of their middle portion is usually relatively stable, and hence, a taxonomy of vowels based on their unique distribution of energy bands along the frequency spectrum, or formants, can be attempted. However, such distribution is influenced by speaking rate, with fast speech typically leading to the target frequency of the formants being missed or leading to an asymmetric shortening of stressed versus unstressed vowels (Lindblom, 1963 ; Port, 1977 ). In general, speech rate variation is particularly problematic for acoustic cues involving time. Even stable cues such as VOT can lose their discriminability power when speech rate is altered. For example, at fast speech rates, the VOT difference between voiced and voiceless stop consonants decreases, making the two types of phonemes more difficult to distinguish (Summerfield, 1981 ). The same problem has been noted for the difference between /b/ and /w/, with /b/ having rapid formant transitions into the vowel and /w/ less rapid ones. This difference is less pronounced at fast speech rates (Miller & Liberman, 1979 ).

Yet, except for those conditions in which subtle differences are manipulated in the laboratory, listeners are surprisingly good at compensating for the acoustic distortions introduced by coarticulation and changes in speech rate. Thus, input variability, phonetic-context effects, and the lack of invariance do not appear to pose a serious problem for everyday speech perception. As reviewed later, however, theoretical accounts aiming to reconcile the complexity of the signal with the effortlessness of perception vary greatly.

Basic Phenomena and Questions in Speech Perception

Following are some of the observations that have shaped theoretical thinking in speech perception over the past 60 years. Most of them concern, in one way or another, the extent to which speech perception is carried out by a part of the auditory system dedicated to speech and involving speech-specific mechanisms not recruited for nonspeech sounds.

Categorical Perception

Categorical perception in a sensory phenomenon whereby a physically continuous dimension is perceived as discrete categories, with abrupt perceptual boundaries between categories and poor discrimination within categories (e.g., perception of the visible electromagnetic radiation spectrum as discrete colors). Early on, categorical perception was found to apply to phonemes—or at least some of them. For example, Liberman, Harris, Hoffman, and Griffith ( 1957 ) showed that synthesized syllables ranging from /ba/ to /da/ to /ga/ by gradually adjusting the transition between the consonant and the vowel’s formants (i.e., the formant transitions) were perceived as falling into coarse /b/, /d/, and /g/ categories, with poor discrimination between syllables belonging to a perceptual category and high discrimination between syllables straddling a perceptual boundary (Fig. 26.4 ). Importantly, categorical perception was not observed for matched auditory stimuli devoid of phonemic significance (Liberman, Harris, Eimas, Lisker, & Bastian, 1961 ). Moreover, since categorical perception meant that easy-to-identify syllables (spectrum endpoints) were also easy syllables to pronounce, whereas less-easy-to-identify syllables (spectrum midpoints) were generally less easy to pronounce, categorical perception was seen as a highly adaptive property of the speech system, and hence, evidence for a dedicated speech mode of the auditory system. This claim was later weakened by reports of categorical perception for nonspeech sounds (e.g., Miller, Wier, Pastore, Kelly, & Dooling, 1976 ) and for speech sounds by nonhuman species (e.g., Kluender, Diehl, & Killeen, 1987 ; Kuhl, 1981 ).

Idealized identification pattern (solid line, left Y axis) and discrimination pattern (dashed line, right Y axis) for categorical perception. Illustration with a /ba/ to /da/ continuum. Identification shows a sharp perceptual boundary between categories. Discrimination is finer around the boundary than inside the categories.

Effects of Phonetic Context

The effect of adjacent phonemes on the acoustic realization of a target phoneme (e.g., /d/ in /di/ vs. /du/) was mentioned earlier as a core element of the variability challenge. This challenge, that is, achieving perceptual constancy despite input variability, is perhaps most directly illustrated by the converse phenomenon, namely, the varying perception of a constant acoustic input as a function of its changing phonetic environment. Mann ( 1980 ) showed that the perception of a /da/-/ga/ continumm was shifted in the direction of reporting more /ga/ when it was preceded by /al/ and more /da/ when it was preceded by /ar/. Since these shifts are in the opposite direction of coarticulation between adjacent phonemes, listeners appear to compensate for the expected consequences of coarticulation. Whether compensation for coarticulation is evidence for a highly sophisticated mechanism whereby listeners use their implicit knowledge of how phonemes are produced—that is, coarticulated—to guide perception (e.g., Fowler, 2006 ) or simply a consequence of long-term association between the signal and the percept (e.g., Diehl, Lotto, & Holt, 2004 ; Lotto & Holt, 2006 ) has been a question of fundamental importance for theories of speech perception, as discussed later.

Integration of Acoustic and Optic Cues

The chief outcome of speech production is the emission of an acoustic signal. However, visual correlates, such as facial and lip movements, are often available to the listener as well. The effect of visual information on speech perception has been extensively studied, especially in the context of the benefit provided by visual cues for listeners with hearing impairments (e.g., Lachs, Pisoni, & Kirk, 2001 ) and for speech perception in noise (e.g., Sumby & Pollack, 1954 ). Visual-based enhancement is also observed for undegraded speech with a semantically complicated content or for foreign-accented speech (Reisberg, McLean, & Goldfield, 1987 ). In the laboratory, audiovisual integration is strikingly illustrated by the well-known McGurk effect. McGurk and McDonald ( 1976 ) showed that listeners presented with an acoustic /ba/ dubbed over a face saying /ga/ tended to report hearing /da/, a syllable whose place of articulation is intermediate between /ba/ and /ga/. The robustness and automaticity of the effect suggest that the acoustic and (visual) articulatory cues of speech are integrated at an early stage of processing. Whether early integration indicates that the primitives of speech perception are articulatory in nature or whether it simply highlights a learned association between acoustic and optic information has been a theoretically divisive debate (see Rosenblum, 2005 , for a review).

Lexical and Sentential Effects on Speech Perception

Although traditional approaches to speech perception often stop where word recognition begins (in the same way that approaches to word recognition often stop where sentence comprehension begins), speech perception has been profoundly influenced by the debate on how higher order knowledge affects the identification and categorization of phonemes and phonetic features. A key observation is that lexical knowledge and sentential context can aid phoneme identification, especially when the signal is ambiguous or degraded. For example, Warren and Obusek ( 1971 ) showed that a word can be heard as intact even when a component phoneme is missing and replaced with noise, for example, “legi*lature,” where the asterisk denotes the replaced phoneme. In this case, lexical knowledge dictates what the listener should have heard rather than what was actually there, a phenomenon referred to as phoneme restoration. Likewise, Warren and Warren ( 1970 ) showed that a word whose initial phoneme is degraded, for example, “*eel,” tends to be heard as “wheel” in “It was found that the *eel was on the axle” and as “peel” in “It was found that the *eel was on the orange.” Thus, phoneme identification can be strongly influenced by lexical and sentential knowledge even when the disambiguating context appears later than the degraded phoneme.

But is this truly of interest for speech perception ? In other words, could phoneme restoration (and other similar speech illusions) simply result from postperceptual, strategic biases? In this case, “*eel” would be interpreted as “wheel” simply because it makes pragmatic sense to do so in a particular sentential context, not because our perceptual system is genuinely tricked by high-level expectations. If so, contextual effects are of interest to speech-perception scientists only insofar as they suggest that speech perception happens in a system that is unpenetrable by higher order knowledge—an unfortunately convenient way of indirectly perpetuating the confinement of speech perception to the study of phoneme identification. The evidence for a postperceptual explanation is mixed. While Norris, McQueen, and Cutler ( 2000 ), Massaro ( 1989 ), and Oden and Massaro ( 1978 ), among others, found no evidence for online top-down feedback to the perceptual system and no logical reasons why such feedback should exist, Samuel ( 1981 , 1997 , 2001 ), Connine and Clifton ( 1987 ), and Magnuson, McMurray, Tanenhaus, and Aslin ( 2003 ), among others, have reported lexical effects on perception that challenge feedforward models—for example, evidence that lexical information truly alters low-level perceptual discrimination (Samuel, 1981 ). This debate has fostered extreme empirical ingenuity over the past decades but comparatively little change to theory. One exception, however, is that the debate has now spread to the long-term effects of higher order knowledge on speech perception. For example, while Norris, McQueen, and Cutler ( 2000 ) argue against online top-down feedback, the same group (2003) recognizes that perceptual (re-)tuning can happen over time, in the context of repeated exposure and learning. Placing the feedforward/feedback debate in the time domain provides an opportunity to examine the speech system at the interface with cognition, and memory functions in particular. It also allows more applied considerations to be introduced, such as the role of perceptual recalibration for second-language learning and speech perception in difficult listening conditions (Samuel & Kraljic, 2009 ), as discussed later.

Theories of Speech Perception (Narrowly and Broadly Construed)

Motor and articulatory-gesture theories.

The Motor Theory of speech perception, reported in a series of articles in the early 1950s by Liberman, Delattre, Cooper, and other researchers from the Haskins Laboratories, was the first to offer a conceptual solution to the lack-of-invariance problem. As mentioned earlier, the main stumbling block for speech-perception theories was the observation that many phonemes cannot uniquely be identified by a set of stable and reliable acoustic cues. For example, the formant transitions of /d/, especially the second formant, differ as a function of the following vowel. However, Delattre et al. ( 1955 ) found commonality between different /d/s by extrapolating the formant transitions back in time to their convergence point, or locus (or hub ; Potter, Kopp, & Green, 1947 ), as shown in Figure 26.3B . Thus, what is common to the formants of all /d/s is the frequency at their origin, that is, the frequency that would best reflect the position of the articulators prior to the release of the consonant. This led to one of the key arguments in support of the motor theory, namely that a one-to-one relationship between acoustics and phonemes can be established if the speech system includes a mechanism that allows the listener to work backward through the rules of production in order to identify the speaker’s intended phonemes. In other words, the lack-of-invariance problem can be solved if it can be demonstrated that listeners perceive speech by identifying the speaker’s intended speech gestures rather than (or in addition to) relying solely on the acoustic manifestation of such gestures. The McGurk effect, whereby auditory perception is dramatically altered by seeing the speaker’s moving lips (articulatory gestures), was an important contributor to the view that the perceptual primitives of speech are gestural in nature.

In addition to claiming that the motor system is recruited for perceiving speech (and partly because of this claim), the Motor Theory also posits that speech perception takes place in a highly specialized and speech-specific module that is neurally isolated and is most likely a unique and innate human endowment (Liberman, 1996 ; Liberman & Mattingly, 1985 ). However, even among supporters of a motor basis for speech perception, agreeing upon an operational definition of intended speech gestures and providing empirical evidence for the contribution of such intended gestures to perception proved difficult. This led Fowler and her colleagues to propose that the objects of speech perception are not intended articulatory gestures but real gestures, that is, actual vocal tract movements that are inferable from the acoustic signal itself (e.g., Fowler, 1986 , 1996 ). Thus, although Fowler’s Direct Realism approach aligns with the Motor Theory in that it claims that perceiving speech is perceiving gestures, it asserts that the acoustic signal itself is rich enough in articulatory information to provide a stable (i.e., invariant) signal-to-phoneme mapping algorithm. In doing so, Direct Realism can do away with claims about specialized and/or innate structures for speech perception.

Although the popularity of the original tenets of the Motor Theory—and, to some extent, associated gesture theories—has waned over the years, the theory has brought forward essential questions about the specificity of speech, the specialization of speech perception, and, more recently, the neuroanatomical substrate of a possible motor component of the speech apparatus (e.g., Gow & Segawa, 2009 ; Pulvermüller et al., 2006 ; Sussman, 1989 ; Whalen et al., 2006 ), a topic that regained interest following the discovery of mirror neurons in the premotor cortex (e.g., Rizzolatti & Craighero, 2004 ; but see Lotto, Hickok, & Holt, 2009 ). The debate has also shifted to a discussion of the extent to which the involvement of articulation during speech perception might in fact be under the listener’s control and its manifestation partly task specific (Yuen, Davis, Brysbaert, & Rastle, 2010 , Fig. 26.5 ; see comments by McGettigan, Agnew, & Scott, 2010 ; Rastle, Davis, & Brysbaert, 2010 ). The Motor Theory has also been extensively reviewed—and revisited—in an attempt to address problems highlighted by auditory-based models, as described later (e.g., Fowler, 2006 , 2008 ; Galantucci, Fowler, & Turvey, 2006 ; Lotto & Holt, 2006 ; Massaro & Chen, 2008 ).

Electropalatographic data showing the proportion of tongue contact on alveolar electrodes during the initial and final portions of /k/-initial (e.g., kib ) or /s/-initial (e.g., s ib ) syllables (collapsed) while a congruent or incongruent distractor is presented (Yuen et al., 2010 ). The distractor was presented auditorily in conditions A and B and visually in condition C. With the target kib as an example, the congruent distractor in the A condition was kib and the incongruent distractor started with a phoneme involving a different place of articulation (e.g., tib ). In condition B, the incongruent distractor started with a phoneme that differed from the target only by its voicing status, not by its place of articulation (e.g., gib ). Condition C was the same as condition A, except that the distractor was presented visually. The results show “traces” of the incongruent distractors in target production when the distractor is in articulatory competition with the target, particularly in the early portion of the phoneme (condition A), but not when it involves the same place of articulation (condition B), or when it is presented visually (condition C). The results suggest a close relationship between speech perception and speech production. (Reprinted from Yuen, I., Davis, M. H., Brysbaert, M., Rastle, K. [2010]. Activation of articulatory information in speech perception. Proceedings of the National Academy of Sciences USA , 107 , 592–597 [Figure 2], by permission of the National Academy of Sciences.)

Auditory Theory(ies)

The role of articulatory gestures in perceiving speech and the special status of the speech-perception system progressively came under attack largely because of insufficient hard evidence and lack of computational parsimony. Recall that recourse to articulatory gestures was originally posited as a way to solve the lack-of-invariance problem and turn a many(acoustic traces)-to-one(phoneme) mapping problem into a one(gesture)-to-one(phoneme) mapping solution. However, the lack of invariance problem turned out to be less prevalent and, at the same time, more complicated than originally claimed. Indeed, as mentioned earlier, many phonemes were found to preserve distinctive features across contexts (e.g., Blumstein & Stevens, 1981 ; Stevens & Blumstein, 1981 ). At the same time, lack of invariance was found in domains for which a gestural explanation was only of limited use, for example, voice quality, loudness, and speech rate.

Perhaps most problematic for gesture-based accounts was the finding by Kluender, Diehl, and Killeen ( 1987 ) that phonemic categorization, which was viewed by such accounts as necessitating access to gestural primitives, could be observed in species lacking the anatomical prerequisites for articulatory knowledge and practice (Japanese quail; Fig. 26.6 ). This result was seen by many as undermining both the motor component of speech perception and its human-specific nature. Parsimony became the new driving force. As Kluender et al. put it, “A theory of human phonetic categorization may need to be no more (and no less) complex than that required to explain the behavior of these quail” (p. 1197). The gestural explanation for compensation for coarticulation effects (Mann, 1980 ) was challenged by a general auditory mechanism as well. In Mann’s experiment, the perceptual shift on the /da/-/ga/ continumm induced by the preceding /al/ versus /ar/ context was explained by reference to articulatory gestures. However, Lotto and Kluender ( 1998 ) found a similar shift when the preceding context consisted of nonspeech sounds mimicking the spectral characteristics of the actual syllables (e.g., tone glides). Thus, the acoustic composition of the context, and in particular its spectral contrast with the following syllable, rather than an underlying reference to abstract articulatory gestures, was able to account for Mann’s context effect (but see Fowler, Brown, & Mann’s, 2000 , subsequent multimodal challenge to the auditory account).

However, auditory theories have been criticized for lacking in theoretical content. Auditory accounts are indeed largely based on counterarguments (and counterevidence) to the motor and gestural theories, rather than resting on a clear set of falsifiable principles (Diehl et al., 2004 ). While it is clear that a great deal of phenomena previously believed to require a gestural account can be explained within an arguably simpler auditory framework, it remains to be seen whether auditory theories can provide a satisfactory explanation for the entire class of phenomena in which the many-to-one puzzle has been observed (e.g., Pardo & Remez, 2006 ).

Pecking rates at test for positive stimuli (/dVs/) and negative stimuli (all others) for one of the quail in Kluender et al.’s ( 1987 ) study in eight vowel contexts. The test session was preceded by a learning phase in which the quail learned to discriminate /dVs/ syllables (i.e., syllables starting with /d/ and ending with /s/, with a varying intervocalic vowel) from /bVs/ and /gVs/ syllables, with four different intervocalic vowels not used in the test phase. During learning, the quail was rewarded for pecking in response to /d/-initial syllables (positive trials) but not to /b/- and /g/-initial syllables (negative trials). The figure shows that, at test, the quail pecked substantially more to positive than negative syllables, even though these syllables contained entirely new vowels, that is, vowels leading to different formant transitions with the initial consonant than those experienced during the learning phase. (Reprinted from Kluender, K. R., Diehl, R. L., & Killeen, P. R. [1987]. Japanese Quail can form phonetic categories. Science , 237 , 1195–1197 [Figure 1], by permission of the National Academy of Sciences.)

Top-Down Theories

This rubric and the following one (bottom-up theories) review theories of speech perception broadly construed . They are broadly construed in that they consider phonemic categorization, the scope of the narrowly construed theories, in the context of its interface with lexical knowledge. Although the traditional separation between narrowly and broadly construed theories originates from the respective historical goals of speech perception and spoken-word recognition research (Pisoni & Luce, 1987 ), an understanding of speech perception cannot be complete without an analysis of the impact of long-term knowledge on early sensory processes (see useful reviews in Goldinger, Pisoni, & Luce, 1996 ; Jusczyk & Luce, 2002 ).

The hallmark of top-down approaches to speech perception is that phonetic analysis and categorization can be influenced by knowledge stored in long-term memory, lexical knowledge in particular. As mentioned earlier, phoneme restoration studies (e.g., Warren & Obusek, 1971 ; Warren & Warren, 1970 ) showed that word knowledge could affect listeners’ interpretation of what they heard, but they did not provide direct evidence that phonetic categorization per se (i.e., perception , as it was referred to in that literature) was modified by lexical expectations. However, Samuel ( 1981 ) demonstrated that auditory acuity was indeed altered when lexical information was available (e.g., “pr*gress” [from “progress”], with * indicating the portion on which auditory acuity was measured) compared to when it was not (e.g., “cr*gress” [from the nonword “crogress”]).

This kind of result (see also, e.g., Ganong, 1980 ; Marslen-Wilson & Tyler, 1980 ; and, more recently, Gow, Segawa, Ahlfors, & Lin, 2008 ) led to conceptualizing the speech system as being deeply interactive, with information flowing not only from bottom to top but also from top down. For example, the TRACE model (more specifically, TRACE II; McClelland & Elman, 1986 ) is an interactive-activation model made of a large number of units organized into three levels: features, phonemes, and words (Fig. 26.7 A). The model includes bottom-up excitatory connections (from features to phonemes and from phonemes to words), inhibitory lateral connections (within each level), and, critically, top-down excitatory connections (from words to phonemes and from phonemes to features). Thus, the activation levels of features, for example, voicing, nasality, and burst, are partly determined by the activation levels of phonemes, and these are partly determined by the activation levels of words. In essence, this architecture places speech perception within a system that allows a given sensory input to yield a different perceptual experience (as opposed to interpretive experience) when it occurs in a word versus a nonword or next to phoneme x versus phoneme y, and so on. TRACE has been shown to simulate a large range of perceptual and psycholinguistic phenomena, for example, categorical perception, cue trading relations, phonetic context effects, compensation for coarticulation, lexical effects on phoneme detection/categorization, segmentation of embedded words, and so on. All this takes place within an architecture that is neither domain nor species specific. Later instantiations of TRACE have been proposed by McClelland ( 1991 ) and Movellan and McClelland ( 2001 ), but all of them preserve the core interactive architecture described in the original model.

Like TRACE, Grossberg’s Adaptive Resonance Theory (ART; e.g., Grossberg, 1986 ; Grossberg & Myers, 1999 ) suggests that perception emerges from a compromise, or stable state, between sensory information and stored lexical knowledge (Fig. 26.7B ). ART includes items (akin to subphonemic features or feature clusters) and list chunks (combinations of items whose composition is the result of prior learning; e.g., phonemes, syllables, or words). In ART, a sensory input activates items that, in turn, activate list chunks. List chunks feed back to component items, and items back to list chunks again in a bottom-up/top-down cyclic manner that extends over time, ultimately creating stable resonance between a set of items and a list chunk. Both TRACE and ART posit that connections between levels are only excitatory and connections within levels are only inhibitory. In ART, in typical circumstances, attention is directed to large chunks (e.g., words), and hence the content of smaller chunks is generally less readily available. Small mismatches between large chunks and small chunks do not prevent resonance, but large mismatches do. In other words, unlike TRACE, ART does not allow the speech system to “hallucinate” information that is not already there (however, for circumstances in which it could, see Grossberg, 2000a ). Large mismatches lead to the establishment of new chunks, and these gain resonance via subsequent exposure. In doing so, ART provides a solution to the stability-plasticity dilemma, that is, the unwanted erasure of prior learning by more recent learning (Grossberg, 1987 ), also referred to as catastrophic interference (e.g., McCloskey & Cohen, 1989 ).

Thus, like TRACE, ART posits that speech perception results from an online interaction between prelexical and lexical processes. However, ART is more deeply grounded in, and motivated by biologically plausible neural dynamics, where reciprocal connectivity and resonance states have been observed (e.g., Felleman & Van Essen, 1991 ). Likewise, ART replaces the hierarchical structure of TRACE with a more flexible one, in which tiers self-organize over time through competitive dynamics—as opposed to being predefined. Although sometimes accused of placing too few constraints on empirical expectations (Norris et al., 2000 ), the functional architecture of ART is thought to be more computationally economical than that of TRACE and more amenable to modeling both real-time and long-term temporal aspects of speech processing (Grossberg, Boardman, & Cohen, 1997 ).

Bottom-Up Theories

Bottom-up theories describe effects of lexical and sentential knowledge on phoneme categorization as a consequence of postperceptual biases. In this conceptualization, reporting “progress” when presented with “pr*gress” simply reflects a strategic decision to do so and the functionality of a system that is geared toward meaningful communication—we generally hear words rather than nonwords. Here, phonetic analysis itself is incorruptible by lexical or sentential knowledge. It takes place within an autonomous module that receives no feedback from lexical and postlexical layers. In Cutler and Norris’s ( 1979 ) Race model, phoneme identification is the result of a time race between a sublexical route and a lexical route activated in parallel in an entirely bottom-up fashion (Fig. 26.7C ). In normal circumstances, the lexical route is faster, which means that a sensory input that has a match in the lexicon (e.g., “progress”) is usually read out from that route. A nonlexical sensory input (e.g., “crogress”) is read out from the prelexical route. In this model, “pr*gress” is reported as containing the phoneme /o/ because the lexical route receives enough evidence to activate the word “progress” and, being faster, this route determines the response. In contrast, “cr*gress” does not lead to an acceptable match in the lexicon, and hence, readout is performed from the sublexical route, with the degraded phoneme being faithfully reported as degraded.

Simplified architecture of ( A ) TRACE, ( B ) ART, ( C ) Race, ( D ) FLMP, and ( E ) Merge. Layers are labeled consistently across models for comparability. Excitatory connections are denoted by arrows. Inhibitory connections are denoted by closed black circles.

Massaro’s Fuzzy Logical Model of Perception (FLMP; Massaro, 1987 , 1996 ; Oden & Massaro, 1978 ) also exhibits a bottom-up architecture, in which various sources of sensory input—for example, auditory, visual—contribute to speech perception without any feedback from the lexical level (Fig. 26.7D ). In FLMP, acoustic-phonetic features are activated multimodally and each feature accumulates a certain level of activation (on a continuous 0-to-1 scale), reflecting the degree of certainty that the feature has appeared in the signal. The profile of features’ activation levels is then compared against a prototypical profile of activation for phonemes stored in memory. Phoneme identification occurs when the match between the actual and prototypical profiles reaches a level determined by goodness-of-fit algorithms. Critically, the match does not need to be perfect to lead to identification; thus, there is no need for lexical top-down feedback. Prelexical and lexical sources of information are then integrated into a conscious percept. Although the extent to which the integration stage can be considered a true instantiation of bottom-up processes is a matter for debate (Massaro, 1996 ), FLMP also predicts that auditory acuity of * is fundamentally comparable in “pr*gress” and “cr*gress”—like the Race model and unlike top-down theories.

From an architectural point of view, integration between sublexical and lexical information is handled differently by Norris et al.’s ( 2000 ) Merge model. In Merge, the phoneme layer is duplicated into an input layer and a decision layer (Fig. 26.7E ). The phoneme input layer feeds forward to the lexical layer (with no top-down connections) and the phoneme decision layer receives input from both the phoneme input layer and the lexical layer. The phoneme decision layer is the place where phonemic and lexical inputs are integrated and where standard lexical phenomena arise (e.g., Ganong, 1980 ; Samuel, 1981 ). While both FLMP and Merge produce a decision by integrating unaltered lexical and sublexical information, the input received from the lexical level differs in the two models. In FLMP, lexical activation is relatively independent from the degree of activation of its component phonemes, whereas, in Merge, lexical activation is directly influenced by the pattern of activation sent upward by the phoneme input layer. While Merge has been criticized for exhibiting a contorted architecture (Gow, 2000 ; Samuel, 2000 ), being ecologically improbable (e.g., Grossberg, 2000b ; Montant, 2000 ; Stevens, 2000 ), and being simply a late instantiation of FLMP (Massaro, 2000 ; Oden, 2000 ), it has gathered the attention of both speech-perception and spoken-word-recognition scientists around a question that is as yet unanswered.

Bayesian Theories

Despite important differences in functional architecture between top-down and bottom-up models, both classes of models agree that speech perception involves distinct levels of representations (e.g., features, phonemes, words), multiple lexical activation, lexical competition, integration (of some sort) between actual sensory input and lexical expectations, and corrective mechanisms (of some sort) to handle incompleteness or uncertainty in the input. A radically different class of models based on optimal Bayesian inference has recently emerged as an alternative to the ones mentioned earlier—recently in psycholinguistics at least. These models eschew the concept of lexical activation altogether, sometimes doing away with the bottom-up/top-down debate itself—or at a minimum blurring the boundaries between the two mechanisms. For instance, in their Shortlist B model, Norris and McQueen ( 2008 ) have replaced activation with the concepts of likelihood and probability, which are seen as better approximations of actual (i.e., imperfect) human performance in the face of actual (i.e., complex and variable) speech input. The appeal of Bayesian computations is substantial because output (or posterior) probabilities, for example, probability that a word will be recognized, are estimated by tabulating both confirmatory and disconfirmatory evidence accumulated over past instances, as opposed to being tied to fixed activation thresholds (Fig. 26.8 ). In particular, Shortlist B has replaced discrete input categories such as features and phonemes with phoneme likelihoods calculated from actual speech data. Because they are derived from real speech, the phoneme likelihoods vary from instance to instance and as a function of the quality of the input and the phonetic context. Thus, while noisier, these probabilities are a better reflection of the type of challenge faced by the speech system in everyday conditions. They also allow the model to provide a single account for speech phenomena that usually require distinct ad-hoc mechanisms in other models. A general criticism leveled against Bayesian models, however, concerns the legitimacy of their priors , that is, the set of assumptions used to determine initial probabilities before any evidence has been gathered (e.g., how expected is a word or a phoneme a priori). Because priors can be difficult to establish, their arbitrariness or the modeler’s own biases can have substantial effects on the model’s outcome. Likewise, compared to the models reviewed earlier, models based on Bayesian inference often lead to less straightforward hypotheses, which makes their testability somewhat limited—even though their performance level in terms of replicating known patterns of data is usually high.

Main Bayesian equation in Shortlist B (Norris & McQueen, 2008 ). P(word i |evidence) is the conditional probability of a specific word ( word i ) having been heard given the available (intact or degraded) input ( evidence ). P(word i ) represents the listener’s prior belief, before any perceptual evidence has been accumulated, that word i will be present in the input. P(word i ) can be approximated from lexical frequencies and contextual variables. The critical term of the equation is P(evidence|word i ) , which is the likelihood of the evidence given word i , that is, the product of the probabilities of the sublexical units (e.g., phonemes) making up word i . This term is important because it acknowledges and takes into account the variability of the input (noise, ambiguity, idiosyncratic realization, etc.) in the input-to-representation mapping process. The probability of word i so calculated is then compared to that of all other words in the lexicon ( n ). Thus, Bayesian inference provides an index of word recognition that considers both lexical and sublexical factors as well as the complexity of a real and variable input.

Tailoring Speech Perception: Learning and Relearning

The literature reviewed so far suggests that perceiving speech involves a set of highly sophisticated processing skills and structures. To what extent are these skills and structures in place at birth? Of particular interest in the context of early theories of speech perception is the way in which speech perception and speech production develop relative to each other and the degree to which perceptual capacities responsible for subtle phonetic discrimination (e.g., voicing distinction) are present in prelinguistic infants. Eimas, Siqueland, Jusczyk, and Vigorito ( 1971 ) showed that 1-month-old infants perceive a voicing-based /ba/-/pa/ continuum categorically, just as adults do. Similarly, like adults (Mattingly, Liberman, Syrdal, & Halwes, 1971 ), young infants show a dissociation between categorical perception with speech and continuous perception with matched nonspeech (Eimas, 1974 ). Infants also seem to start off with an open-ended perceptual system, allowing them to discriminate a wide range of subtle phonetic contrasts—far more contrasts than they will be able to discriminate in adulthood (e.g., Aslin, Werker, & Morgan, 2002 ; Trehub, 1976 ). There is therefore strong evidence that fine speech-perception skills are in place early in life—at least well before the onset of speech production—and operational with minimal, if any, exposure to ambient speech. These findings have led to the idea that speech-specific mechanisms are part of the human biological endowment and have been taken as evidence for the innateness of language, or at least some of its perceptual aspects (Eimas et al., 1971 ). In that sense, an infant has very little to learn about speech perception. If anything, attuning to one’s native language is rather a matter of losing sensitivity to (or unlearning ) phonetic contrasts that have little communicative value for that particular language, for example, the /r/-/l/ distinction for Japanese listeners.

However, the idea that infants are born with a universal discrimination device operating according to a use-it-or-lose-it principle has not been unchallenged. For instance, on closer examination, discrimination capacities at the end of the first year appear far less acute and far less universal than expected (e.g., Lacerda & Sundberg, 2001 ). Likewise, discrimination of irrelevant contrasts does not wane as systematically and as fully as the theory would have it (e.g., Polka, Colantonio, & Sundara, 2001 ). For example, Bowers, Mattys, and Gage ( 2009 ) showed that language-specific phonemes learned in early childhood but never heard or produced subsequently, as would be the case for young children of temporarily expatriate parents, can be relearned relatively easily even decades later (Fig. 26.9A ). Thus, discriminatory attrition is not as widespread and severe as previously believed, suggesting that the representations of phonemes from “forgotten” languages, that is, those we stop practicing early in life, may be more deeply engraved in our long-term memory than we think.

By and large, however, the literature on early speech perception indicates that infants possess fine language-oriented auditory skills from birth as well as impressive capacities to learn from the ambient auditory scene during the first year of life (Fig. 26.10 ). Auditory deprivation during that period (e.g., otitis media; delay prior to cochlear implantation) can have severe consequences on speech perception and later language development (e.g., Clarkson, Eimas, & Marean, 1989 ; Mody, Schwartz, Gravel, & Ruben, 1999 ), possibly due to a general decrease of attention to sounds (e.g., Houston, Pisoni, Kirk, Ying, & Miyamoto, 2003 ). However, even in such circumstances, partial sensory information is often available through the visual channel (facial and lip information), which might explain the relative resilience of basic speech perception skills to auditory deprivation. Indeed, Kuhl and Meltzoff ( 1982 ) showed that, as early as 4 months of age, infants show a preference for matched audiovisual inputs (e.g., audio /a/ with visual /a/) over mismatched inputs (e.g., audio /a/ with visual /i/). Even more striking, infants around that age seem to integrate discrepant audiovisual information following the typical McGurk pattern observed in adults (Rosenblum, Schmuckler, & Johnson, 1997 ). These results suggest that the multimodal (or amodal) nature of speech perception, a central tenet of Massaro’s Fuzzy Logical Model of Perception (FLMP; cf. Massaro, 1987 ), is present early in life and operates without much prior experience with sound-gesture association. Although the strength of the McGurk effect is lower in infants than adults (e.g., Massaro, Thompson, Barron, & Laren, 1986 ; McGurk & MacDonald, 1976 ), early cross-modal integration is often taken as evidence for gestural theories of speech perception and as a challenge to auditory theories.

A question of growing interest concerns the flexibility of the speech-perception system when it is faced with an unstable or changing input. Can the perceptual categories learned during early infancy be undone or retuned to reflect a new environment? The issue of perceptual (re)learning is central to research on second-language (L2) perception and speech perception in degraded conditions. Evidence for a speech-perception-sensitive period during the first year of life (Trehub, 1976 ) suggests that attuning to new perceptual categories later on should be difficult and perhaps not as complete as it is for categories learned earlier. Late learning of L2 phonetic contrasts (e.g., /r/-/l/ distinction for Japanese L1 speakers) has indeed been shown to be slow, effortful, and imperfect (e.g., Logan, Lively, & Pisoni, 1991 ). However, even in those conditions, learning appears to transfer to tokens produced by new talkers (Logan et al., 1991 ) and, to some degree, to production (Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997 ). Successful learning of L2 contrasts is not systematically observed, however. For example, Bowers et al. ( 2009 ) found no evidence that English L1 speakers could learn to discriminate Zulu contrasts (e.g., /b/-//) or Hindi contrasts (e.g., /t/ vs. /˛/) even after 30 days of daily training (Fig. 26.9 B ). Thus, although possible, perceptual learning of L2 contrasts is greatly constrained by the age of L2 exposure, the nature and duration of training, and the phonetic overlap between the L1 and L2 phonetic inventories (e.g., Best, 1994 ; Kuhl, 2000 ).

Perceptual learning of accented L1 and noncanonical speech follows the same general patterns as L2 learning, but it usually leads to faster and more complete retuning (e.g., Bradlow & Bent, 2008 ; Clarke & Garrett, 2004 ). A reason for this difference is that, while L2 contrast learning involves the formation of new perceptual categories, whose boundaries are sometimes in direct conflict with L1 categories, accented L1 learning “simply” involves retuning existing perceptual categories, often by broadening their mapping range. This latter feature makes perceptual learning of accented speech a special instance of the more general debate on the episodic versus abstract nature of phonemic and lexical representations. At issue, here, is whether phonemic and lexical representations consist of a collection of episodic instances in which surface details are preserved (voice, accent, speech rate) or, alternatively, single, abstract representations (i.e., one for each phoneme and one for each word). That at least some surface details of words are preserved in long-term memory is undeniable (e.g., Goldinger, 1998 ). The current debate focuses on (1) whether lexical representations include both indexical (e.g., voice quality) and allophonic (e.g., phonological variants) details (Luce, McLennan, & Charles-Luce, 2003 ); (2) whether such details are of a lexical nature (i.e., stored within the lexicon), rather than sublexical (i.e., stored at the subphonemic, phonemic, or syllabic level; McQueen, Cutler, & Norris, 2006 ); (3) the online time course of episodic trace activation (e.g., Luce et al., 2003 ; McLennan, Luce, & Charles-Luce, 2005 ); (4) the mechanisms responsible for consolidating newly learned instances or new perceptual categories (e.g., Fenn, Nusbaum, & Margoliash, 2003 ); and (5) the possible generalization to other types of noncanonical speech, such as disordered speech (e.g., Lee, Whitehall, & Coccia, 2009 ; Mattys & Liss, 2008 ).

( A ) AX discrimination scores over 30 consecutive days (50% chance level; feedback provided) for Zulu contrasts (e.g., /b/-//) and Hindi contrasts (e.g., /t/ vs. /˛/) by DM, a 20-year-old, male, native English speaker who was exposed to Zulu from 4 to 8 years of age but never heard Zulu subsequently. Note DM’s improvement with the Zulu contrasts over the 30 days, in sharp contrast with his inability to learn the Hindi contrasts. ( B ) Performance on the same task by native English speakers with no prior exposure to Zulu or Hindi. (Adapted with permission from Bowers, J. S., Mattys, S. L., & Gage, S. H., [2009]. Preserved implicit knowledge of a forgotten childhood language. Psychological Science , 20 , 1064–1069 [partial Figure 1].)

Summary of key developmental landmarks for speech perception and speech production in the first year of life. (Reprinted from Kuhl, P. K. [2004]. Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience , 5 , 831–843 [Figure 1], by permission of the Nature Publishing Group.)

According to Samuel and Kraljic ( 2009 ), the aforementioned literature should be distinguished from a more recent strand of research that focuses on the specific variables affecting perceptual learning and the mechanisms linking such variables to perception. In particular, Norris, McQueen, and Cutler ( 2003 ) found that lexical information is a powerful source of perceptual recalibration. For example, Dutch listeners repeatedly exposed to a word containing a sound halfway between two existing phonemes (e.g., witlo* , where * is ambiguous between /f/ and /s/, with witlof a Dutch word—chicorey—and witlos a nonword) subsequently perceived a /f/-/s/ continuum as biased in the direction of the lexically induced percept (more /f/ than /s/ in the witlo* case). Likewise, Bertelson, Vroomen, and de Gelder ( 2003 ) found that repeated exposure to McGurk audiovisual stimuli (e.g., audio /a*a/ and visual /aba/ leading to the auditory perception of / aba/) biased the subsequent perception of an audio-only /aba/-/ada/ continuum in the direction of the visually induced percept. Although visually induced perceptual learning seems to be less long-lasting than its lexically induced counterpart (Vroomen, van Linden, Keetels, de Gelder, & Bertelson, 2004 ), the Norris et al. and Bertelson et al. studies demonstrate that even the mature perceptual system can show a certain degree of flexibility when it is faced with a changing auditory environment.

Comparison of speech recognition error rate by machines (ASR) and humans. The logarithmic scale on the Y axis shows that ASR performance is approximately one order of magnitude behind human performance across various speech materials (ASR error rate for telephone conversation: 43%). The data were collated by Lippmann ( 1997 ). (Reprinted from Moore, R. K. [ 2007 ]. Spoken language processing by machine. In G. Gaskell [Ed.], Oxford handbook of psycholinguistics (pp. 723–738). Oxford, UK: Oxford University Press [Figure 44.6], by permission of Oxford University Press.)

Speech Recognition by Machines

This chapter was mainly concerned with human speech recognition (HSR), but technological advances in the past decades have allowed the topic of speech perception and recognition to become an economically profitable challenge for engineers and applied computer scientists. A complete review of Automatic Speech Recognition’s (ASR) historical background, issues, and state of the art is beyond the scope of this chapter. However, a brief analysis of ASR in the context of the key topics in HSR reviewed earlier reveals interesting commonalities as well as divergences among the preoccupations and goals of the two fields.

Perhaps the most notable difference between HSR and ASR is their ultimate aim. Whereas HSR aims to provide a description of how the speech system works (processes, representations, functional architecture, biological plausibility), ASR aims to deliver speech transcriptions as error-free as possible, regardless of the biological and cognitive validity of the underlying algorithms. The success of ASR is typically measured by the percentage of words correctly identified from speech samples varying in their acoustic and lexical complexity. While increasing computer capacity and speed have allowed ASR performance to improve substantially since the early systems of the 1970s (e.g., Jelinek, 1976 ; Klatt, 1977 ), ASR accuracy is still about an order of magnitude behind its HSR counterpart (Moore, 2007 ; see Fig. 26.11 ).

What is the cause of the enduring performance gap between ASR and HSR? Given that the basic constraints imposed by the signal (sequentiality, continuity, variability) are the same for humans and machines, it is tempting to conclude that the gap between ASR and HSR will not be bridged until the algorithms of the former resemble those of the latter. And currently, they do not. The architecture of most ASR systems is almost entirely data driven: Its structure is expressed in terms of a network of sequence probabilities calculated over large corpora of natural speech (and their supervised transcription). The ultimate goal of the corpora, or training data, is to provide a database of acoustic-phonetic information sufficiently large that an appropriate match can be found for any input sound sequence. The larger the corpora, the tighter the fit between the input and the acoustic model (e.g., triphones instantiated in hidden Markov models, HMM, cf. Rabiner & Juang, 1993 ), and the lower the ASR system’s error rate (Lamel, Gauvain, & Adda, 2000 ). By that logic, hours of training corpora, not human-machine avatars, are the solution for increased accuracy, giving support to the controversial assertion that human models have so far hindered rather than promoted ASR progress (Jelinek, 1985 ). However, Moore and Cutler ( 2001 ) estimated that increasing corpus sizes from their current average capacity (1,000 hours or less, which is the equivalent of the average hearing time of a 2-year-old) to 10,000 hours (average hearing time of a 10-year-old) would only drop the ASR error rate to 12%.

Thus, a data-driven approach to speech recognition is constrained by more than just the size of the training data set. For example, the lexical and syntactic content of the training data often determines the application for which the ASR system is likely to perform best. Domain-specific systems (e.g., banking transactions by phone) generally reach high recognition accuracy levels even when they are fed continuous speech produced by various speakers, whereas domain-general systems (e.g., speech-recognition packages on personal computers) often have to compromise on the number of speakers they can recognize and/or training time in order to be effective (Everman et al., 2005 ). Therefore, one of the current stumbling blocks of ASR systems is language modeling (as opposed to acoustic modeling), that is, the extent to which the systems include higher order knowledge—syntax, semantics, pragmatics—from which inferences can be made to refine the mapping between the signal and the acoustic model. Existing ASR language models are fairly simple, drawing upon the distributional methods of acoustic models in that they simply provide the probability of all possible word sequences based on their occurrences in the training corpora. In that sense, an ASR system can predict that “necklace” is a possible completion of “The burglar stole the…” because of its relatively high transitional probability in the corpora, not because of the semantic knowledge that burglars tend to steal valuable items, and not because of the syntactic knowledge that a noun phrase typically follows a transitional verb. Likewise, ASR systems rarely include the kind of lexical feedback hypothesized in HSR models like TRACE (McClelland & Elman, 1986 ) and ART (Grossberg, 1986 ). Like Merge (Norris et al., 2000 ), ASR systems only allow lexical information and the language model to influence the relative weights of activated candidates, but not the fit between the signal and the acoustic model (Scharenborg Norris, ten Bosch, & McQueen, 2005 ).

While the remaining performance gap between ASR and HSR is widely recognized in the ASR literature, there seems to be no clear consensus on the direction to take in order to reduce it (Moore, 2007 ). Given today’s ever-expanding computer power, increasing the size of training corpora is probably the easiest way of gaining a few percentage points in accuracy, at least in the short term. More radical solutions are also being envisaged, however. For example, attempts are being made to build more linguistically plausible acoustic models by using phonemes (as opposed to di/triphone HMMs) as basic segmentation units (Ostendorf, Digilakis, & Kimball, 1996 ; Russell, 1993 ) or by preserving and exploiting fine acoustic detail in the signal instead of treating it as noise (Carlson & Hawkins, 2007 ; Moore & Maier, 2007 ).

The scientific study of speech perception started in the early 1950s under the impetus of research carried out at the Haskins Laboratories, following the development of the Pattern Playback device. This machine allowed Franklin S. Cooper and his colleagues to visualize speech in the form of a decomposable spectrogram and, reciprocally, to create artificial speech by sounding out the spectrogram. Contemporary speech perception research is both a continuation of its earlier preoccupations with the building blocks of speech perception and a departure from them. On the one hand, the quest for universal units of speech perception and attempts to crack the many-to-one mapping code are still going strong. Still alive, too, is the debate about the involvement of gestural knowledge in speech perception, reignited recently by neuroimaging techniques and the discovery of mirror neurons. On the decline are the ideas that speech is special with respect to audition and that infants are born with speech- and species-specific perceptual capacities. On the other hand, questions have necessarily spread beyond the sublexical level, following the assumption that decoding the sensory input must be investigated in the context of the entirety of the language system—or, at the very least, some of its phonologically related components. Indeed, lexical feedback, online or learning related, has been shown to modulate the perceptual experience of an otherwise unchanged input. Likewise, what used to be treated as speech surface details (e.g., indexical variations), and commonly filtered out for the sake of modeling simplicity, are now more fully acknowledged as being preserved during encoding, embedded in long-term representations, and used during retrieval. Speech-perception research in the coming decades is likely to expand its interest not only to the rest of the language system but also to domain-general cognitive functions such as attention and memory as well as practical applications (e.g., ASR) in the field of artificial intelligence. At the same time, researchers have become increasingly concerned with the external validity of their models. Attempts to enhance the ecological contribution of speech research are manifest in a sharp increase in studies using natural speech (conversational, accented, disordered) as the front end of their models.

Aslin, R. N. , Werker, J. F. , & Morgan, J. L. ( 2002 ). Innate phonetic boundaries revisited. Journal of the Acoustical Society of America , 112, 1257–1260.

Bard, E. G. , Shillcock, R. C. , & Altmann, G. T. M. ( 1988 ). The recognition of words after their acoustic offsets in spontaneous speech: Effects of subsequent context. Perception and Psychophysics , 44 , 395–408.

Bertelson, P. , Vroomen, J. , & de Gelder, B. ( 2003 ). Visual recalibration of auditory speech identification: A McGurk aftereffect. Psychological Science , 14, 592–597.

Best, C. T. ( 1994 ). The emergence of native-language phonological influences in infants: A perceptual assimilation model. In H. Nusbaum & J. Goodman (Eds.), The transition from speech sounds to spoken words: The development of speech perception (pp. 167–224). Cambridge, MA: MIT Press.

Blumstein, S. E. , & Stevens, K. N. ( 1981 ). Phonetic features and acoustic invariance in speech.   Cognition , 10, 25–32

Google Scholar

Bowers, J. S. , Mattys, S. L. , & Gage, S. H. ( 2009 ). Preserved implicit knowledge of a forgotten childhood language. Psychological Science , 20, 1064–1069.

Bradlow, A. R. , & Bent, T. ( 2008 ). Perceptual adaptation to non-native speech.   Cognition , 106, 707–729.

Bradlow, A. R. , Pisoni, D. B. , Yamada, R. A. , & Tohkura, Y. ( 1997 ). Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America , 101, 2299–2310.

Carlson, R. , & Hawkins, S. (2007). When is fine phonetic detail a detail? In Proceedings of the 16th ICPhS Meeting (pp. 211–214). Saarbrücken, Germany.

Clarke, C. M. , & Garrett, M. F. ( 2004 ). Rapid adaptation to foreign-accented English. Journal of the Acoustical Society of America , 116, 3647–3658.

Clarkson, R. L. , Eimas, P. D. , & Marean, G. C. ( 1989 ). Speech perception in children with histories of recurrent otitis media. Journal of the Acoustical Society of America , 85, 926–933.

Connine, C. M. , & Clifton, C. ( 1987 ) Interactive use of lexical information in speech perception. Journal of Experimental Psychology: Human Perception and Performance , 13, 291–299.

Cutler, A. ( 1994 ). Segmentation problems, rhythmic solutions.   Lingua , 92, 81–104

Cutler, A. , & Norris, D. ( 1979 ). Monitoring sentence comprehension. In W. E. Cooper & E. C. T. Walker (Eds.), Sentence processing: Psycholinguistic studies presented to Merrill Garrett (pp. 113–134). Hillsdale, NJ: Erlbaum.

Dahan, D . ( 2010 ). The time course of interpretation in speech comprehension. Current Directions in Psychological Science, 19, 121–126.

Delattre, P. C. , Liberman, A. M. , & Cooper, F. S. ( 1955 ). Acoustic loci and transitional cues for consonants.   Journal of the Acoustical Society of America , 27, 769–773.

Diehl, R. L. , Lotto, A. J. , & Holt, L. L. ( 2004 ). Speech perception. Annual Review of Psychology , 55, 149–179.

Dupoux, E. , & Mehler, J. ( 1990 ). Monitoring the lexicon with normal and compressed speech: Frequency effects and the prelexical code. Journal of Memory and Language , 29, 316–335.

Eimas, P. D. ( 1974 ). Auditory and linguistic processing of cues for place of articulation by infants. Perception and Psychophysics , 16, 513–521.

Eimas, P. D. , Siqueland, E. R. , Jusczyk, P. , & Vigorito, J. ( 1971 ). Speech perception in infants. Science , 171, 303–306.

Everman, G. , Chan, H. Y. , Gales, M. J. F , Jia, B. , Mrva, D. , & Woodland, P. C. ( 2005 ). Training LVCSR systems on thousands of hours of data. In Proceedings of the IEEE ICASSP (pp. 209–212).

Felleman, D. , & Van Essen, D. ( 1991 ). Distributed hierarchical processing in primate cerebral cortex. Cerebral Cortex , 1, 1–47.

Fenn, K. M. , Nusbaum, H. C. , & Margoliash, D. ( 2003 ). Consolidation during sleep of perceptual learning of spoken language. Nature , 425, 614–616.

Fougeron, C. , & Keating, P. A. ( 1997 ). Articulatory strengthening at edges of prosodic domains. Journal of the Acoustical Society of America , 101, 3728–3740.

Foulke, E. , & Sticht, T. G. ( 1969 ). Review of research on the intelligibility and comprehension of accelerated speech. Psychological Bulletin , 72, 50–62.

Fowler, C. A. ( 1986 ). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics , 14, 3–28.

Fowler, C. A. ( 1996 ). Listeners do hear sounds not tongues.   Journal of the Acoustical Society of America , 99, 1730–1741.

Fowler, C. A. ( 2006 ). Compensation for coarticulation reflects gesture perception, not spectral contrast. Perception and Psychophysics , 68, 161–177.

Fowler, C. A. ( 2008 ). The FLMP STMPed.   Psychonomic Bulletin and Review , 15, 458–462

Fowler, C. A. , Brown, J. M. , & Mann, V. A. ( 2000 ). Contrast effects do not underlie effects of preceding liquids on stop-consonant identification by humans. Journal of Experimental Psychology: Human Perception and Performance , 26, 877–888.

Galantucci, B. , Fowler, C. A. , & Turvey, M. T. ( 2006 ). The motor theory of speech perception reviewed. Psychonomic Bulletin and Review , 13, 361–377.

Ganong, W. F. ( 1980 ). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance , 6, 110–125.

Goldinger, S. D. ( 1998 ). Echoes of echoes? An episodic theory of lexical access. Psychological Review , 105, 251–279.

Goldinger, S. D. , Pisoni, D. B. , & Luce, P. A. ( 1996 ). Speech perception and spoken word recognition: Research and theory. In N. J. Lass (Ed.), Principles of experimental phonetics (pp. 277–327). St. Louis, MO: Mosby.

Google Preview

Gow, D. W. ( 2000 ). One phonemic representation should suffice.   Behavioral and Brain Science , 23, 331.

Gow, D. W. , & Segawa, J. A. ( 2009 ). Articulatory mediation of speech perception: A causal analysis of multi-modal imaging data. Cognition , 110, 222–236.

Gow, D. W. , Segawa, J. A. , Ahlfors, S. P. , & Lin, F. H. ( 2008 ). Lexical influences on speech perception: A Granger causality analysis of MEG and EEG source estimates. Neuroimage , 43, 614–23.

Grossberg, S. ( 1986 ). The adaptive self-organization of serial order in behavior: Speech, language, and motor control. In E. C. Schwab & H. C. Nusbaum (Eds.), Pattern recognition by humans and machines, Vol 1. Speech perception (pp. 187–294). New York: Academic Press.

Grossberg, S. ( 1987 ). Competitive learning: From interactive activations to adaptive resonance. Cognitive Science , 11, 23–63

Grossberg, S. ( 2000 a). How hallucinations may arise from brain mechanisms of learning, attention, and volition. Journal of the International Neuropsychological Society , 6, 579–588.

Grossberg, S. ( 2000 b). Brain feedback and adaptive resonance in speech perception. Behavioral and Brain Science , 23, 332–333.

Grossberg, S. , Boardman, I. , & Cohen, M. A. ( 1997 ). Neural dynamics of variable-rate speech categorization. Journal of Experimental Psychology: Human Perception and Performance , 23, 481–503.

Grossberg, S. , & Myers, C. ( 1999 ). The resonant dynamics of conscious speech: Interword integration and duration-dependent backward effects. Psychological Review , 107, 735–767.

Houston, D. M. , Pisoni, D. B. , Kirk, K. I. , Ying, E. A. , & Miyamoto, R. T. ( 2003 ). Speech perception skills of infants following cochlear implantation: A first report. International Journal of Pediatric Otorhinolaryngology , 67, 479–495.

Huggins, A.W. F. ( 1975 ). Temporally segmented speech and “echoic” storage. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process in speech perception (pp. 209–225). New York: Springer-Verlag.

Jelinek, F. ( 1976 ). Continuous speech recognition by statistical methods. Proceedings of the IEEE , 64, 532–556.

Jelinek, F. ( 1985 ). Every time I fire a linguist, the performance of my system goes up . Public statement at the IEEE ASSPS Workshop on Frontiers of Speech Recognition, Tampa, Florida.

Jusczyk, P. W. , & Luce, P. A. ( 2002 ). Speech perception and spoken word recognition: Past and present. Ear and Hearing , 23, 2–40.

Klatt, D. H. ( 1977 ). Review of the ARPA speech understanding project.   Journal of the Acoustical Society of America , 62, 1345–1366.

Kluender, K. R. , Diehl, R. L. , & Killeen, P. R. ( 1987 ). Japanese quail can form phonetic categories.   Science , 237, 1195–1197.

Kuhl, P. K. ( 1981 ). Discrimination of speech by non-human animals: Basic auditory sensitivities conductive to the perception of speech-sound categories. Journal of the Acoustical Society of America , 95, 340–349.

Kuhl, P. K. ( 2000 ). A new view of language acquisition.   Proceedings of the National Academy of Sciences USA , 97, 11850–11857.

Kuhl, P. K. ( 2004 ). Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience , 5, 831–843.

Kuhl, P. K. , & Meltzoff, A. N. ( 1982 ). The bimodal development of speech in infancy. Science , 218, 1138–1141.

Lacerda, F. , & Sundberg, U. ( 2001 ). Auditory and articulatory biases influence the initial stages of the language acquisition process. In F. Lacerda , C. von Hofsten , & M. Heimann (Eds.), Emerging cognitive abilities in early infancy (pp. 91–110). Mahwah, NJ: Erlbaum.

Lachs, L. , Pisoni, D. B. , & Kirk, K. I. ( 2001 ). Use of audio-visual information in speech perception by pre-lingually deaf children with cochlear implants: A first report. Ear and Hearing , 22, 236–251.

Lamel, L. , Gauvain, J-L. , & Adda, G. ( 2000 ). Lightly supervised acoustic model training. In Proceeding of the ISCA Workshop on Automatic Speech Recognition (pp. 150–154).

Lee, A. , Whitehall, T. L. , & Coccia, V. ( 2009 ). Effect of listener training on perceptual judgement of hypernasality. Clinical Linguistics and Phonetics , 23, 319–334.

Liberman, A. M. ( 1996 ). Speech: A special code . Cambridge, MA: MIT Press.

Liberman, A. M. , Delattre, P. C. , & Cooper, F. S. ( 1958 ). Some cues for the distinction between voiced and voiceless stops in initial position. Language and Speech , 1, 153–167.

Liberman, A. M. , Harris, K. S. , Eimas, P. , Lisker, L. , & Bastian, J. ( 1961 ). An effect of learning on speech perception: The discrimination of durations of silence with and without phonemic significance. Language and Speech , 4, 175–195.

Liberman, A. M. , Harris, K. S. , Hoffman, H. S. , & Griffith, B. C. ( 1957 ). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology , 54, 358–368.

Liberman, A. M. , & Mattingly, I. G. ( 1985 ). The motor theory of speech perception revised. Cognition , 21, 1–36.

Lindblom, B. ( 1963 ). Spectrographic study of vowel reduction.   Journal of the Acoustical Society of America , 35, 1773–1781.

Lippmann, R. ( 1997 ). Speech recognition by machines and humans.   Speech Communication , 22, 1–16.

Lisker, L. , & Abramson, A. S. ( 1970 ). The voicing dimensions: Some experiments in comparative phonetics. In Proceedings of the Sixth International Congress of Phonetic Sciences (pp. 563–567) . Prague, Czechoslovakia: Academia.

Logan, J. S. , Lively, S. E. , & Pisoni, D. B. ( 1991 ). Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America , 89, 874–886.

Lotto, A. J. , Hickok, G. S. , & Holt, L. L. ( 2009 ). Reflections on mirror neurons and speech perception.   Trends in Cognitive Science , 13, 110–114.

Lotto, A. J. , & Holt, L. L. ( 2006 ). Putting phonetic context effects into context: A commentary on Fowler (2006). Perception and Psychophysics , 68, 178–183.

Lotto, A. J. , & Kluender, K. R. ( 1998 ). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception and Psychophysics , 60, 602–619.

Luce, P. A. , Mc Lennan, C. T. , & Charles-Luce, J. ( 2003 ). Abstractness and specificity in spoken word recognition: Indexical and allophonic variability in long-term repetition priming. In J. Bowers & C. Marsolek (Eds.), Rethinking implicit memory (pp. 197–214). New York: Oxford University Press.

Magnuson, J. S. , McMurray, B. , Tanenhaus, M. K. , & Aslin, R. N. ( 2003 ). Lexical effects on compensation for coarticulation: The ghost of Christmas past. Cognitive Science , 27, 285–298.

Mann, V. A. ( 1980 ). Influence of preceding liquid on stop-consonant perception. Perception and Psychophysics , 28, 407–412.

Marslen-Wilson, W. D. ( 1987 ). Functional parallelism in spoken word recognition. Cognition , 25, 71–102.

Marslen-Wilson, W. D. , & Tyler, L. K. ( 1980 ). The temporal structure of spoken language understanding. Cognition , 8, 1–71.

Massaro, D. W. ( 1987 ). Speech perception by ear and eye: A paradigm for psychological inquiry . Hillsdale, NJ: Erlbaum.

Massaro, D. W. ( 1989 ). Testing between the TRACE model and the Fuzzy Logical Model of speech perception. Cognitive Psychology , 21, 398–421.

Massaro, D. W. ( 1996 ). Integration of multiple sources of information in language processing. In T. Inui & J. L. McClelland (Eds.), Attention and performance XVI: Information integration in perception and communication (pp. 397–432). Cambridge, MA: MIT Press.

Massaro, D. W. ( 2000 ). The horse race to language understanding: FLMP was first out of the gate and has yet to be overtaken. Behavioral and Brain Science , 23, 338–339.

Massaro, D. W. , & Chen, T. H. ( 2008 ). The motor theory of speech perception revisited. Psychonomic Bulletin and Review , 15, 453–457.

Massaro, D. W. , Thompson, L. A. , Barron, B. , & Laren, E. ( 1986 ). Developmental changes in visual and auditory contributions to speech perception. Journal of Experimental Child Psychology , 41, 93–113.

Mattingly, I. G. , Liberman, A. M. , Syrdal A. K. , & Halwes T. ( 1971 ). Discrimination in speech and nonspeech modes.   Cognitive Psychology , 2, 131–157.

Mattys, S. L. ( 1997 ). The use of time during lexical processing and segmentation: A review. Psychonomic Bulletin and Review , 4, 310–329.

Mattys, S. L. , & Liss, J. M. ( 2008 ). On building models of spoken-word recognition: When there is as much to learn from natural “oddities” as from artificial normality. Perception and Psychophysics , 70, 1235–1242.

Mattys, S. L. , White, L. , & Melhorn, J. F ( 2005 ). Integration of multiple speech segmentation cues: A hierarchical framework. Journal of Experimental Psychology: General , 134, 477–500.

McClelland, J. L. ( 1991 ). Stochastic interactive processes and the effect of context on perception. Cognitive Psychology , 23, 1–44.

McClelland, J. L. , & Elman, J. L. ( 1986 ). The TRACE model of speech perception. Cognitive Psychology , 18, 1–86.

McCloskey, M. , & Cohen, N. J. ( 1989 ). Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation , 24, 109–165.

McGettigan, C. , Agnew, Z. K. , & Scott, S. K. ( 2010 ). Are articulatory commands automatically and involuntarily activated during speech perception? Proceedings of the National Academy of Sciences USA , 107, E42.

McGurk, H. , & MacDonald, J. W. ( 1976 ). Hearing lips and seeing voices.   Nature , 264, 746–748.

McLennan, C. T. , Luce, P. A. , & Charles-Luce, J. ( 2005 ). Examining the time course of indexical specificity effects in spoken word recognition. Journal of Experimental Psychology: Learning Memory and Cognition , 31, 306–321.

McQueen, J. M. ( 1998 ). Segmentation of continuous speech using phonotactics. Journal of Memory and Language , 39, 21–46.

McQueen, J. M. , Cutler, A. , & Norris, D. ( 2006 ). Phonological abstraction in the mental lexicon. Cognitive Science , 30, 1113–1126.

Miller, J. D.   Wier, C. C. , Pastore, R. , Kelly, W. J. , & Dooling, R. J. ( 1976 ). Discrimination and labeling of noise-buzz sequences with varying noise lead times: An example of categorical perception. Journal of the Acoustical Society of America , 60, 410–417.

Miller, J. L. , & Liberman, A. M. ( 1979 ). Some effects of later-occurring information on the perception of stop consonant and semivowel. Perception and Psychophysics , 25, 457–465.

Mody, M. , Schwartz, R. G. , Gravel, R. S. , & Ruben, R. J. ( 1999 ). Speech perception and verbal memory in children with and without histories of otitis media. Journal of Speech, Language and Hearing Research , 42, 1069–1079.

Montant, M. ( 2000 ). Feedback: A general mechanism in the brain.   Behavioral and Brain Science , 23, 340–341.

Moore, R. K. ( 2007 ). Spoken language processing by machine. In G. Gaskell (Ed.), Oxford handbook of psycholinguistics (pp. 723–738). Oxford, England: Oxford University Press.

Moore, R. K. , & Cutler, A. (2001, July 11-13). Constraints on theories of human vs. machine recognition of speech . Paper presented at the SPRAAC Workshop on Human Speech Recognition as Pattern Classification, Max-Planck-Institute for Psycholinguistics, Nijmegen, The Netherlands.

Moore, R. K. , & Maier, V . (2007). Preserving fine phonetic detail using episodic memory: Automatic speech recognition using MINERVA2. In Proceedings of the 16th ICPhS Meeting (pp. 197–203). Saarbrücken, Germany.

Movellan, J. R. , & McClelland, J. L. ( 2001 ). The Morton-Massaro law of information integration: Implications for models of perception. Psychological Review , 108, 113–148.

Nooteboom, S. G. ( 1979 ). The time course of speech perception. In W. J. Barry & K. J. Kohler (Eds.), “Time” in the production and perception of speech (Arbeitsberichte 12). Kiel, Germany: Institut für Phonetik, University of Kiel.

Norris, D. ( 1994 ). Shortlist: A connectionist model of continuous speech recognition. Cognition , 52, 189–234.

Norris, D. , & McQueen, J. M. ( 2008 ). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review , 115 , 357–395.

Norris, D. , McQueen. J. M. , & Cutler, A. ( 2000 ). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences , 23, 299–370.

Norris, D. , McQueen, J. M. , & Cutler, A. ( 2003 ). Perceptual learning in speech. Cognitive Psychology , 47, 204–238.

Oden, G. C. ( 2000 ). Implausibility versus misinterpretation of the FLMP.   Behavioral and Brain Science , 23, 344.

Oden, G. C. , & Massaro, D. W. ( 1978 ). Integration of featural information in speech perception. Psychological Review , 85, 172–191.

Ostendorf, M. , Digilakis, V. , & Kimball, O. A. ( 1996 ). From HMMs to segment models: A unified view of stochastic modelling for speech recognition. IEEE Transactions, Speech and Audio Processing , 4, 360–378.

Pardo, J. S. , & Remez, R. E. ( 2006 ). The perception of speech. In M. Traxler & M. A. Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 201–248). New York: Academic Press.

Pisoni, D. B. , & Luce, P. A. ( 1987 ). Acoustic-phonetic representations in word recognition. Cognition , 25, 21–52.

Pitt, M. A. , & Samuel, A. G. ( 1995 ). Lexical and sublexical feedback in auditory word recognition. Cognitive Psychology , 29 , 149–188.

Polka, L. , Colantonio, C. , & Sundara, M. ( 2001 ). A cross-language comparison of /d/–/Δ/ perception: Evidence for a new developmental pattern. Journal of the Acoustical Society of America , 109, 2190–2201.

Port, R. F. (1977). The influence of speaking tempo on the duration of stressed vowel and medial stop in English Trochee words . Unpublished Ph.D. dissertation, Indiana University, Bloomington.

Potter, R. K. , Kopp, G. A. , & Green, H. C. ( 1947 ). Visible speech . New York: D. Van Nostrand.

Pulvermüller, F. , Huss, M. , Kherif, F. , Moscoso Del Prado Martin, F. , Hauk, O. , & Shtyrof, Y. ( 2006 ). Motor cortex maps articulatory features of speech sounds. Proceedings of the National Academy of Sciences USA , 103, 7865–7870.

Rabiner, L. , & Juang, B. H. ( 1993 ). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall.

Radeau, M. , Morais, J. , Mousty, P. , & Bertelson, P. ( 2000 ). The effect of speaking rate on the role of the uniqueness point in spoken word recognition. Journal of Memory and Language , 42, 406–422.

Rastle, K. , Davis, M. H. , & Brysbaert, M. , ( 2010 ). Response to McGettigan et al.: Task-based accounts are not sufficiently coherent to explain articulatory effects in speech perception. Proceedings Proceedings of the National Academy of Sciences USA , 107, E43.

Reisberg, D. , Mc Lean, J. , & Goldfield, A. ( 1987 ). Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In R. Campbell & B. Dodd (Eds.), Hearing by eye: The psychology of lip-reading (pp. 97–114). Hillsdale, NJ: Erlbaum.

Rizzolatti, G. , & Craighero, L. ( 2004 ). The mirror-neuron system.   Annual Review of Neuroscience , 27, 169–192,

Rosenblum, L. D. ( 2005 ). Primacy of multimodal speech perception. In D. B. Pisoni & R. E. Remez (Eds.), The handbook of speech perception (pp. 51–78). Oxford, England: Blackwell.

Rosenblum, L. D. , Schmuckler, M. A. , & Johnson, J. A. ( 1997 ). The McGurk effect in infants.   Perception and Psychophysics , 59, 347–357.

Russell, M. J. (1993). A segmental HMM for speech pattern modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 640–643)

Samuel, A. G. ( 1981 ). Phonemic restoration: Insights from a new methodology. Journal of Experimental Psychology: General , 110, 474–494.

Samuel, A. G. ( 1997 ). Lexical activation produces potent phonemic percepts. Cognitive Psychology , 32, 97–127.

Samuel, A. G. ( 2000 ). Merge: Contorted architecture, distorted facts, and purported autonomy. Behavioral and Brain Science , 23, 345–346.

Samuel, A. G. ( 2001 ). Knowing a word affects the fundamental perception of the sounds within it. Psychological Science , 12, 348–351.

Samuel, A. G. , & Kraljic, T. ( 2009 ). Perceptual learning for speech.   Attention, Perception, and Psychophysics , 71, 1207–1218.

Scharenborg, O. , Norris, D. , ten Bosch, L. , & Mc Queen, J. M. ( 2005 ). How should a speech recognizer work ? Cognitive Science , 29, 867–918.

Stevens, K. N. ( 2000 ). Recognition of continuous speech requires top-down processing. Behavioral and Brain Science , 23, 348.

Stevens, K. N. , & Blumstein, S. E. ( 1981 ). The search for invariant acoustic correlates of phonetic features. In P. Eimas & J. Miller (Eds.), Perspectives on the study of speech (pp. 1–38). Hillsdale, NJ: Erlbaum.

Sumby, W. H. , & Pollack, I. ( 1954 ). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America , 26, 212–215.

Sussman, H. M. ( 1989 ). Neural coding of relational invariance in speech: Human language analogs to the barn owl. Psychological Review , 96, 631–642.

Summerfield, A. Q. ( 1981 ). Articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance , 7, 1074–1095.

Trehub, S. E. ( 1976 ). The discrimination of foreign speech contrasts by infants and adults. Child Development , 47, 466–472.

Umeda, N. , & Coker, C. H. ( 1974 ). Allophonic variation in American English.   Journal of Phonetics , 2, 1–5.

van Buuren, R. A. , Festen, J. , & Houtgast, T . ( 1999 ). Compression and expansion of the temporal envelope: Evaluation of speech intelligibility and sound quality. Journal of the Acoustical Society of America , 105, 2903–2913.

Vroomen, J. , Van Linden, B. , Keetels, M. , de Gelder, B. , & Bertelson, P. ( 2004 ). Selective adaptation and recalibration of auditory speech by lipread information: Dissipation. Speech Communication , 44, 55–61.

Warren, R. M. , & Obusek, C. J. ( 1971 ). Speech perception phonemic restorations. Perception & Psychophysics , 9 , 358–362.

Warren, R. M. , & Warren, R. P. ( 1970 ). Auditory illusions and confusions.   Scientific American , 223 , 30–36.

Whalen, D. H. , Benson, R. R. , Richardson, M. , Swainson, B. , Clark, V. P. , Lai, S. ,… Liberman, A. M. ( 2006 ). Differentiation of speech and nonspeech processing within primary auditory cortex. Journal of the Acoustical Society of America , 119, 575–581.

Yuen, I. , Davis, M. H. , Brysbaert, M. , & Rastle, K. ( 2010 ). Activation of articulatory information in speech perception. Proceedings of the National Academy of Sciences USA , 107, 592–597.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Freedom of Speech

[ Editor’s Note: The following new entry by Jeffrey W. Howard replaces the former entry on this topic by the previous author. ]

Human beings have significant interests in communicating what they think to others, and in listening to what others have to say. These interests make it difficult to justify coercive restrictions on people’s communications, plausibly grounding a moral right to speak (and listen) to others that is properly protected by law. That there ought to be such legal protections for speech is uncontroversial among political and legal philosophers. But disagreement arises when we turn to the details. What are the interests or values that justify this presumption against restricting speech? And what, if anything, counts as an adequate justification for overcoming the presumption? This entry is chiefly concerned with exploring the philosophical literature on these questions.

The entry begins by distinguishing different ideas to which the term “freedom of speech” can refer. It then reviews the variety of concerns taken to justify freedom of speech. Next, the entry considers the proper limits of freedom of speech, cataloging different views on when and why restrictions on communication can be morally justified, and what considerations are relevant when evaluating restrictions. Finally, it considers the role of speech intermediaries in a philosophical analysis of freedom of speech, with special attention to internet platforms.

1. What is Freedom of Speech?

2.1 listener theories, 2.2 speaker theories, 2.3 democracy theories, 2.4 thinker theories, 2.5 toleration theories, 2.6 instrumental theories: political abuse and slippery slopes, 2.7 free speech skepticism, 3.1 absoluteness, coverage, and protection, 3.2 the limits of free speech: external constraints, 3.3 the limits of free speech: internal constraints, 3.4 proportionality: chilling effects and political abuse, 3.5 necessity: the counter-speech alternative, 4. the future of free speech theory: platform ethics, other internet resources, related entries.

In the philosophical literature, the terms “freedom of speech”, “free speech”, “freedom of expression”, and “freedom of communication” are mostly used equivalently. This entry will follow that convention, notwithstanding the fact that these formulations evoke subtly different phenomena. For example, it is widely understood that artistic expressions, such as dancing and painting, fall within the ambit of this freedom, even though they don’t straightforwardly seem to qualify as speech , which intuitively connotes some kind of linguistic utterance (see Tushnet, Chen, & Blocher 2017 for discussion). Still, they plainly qualify as communicative activity, conveying some kind of message, however vague or open to interpretation it may be.

Yet the extension of “free speech” is not fruitfully specified through conceptual analysis alone. The quest to distinguish speech from conduct, for the purpose of excluding the latter from protection, is notoriously thorny (Fish 1994: 106), despite some notable attempts (such as Greenawalt 1989: 58ff). As John Hart Ely writes concerning Vietnam War protesters who incinerated their draft cards, such activity is “100% action and 100% expression” (1975: 1495). It is only once we understand why we should care about free speech in the first place—the values it instantiates or serves—that we can evaluate whether a law banning the burning of draft cards (or whatever else) violates free speech. It is the task of a normative conception of free speech to offer an account of the values at stake, which in turn can illuminate the kinds of activities wherein those values are realized, and the kinds of restrictions that manifest hostility to those values. For example, if free speech is justified by the value of respecting citizens’ prerogative to hear many points of view and to make up their own minds, then banning the burning of draft cards to limit the views to which citizens will be exposed is manifestly incompatible with that purpose. If, in contrast, such activity is banned as part of a generally applied ordinance restricting fires in public, it would likely raise no free-speech concerns. (For a recent analysis of this issue, see Kramer 2021: 25ff).

Accordingly, the next section discusses different conceptions of free speech that arise in the philosophical literature, each oriented to some underlying moral or political value. Before turning to the discussion of those conceptions, some further preliminary distinctions will be useful.

First, we can distinguish between the morality of free speech and the law of free speech. In political philosophy, one standard approach is to theorize free speech as a requirement of morality, tracing the implications of such a theory for law and policy. Note that while this is the order of justification, it need not be the order of investigation; it is perfectly sensible to begin by studying an existing legal protection for speech (such as the First Amendment in the U.S.) and then asking what could justify such a protection (or something like it).

But of course morality and law can diverge. The most obvious way they can diverge is when the law is unjust. Existing legal protections for speech, embodied in the positive law of particular jurisdictions, may be misguided in various ways. In other words, a justified legal right to free speech, and the actual legal right to free speech in the positive law of a particular jurisdiction, can come apart. In some cases, positive legal rights might protect too little speech. For example, some jurisdictions’ speech laws make exceptions for blasphemy, such that criminalizing blasphemy does not breach the legal right to free speech within that legal system. But clearly one could argue that a justified legal right to free speech would not include any such exception. In other cases, positive legal rights might perhaps protect too much speech. Consider the fact that, as a matter of U.S. constitutional precedent, the First Amendment broadly protects speech that expresses or incites racial or religious hatred. Plainly we could agree that this is so as a matter of positive law while disagreeing about whether it ought to be so. (This is most straightforwardly true if we are legal positivists. These distinctions are muddied by moralistic theories of constitutional interpretation, which enjoin us to interpret positive legal rights in a constitutional text partly through the prism of our favorite normative political theory; see Dworkin 1996.)

Second, we can distinguish rights-based theories of free speech from non-rights-based theories. For many liberals, the legal right to free speech is justified by appealing to an underlying moral right to free speech, understood as a natural right held by all persons. (Some use the term human right equivalently—e.g., Alexander 2005—though the appropriate usage of that term is contested.) The operative notion of a moral right here is that of a claim-right (to invoke the influential analysis of Hohfeld 1917); it thereby correlates to moral duties held by others (paradigmatically, the state) to respect or protect the right. Such a right is natural in that it exerts normative force independently of whether anyone thinks it does, and regardless of whether it is codified into the law. A tyrannical state that imprisons dissidents acts unjustly, violating moral rights, even if there is no legal right to freedom of expression in its legal system.

For others, the underlying moral justification for free speech law need not come in the form of a natural moral right. For example, consequentialists might favor a legal right to free speech (on, e.g., welfare-maximizing grounds) without thinking that it tracks any underlying natural right. Or consider democratic theorists who have defended legal protections for free speech as central to democracy. Such theorists may think there is an underlying natural moral right to free speech, but they need not (especially if they hold an instrumental justification for democracy). Or consider deontologists who have argued that free speech functions as a kind of side-constraint on legitimate state action, requiring that the state always justify its decisions in a manner that respects citizens’ autonomy (Scanlon 1972). This theory does not cast free speech as a right, but rather as a principle that forbids the creation of laws that restrict speech on certain grounds. In the Hohfeldian analysis (Hohfeld 1917), such a principle may be understood as an immunity rather than a claim-right (Scanlon 2013: 402). Finally, some “minimalists” (to use a designation in Cohen 1993) favor legal protection for speech principally in response to government malice, corruption, and incompetence (see Schauer 1982; Epstein 1992; Leiter 2016). Such theorists need not recognize any fundamental moral right, either.

Third, among those who do ground free speech in a natural moral right, there is scope for disagreement about how tightly the law should mirror that right (as with any right; see Buchanan 2013). It is an open question what the precise legal codification of the moral right to free speech should involve. A justified legal right to freedom of speech may not mirror the precise contours of the natural moral right to freedom of speech. A raft of instrumental concerns enters the downstream analysis of what any justified legal right should look like; hence a defensible legal right to free speech may protect more speech (or indeed less speech) than the underlying moral right that justifies it. For example, even if the moral right to free speech does not protect so-called hate speech, such speech may still merit legal protection in the final analysis (say, because it would be too risky to entrust states with the power to limit those communications).

2. Justifying Free Speech

I will now examine several of the morally significant considerations taken to justify freedom of expression. Note that while many theorists have built whole conceptions of free speech out of a single interest or value alone, pluralism in this domain remains an option. It may well be that a plurality of interests serves to justify freedom of expression, properly understood (see, influentially, Emerson 1970 and Cohen 1993).

Suppose a state bans certain books on the grounds that it does not want us to hear the messages or arguments contained within them. Such censorship seems to involve some kind of insult or disrespect to citizens—treating us like children instead of adults who have a right to make up our own minds. This insight is fundamental in the free speech tradition. On this view, the state wrongs citizens by arrogating to itself the authority to decide what messages they ought to hear. That is so even if the state thinks that the speech will cause harm. As one author puts it,

the government may not suppress speech on the ground that the speech is likely to persuade people to do something that the government considers harmful. (Strauss 1991: 335)

Why are restrictions on persuasive speech objectionable? For some scholars, the relevant wrong here is a form of disrespect for citizens’ basic capacities (Dworkin 1996: 200; Nagel 2002: 44). For others, the wrong here inheres in a violation of the kind of relationship the state should have with its people: namely, that it should always act from a view of them as autonomous, and so entitled to make up their own minds (Scanlon 1972). It would simply be incompatible with a view of ourselves as autonomous—as authors of our own lives and choices—to grant the state the authority to pre-screen which opinions, arguments, and perspectives we should be allowed to think through, allowing us access only to those of which it approves.

This position is especially well-suited to justify some central doctrines of First Amendment jurisprudence. First, it justifies the claim that freedom of expression especially implicates the purposes with which the state acts. There are all sorts of legitimate reasons why the state might restrict speech (so-called “time, place, and manner” restrictions)—for example, noise curfews in residential neighborhoods, which do not raise serious free speech concerns. Yet when the state restricts speech with the purpose of manipulating the communicative environment and controlling the views to which citizens are exposed, free speech is directly affronted (Rubenfeld 2001; Alexander 2005; Kramer 2021). To be sure, purposes are not all that matter for free speech theory. For example, the chilling effects of otherwise justified speech regulations (discussed below) are seldom intended. But they undoubtedly matter.

Second, this view justifies the related doctrines of content neutrality and viewpoint neutrality (see G. Stone 1983 and 1987) . Content neutrality is violated when the state bans discussion of certain topics (“no discussion of abortion”), whereas viewpoint neutrality is violated when the state bans advocacy of certain views (“no pro-choice views may be expressed”). Both affront free speech, though viewpoint-discrimination is especially egregious and so even harder to justify. While listener autonomy theories are not the only theories that can ground these commitments, they are in a strong position to account for their plausibility. Note that while these doctrines are central to the American approach to free speech, they are less central to other states’ jurisprudence (see A. Stone 2017).

Third, this approach helps us see that free speech is potentially implicated whenever the state seeks to control our thoughts and the processes through which we form beliefs. Consider an attempt to ban Marx’s Capital . As Marx is deceased, he is probably not wronged through such censorship. But even if one held idiosyncratic views about posthumous rights, such that Marx were wronged, it would be curious to think this was the central objection to such censorship. Those with the gravest complaint would be the living adults who have the prerogative to read the book and make up their own minds about it. Indeed free speech may even be implicated if the state banned watching sunsets or playing video games on the grounds that is disapproved of the thoughts to which such experiences might give rise (Alexander 2005: 8–9; Kramer 2021: 22).

These arguments emphasize the noninstrumental imperative of respecting listener autonomy. But there is an instrumental version of the view. Our autonomy interests are not merely respected by free speech; they are promoted by an environment in which we learn what others have to say. Our interests in access to information is served by exposure to a wide range of viewpoints about both empirical and normative issues (Cohen 1993: 229), which help us reflect on what goals to choose and how best to pursue them. These informational interests are monumental. As Raz suggests, if we had to choose whether to express our own views on some question, or listen to the rest of humanity’s views on that question, we would choose the latter; it is our interest as listeners in the public good of a vibrant public discourse that, he thinks, centrally justifies free speech (1991).

Such an interest in acquiring justified beliefs, or in accessing truth, can be defended as part of a fully consequentialist political philosophy. J.S. Mill famously defends free speech instrumentally, appealing to its epistemic benefits in On Liberty . Mill believes that, given our fallibility, we should routinely keep an open mind as to whether a seemingly false view may actually be true, or at least contain some valuable grain of truth. And even where a proposition is manifestly false, there is value in allowing its expression so that we can better apprehend why we take it to be false (1859: chapter 2), enabled through discursive conflict (cf. Simpson 2021). Mill’s argument focuses especially on the benefits to audiences:

It is is not on the impassioned partisan, it is on the calmer and more disinterested bystander, that this collision of opinions works its salutary effect. (1859: chapter 2, p. 94)

These views are sometimes associated with the idea of a “marketplace of ideas”, whereby the open clash of views inevitably leads to the correct ones winning out in debate. Few in the contemporary literature holds such a strong teleological thesis about the consequences of unrestricted debate (e.g., see Brietzke 1997; cf. Volokh 2011). Much evidence from behavioral economics and social psychology, as well as insights about epistemic injustice from feminist epistemology, strongly suggest that human beings’ rational powers are seriously limited. Smug confidence in the marketplace of ideas belies this. Yet it is doubtful that Mill held such a strong teleological thesis (Gordon 1997). Mill’s point was not that unrestricted discussion necessarily leads people to acquire the truth. Rather, it is simply the best mechanism available for ascertaining the truth, relative to alternatives in which some arbiter declares what he sees as true and suppresses what he sees as false (see also Leiter 2016).

Note that Mill’s views on free speech in chapter 2 in On Liberty are not simply the application of the general liberty principle defended in chapter 1 of that work; his view is not that speech is anodyne and therefore seldom runs afoul of the harm principle. The reason a separate argument is necessary in chapter 2 is precisely that he is carving out a partial qualification of the harm principle for speech (on this issue see Jacobson 2000, Schauer 2011b, and Turner 2014). On Mill’s view, plenty of harmful speech should still be allowed. Imminently dangerous speech, where there is no time for discussion before harm eventuates, may be restricted; but where there is time for discussion, it must be allowed. Hence Mill’s famous example that vociferous criticism of corn dealers as

starvers of the poor…ought to be unmolested when simply circulated through the press, but may justly incur punishment when delivered orally to an excited mob assembled before the house of a corn dealer. (1859: chapter 3, p. 100)

The point is not that such speech is harmless; it’s that the instrumental benefits of permitting its expressions—and exposing its falsehood through public argument—justify the (remaining) costs.

Many authors have unsurprisingly argued that free speech is justified by our interests as speakers . This family of arguments emphasizes the role of speech in the development and exercise of our personal autonomy—our capacity to be the reflective authors of our own lives (Baker 1989; Redish 1982; Rawls 2005). Here an emphasis on freedom of expression is apt; we have an “expressive interest” (Cohen 1993: 224) in declaring our views—about the good life, about justice, about our identity, and about other aspects of the truth as we see it.

Our interests in self-expression may not always depend on the availability of a willing audience; we may have interests simply in shouting from the rooftops to declare who we are and what we believe, regardless of who else hears us. Hence communications to oneself—for example, in a diary or journal—are plausibly protected from interference (Redish 1992: 30–1; Shiffrin 2014: 83, 93; Kramer 2021: 23).

Yet we also have distinctive interests in sharing what we think with others. Part of how we develop our conceptions of the good life, forming judgments about how to live, is precisely through talking through the matter with others. This “deliberative interest” in directly served through opportunities to tell others what we think, so that we can learn from their feedback (Cohen 1993). Such encounters also offer opportunities to persuade others to adopt our views, and indeed to learn through such discussions who else already shares our views (Raz 1991).

Speech also seems like a central way in which we develop our capacities. This, too, is central to J.S. Mill’s defense of free speech, enabling people to explore different perspectives and points of view (1859). Hence it seems that when children engage in speech, to figure out what they think and to use their imagination to try out different ways of being in the world, they are directly engaging this interest. That explains the intuition that children, and not just adults, merit at least some protection under a principle of freedom of speech.

Note that while it is common to refer to speaker autonomy , we could simply refer to speakers’ capacities. Some political liberals hold that an emphasis on autonomy is objectionably Kantian or otherwise perfectionist, valorizing autonomy as a comprehensive moral ideal in a manner that is inappropriate for a liberal state (Cohen 1993: 229; Quong 2011). For such theorists, an undue emphasis on autonomy is incompatible with ideals of liberal neutrality toward different comprehensive conceptions of the good life (though cf. Shiffrin 2014: 81).

If free speech is justified by the importance of our interests in expressing ourselves, this justifies negative duties to refrain from interfering with speakers without adequate justification. Just as with listener theories, a strong presumption against content-based restrictions, and especially against viewpoint discrimination, is a clear requirement of the view. For the state to restrict citizens’ speech on the grounds that it disfavors what they have to say would affront the equal freedom of citizens. Imagine the state were to disallow the expression of Muslim or Jewish views, but allow the expression of Christian views. This would plainly transgress the right to freedom of expression, by valuing certain speakers’ interests in expressing themselves over others.

Many arguments for the right to free speech center on its special significance for democracy (Cohen 1993; Heinze 2016: Heyman 2009; Sunstein 1993; Weinstein 2011; Post 1991, 2009, 2011). It is possible to defend free speech on the noninstrumental ground that it is necessary to respect agents as democratic citizens. To restrict citizens’ speech is to disrespect their status as free and equal moral agents, who have a moral right to debate and decide the law for themselves (Rawls 2005).

Alternatively (or additionally), one can defend free speech on the instrumental ground that free speech promotes democracy, or whatever values democracy is meant to serve. So, for example, suppose the purpose of democracy is the republican one of establishing a state of non-domination between relationally egalitarian citizens; free speech can be defended as promoting that relation (Whitten 2022; Bonotti & Seglow 2022). Or suppose that democracy is valuable because of its role in promoting just outcomes (Arneson 2009) or tending to track those outcomes in a manner than is publicly justifiable (Estlund 2008) or is otherwise epistemically valuable (Landemore 2013).

Perhaps free speech doesn’t merely respect or promote democracy; another framing is that it is constitutive of it (Meiklejohn 1948, 1960; Heinze 2016). As Rawls says: “to restrict or suppress free political speech…always implies at least a partial suspension of democracy” (2005: 254). On this view, to be committed to democracy just is , in part, to be committed to free speech. Deliberative democrats famously contend that voting merely punctuates a larger process defined by a commitment to open deliberation among free and equal citizens (Gutmann & Thompson 2008). Such an unrestricted discussion is marked not by considerations of instrumental rationality and market forces, but rather, as Habermas puts it, “the unforced force of the better argument” (1992 [1996: 37]). One crucial way in which free speech might be constitutive of democracy is if it serves as a legitimation condition . On this view, without a process of open public discourse, the outcomes of the democratic decision-making process lack legitimacy (Dworkin 2009, Brettschneider 2012: 75–78, Cohen 1997, and Heinze 2016).

Those who justify free speech on democratic grounds may view this as a special application of a more general insight. For example, Scanlon’s listener theory (discussed above) contends that the state must always respect its citizens as capable of making up their own minds (1972)—a position with clear democratic implications. Likewise, Baker is adamant that both free speech and democracy are justified by the same underlying value of autonomy (2009). And while Rawls sees the democratic role of free speech as worthy of emphasis, he is clear that free speech is one of several basic liberties that enable the development and exercise of our moral powers: our capacities for a sense of justice and for the rational pursuit a lifeplan (2005). In this way, many theorists see the continuity between free speech and our broader interests as moral agents as a virtue, not a drawback (e.g., Kendrick 2017).

Even so, some democracy theorists hold that democracy has a special role in a theory of free speech, such that political speech in particular merits special protection (for an overview, see Barendt 2005: 154ff). One consequence of such views is that contributions to public discourse on political questions merit greater protection under the law (Sunstein 1993; cf. Cohen 1993: 227; Alexander 2005: 137–8). For some scholars, this may reflect instrumental anxieties about the special danger that the state will restrict the political speech of opponents and dissenters. But for others, an emphasis on political speech seems to reflect a normative claim that such speech is genuinely of greater significance, meriting greater protection, than other kinds of speech.

While conventional in the free speech literature, it is artificial to separate out our interests as speakers, listeners, and democratic citizens. Communication, and the thinking that feeds into it and that it enables, invariably engages our interests and activities across all these capacities. This insight is central to Seana Shiffrin’s groundbreaking thinker-based theory of freedom of speech, which seeks to unify the range of considerations that have informed the traditional theories (2014). Like other theories (e.g., Scanlon 1978, Cohen 1993), Shiffrin’s theory is pluralist in the range of interests it appeals to. But it offers a unifying framework that explains why this range of interests merits protection together.

On Shiffrin’s view, freedom of speech is best understood as encompassing both freedom of communication and freedom of thought, which while logically distinct are mutually reinforcing and interdependent (Shiffrin 2014: 79). Shiffrin’s account involves several profound claims about the relation between communication and thought. A central contention is that “free speech is essential to the development, functioning, and operation of thinkers” (2014: 91). This is, in part, because we must often externalize our ideas to articulate them precisely and hold them at a distance where we can evaluate them (p. 89). It is also because we work out what we think largely by talking it through with others. Such communicative processes may be monological, but they are typically dialogical; speaker and listener interests are thereby mutually engaged in an ongoing manner that cannot be neatly disentangled, as ideas are ping-ponged back and forth. Moreover, such discussions may concern democratic politics—engaging our interests as democratic citizens—but of course they need not. Aesthetics, music, local sports, the existence of God—these all are encompassed (2014: 92–93). Pace prevailing democratic theories,

One’s thoughts about political affairs are intrinsically and ex ante no more and no less central to the human self than thoughts about one’s mortality or one’s friends. (Shiffrin 2014: 93)

The other central aspect of Shiffrin’s view appeals to the necessity of communication for successfully exercising our moral agency. Sincere communication enables us

to share needs, emotions, intentions, convictions, ambitions, desires, fantasies, disappointments, and judgments. Thereby, we are enabled to form and execute complex cooperative plans, to understand one another, to appreciate and negotiate around our differences. (2014: 1)

Without clear and precise communication of the sort that only speech can provide, we cannot cooperate to discharge our collective obligations. Nor can we exercise our normative powers (such as consenting, waiving, or promising). Our moral agency thus depends upon protected channels through which we can relay our sincere thoughts to one another. The central role of free speech is to protect those channels, by ensuring agents are free to share what they are thinking without fear of sanction.

The thinker-based view has wide-ranging normative implications. For example, by emphasizing the continuity of speech and thought (a connection also noted in Macklem 2006 and Gilmore 2011), Shiffrin’s view powerfully explains the First Amendment doctrine that compelled speech also constitutes a violation of freedom of expression. Traditional listener- and speaker-focused theories seemingly cannot explain what is fundamentally objectionable with forcing someone to declare a commitment to something, as with children compelled to pledge allegiance to the American flag ( West Virginia State Board of Education v. Barnette 1943). “What seems most troubling about the compelled pledge”, Shiffrin writes,

is that the motive behind the regulation, and its possible effect, is to interfere with the autonomous thought processes of the compelled speaker. (2014: 94)

Further, Shiffrin’s view explains why a concern for free speech does not merely correlate to negative duties not to interfere with expression; it also supports positive responsibilities on the part of the state to educate citizens, encouraging and supporting their development and exercise as thinking beings (2014: 107).

Consider briefly one final family of free speech theories, which appeal to the role of toleration or self-restraint. On one argument, freedom of speech is important because it develops our character as liberal citizens, helping us tame our illiberal impulses. The underlying idea of Lee Bollinger’s view is that liberalism is difficult; we recurrently face temptation to punish those who hold contrary views. Freedom of speech helps us to practice the general ethos of toleration in a manner than fortifies our liberal convictions (1986). Deeply offensive speech, like pro-Nazi speech, is protected precisely because toleration in these enormously difficult cases promotes “a general social ethic” of toleration more generally (1986: 248), thereby restraining unjust exercises of state power overall. This consequentialist argument treats the protection of offensive speech not as a tricky borderline case, but as “integral to the central functions of the principle of free speech” (1986: 133). It is precisely because tolerating evil speech involves “extraordinary self-restraint” (1986: 10) that it works its salutary effects on society generally.

The idea of self-restraint arises, too, in Matthew Kramer’s recent defense of free speech. Like listener theories, Kramer’s strongly deontological theory condemns censorship aimed at protecting audiences from exposure to misguided views. At the core of his theory is the thesis that the state’s paramount moral responsibility is to furnish the social conditions that serve the development and maintenance of citizens’ self-respect and respect for others. The achievement of such an ethically resilient citizenry, on Kramer’s view, has the effect of neutering the harmfulness of countless harmful communications. “Securely in a position of ethical strength”, the state “can treat the wares of pornographers and the maunderings of bigots as execrable chirps that are to be endured with contempt” (Kramer 2021: 147). In contrast, in a society where the state has failed to do its duty of inculcating a robust liberal-egalitarian ethos, the communication of illiberal creeds may well pose a substantial threat. Yet for the state then to react by banning such speech is

overweening because with them the system’s officials take control of communications that should have been defused (through the system’s fulfillment of its moral obligations) without prohibitory or preventative impositions. (2021: 147)

(One might agree with Kramer that this is so, but diverge by arguing that the state—having failed in its initial duty—ought to take measures to prevent the harms that flow from that failure.)

These theories are striking in that they assume that a chief task of free speech theory is to explain why harmful speech ought to be protected. This is in contrast to those who think that the chief task of free speech theory is to explain our interests in communicating with others, treating the further issue of whether (wrongfully) harmful communications should be protected as an open question, with different reasonable answers available (Kendrick 2017). In this way, toleration theories—alongside a lot of philosophical work on free speech—seem designed to vindicate the demanding American legal position on free speech, one unshared by virtually all other liberal democracies.

One final family of arguments for free speech appeals to the danger of granting the state powers it may abuse. On this view, we protect free speech chiefly because if we didn’t, it would be far easier for the state to silence its political opponents and enact unjust policies. On this view, a state with censorial powers is likely to abuse them. As Richard Epstein notes, focusing on the American case,

the entire structure of federalism, divided government, and the system of checks and balances at the federal level shows that the theme of distrust has worked itself into the warp and woof of our constitutional structure.

“The protection of speech”, he writes, “…should be read in light of these political concerns” (Epstein 1992: 49).

This view is not merely a restatement of the democracy theory; it does not affirm free speech as an element of valuable self-governance. Nor does it reduce to the uncontroversial thought that citizens need freedom of speech to check the behavior of fallible government agents (Blasi 1977). One need not imagine human beings to be particularly sinister to insist (as democracy theorists do) that the decisions of those entrusted with great power be subject to public discussion and scrutiny. The argument under consideration here is more pessimistic about human nature. It is an argument about the slippery slope that we create even when enacting (otherwise justified) speech restrictions; we set an unacceptable precedent for future conduct by the state (see Schauer 1985). While this argument is theoretical, there is clearly historical evidence for it, as in the manifold cases in which bans on dangerous sedition were used to suppress legitimate war protest. (For a sweeping canonical study of the uses and abuses of speech regulations during wartime, with a focus on U.S. history, see G. Stone 2004.)

These instrumental concerns could potentially justify the legal protection for free speech. But they do not to attempt to justify why we should care about free speech as a positive moral ideal (Shiffrin 2014: 83n); they are, in Cohen’s helpful terminology, “minimalist” rather than “maximalist” (Cohen 1993: 210). Accordingly, they cannot explain why free speech is something that even the most trustworthy, morally competent administrations, with little risk of corruption or degeneration, ought to respect. Of course, minimalists will deny that accounting for speech’s positive value is a requirement of a theory of free speech, and that critiquing them for this omission begs the question.

Pluralists may see instrumental concerns as valuably supplementing or qualifying noninstrumental views. For example, instrumental concerns may play a role in justifying deviations between the moral right to free communication, on the one hand, and a properly specified legal right to free communication, on the other. Suppose that there is no moral right to engage in certain forms of harmful expression (such as hate speech), and that there is in fact a moral duty to refrain from such expression. Even so, it does not follow automatically that such a right ought to be legally enforced. Concerns about the dangers of granting the state such power plausibly militate against the enforcement of at least some of our communicative duties—at least in those jurisdictions that lack robust and competently administered liberal-democratic safeguards.

This entry has canvassed a range of views about what justifies freedom of expression, with particular attention to theories that conceive free speech as a natural moral right. Clearly, the proponents of such views believe that they succeed in this justificatory effort. But others dissent, doubting that the case for a bona fide moral right to free speech comes through. Let us briefly note the nature of this challenge from free speech skeptics , exploring a prominent line of reply.

The challenge from skeptics is generally understood as that of showing that free speech is a special right . As Leslie Kendrick notes,

the term “special right” generally requires that a special right be entirely distinct from other rights and activities and that it receive a very high degree of protection. (2017: 90)

(Note that this usage is not to be confused from the alternative usage of “special right”, referring to conditional rights arising out of particular relationships; see Hart 1955.)

Take each aspect in turn. First, to vindicate free speech as a special right, it must serve some distinctive value or interest (Schauer 2015). Suppose free speech were just an implication of a general principle not to interfere in people’s liberty without justification. As Joel Feinberg puts it, “Liberty should be the norm; coercion always needs some special justification” (1984: 9). In such a case, then while there still might be contingent, historical reasons to single speech out in law as worthy of protection (Alexander 2005: 186), such reasons would not track anything especially distinctive about speech as an underlying moral matter. Second, to count as a special right, free speech must be robust in what it protects, such that only a compelling justification can override it (Dworkin 2013: 131). This captures the conviction, prominent among American constitutional theorists, that “any robust free speech principle must protect at least some harmful speech despite the harm it may cause” (Schauer 2011b: 81; see also Schauer 1982).

If the task of justifying a moral right to free speech requires surmounting both hurdles, it is a tall order. Skeptics about a special right to free speech doubt that the order can be met, and so deny that a natural moral right to freedom of expression can be justified (Schauer 2015; Alexander & Horton 1983; Alexander 2005; Husak 1985). But these theorists may be demanding too much (Kendrick 2017). Start with the claim that free speech must be distinctive. We can accept that free speech be more than simply one implication of a general presumption of liberty. But need it be wholly distinctive? Consider the thesis that free speech is justified by our autonomy interests—interests that justify other rights such as freedom of religion and association. Is it a problem if free speech is justified by interests that are continuous with, or overlap with, interests that justify other rights? Pace the free speech skeptics, maybe not. So long as such claims deserve special recognition, and are worth distinguishing by name, this may be enough (Kendrick 2017: 101). Many of the views canvassed above share normative bases with other important rights. For example, Rawls is clear that he thinks all the basic liberties constitute

essential social conditions for the adequate development and full exercise of the two powers of moral personality over a complete life. (Rawls 2005: 293)

The debate, then, is whether such a shared basis is a theoretical virtue (or at least theoretically unproblematic) or whether it is a theoretical vice, as the skeptics avow.

As for the claim that free speech must be robust, protecting harmful speech, “it is not necessary for a free speech right to protect harmful speech in order for it to be called a free speech right” (Kendrick 2017: 102). We do not tend to think that religious liberty must protect harmful religious activities for it to count as a special right. So it would be strange to insist that the right to free speech must meet this burden to count as a special right. Most of the theorists mentioned above take themselves to be offering views that protect quite a lot of harmful speech. Yet we can question whether this feature is a necessary component of their views, or whether we could imagine variations without this result.

3. Justifying Speech Restrictions

When, and why, can restrictions on speech be justified? It is common in public debate on free speech to hear the provocative claim that free speech is absolute . But the plausibility of such a claim depends on what is exactly meant by it. If understood to mean that no communications between humans can ever be restricted, such a view is held by no one in the philosophical debate. When I threaten to kill you unless you hand me your money; when I offer to bribe the security guard to let me access the bank vault; when I disclose insider information that the company in which you’re heavily invested is about to go bust; when I defame you by falsely posting online that you’re a child abuser; when I endanger you by labeling a drug as safe despite its potentially fatal side-effects; when I reveal your whereabouts to assist a murderer intent on killing you—across all these cases, communications may be uncontroversially restricted. But there are different views as to why.

To help organize such views, consider a set of distinctions influentially defended by Schauer (from 1982 onward). The first category involves uncovered speech : speech that does not even presumptively fall within the scope of a principle of free expression. Many of the speech-acts just canvassed, such as the speech involved in making a threat or insider training, plausibly count as uncovered speech. As the U.S. Supreme Court has said of fighting words (e.g., insults calculated to provoke a street fight),

such utterances are no essential part of any exposition of ideas, and are of such slight social value as a step to truth that any benefit that may be derived from them is clearly outweighed by the social interest in order and morality. ( Chaplinsky v. New Hampshire 1942)

The general idea here is that some speech simply has negligible—and often no —value as free speech, in light of its utter disconnection from the values that justify free speech in the first place. (For discussion of so-called “low-value speech” in the U.S. context, see Sunstein 1989 and Lakier 2015.) Accordingly, when such low-value speech is harmful, it is particularly easy to justify its curtailment. Hence the Court’s view that “the prevention and punishment of [this speech] have never been thought to raise any Constitutional problem”. For legislation restricting such speech, the U.S. Supreme Court applies a “rational basis” test, which is very easy to meet, as it simply asks whether the law is rationally related to a legitimate state interest. (Note that it is widely held that it would still be impermissible to selectively ban low-value speech on a viewpoint-discriminatory basis—e.g., if a state only banned fighting words from left-wing activists while allowing them from right-wing activists.)

Schauer’s next category concerns speech that is covered but unprotected . This is speech that engages the values that underpin free speech; yet the countervailing harm of the speech justifies its restriction. In such cases, while there is real value in such expression as free speech, that value is outweighed by competing normative concerns (or even, as we will see below, on behalf of the very values that underpin free speech). In U.S. constitutional jurisprudence, this category encompasses those extremely rare cases in which restrictions on political speech pass the “strict scrutiny” test, whereby narrow restrictions on high-value speech can be justified due to the compelling state interests thereby served. Consider Holder v. Humanitarian Law Project 2010, in which the Court held that an NGO’s legal advice to a terrorist organization on how to pursue peaceful legal channels were legitimately criminalized under a counter-terrorism statute. While such speech had value as free speech (at least on one interpretation of this contested ruling), the imperative of counter-terrorism justified its restriction. (Arguably, commercial speech, while sometimes called low-value speech by scholars, falls into the covered but unprotected category. Under U.S. law, legislation restricting it receives “intermediate scrutiny” by courts—requiring restrictions to be narrowly drawn to advance a substantial government interest. Such a test suggests that commercial speech has bona fide free-speech value, making it harder to justify regulations on it than regulations on genuinely low-value speech like fighting words. It simply doesn’t have as much free-speech value as categories like political speech, religious speech, or press speech, all of which trigger the strict scrutiny test when restricted.)

As a philosophical matter, we can reasonably disagree about what speech qualifies as covered but unprotected (and need not treat the verdicts of the U.S. Supreme Court as philosophically decisive). For example, consider politically-inflected hate speech, which advances repugnant ideas about the inferior status of certain groups. One could concur that there is substantial free-speech value in such expression, just because it involves the sincere expression of views about central questions of politics and justice (however misguided the views doubtlessly are). Yet one could nevertheless hold that such speech should not be protected in virtue of the substantial harms to which it can lead. In such cases, the free-speech value is outweighed. Many scholars who defend the permissibility of legal restrictions on hate speech hold such a view (e.g., Parekh 2012; Waldron 2012). (More radically, one could hold that such speech’s value is corrupted by its evil, such that it qualifies as genuinely low-value; Howard 2019a.)

The final category of speech encompasses expression that is covered and protected . To declare that speech is protected just is to conclude that it is immune from restriction. A preponderance of human communications fall into this category. This does not mean that such speech can never be regulated ; content-neutral time, place, and manner regulations (e.g., prohibiting loud nighttime protests) can certainly be justified (G. Stone 1987). But such regulations must not be viewpoint discriminatory; they must apply even-handedly across all forms of protected speech.

Schauer’s taxonomy offers a useful organizing framework for how we should think about different forms of speech. Where does it leave the claim that free speech is absolute? The possibility of speech that is covered but unprotected suggests that free speech should sometimes be restricted on account of rival normative concerns. Of course, one could contend that such a category, while logically possible, is substantively an empty set; such a position would involve some kind of absoluteness about free speech (holding that where free-speech values are engaged by expression, no countervailing values can ever be weighty enough to override them). Such a position would be absolutist in a certain sense while granting the permissibility of restrictions on speech that do not engage the free-speech values. (For a recent critique of Schauer’s framework, arguing that governmental designation of some speech as low-value is incompatible with the very ideal of free speech, see Kramer 2021: 31.)

In what follows, this entry will focus on Schauer’s second category: speech that is covered by a free speech principle, but is nevertheless unprotected because of the harms it causes. How do we determine what speech falls into this category? How, in other words, do we determine the limits of free speech? Unsurprisingly, this is where most of the controversy lies.

Most legal systems that protect free speech recognize that the right has limits. Consider, for example, international human rights law, which emphatically protects the freedom of speech as a fundamental human right while also affirming specific restrictions on certain seriously harmful speech. Article 19 of the International Covenant of Civil and Political Rights declares that “[e]veryone shall have the right to freedom of expression; this right shall include freedom to seek, receive and impart information and ideas of all kinds”—but then immediately notes that this right “carries with it special duties and responsibilities”. The subsequent ICCPR article proceeds to endorse legal restrictions on “advocacy of national, racial or religious hatred that constitutes incitement to discrimination, hostility or violence”, as well as speech constituting “propaganda for war” (ICCPR). While such restrictions would plainly be struck down as unconstitutional affronts to free speech in the U.S., this more restrictive approach prevails in most liberal democracies’ treatment of harmful speech.

Set aside the legal issue for now. How should we think about how to determine the limits of the moral right free speech? Those seeking to justify limits on speech tend to appeal to one of two strategies (Howard and Simpson forthcoming). The first strategy appeals to the importance of balancing free speech against other moral values when they come into conflict. This strategy involves external limits on free speech. (The next strategy, discussed below, invokes free speech itself, or the values that justify it, as limit-setting rationales; it thus involves internal limits on free speech.)

A balancing approach recognizes a moral conflict between unfettered communication and external values. Consider again the case of hate speech, understood as expression that attacks members of socially vulnerable groups as inferior or dangerous. On all of the theories canvassed above, there are grounds for thinking that restrictions on hate speech are prima facie in violation of the moral right to free speech. Banning hate speech to prevent people from hearing ideas that might incline them to bigotry plainly seems to disrespect listener autonomy. Further, even when speakers are expressing prejudiced views, they are still engaging their autonomous faculties. Certainly, they are expressing views on questions of public political concern, even false ones. And as thinkers they are engaged in the communication of sincere testimony to others. On many of the leading theories, the values underpinning free speech seem to be militate against bans on hate speech.

Even so, other values matter. Consider, for example, the value of upholding the equal dignity of all citizens. A central insight of critical race theory is that public expressions of white supremacy, for example, attack and undermine that equal dignity (Matsuda, Lawrence, Delgado, & Crenshaw 1993). On Jeremy Waldron’s view (2012), hate speech is best understood as a form of group defamation, launching spurious attacks on others’ reputations and thereby undermining their standing as respected equals in their own community (relatedly, see Beauharnais v. Illinois 1952).

Countries that ban hate speech, accordingly, are plausibly understood not as opposed to free speech, but as recognizing the importance that it be balanced when conflicting with other values. Such balancing can be understood in different ways. In European human rights law, for example, the relevant idea is that the right to free speech is balanced against other rights ; the relevant task, accordingly, is to specify what counts as a proportionate balance between these rights (see Alexy 2003; J. Greene 2021).

For others, the very idea of balancing rights undermines their deontic character. This alternative framing holds that the balancing occurs before we specify what rights are; on this view, we balance interests against each other, and only once we’ve undertaken that balancing do we proceed to define what our rights protect. As Scanlon puts it,

The only balancing is balancing of interests. Rights are not balanced, but are defined, or redefined, in the light of the balance of interests and of empirical facts about how these interests can best be protected. (2008: 78)

This balancing need not come in the form of some crude consequentialism; otherwise it would be acceptable to limit the rights of the few to secure trivial benefits for the many. On a contractualist moral theory such as Scanlon’s, the test is to assess the strength of any given individual’s reason to engage in (or access) the speech, against the strength of any given individual’s reason to oppose it.

Note that those who engage in balancing need not give up on the idea of viewpoint neutrality; they can accept that, as a general principle, the state should not restrict speech on the grounds that it disapproves of its message and dislikes that others will hear it. The point, instead, is that this commitment is defeasible; it is possible to be overridden.

One final comment is apt. Those who are keen to balance free speech against other values tend to be motivated by the concern that speech can cause harm, either directly or indirectly (on this distinction, see Schauer 1993). But to justify restrictions on speech, it is not sufficient (and perhaps not even necessary) to show that such speech imposes or risks imposing harm. The crucial point is that the speech is wrongful (or, perhaps, wrongfully harmful or risky) , breaching a moral duty that speakers owe to others. Yet very few in the free speech literature think that the mere offensiveness of speech is sufficient to justify restrictions on it. Even Joel Feinberg, who thinks offensiveness can sometimes be grounds for restricting conduct, makes a sweeping exception for

[e]xpressions of opinion, especially about matters of public policy, but also about matters of empirical fact, and about historical, scientific, theological, philosophical, political, and moral questions. (1985: 44)

And in many cases, offensive speech may be actively salutary, as when racists are offended by defenses of racial equality (Waldron 1987). Accordingly, despite how large it looms in public debate, discussion of offensive speech will not play a major role in the discussion here.

We saw that one way to justify limits on free speech is to balance it against other values. On that approach, free speech is externally constrained. A second approach, in contrast, is internally constrained. On this approach, the very values that justify free speech themselves determine its own limits. This is a revisionist approach to free speech since, unlike orthodox thinking, it contends that a commitment to free speech values can counterintuitively support the restriction of speech—a surprising inversion of traditional thinking on the topic (see Howard and Simpson forthcoming). This move—justifying restrictions on speech by appealing to the values that underpin free speech—is now prevalent in the philosophical literature (for an overview, see Barendt 2005: 1ff).

Consider, for example, the claim that free speech is justified by concerns of listener autonomy. On such a view, as we saw above, autonomous citizens have interests in exposure to a wide range of viewpoints, so that they can decide for themselves what to believe. But many have pointed out that this is not autonomous citizens’ only interest; they also have interests in not getting murdered by those incited by incendiary speakers (Amdur 1980). Likewise, insofar as being targeted by hate speech undermines the exercise of one’s autonomous capacities, appeal to the underlying value of autonomy could well support restrictions on such speech (Brison 1998; see also Brink 2001). What’s more, if our interests as listeners in acquiring accurate information is undermined by fraudulent information, then restrictions on such information could well be compatible with our status as autonomous; this was one of the insights that led Scanlon to complicate his theory of free speech (1978).

Or consider the theory that free speech is justified because of its role in enabling autonomous speakers to express themselves. But as Japa Pallikkathayil has argued, some speech can intimidate its audiences into staying silent (as with some hate speech), out of fear for what will happen if they speak up (Pallikkathayil 2020). In principle, then, restrictions on hate speech may serve to support the value of speaker expression, rather than undermine it (see also Langton 2018; Maitra 2009; Maitra & McGowan 2007; and Matsuda 1989: 2337). Indeed, among the most prominent claims in feminist critiques of pornography is precisely that it silences women—not merely through its (perlocutionary) effects in inspiring rape, but more insidiously through its (illocutionary) effects in altering the force of the word “no” (see MacKinnon 1984; Langton 1993; and West 204 [2022]; McGowan 2003 and 2019; cf. Kramer 2021, pp. 160ff).

Now consider democracy theories. On the one hand, democracy theorists are adamant that citizens should be free to discuss any proposals, even the destruction of democracy itself (e.g., Meiklejohn 1948: 65–66). On the other hand, it isn’t obvious why citizens’ duties as democratic citizens could not set a limit to their democratic speech rights (Howard 2019a). The Nazi propagandist Goebbels is said to have remarked:

This will always remain one of the best jokes of democracy, that it gave its deadly enemies the means by which it was destroyed. (as quoted in Fox & Nolte 1995: 1)

But it is not clear why this is necessarily so. Why should we insist on a conception of democracy that contains a self-destruct mechanism? Merely stipulating that democracy requires this is not enough (see A. Greene and Simpson 2017).

Finally, consider Shiffrin’s thinker-based theory. Shiffrin’s view is especially well-placed to explain why varieties of harmful communications are protected speech; what the theory values is the sincere transmission of veridical testimony, whereby speakers disclose what they genuinely believe to others, even if what they believe is wrongheaded and dangerous. Yet because the sincere testimony of thinkers is what qualifies some communication for protection, Shiffrin is adamant that lying falls outside the protective ambit of freedom of expression (2014) This, then, sets an internal limit on her own theory (even if she herself disfavors all lies’ outright prohibition for reasons of tolerance). The claim that lying falls outside the protective ambit of free speech is itself a recurrent suggestion in the literature (Strauss 1991: 355; Brown 2023). In an era of rampant disinformation, this internal limit is of substantial practical significance.

Suppose the moral right (or principle) of free speech is limited, as most think, such that not all communications fall within its protective ambit (either for external reasons, internal reasons, or both). Even so, it does not follow that laws banning such unprotected speech can be justified all-things-considered. Further moral tests must be passed before any particular policy restricting speech can be justified. This sub-section focuses on the requirement that speech restrictions be proportionate .

The idea that laws implicating fundamental rights must be proportionate is central in many jurisdictions’ constitutional law, as well as in the international law of human rights. As a representative example, consider the specification of proportionality offered by the Supreme Court of Canada:

First, the measures adopted must be carefully designed to achieve the objective in question. They must not be arbitrary, unfair, or based on irrational considerations. In short, they must be rationally connected to the objective. Second, the means, even if rationally connected to the objective in this first sense, should impair “as little as possible” the right or freedom in question[…] Third, there must be a proportionality between the effects of the measures which are responsible for limiting the Charter right or freedom, and the objective which has been identified as of “sufficient importance” ( R v. Oakes 1986).

It is this third element (often called “proportionality stricto sensu ”) on which we will concentrate here; this is the focused sense of proportionality that roughly tracks how the term is used in the philosophical literatures on defensive harm and war, as well as (with some relevant differences) criminal punishment. (The strict scrutiny and intermediate scrutiny tests of U.S. constitutional law are arguably variations of the proportionality test; but set aside this complication for now as it distracts from the core philosophical issues. For relevant legal discussion, see Tsesis 2020.)

Proportionality, in the strict sense, concerns the relation between the costs or harms imposed by some measure and the benefits that the measure is designed to secure. The organizing distinction in recent philosophical literature (albeit largely missing in the literature on free speech) is one between narrow proportionality and wide proportionality . While there are different ways to cut up the terrain between these terms, let us stipulatively define them as follows. An interference is narrowly proportionate just in case the intended target of the interference is liable to bear the costs of that interference. An interference is widely proportionate just in case the collateral costs that the interference unintentionally imposes on others can be justified. (This distinction largely follows the literature in just war theory and the ethics of defensive force; see McMahan 2009.) While the distinction is historically absent from free speech theory, it has powerful payoffs in helping to structure this chaotic debate (as argued in Howard 2019a).

So start with the idea that restrictions on communication must be narrowly proportionate . For a restriction to be narrowly proportionate, those whose communications are restricted must be liable to bear their costs, such that they are not wronged by their imposition. One standard way to be liable to bear certain costs is to have a moral duty to bear them (Tadros 2012). So, for example, if speakers have a moral duty to refrain from libel, hate speech, or some other form of harmful speech, they are liable to bear at least some costs involved in the enforcement of that duty. Those costs cannot be unlimited; a policy of executing hate speakers could not plausibly be justified. Typically, in both defensive and punitive contexts, wrongdoers’ liability is determined by their culpability, the severity of their wrong, or some combination of the two. While it is difficult to say in the abstract what the precise maximal cost ceiling is for any given restriction, as it depends hugely on the details, the point is simply that there is some ceiling above which a speech restriction (like any restriction) imposes unacceptably high costs, even on wrongdoers.

Second, for a speech restriction to be justified, we must also show that it would be widely proportionate . Suppose a speaker is liable to bear the costs of some policy restricting her communication, such that she is not wronged by its imposition. It may be that the collateral costs of such a policy would render it unacceptable. One set of costs is chilling effects , the “overdeterrence of benign conduct that occurs incidentally to a law’s legitimate purpose or scope” (Kendrick 2013: 1649). The core idea is that laws targeting unprotected, legitimately proscribed expression may nevertheless end up having a deleterious impact on protected expression. This is because laws are often vague, overbroad, and in any case are likely to be misapplied by fallible officials (Schauer 1978: 699).

Note that if a speech restriction produces chilling effects, it does not follow that the restriction should not exist at all. Rather, concern about chilling effects instead suggests that speech restrictions should be under-inclusive—restricting less speech than is actually harmful—in order to create “breathing space”, or “a buffer zone of strategic protection” (Schauer 1978: 710) for legitimate expression and so reduce unwanted self-censorship. For example, some have argued that even though speech can cause harm recklessly or negligently, we should insist on specific intent as the mens rea of speech crimes in order to reduce any chilling effects that could follow (Alexander 1995: 21–128; Schauer 1978: 707; cf. Kendrick 2013).

But chilling effects are not the only sort of collateral effects to which speech restrictions could lead. Earlier we noted the risk that states might abuse their censorial powers. This, too, could militate in favor of underinclusive speech restrictions. Or the implication could be more radical. Consider the problem that it is difficult to author restrictions on hate speech in a tightly specified way; the language involved is open-ended in a manner that enables states to exercise considerable judgment in deciding what speech-acts, in fact, count as violations (see Strossen 2018). Given the danger that the state will misuse or abuse these laws to punish legitimate speech, some might think this renders their enactment widely disproportionate. Indeed, even if the law were well-crafted and would be judiciously applied by current officials, the point is that those in the future may not be so trustworthy.

Those inclined to accept such a position might simply draw the conclusion that legislatures ought to refrain from enacting laws against hate speech. A more radical conclusion is that the legal right to free speech ought to be specified so that hate speech is constitutionally protected. In other words, we ought to give speakers a legal right to violate their moral duties, since enforcing those moral duties through law is simply too risky. By appealing to this logic, it is conceivable that the First Amendment position on hate speech could be justified all-things-considered—not because the underlying moral right to free speech protects hate speech, but because hate speech must be protected for instrumental reasons of preventing future abuses of power (Howard 2019a).

Suppose certain restrictions on harmful speech can be justified as proportionate, in both the narrow and wide senses. This is still not sufficient to justify them all-things-considered. Additionally, they must be justified as necessary . (Note that some conceptions of proportionality in human rights law encompass the necessity requirement, but this entry follows the prevailing philosophical convention by treating them as distinct.)

Why might restrictions on harmful speech be unnecessary? One of the standard claims in the free speech literature is that we should respond to harmful speech not by banning it, but by arguing back against it. Counter-speech—not censorship—is the appropriate solution. This line of reasoning is old. As John Milton put it in 1644: “Let [Truth] and Falsehood grapple; who ever knew Truth put to the worse in a free and open encounter?” The insistence on counter-speech as the remedy for harmful speech is similarly found, as noted above, throughout chapter 2 of Mill’s On Liberty .

For many scholars, this line of reply is justified by the fact that they think the harmful speech in question is protected by the moral right to free speech. For such scholars, counter-speech is the right response because censorship is morally off the table. For other scholars, the recourse to counter-speech has a plausible distinct rationale (although it is seldom articulated): its possibility renders legal restrictions unnecessary. And because it is objectionable to use gratuitous coercion, legal restrictions are therefore impermissible (Howard 2019a). Such a view could plausibly justify Mill’s aforementioned analysis in the corn dealer example, whereby censorship is permissible but only when there’s no time for counter-speech—a view that is also endorsed by the U.S. Supreme Court in Brandenburg v. Ohio 395 U.S. 444 (1969).

Whether this argument succeeds depends upon a wide range of further assumptions—about the comparable effectiveness of counter-speech relative to law; about the burdens that counter-speech imposes on prospective counter-speakers. Supposing that the argument succeeds, it invites a range of further normative questions about the ethics of counter-speech. For example, it is important who has the duty to engage in counter-speech, who its intended audience is, and what specific forms the counter-speech ought to take—especially in order to maximize its persuasive effectiveness (Brettschneider 2012; Cepollaro, Lepoutre, & Simpson 2023; Howard 2021b; Lepoutre 2021; Badano & Nuti 2017). It is also important to ask questions about the moral limits of counter-speech. For example, insofar as publicly shaming wrongful speakers has become a prominent form of counter-speech, it is crucial to interrogate its permissibility (e.g., Billingham and Parr 2020).

This final section canvasses the young philosophical debate concerning freedom of speech on the internet. With some important exceptions (e.g., Barendt 2005: 451ff), this issue has only recently accelerated (for an excellent edited collection, see Brison & Gelber 2019). There are many normative questions to be asked about the moral rights and obligations of internet platforms. Here are three. First, do internet platforms have moral duties to respect the free speech of their users? Second, do internet platforms have moral duties to restrict (or at least refrain from amplifying) harmful speech posted by their users? And finally, if platforms do indeed have moral duties to restrict harmful speech, should those duties be legally enforced?

The reference to internet platforms , is a deliberate focus on large-scale social media platforms, through which people can discover and publicly share user-generated content. We set aside other entities such as search engines (Whitney & Simpson 2019), important though they are. That is simply because the central political controversies, on which philosophical input is most urgent, concern the large social-media platforms.

Consider the question of whether internet platforms have moral duties to respect the free speech of their users. One dominant view in the public discourse holds that the answer is no . On this view, platforms are private entities, and as such enjoy the prerogative to host whatever speech they like. This would arguably be a function of them having free speech rights themselves. Just as the free speech rights of the New York Times give it the authority to publish whatever op-eds it sees fit, the free speech rights of platforms give them the authority to exercise editorial or curatorial judgment about what speech to allow. On this view, if Facebook were to decide to become a Buddhist forum, amplifying the speech of Buddhist users and promoting Buddhist perspectives and ideas, and banning speech promoting other religions, it would be entirely within its moral (and thus proper legal) rights to do so. So, too, if it were to decide to become an atheist forum.

A radical alternative view holds that internet platforms constitute a public forum , a term of art from U.S. free speech jurisprudence used to designate spaces “designed for and dedicated to expressive activities” ( Southeastern Promotions Ltd., v. Conrad 1975). As Kramer has argued:

social-media platforms such as Facebook and Twitter and YouTube have become public fora. Although the companies that create and run those platforms are not morally obligated to sustain them in existence at all, the role of controlling a public forum morally obligates each such company to comply with the principle of freedom of expression while performing that role. No constraints that deviate from the kinds of neutrality required under that principle are morally legitimate. (Kramer 2021: 58–59)

On this demanding view, platforms’ duties to respect speech are (roughly) identical to the duties of states. Accordingly, if efforts by the state to restrict hate speech, pornography, and public health misinformation (for example) are objectionable affronts to free speech, so too are platforms’ content moderation rules for such content. A more moderate view does not hold that platforms are public forums as such, but holds that government channels or pages qualify as public forums (the claim at issue in Knight First Amendment Institute v. Trump (2019).)

Even if we deny that platforms constitute public forums, it is plausible that they engage in a governance function of some kind (Klonick 2018). As Jack Balkin has argued, the traditional model of free speech, which sees it as a relation between speakers and the state, is today plausibly supplanted by a triadic model, involving a more complex relation between speakers, governments, and intermediaries (2004, 2009, 2018, 2021). If platforms do indeed have some kind of governance function, it may well trigger responsibilities for transparency and accountability (as with new legislation such as the EU’s Digital Services Act and the UK’s Online Safety Act).

Second, consider the question of whether platforms have a duty to remove harmful content posted by users. Even those who regard them as public forums could agree that platforms may have a moral responsibility to remove illegal unprotected speech. Yet a dominant view in the public debate has historically defended platforms’ place as mere conduits for others’ speech. This is the current position under U.S. law (as with 47 U.S. Code §230), which broadly exempts platforms from liability for much illegal speech, such as defamation. On this view, we should view platforms as akin to bulletin boards: blame whoever posts wrongful content, but don’t hold the owner of the board responsible.

This view is under strain. Even under current U.S. law, platforms are liable for removing some content, such as child sexual abuse material and copyright infringements, suggesting that it is appropriate to demand some accountability for the wrongful content posted by others. An increasing body of philosophical work explores the idea that platforms are indeed morally responsible for removing extreme content. For example, some have argued that platforms have a special responsibility to prevent the radicalization that occurs on their networks, given the ways in which extreme content is amplified to susceptible users (Barnes 2022). Without engaging in moderation (i.e., removal) of harmful content, platforms are plausibly complicit with the wrongful harms perpetrated by users (Howard forthcoming).

Yet it remains an open question what a responsible content moderation policy ought to involve. Many are tempted by a juridical model, whereby platforms remove speech in accordance with clearly announced rules, with user appeals mechanisms in place for individual speech decisions to ensure they are correctly made (critiqued in Douek 2022b). Yet platforms have billions of users and remove millions of pieces of content per week. Accordingly, perfection is not possible. Moving quickly to remove harmful content during a crisis—e.g., Covid misinformation—will inevitably increase the number of false positives (i.e., legitimate speech taken down as collateral damage). It is plausible that the individualistic model of speech decisions adopted by courts is decidedly implausible to help us govern online content moderation; as noted in Douek 2021 and 2022a, what is needed is analysis of how the overall system should operate at scale, with a focus on achieving proportionality between benefits and costs. Alternatively, one might double down and insist that the juridical model is appropriate, given the normative significance of speech. And if it is infeasible for social-media companies to meet its demands given their size, then all the worse for social-media companies. On this view, it is they who must bend to meet the moral demands of free speech theory, not the other way around.

Substantial philosophical work needs to be done to deliver on this goal. The work is complicated by the fact that artificial intelligence (AI) is central to the processes of content moderation; human moderators, themselves subjected to terrible working conditions at long hours, work in conjunction with machine learning tools to identify and remove content that platforms have restricted. Yet AI systems notoriously are as biased as their training data. Further, their “black box” decisions are cryptic and cannot be easily understood. Given that countless speech decisions will necessarily be made without human involvement, it is right to ask whether it is reasonable to expect users to accept the deliverances of machines (e.g., see Vredenburgh 2022; Lazar forthcoming a). Note that machine intelligence is used not merely for content moderation, narrowly understood as the enforcement of rules about what speech is allowed. It is also deployed for the broader practice of content curation, determining what speech gets amplified — raising the question of what normative principles should govern such amplification; see Lazar forthcoming b).

Finally, there is the question of legal enforcement. Showing that platforms have the moral responsibility to engage in content moderation is necessary to justifying its codification into a legal responsibility. Yet it is not sufficient; one could accept that platforms have moral duties to moderate (some) harmful speech while also denying that those moral duties ought to be legally enforced. A strong, noninstrumental version of such a view would hold that while speakers have moral duties to refrain from wrongful speech, and platforms have duties not to platform or amplify it, the coercive enforcement of such duties would violate the moral right to freedom of expression. A more contingent, instrumental version of the view would hold that legal enforcement is not in principle impermissible; but in practice, it is simply too risky to grant the state the authority to enforce platforms’ and speakers’ moral duties, given the potential for abuse and overreach.

Liberals who champion the orthodox interpretation of the First Amendment, yet insist on robust content moderation, likely hold one or both of these views. Yet globally such views seem to be in the minority. Serious legislation is imminent that will subject social-media companies to burdensome regulation, in the form of such laws as the Digital Services Act in the European Union and the Online Safety Bill in the UK. Normatively evaluating such legislation is a pressing task. So, too, is the task of designing normative theories to guide the design of content moderation systems, and the wider governance of the digital public sphere. On both fronts, political philosophers should get back to work.

  • Alexander, Larry [Lawrence], 1995, “Free Speech and Speaker’s Intent”, Constitutional Commentary , 12(1): 21–28.
  • –––, 2005, Is There a Right of Freedom of Expression? , (Cambridge Studies in Philosophy and Law), Cambridge/New York: Cambridge University Press.
  • Alexander, Lawrence and Paul Horton, 1983, “The Impossibility of a Free Speech Principle Review Essay”, Northwestern University Law Review , 78(5): 1319–1358.
  • Alexy, Robert, 2003, “Constitutional Rights, Balancing, and Rationality”, Ratio Juris , 16(2): 131–140. doi:10.1111/1467-9337.00228
  • Amdur, Robert, 1980, “Scanlon on Freedom of Expression”, Philosophy & Public Affairs , 9(3): 287–300.
  • Arneson, Richard, 2009, “Democracy is Not Intrinsically Just”, in Justice and Democracy , Keith Dowding, Robert E. Goodin, and Carole Pateman (eds.), Cambridge: Cambridge University Press, 40–58.
  • Baker, C. Edwin, 1989, Human Liberty and Freedom of Speech , New York: Oxford University Press.
  • –––, 2009, “Autonomy and Hate Speech”, in Hare and Weinstein 2009: 139–157 (ch. 8). doi:10.1093/acprof:oso/9780199548781.003.0009
  • Balkin, Jack M., 2004, “Digital Speech and Democratic Culture: A Theory of Freedom of Expression for the Information Society”, New York University Law Review , 79(1): 1–55.
  • –––, 2009, “The Future of Free Expression in a Digital Age Free Speech and Press in the Digital Age”, Pepperdine Law Review , 36(2): 427–444.
  • –––, 2018, “Free Speech Is a Triangle Essays”, Columbia Law Review , 118(7): 2011–2056.
  • –––, 2021, “How to Regulate (and Not Regulate) Social Media”, Journal of Free Speech Law , 1(1): 71–96. [ Balkin 2021 available online (pdf) ]
  • Barendt, Eric M., 2005, Freedom of Speech , second edition, Oxford/New York: Oxford University Press. doi:10.1093/acprof:oso/9780199225811.001.0001
  • Barnes, Michael Randall, 2022, “Online Extremism, AI, and (Human) Content Moderation”, Feminist Philosophy Quarterly , 8(3/4): article 6. [ Barnes 2022 available online ]
  • Beauharnais v. Illinois 343 U.S. 250 (1952).
  • Billingham, Paul and Tom Parr, 2020, “Enforcing Social Norms: The Morality of Public Shaming”, European Journal of Philosophy , 28(4): 997–1016. doi:10.1111/ejop.12543
  • Blasi, Vincent, 1977, “The Checking Value in First Amendment Theory”, American Bar Foundation Research Journal 3: 521–649.
  • –––, 2004, “Holmes and the Marketplace of Ideas”, The Supreme Court Review , 2004: 1–46.
  • Brettschneider, Corey Lang, 2012, When the State Speaks, What Should It Say? How Democracies Can Protect Expression and Promote Equality , Princeton, NJ: Princeton University Press.
  • Brietzke, Paul H., 1997, “How and Why the Marketplace of Ideas Fails”, Valparaiso University Law Review , 31(3): 951–970.
  • Bollinger, Lee C., 1986, The Tolerant Society: Free Speech and Extremist Speech in America , New York: Oxford University Press.
  • Bonotti, Matteo and Jonathan Seglow, 2022, “Freedom of Speech: A Relational Defence”, Philosophy & Social Criticism , 48(4): 515–529.
  • Brandenburg v. Ohio 395 U.S. 444 (1969).
  • Brink, David O., 2001, “Millian Principles, Freedom of Expression, and Hate Speech”, Legal Theory , 7(2): 119–157. doi:10.1017/S1352325201072019
  • Brison, Susan J., 1998, “The Autonomy Defense of Free Speech”, Ethics , 108(2): 312–339. doi:10.1086/233807
  • Brison, Susan J. and Katharine Gelber (eds), 2019, Free Speech in the Digital Age , Oxford: Oxford University Press. doi:10.1093/oso/9780190883591.001.0001
  • Brown, Étienne, 2023, “Free Speech and the Legal Prohibition of Fake News”, Social Theory and Practice , 49(1): 29–55. doi:10.5840/soctheorpract202333179
  • Buchanan, Allen E., 2013, The Heart of Human Rights , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199325382.001.0001
  • Cepollaro, Bianca, Maxime Lepoutre, and Robert Mark Simpson, 2023, “Counterspeech”, Philosophy Compass , 18(1): e12890. doi:10.1111/phc3.12890
  • Chaplinsky v. New Hampshire 315 U.S. 568 (1942).
  • Cohen, Joshua, 1993, “Freedom of Expression”, Philosophy & Public Affairs , 22(3): 207–263.
  • –––, 1997, “Deliberation and Democratic Legitimacy”, in Deliberative Democracy: Essays on Reason and Politics , James Bohman and William Rehg (eds), Cambridge, MA: MIT Press, 67–92.
  • Dworkin, Ronald, 1981, “Is There a Right to Pornography?”, Oxford Journal of Legal Studies , 1(2): 177–212. doi:10.1093/ojls/1.2.177
  • –––, 1996, Freedom’s Law: The Moral Reading of the American Constitution , Cambridge, MA: Harvard University Press.
  • –––, 2006, “A New Map of Censorship”, Index on Censorship , 35(1): 130–133. doi:10.1080/03064220500532412
  • –––, 2009, “Forward.” In Extreme Speech and Democracy , ed. J. Weinstein and I. Hare, pp. v-ix. Oxford: Oxford University Press.
  • –––, 2013, Religion without God , Cambridge, MA: Harvard University Press.
  • Douek, Evelyn, 2021, “Governing Online Speech: From ‘Posts-as-Trumps’ to Proportionality and Probability”, Columbia Law Review , 121(3): 759–834.
  • –––, 2022a, “Content Moderation as Systems Thinking”, Harvard Law Review , 136(2): 526–607.
  • –––, 2022b, “The Siren Call of Content Moderation Formalism”, in Social Media, Freedom of Speech, and the Future of Our Democracy , Lee C. Bollinger and Geoffrey R. Stone (eds.), New York: Oxford University Press, 139–156 (ch. 9). doi:10.1093/oso/9780197621080.003.0009
  • Ely, John Hart, 1975, “Flag Desecration: A Case Study in the Roles of Categorization and Balancing in First Amendment Analysis”, Harvard Law Review , 88: 1482–1508.
  • Emerson, Thomas I., 1970, The System of Freedom of Expression , New York: Random House.
  • Epstein, Richard A., 1992, “Property, Speech, and the Politics of Distrust”, University of Chicago Law Review , 59(1): 41–90.
  • Estlund, David, 2008, Democratic Authority: A Philosophical Framework , Princeton: Princeton University Press.
  • Feinberg, Joel, 1984, The Moral Limits of the Criminal Law Volume 1: Harm to Others , New York: Oxford University Press. doi:10.1093/0195046641.001.0001
  • –––, 1985, The Moral Limits of the Criminal Law: Volume 2: Offense to Others , New York: Oxford University Press. doi:10.1093/0195052153.001.0001
  • Fish, Stanley Eugene, 1994, There’s No Such Thing as Free Speech, and It’s a Good Thing, Too , New York: Oxford University Press.
  • Fox, Gregory H. and Georg Nolte, 1995, “Intolerant Democracies”, Harvard International Law Journal , 36(1): 1–70.
  • Gelber, Katharine, 2010, “Freedom of Political Speech, Hate Speech and the Argument from Democracy: The Transformative Contribution of Capabilities Theory”, Contemporary Political Theory , 9(3): 304–324. doi:10.1057/cpt.2009.8
  • Gilmore, Jonathan, 2011, “Expression as Realization: Speakers’ Interests in Freedom of Speech”, Law and Philosophy , 30(5): 517–539. doi:10.1007/s10982-011-9096-z
  • Gordon, Jill, 1997, “John Stuart Mill and the ‘Marketplace of Ideas’:”, Social Theory and Practice , 23(2): 235–249. doi:10.5840/soctheorpract199723210
  • Greenawalt, Kent, 1989, Speech, Crime, and the Uses of Language , New York: Oxford University Press.
  • Greene, Amanda R. and Robert Mark Simpson, 2017, “Tolerating Hate in the Name of Democracy”, The Modern Law Review , 80(4): 746–765. doi:10.1111/1468-2230.12283
  • Greene, Jamal, 2021, How Rights Went Wrong: Why Our Obsession with Rights Is Tearing America Apart , Boston: Houghton Mifflin Harcourt.
  • Gutmann, Amy and Dennis Thompson, 2008, Why Deliberative Democracy? Princeton: Princeton University Press.
  • Habermas, Jürgen, 1992 [1996], Faktizität und Geltung: Beiträge zur Diskurstheorie des Rechts und des demokratischen Rechtsstaats , Frankfurt am Main: Suhrkamp. Translated as Between Facts and Norms: Contributions to a Discourse Theory of Law and Democracy , William Rehg (trans.), (Studies in Contemporary German Social Thought), Cambridge, MA: MIT Press, 1996.
  • Hare, Ivan and James Weinstein (eds), 2009, Extreme Speech and Democracy , Oxford/New York: Oxford University Press. doi:10.1093/acprof:oso/9780199548781.001.0001
  • Hart, H. L. A., 1955, “Are There Any Natural Rights?”, The Philosophical Review , 64(2): 175–191. doi:10.2307/2182586
  • Heinze, Eric, 2016, Hate Speech and Democratic Citizenship , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780198759027.001.0001
  • Heyman, Steven J., 2009, “Hate Speech, Public Discourse, and the First Amendment”, in Hare and Weinstein 2009: 158–181 (ch. 9). doi:10.1093/acprof:oso/9780199548781.003.0010
  • Hohfeld, Wesley, 1917, “Fundamental Legal Conceptions as Applied in Judicial Reasoning,” Yale Law Journal 26(8): 710–770.
  • Holder v. Humanitarian Law Project 561 U.S. 1 (2010).
  • Hornsby, Jennifer, 1995, “Disempowered Speech”, Philosophical Topics , 23(2): 127–147. doi:10.5840/philtopics199523211
  • Howard, Jeffrey W., 2019a, “Dangerous Speech”, Philosophy & Public Affairs , 47(2): 208–254. doi:10.1111/papa.12145
  • –––, 2019b, “Free Speech and Hate Speech”, Annual Review of Political Science , 22: 93–109. doi:10.1146/annurev-polisci-051517-012343
  • –––, 2021, “Terror, Hate and the Demands of Counter-Speech”, British Journal of Political Science , 51(3): 924–939. doi:10.1017/S000712341900053X
  • –––, forthcoming a, “The Ethics of Social Media: Why Content Moderation is a Moral Duty”, Journal of Practical Ethics .
  • Howard, Jeffrey W. and Robert Simpson, forthcoming b, “Freedom of Speech”, in Issues in Political Theory , fifth edition, Catriona McKinnon, Patrick Tomlin, and Robert Jubb (eds), Oxford: Oxford University Press.
  • Husak, Douglas N., 1985, “What Is so Special about [Free] Speech?”, Law and Philosophy , 4(1): 1–15. doi:10.1007/BF00208258
  • Jacobson, Daniel, 2000, “Mill on Liberty, Speech, and the Free Society”, Philosophy & Public Affairs , 29(3): 276–309. doi:10.1111/j.1088-4963.2000.00276.x
  • Kendrick, Leslie, 2013, “Speech, Intent, and the Chilling Effect”, William & Mary Law Review , 54(5): 1633–1692.
  • –––, 2017, “Free Speech as a Special Right”, Philosophy & Public Affairs , 45(2): 87–117. doi:10.1111/papa.12087
  • Klonick, Kate, 2018, “The New Governors”, Harvard Law Review 131: 1589–1670.
  • Knight First Amendment Institute v. Trump 928 F.3d 226 (2019).
  • Kramer, Matthew H., 2021, Freedom of Expression as Self-Restraint , Oxford: Oxford University Press.
  • Lakier, Genevieve, 2015, “The Invention of Low-Value Speech”, Harvard Law Review , 128(8): 2166–2233.
  • Landemore, Hélène, 2013, Democratic Reason: Politics, Collective Intelligence, and the Rule of the Many , Princeton/Oxford: Princeton University Press.
  • Langton, Rae, 1993, “Speech Acts and Unspeakable Acts”, Philosophy & Public Affairs , 22(4): 293–330.
  • –––, 2018, “The Authority of Hate Speech”, in Oxford Studies in Philosophy of Law (Volume 3), John Gardner, Leslie Green, and Brian Leiter (eds.), Oxford: Oxford University Press: ch. 4. doi:10.1093/oso/9780198828174.003.0004
  • Lazar, Seth, forthcoming, “Legitimacy, Authority, and the Public Value of Explanations”, in Oxford Studies in Political Philosophy (Volume 10), Steven Wall (ed.), Oxford: Oxford University Press.
  • –––, forthcoming, Connected by Code: Algorithmic Intermediaries and Political Philosophy, Oxford: Oxford University Press.
  • Leiter, Brian, 2016, “The Case against Free Speech”, Sydney Law Review , 38(4): 407–439.
  • Lepoutre, Maxime, 2021, Democratic Speech in Divided Times , Oxford/New York: Oxford University Press.
  • MacKinnon, Catharine A., 1984 [1987], “Not a Moral Issue”, Yale Law & Policy Review , 2(2): 321–345. Reprinted in her Feminism Unmodified: Discourses on Life and Law , Cambridge, MA: Harvard University Press, 1987, 146–162 (ch. 13).
  • Macklem, Timothy, 2006, Independence of Mind , Oxford/New York: Oxford University Press. doi:10.1093/acprof:oso/9780199535446.001.0001
  • Maitra, Ishani, 2009, “Silencing Speech”, Canadian Journal of Philosophy , 39(2): 309–338. doi:10.1353/cjp.0.0050
  • Maitra, Ishani and Mary Kate McGowan, 2007, “The Limits of Free Speech: Pornography and the Question of Coverage”, Legal Theory , 13(1): 41–68. doi:10.1017/S1352325207070024
  • Matsuda, Mari J., 1989, “Public Response to Racist Speech: Considering the Victim’s Story Legal Storytelling”, Michigan Law Review , 87(8): 2320–2381.
  • Matsuda, Mari J., Charles R. Lawrence, Richard Delgado, and Kimberlè Williams Crenshaw, 1993, Words That Wound: Critical Race Theory, Assaultive Speech, and the First Amendment (New Perspectives on Law, Culture, and Society), Boulder, CO: Westview Press. Reprinted 2018, Abingdon: Routledge. doi:10.4324/9780429502941
  • McGowan, Mary Kate, 2003, “Conversational Exercitives and the Force of Pornography”, Philosophy & Public Affairs , 31(2): 155–189. doi:10.1111/j.1088-4963.2003.00155.x
  • –––, 2019, Just Words: On Speech and Hidden Harm , Oxford: Oxford University Press. doi:10.1093/oso/9780198829706.001.0001
  • McMahan, Jeff, 2009, Killing in War , (Uehiro Series in Practical Ethics), Oxford: Clarendon Press. doi:10.1093/acprof:oso/9780199548668.001.0001
  • Milton, John, 1644, “Areopagitica”, London. [ Milton 1644 available online ]
  • Meiklejohn, Alexander, 1948, Free Speech and Its Relation to Self-Government , New York: Harper.
  • –––, 1960, Political Freedom: The Constitutional Powers of the People , New York: Harper.
  • Mill, John Stuart, 1859, On Liberty , London: John W. Parker and Son. [ Mill 1859 available online ]
  • Nagel, Thomas, 2002, Concealment and Exposure , New York: Oxford University Press.
  • Pallikkathayil, Japa, 2020, “Free Speech and the Embodied Self”, in Oxford Studies in Political Philosophy (Volume 6), David Sobel, Peter Vallentyne, and Steven Wall (eds.), Oxford: Oxford University Press, 61–84 (ch. 3). doi:10.1093/oso/9780198852636.003.0003
  • Parekh, Bhikhu, 2012, “Is There a Case for Banning Hate Speech?”, in The Content and Context of Hate Speech: Rethinking Regulation and Responses , Michael Herz and Peter Molnar (eds.), Cambridge/New York: Cambridge University Press, 37–56. doi:10.1017/CBO9781139042871.006
  • Post, Robert C., 1991, “Racist Speech, Democracy, and the First Amendment Free Speech and Religious, Racial, and Sexual Harassment”, William and Mary Law Review , 32(2): 267–328.
  • –––, 2000, “Reconciling Theory and Doctrine in First Amendment Jurisprudence Symposium of the Law in the Twentieth Century”, California Law Review , 88(6): 2353–2374.
  • –––, 2009, “Hate Speech”, in Hare and Weinstein 2009: 123–138 (ch. 7). doi:10.1093/acprof:oso/9780199548781.003.0008
  • –––, 2011, “Participatory Democracy as a Theory of Free Speech: A Reply Replies”, Virginia Law Review , 97(3): 617–632.
  • Quong, Jonathan, 2011, Liberalism without Perfection , Oxford/New York: Oxford University Press. doi:10.1093/acprof:oso/9780199594870.001.0001
  • R v. Oakes , 1 SCR 103 (1986).
  • Rawls, John, 2005, Political Liberalism , expanded edition, (Columbia Classics in Philosophy), New York: Columbia University Press.
  • Raz, Joseph, 1991 [1994], “Free Expression and Personal Identification”, Oxford Journal of Legal Studies , 11(3): 303–324. Collected in his Ethics in the Public Domain: Essays in the Morality of Law and Politics , Oxford: Clarendon Press, 146–169 (ch. 7).
  • Redish, Martin H., 1982, “Value of Free Speech”, University of Pennsylvania Law Review , 130(3): 591–645.
  • Rubenfeld, Jed, 2001, “The First Amendment’s Purpose”, Stanford Law Review , 53(4): 767–832.
  • Scanlon, Thomas, 1972, “A Theory of Freedom of Expression”, Philosophy & Public Affairs , 1(2): 204–226.
  • –––, 1978, “Freedom of Expression and Categories of Expression ”, University of Pittsburgh Law Review , 40(4): 519–550.
  • –––, 2008, “Rights and Interests”, in Arguments for a Better World: Essays in Honor of Amartya Sen , Kaushik Basu and Ravi Kanbur (eds), Oxford: Oxford University Press, 68–79 (ch. 5). doi:10.1093/acprof:oso/9780199239115.003.0006
  • –––, 2013, “Reply to Wenar”, Journal of Moral Philosophy 10: 400–406
  • Schauer, Frederick, 1978, “Fear, Risk and the First Amendment: Unraveling the Chilling Effect”, Boston University Law Review , 58(5): 685–732.
  • –––, 1982, Free Speech: A Philosophical Enquiry , Cambridge/New York: Cambridge University Press.
  • –––, 1985, “Slippery Slopes”, Harvard Law Review , 99(2): 361–383.
  • –––, 1993, “The Phenomenology of Speech and Harm”, Ethics , 103(4): 635–653. doi:10.1086/293546
  • –––, 2004, “The Boundaries of the First Amendment: A Preliminary Exploration of Constitutional Salience”, Harvard Law Review , 117(6): 1765–1809.
  • –––, 2009, “Is It Better to Be Safe than Sorry: Free Speech and the Precautionary Principle Free Speech in an Era of Terrorism”, Pepperdine Law Review , 36(2): 301–316.
  • –––, 2010, “Facts and the First Amendment”, UCLA Law Review , 57(4): 897–920.
  • –––, 2011a, “On the Relation between Chapters One and Two of John Stuart Mill’s On Liberty ”, Capital University Law Review , 39(3): 571–592.
  • –––, 2011b, “Harm(s) and the First Amendment”, The Supreme Court Review , 2011: 81–111. doi:10.1086/665583
  • –––, 2015, “Free Speech on Tuesdays”, Law and Philosophy , 34(2): 119–140. doi:10.1007/s10982-014-9220-y
  • Shiffrin, Seana Valentine, 2014, Speech Matters: On Lying, Morality, and the Law (Carl G. Hempel Lecture Series), Princeton, NJ: Princeton University Press.
  • Simpson, Robert Mark, 2016, “Defining ‘Speech’: Subtraction, Addition, and Division”, Canadian Journal of Law & Jurisprudence , 29(2): 457–494. doi:10.1017/cjlj.2016.20
  • –––, 2021, “‘Lost, Enfeebled, and Deprived of Its Vital Effect’: Mill’s Exaggerated View of the Relation Between Conflict and Vitality”, Aristotelian Society Supplementary Volume , 95: 97–114. doi:10.1093/arisup/akab006
  • Southeastern Promotions Ltd., v. Conrad , 420 U.S. 546 (1975).
  • Sparrow, Robert and Robert E. Goodin, 2001, “The Competition of Ideas: Market or Garden?”, Critical Review of International Social and Political Philosophy , 4(2): 45–58. doi:10.1080/13698230108403349
  • Stone, Adrienne, 2017, “Viewpoint Discrimination, Hate Speech Laws, and the Double-Sided Nature of Freedom of Speech”, Constitutional Commentary , 32(3): 687–696.
  • Stone, Geoffrey R., 1983, “Content Regulation and the First Amendment”, William and Mary Law Review , 25(2): 189–252.
  • –––, 1987, “Content-Neutral Restrictions”, University of Chicago Law Review , 54(1): 46–118.
  • –––, 2004, Perilous Times: Free Speech in Wartime from the Sedition Act of 1798 to the War on Terrorism , New York: W.W. Norton & Company.
  • Strauss, David A., 1991, “Persuasion, Autonomy, and Freedom of Expression”, Columbia Law Review , 91(2): 334–371.
  • Strossen, Nadine, 2018, Hate: Why We Should Resist It With Free Speech, Not Censorship , New York: Oxford University Press
  • Sunstein, Cass R., 1986, “Pornography and the First Amendment”, Duke Law Journal , 1986(4): 589–627.
  • –––, 1989, “Low Value Speech Revisited Commentaries”, Northwestern University Law Review , 83(3): 555–561.
  • –––, 1993, Democracy and the Problem of Free Speech , New York: The Free Press.
  • –––, 2017, #Republic: Divided Democracy in the Age of Social Media , Princeton, NJ: Princeton University Press.
  • Tadros, Victor, 2012, “Duty and Liability”, Utilitas , 24(2): 259–277.
  • Turner, Piers Norris, 2014, “‘Harm’ and Mill’s Harm Principle”, Ethics , 124(2): 299–326. doi:10.1086/673436
  • Tushnet, Mark, Alan Chen, and Joseph Blocher, 2017, Free Speech beyond Words: The Surprising Reach of the First Amendment , New York: New York University Press.
  • Volokh, Eugene, 2011, “In Defense of the Marketplace of Ideas/Search for Truth as a Theory of Free Speech Protection Responses”, Virginia Law Review , 97(3): 595–602.
  • Vredenburgh, Kate, 2022, “The Right to Explanation”, Journal of Political Philosophy , 30(2): 209–229. doi:10.1111/jopp.12262
  • Waldron, Jeremy, 1987, “Mill and the Value of Moral Distress”, Political Studies , 35(3): 410–423. doi:10.1111/j.1467-9248.1987.tb00197.x
  • –––, 2012, The Harm in Hate Speech (The Oliver Wendell Holmes Lectures, 2009), Cambridge, MA: Harvard University Press.
  • Weinstein, James, 2011, “Participatory Democracy as the Central Value of American Free Speech Doctrine”, Virginia Law Review , 97(3): 491–514.
  • West Virginia State Board of Education v. Barnette 319 U.S. 624 (1943).
  • Whitten, Suzanne, 2022, A Republican Theory of Free Speech: Critical Civility , Cham: Palgrave Macmillan. doi:10.1007/978-3-030-78631-1
  • Whitney, Heather M. and Robert Mark Simpson, 2019, “Search Engines and Free Speech Coverage”, in Free Speech in the Digital Age , Susan J. Brison and Katharine Gelber (eds), Oxford: Oxford University Press, 33–51 (ch. 2). doi:10.1093/oso/9780190883591.003.0003
  • West, Caroline, 2004 [2022], “Pornography and Censorship”, The Stanford Encyclopedia of Philosophy (Winter 2022 edition), Edward N. Zalta and Uri Nodelman (eds.), URL = < https://plato.stanford.edu/archives/win2022/entries/pornography-censorship/ >.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • International Covenant on Civil and Political Rights (ICCPR) , adopted: 16 December 1966; Entry into force: 23 March 1976.
  • Free Speech Debate
  • Knight First Amendment Institute at Columbia University
  • van Mill, David, “Freedom of Speech”, Stanford Encyclopedia of Philosophy (Winter 2023 Edition), Edward N. Zalta & Uri Nodelman (eds.), URL = < https://plato.stanford.edu/archives/win2023/entries/freedom-speech/ >. [This was the previous entry on this topic in the Stanford Encyclopedia of Philosophy – see the version history .]

ethics: search engines and | hate speech | legal rights | liberalism | Mill, John Stuart | Mill, John Stuart: moral and political philosophy | pornography: and censorship | rights | social networking and ethics | toleration

Acknowledgments

I am grateful to the editors and anonymous referees of this Encyclopedia for helpful feedback. I am greatly indebted to Robert Mark Simpson for many incisive suggestions, which substantially improved the entry. This entry was written while on a fellowship funded by UK Research & Innovation (grant reference MR/V025600/1); I am thankful to UKRI for the support.

Copyright © 2024 by Jeffrey W. Howard < jeffrey . howard @ ucl . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2024 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

  • Dictionaries home
  • American English
  • Collocations
  • German-English
  • Grammar home
  • Practical English Usage
  • Learn & Practise Grammar (Beta)
  • Word Lists home
  • My Word Lists
  • Recent additions
  • Resources home
  • Text Checker

Definition of speech noun from the Oxford Advanced Learner's Dictionary

  • speaker noun
  • speech noun
  • spoken adjective (≠ unspoken)
  • Several people made speeches at the wedding.
  • She gave a rousing speech to the crowd.
  • speech on something to deliver a speech on human rights
  • speech about something He inspired everyone with a moving speech about tolerance and respect.
  • in a speech In his acceptance speech , the actor thanked his family.
  • a lecture on the Roman army
  • a course/​series of lectures
  • a televised presidential address
  • She gave an interesting talk on her visit to China.
  • to preach a sermon
  • a long/​short speech/​lecture/​address/​talk/​sermon
  • a keynote speech/​lecture/​address
  • to write/​prepare/​give/​deliver/​hear a(n) speech/​lecture/​address/​talk/​sermon
  • to attend/​go to a lecture/​talk
  • George Washington's inaugural speech
  • He made a speech about workers of the world uniting.
  • In a speech given last month, she hinted she would run for office.
  • She delivered the keynote speech (= main general speech) at the conference.
  • He wrote her party conference speech.
  • His 20-minute speech was interrupted several times by booing.
  • Her comments came ahead of a speech she will deliver on Thursday to business leaders.
  • She concluded her speech by thanking the audience.
  • He gave an impassioned speech broadcast nationwide.
  • We heard a speech by the author.
  • This is very unexpected—I haven't prepared a speech.
  • The guest speaker is ill so I have to do the opening speech.
  • He read his speech from a prompter.
  • the farewell speech given by George Washington
  • He made the comments in a nationally televised speech.
  • During his victory speech the President paid tribute to his defeated opponent.
  • In his concession speech, he urged his supporters to try to work with Republicans.
  • The Prime Minister addressed the nation in a televised speech.
  • He delivered his final speech to Congress.
  • He delivered the commencement speech at Notre Dame University.
  • His speech was broadcast on national radio.
  • In her speech to the House of Commons, she outlined her vision of Britain in the 21st century.
  • President Bush delivered his 2004 State of the Union speech.
  • She gave a speech on the economy.
  • She made a stirring campaign speech on improving the lot of the unemployed.
  • The President will deliver a major foreign-policy speech to the United Nations.
  • The candidates gave their standard stump speeches (= political campaign speeches) .
  • The prizewinner gave an emotional acceptance speech.
  • a Senate floor speech
  • her maiden speech (= her first) in the House of Commons
  • the Chancellor's Budget speech
  • the Prime Minister's speech-writers
  • She's been asked to give the after-dinner speech.
  • You will need to prepare an acceptance speech.
  • a political speech writer
  • in a/​the speech
  • speech about

Take your English to the next level

The Oxford Learner’s Thesaurus explains the difference between groups of similar words. Try it for free as part of the Oxford Advanced Learner’s Dictionary app

speech human definition

6 Filthy Americanisms that Aren't...

6 Filthy Americanisms that Aren't Actually American

And one that is

Dictionary Entries Near human

Humala (Tasso)

Cite this Entry

“Human.” Merriam-Webster.com Dictionary , Merriam-Webster, https://www.merriam-webster.com/dictionary/human. Accessed 22 Apr. 2024.

Kids Definition

Kids definition of human.

Kids Definition of human  (Entry 2 of 2)

Medical Definition

Medical definition of human.

Medical Definition of human  (Entry 2 of 2)

More from Merriam-Webster on human

Nglish: Translation of human for Spanish Speakers

Britannica English: Translation of human for Arabic Speakers

Britannica.com: Encyclopedia article about human

Subscribe to America's largest dictionary and get thousands more definitions and advanced search—ad free!

Play Quordle: Guess all four words in a limited number of tries.  Each of your guesses must be a real 5-letter word.

Can you solve 4 words at once?

Word of the day.

See Definitions and Examples »

Get Word of the Day daily email!

Popular in Grammar & Usage

Commonly misspelled words, how to use em dashes (—), en dashes (–) , and hyphens (-), absent letters that are heard anyway, how to use accents and diacritical marks, on 'biweekly' and 'bimonthly', popular in wordplay, the words of the week - apr. 19, 10 words from taylor swift songs (merriam's version), a great big list of bread words, 9 superb owl words, 10 words for lesser-known games and sports, games & quizzes.

Play Blossom: Solve today's spelling word game by finding as many words as you can using just 7 letters. Longer words score more points.

Personification

Definition of personification.

Personification is a figure of speech in which an idea or thing is given human attributes and/or feelings or is spoken of as if it were human. Personification is a common form of metaphor in that human characteristics are attributed to nonhuman things. This allows writers to create life and motion within inanimate objects, animals, and even abstract ideas by assigning them recognizable human behaviors and emotions.

Personification is a literary device found often in children’s literature. This is an effective use of figurative language because personification relies on imagination for understanding. Of course, readers know at a logical level that nonhuman things cannot feel, behave, or think like humans. However, personifying nonhuman things can be an interesting, creative, and effective way for a writer to illustrate a concept or make a point.

For example, in his picture book, “The Day the Crayons Quit,” Drew Daywalt uses personification to allow the crayons to express their frustration at how they are (or are not) being used. This literary device is effective in creating an imaginary world for children in which crayons can communicate like humans.

Common Examples of Personification

Here are some examples of personification that may be found in everyday expression:

  • My alarm yelled at me this morning.
  • I like onions, but they don’t like me.
  • The sign on the door insulted my intelligence.
  • My phone is not cooperating with me today.
  • That bus is driving too fast.
  • My computer works very hard.
  • However, the mail is running unusually slow this week.
  • I wanted to get money, but the ATM died.
  • This article says that spinach is good for you.
  • Unfortunately, when she stepped on the Lego, her foot cried.
  • The sunflowers hung their heads.
  • That door jumped in my way.
  • The school bell called us from outside.
  • In addition, the storm trampled the town.
  • I can’t get my calendar to work for me.
  • This advertisement speaks to me.
  • Fear gripped the patient waiting for a diagnosis.
  • The cupboard groans when you open it.
  • Can you see that star winking at you?
  • Books reach out to kids.

Examples of Personification in Speech or Writing

Here are some examples of personification that may be found in everyday writing or conversation:

  • My heart danced when he walked in the room.
  • The hair on my arms stood after the performance.
  • Why is your plant pouting in the corner?
  • The wind is whispering outside.
  • Additionally, that picture says a lot.
  • Her eyes are not smiling at us.
  • Also, my brain is not working fast enough today.
  • Those windows are watching us.
  • Our coffee maker wishes us good morning.
  • The sun kissed my cheeks when I went outside.

Famous Personification Examples

Think you haven’t heard of any  famous personification examples? Here are some well-known and recognizable titles and quotes featuring this figure of speech:

  • “The Brave Little Toaster” ( novel by Thomas M. Disch and adapted animated film series)
  • “This Tornado Loves You” (song by Neko Case)
  • “Happy Feet” (animated musical film)
  • “Time Waits for No One” (song by The Rolling Stones)
  • “The Little Engine that Could” (children’s book by Watty Piper)
  •    “The sea was angry that day, my friends – like an old man trying to send back soup in a deli.” (Seinfeld television series)
  •    “Life moves pretty fast.” (movie “Ferris Bueller’s Day Off”)
  •    “The dish ran away with the spoon.” (“ Hey, diddle, diddle ” by Mother Goose)
  •    “The Heart wants what it wants – or else it does not care” ( Emily Dickinson )
  •    “Once there was a tree, and she loved a little boy.” (“The Giving Tree” by Shel Silverstein)

Difference Between Personification and Anthropomorphism

Personification is often confused with the literary term anthropomorphism due to fundamental similarities. However, there is a difference between these two literary devices . Anthropomorphism is when human characteristics or qualities are applied to animals or deities, not inanimate objects or abstract ideas. As a literary device, anthropomorphism allows an animal or deity to behave as a human. This is reflected in Greek dramas in which gods would appear and involve themselves in human actions and relationships.

In addition to gods, writers use anthropomorphism to create animals that display human traits or likenesses such as wearing clothes or speaking. There are several examples of this literary device in popular culture and literature. For example, Mickey Mouse is a character that illustrates anthropomorphism in that he wears clothes and talks like a human, though he is technically an animal. Other such examples are Winnie the Pooh, Paddington Bear, and Thomas the Tank Engine.

Therefore, while anthropomorphism is limited to animals and deities, personification can be more widely applied as a literary device by including inanimate objects and abstract ideas. Personification allows writers to attribute human characteristics to nonhuman things without turning those things into human-like characters, as is done with anthropomorphism.

Writing Personification

Overall, as a literary device, personification functions as a means of creating imagery and connections between the animate and inanimate for readers. Therefore, personification allows writers to convey meaning in a creative and poetic way. These figures of speech enhance a reader’s understanding of concepts and comparisons , interpretations of symbols and themes , and enjoyment of language.

Here are instances in which it’s effective to use personification in writing:

Demonstrate Creativity

Personification demonstrates a high level of creativity. To be valuable as a figure of speech, the human attributes assigned to a nonhuman thing through personification must make sense in some way. In other words, human characteristics can’t just be assigned to any inanimate object as a literary device. There must be some connection between them that resonates with the reader, demanding creativity on the part of the writer to find that connection and develop successful personification.

Exercise Poetic Skill

Many poets rely on personification to create vivid imagery and memorable symbolism . For example, in Edgar Allan Poe ’s poem “ The Raven ,” the poet skillfully personifies the raven through allowing it to speak one word, “nevermore,” in response to the narrator ’s questions. This is a powerful use of personification, as the narrator ends up projecting more complex and intricate human characteristics onto the bird as the poem continues though the raven only speaks the same word.

Create Humor

Personification can be an excellent tool in creating humor for a reader. This is especially true among young readers who tend to appreciate the comedic contrast between a nonhuman thing being portrayed as possessing human characteristics. Personification allows for creating humor related to incongruity and even absurdity.

Enhance Imagination

Overall, personification is a literary device that allows readers to enhance their imagination by “believing” that something inanimate or nonhuman can behave, think, or feel as a human. In fact, people tend to personify things in their daily lives by assigning human behavior or feelings to pets and even objects. For example, a child may assign emotions to a favorite stuffed animal to match their own feelings. In addition, a cat owner may pretend their pet is speaking to them and answer back. This allows writers and readers to see a reflection of humanity through imagination. Readers may also develop a deeper understanding of human behavior and emotion.

Examples of Personification in Literature

Example #1: the house on mango street (sandra cisneros).

But the house on Mango Street is not the way they told it at all. It’s small and red with tight steps in front and windows so small you’d think they were holding their breath.

In the first chapter of Cisneros’s book, the narrator Esperanza is describing the house into which she and her family are moving. Her parents have promised her that they would find a spacious and welcoming home for their family, similar to what Esperanza has seen on television. However, their economic insecurity has prevented them from getting a home that represents the American dream.

Cisneros uses personification to emphasize the restrictive circumstances of Esperanza’s family. To Esperanza, the windows of the house appear to be “holding their breath” due to their small size, creating an image of suffocation. This personification not only enhances the description of the house on Mango Street for the reader, but it also reflects Esperanza’s feelings about the house, her family, and her life. Like the windows, Esperanza is holding her breath as well, with the hope of a better future and the fear of her dreams not becoming reality.

Example #2:  Ex-Basketball Player (John Updike)

Off work, he hangs around Mae’s Luncheonette. Grease-gray and kind of coiled, he plays pinball, Smokes those thin cigars, nurses lemon phosphates. Flick seldom says a word to Mae, just nods Beyond her face toward bright applauding tiers Of Necco Wafers, Nibs, and Juju Beads.

In his poem about a former basketball player named Flick, Updike recreates an arena crowd watching Flick play pinball by personifying the candy boxes in the luncheonette. The snack containers “applaud” Flick as he spends his free time playing a game that is isolating and requires no athletic skill. However, the personification in Updike’s poem is a reflection of how Flick’s life has changed since he played and set records for his basketball team in high school.

Flick’s fans have been replaced by packages of sugary snacks with little substance rather than real people appreciating his skills and cheering him on. Like the value of his audience , Flick’s own value as a person has diminished into obscurity and the mundane now that he is an ex-basketball player.

Example #3:  How Cruel Is the Story of Eve (Stevie Smith)

It is only a legend , You say? But what Is the meaning of the legend If not To give blame to women most And most punishment? This is the meaning of a legend that colours All human thought; it is not found among animals. How cruel is the story of Eve, What responsibility it has In history For misery.

In her poem, Smith personifies the story of Eve as it is relayed in the first book of the Bible,  Genesis . Smith attributes several human characteristics to this story, such as cruelty and responsibility. Therefore, this enhances the deeper meaning of the poem which is that Eve is not to blame for her actions, essentially leading to the “fall” of man and expulsion from Paradise In addition, she is not to blame for the subjugation and inequality that women have faced throughout history and tracing back to Eve.

Eve’s “story” or “legend” in the poem is accused by the poet of coloring “all human thought.” In other words, Smith is holding the story responsible for the legacy of punishment towards women throughout history by its portrayal of Eve, the first woman, as a temptress and sinner. The use of this literary device is effective in separating Eve’s character as a woman from the manner in which her story is told.

Related posts:

  • 10 Songs with Meaningful Personification
  • 10 Fun Examples of Personification in Poetry
  • Romeo and Juliet Personification
  • Brevity is the Soul of Wit
  • The Fault, Dear Brutus
  • Hamlet Act-I, Scene-I Study Guide

Post navigation

California university cancels Muslim valedictorian's speech, citing safety concerns

  • Medium Text

'CAVING TO FEAR'

California university cancels Muslim valedictorian's speech, citing safety concerns

Get weekly news and analysis on the U.S. elections and how it matters to the world with the newsletter On the Campaign Trail. Sign up here.

Reporting by Steve Gorman in Los Angeles, Julia Harte in New York and Kanishka Singh in Washington; Editing by Jonathan Oatis and Christopher Cushing

Our Standards: The Thomson Reuters Trust Principles. New Tab , opens new tab

speech human definition

Thomson Reuters

Kanishka Singh is a breaking news reporter for Reuters in Washington DC, who primarily covers US politics and national affairs in his current role. His past breaking news coverage has spanned across a range of topics like the Black Lives Matter movement; the US elections; the 2021 Capitol riots and their follow up probes; the Brexit deal; US-China trade tensions; the NATO withdrawal from Afghanistan; the COVID-19 pandemic; and a 2019 Supreme Court verdict on a religious dispute site in his native India.

Former U.S. President Trump's criminal trial on charges of falsifying business records continues in New York

World Chevron

Haiti's main fuel import terminal was suspended on Monday as gang-affiliated armed men seized trucks and demanded it be shut down, according to a source with information on the matter, likely exacerbating existing gasoline and diesel shortages.

Aftermath of a Russian missile attack in Kharkiv

  • Newsletters
  • Account Activating this button will toggle the display of additional content Account Sign out

What USC Got Wrong When It Canceled Its Valedictorian’s Speech

I’ve seen this kind of mistake before..

On Monday, Andrew Guzman, the provost of the University of Southern California, sent a letter  to the campus community announcing the cancellation of the speech by the student valedictorian. Concerned with the “intensity of feelings” around the Middle East and accompanying risks to security, he wrote, “tradition must give way to safety.”

There is no question that universities have a duty to maintain campus safety during graduation ceremonies. Campus administrators are responsible for the safety of tens of thousands of students and their friends and families at a very public venue during this period. They want everyone to share a memorable moment of recognition of accomplishment, and to be safe while doing so.

Yet the provost’s letter sounded all too familiar to me. For six years, I served as the United Nations’ principal monitor of freedom of expression worldwide. In that role, I repeatedly saw governments shutting down public speech to prioritize vague assertions of national security or public order over the rights of its citizens.

This context helps us understand why USC’s decision is so troubling. For as much as Guzman asserted that “the decision has nothing to do with freedom of speech,” he failed to demonstrate the necessity of this draconian measure. As such, the action is clearly an interference with free speech—the question is whether it was justified.

The student selected as valedictorian, Asna Tabassum, earned the honor, the result of a faculty recommendation that Guzman himself approved. With nearly perfect grades, a major in biomedical engineering, and a minor in genocide studies, Tabassum presents the kind of profile that any university would be thrilled to celebrate—hardworking, successful, committed to science and society, engaged in the life of her campus.

And like so many young people today, she has thoughts about justice and the wider world of which she is a part. Specifically, she supports the pro-Palestine activism that has grown across the world, especially on college campuses. This is obvious because she linked to a pro-Palestine website on her Instagram page and liked posts from a campus organization favoring Palestinian rights.

Many find those websites, and those views, objectionable. That’s fine: Everyone enjoys the right to disagree and object. According to reporting in the Los Angeles Times and elsewhere, those associations and views caused pro-Israel groups to launch a campaign against her, and some unnamed individuals to issue threats.

The USC leadership caved to these efforts. Asserting that Tabassum had no “entitlement” to speak, as the provost’s letter emphasized, is beside the point. USC pulled her from the podium because, it appears, it concluded that the  reactions to  her views and associations—perhaps her valedictory speech—could somehow threaten public safety or disrupt commencement. Even Guzman’s letter makes this plain: He made sure to note that the criteria for selection did not include candidates’ “social media presence,” implying that he would not have approved her as valedictorian if he had known her views.

The question is not whether the university has a significant interest in a safe celebration—it obviously does. The question is whether it has shown that the steps it took were necessary and proportionate to ensure that kind of environment. And here is where USC administrators have failed. They did not demonstrate that it was necessary to cancel Tabassum’s speech. They did not show, or even allege, that Tabassum would use the moment to incite any kind of disruption. There is no evidence that the university considered what a security arrangement might look like to protect Tabassum and all participants at graduation. There is no evidence that it considered or offered alternatives to canceling her speech altogether.

In short, Tabassum has been penalized while those making the threats have secured a victory. USC’s choice came with obvious costs, depriving Tabassum of a speaking role and her classmates of hearing one of their most academically successful members.

USC gave opponents of Tabassum’s views the “heckler’s veto.” The lesson seems to be: If you don’t like a speaker, complain and threaten disruption to get your way. The risks to campus free speech are obvious. Once a school starts down this path, there is no end to political tests in which university administrators bless certain views—those that do not stir up intense feelings—and reject others. That is the path of campus authoritarianism, something American students have been fighting against since at least 1964.

Schools like USC will forever face pressure to pick students without a political backstory, without convictions or passions that spark dissent or make some uncomfortable. Universities face increasingly strident calls, inside and outside their campuses, for them to limit speech on grounds that have nothing to do with their academic missions. Now, more than ever, members of campus leadership must stand up for their students, their faculties, and their communities in the face of threats—and not only teach but practice the centrality of freedom of expression in democratic societies like ours.

comscore beacon

  • Election 2024
  • Entertainment
  • Newsletters
  • Photography
  • Personal Finance
  • AP Investigations
  • AP Buyline Personal Finance
  • AP Buyline Shopping
  • Press Releases
  • Israel-Hamas War
  • Russia-Ukraine War
  • Global elections
  • Asia Pacific
  • Latin America
  • Middle East
  • Election Results
  • Delegate Tracker
  • AP & Elections
  • Auto Racing
  • 2024 Paris Olympic Games
  • Movie reviews
  • Book reviews
  • Personal finance
  • Financial Markets
  • Business Highlights
  • Financial wellness
  • Artificial Intelligence
  • Social Media

A knife attack in Australia against a bishop and a priest is being treated as terrorism, police say

Australian police say suspect arrested after reported stabbing at church in Sydney

speech human definition

Australian police say a knife attack in Sydney that wounded a bishop and a priest during a church service as horrified worshippers watched online and in person and sparked a riot, was an act of terrorism.

speech human definition

Australian Prime Minister Anthony Albanese described a knife attack in Sydney against two clergymen on Tuesday as “disturbing”. Australian police say the attack in a western Sydney church is being treated as terrorism.

speech human definition

The premier of Australia’s New South Wales state has warned the community in Western Sydney not take the law into their own hands after police declared Monday’s stabbings at a Sydney church a “terrorist incident”.

speech human definition

Police in the Australian State of New South Wales have declared Monday’s stabbings at a Sydney church a “terrorist incident”.

Security officers stand guard outside Orthodox Assyrian church in Sydney, Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

Security officers stand guard outside Orthodox Assyrian church in Sydney, Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

  • Copy Link copied

Police officers check vandalized vehicles, including theirs, outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

Policemen stand guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

Policeman stand guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. (AP Photo/Mark Baker)

Policeman stand guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Thursday, May 25, 2023. (AP Photo/Mark Baker)

A Police car is seen vandalised outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. (AP Photo/Mark Baker)

A policeman stands guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. (AP Photo/Mark Baker)

Security officers stand guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. (AP Photo/Mark Baker)

Riot police drives away after securing outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

The Sydney Opera House is illuminated with a black ribbon Monday, April 15, 2024, as part of the national day of mourning following the stabbing deaths of several people at a shopping mall on April 13. Australian police are examining why a lone assailant who stabbed several people to death in a busy Sydney shopping mall and injured more than a dozen others targeted women while avoiding men. (AP Photo/Mark Baker)

A police officer investigates in a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

Police officers check vandalized vehicles outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

SYDNEY (AP) — Australian police say a knife attack in Sydney that wounded a bishop and a priest during a church service as horrified worshippers watched online and in person, and sparked a riot was an act of terrorism.

Police arrested a 16-year-old boy Tuesday after the stabbing at Christ the Good Shepherd Church that injured Bishop Mar Mari Emmanuel and a priest. Both are expected to survive.

Security officers stand guard outside Orthodox Assyrian church in Sydney, Australia, Monday, April 15, 2024. Police in Australia say a man has been arrested after a bishop and churchgoers were stabbed in the church. There are no life-threatening injuries. (AP Photo/Mark Baker)

Security officers stand guard outside Orthodox Assyrian church in Sydney, Australia, Monday, April 15, 2024. (AP Photo/Mark Baker)

New South Wales Police Commissioner Karen Webb said the suspect’s comments pointed to a religious motive for the attack.

“We’ll allege there’s a degree of premeditation on the basis that this person has travelled to that location, which is not near his residential address, he has travelled with a knife and subsequently the bishop and the priest have been stabbed,” Webb said. “They’re lucky to be alive.”

The teenager was known to police but was not on a terror watch list, Webb said.

A group of people react after placing flowers as a tribute near a crime scene at Bondi Junction in Sydney, Monday, April 15, 2024, after several people were stabbed to death at a shopping on April 13. Australian police are examining why a lone assailant who stabbed several people to death in a busy Sydney shopping mall and injured more than a dozen others targeted women while avoiding men. (AP Photo/Mark Baker)

The Australian Security Intelligence Organization, the nation’s main domestic spy agency, and Australian Federal Police had joined state police in a counter-terrorism task force to investgate who else was potentially involved.

ASIO director-general Mike Burgess said the investigation had yet to uncover any associated threats.

The Sydney Opera House is illuminated with a black ribbon Monday, April 15, 2024, as part of the national day of mourning following the stabbing deaths of several people at a shopping mall on April 13. Australian police are examining why a lone assailant who stabbed several people to death in a busy Sydney shopping mall and injured more than a dozen others targeted women while avoiding men. (AP Photo/Mark Baker)

“It does appear to be religiously motivated, but we continue our lines of investigation,” Burgess said.

“Our job is to look at individuals connected with the attacker to assure ourselves that there is no-one else in the community with similar intent. At this stage, we have no indications of that,” Burgess added.

On ASIO’s advice, the risk of a terrorist attack in Australia is rated at “possible.” That is the second lowest level after “not expected” on the five-tier National Terrorism Threat Advisory System.

The boy had been convicted in January of a range of offenses including possession of a switch blade knife, being armed with a weapon with an intention to commit an indictable offence, stalking, intimidation and damaging property, Australian Broadcasting Corp. reported.

A Sydney court released him on a good behavior bond, the ABC reported.

The boy had also used a switch blade, which is an illegal weapon in Australia, in Monday’s attack, the ABC reported.

Policeman stand guard outside a church where a bishop and churchgoers were reportedly stabbed in Sydney Australia, Thursday, May 25, 2023. (AP Photo/Mark Baker)

Juvenile offenders cannot be publicly identified in New South Wales state.

In response to the attack, Prime Minister Anthony Albanese said “there is no place for violence in our community. There’s no place for violent extremism.”

The Christ the Good Shepherd in suburban Wakeley streams sermons online and worshippers watched as a person in black clothes approached the altar and stabbed the bishop and priest Isaac Royel during a church service Monday evening before the congregation overpowered him, police said.

A crowd of hundreds seeking revenge gathered outside the Orthodox Assyrian church, hurling bricks and bottles, injuring police officers and preventing police from taking the teen outside, officials said.

The teen suspect and at least two police officers were also hospitalized, Acting Assistant Police Commissioner Andrew Holland told journalists.

Paramedics treated 30 patients, with seven taken to hospitals, NSW Ambulance commissioner Dominic Morgan said.

“This was a rapidly evolving situation where the crowds went from 50 to a number of hundreds of people in a very rapid period of time,” Morgan said.

“Our paramedics became directly under threat ... and had to retreat into the

church,” Morgan added.

The church in a message on social media said the bishop and priest were in stable condition and asked for people’s prayers. “It is the bishop’s and father’s wishes that you also pray for the perpetrator,” the statement said.

Holland commended the congregation for subduing the teen before calling police. When asked if the teen’s fingers had been severed, he said the hand injuries were “severe.”

More than 100 police reinforcements arrived before the teen was taken from the church in the hours-long incident. Several police vehicles were damaged, Holland said.

“A number of houses have been damaged. They’ve broken into a number of houses to gain weapons to throw at the police. They’ve thrown weapons and items at the church itself. There were obviously people who wanted to get access to the young person who caused the injuries to the clergy people,” he said.

Australians were still in shock after a lone assailant stabbed six people to death in a Sydney shopping mall on Saturday and injured more than a dozen others.

Holland suggested the weekend attack heightened the community’s response to the church stabbing.

“Given that there has been incidents in Sydney the last few days with knives involved, obviously there’s concerns,” he said. “We’ve asked for everyone to think rationally at this stage. “

The church said in a statement on Tuesday the 53-year-old Iraq-born bishop’s condition was “improving.”

Emmanuel has a strong social media following and is outspoken on a range of issues. He proselytizes to both Jews and Muslims and is critical of liberal Christian denominations.

He also speaks out on global political issues and laments the plight of Palestinians in Gaza.

The bishop, described in local media as a figure sometimes seen as divisive on issues such as COVID-19 restrictions, was in national news last year with comments about gender.

A video posted in May 2023 by the ABC about a campaign targeting the LGBTQ+ community showed the bishop in a sermon saying that “when a man calls himself a woman, he is neither a man nor a woman, you are not a human, then you are an it. Now, since you are an it, I will not address you as a human anymore because it is not my choosing, it your choosing.”

McGuirk reported from Melbourne, Australia.

speech human definition

IMAGES

  1. Primer: Acoustics and Physiology of Human Speech

    speech human definition

  2. Introduction to Articulatory Phonetics. The production of speech: The

    speech human definition

  3. What is Speech Perception and Theories of Speech Perception

    speech human definition

  4. Parts of speech with examples and definition| List of all parts of

    speech human definition

  5. PPT

    speech human definition

  6. Human speech is a fundamental part of who we are. It is the way we make

    speech human definition

VIDEO

  1. Good human Definition ❤️#radhekrishna #trendingshorts #trending

  2. Parts Of Speech |Definition #grammar #speech #english #education #knowledge #vowel #learningenglish

  3. Girls' definition of a "quick chat"...#pyscology #motivation #psychologyfacts #quotes #shortvideo

  4. What are the 8 types of speech with definition and examples?

  5. The Rise of the Speaking Machine

  6. Learn part of speech-Types of parts of speech with definition/What is part of speech?

COMMENTS

  1. Speech

    Speech is a human vocal communication using language.Each language uses phonetic combinations of vowel and consonant sounds that form the sound of its words (that is, all English words sound different from all French words, even if they are the same word, e.g., "role" or "hotel"), and using those words in their semantic character as words in the lexicon of a language according to the syntactic ...

  2. Speech

    Speech is the faculty of producing articulated sounds, which, when blended together, form language. Human speech is served by a bellows-like respiratory activator, which furnishes the driving energy in the form of an airstream; a phonating sound generator in the larynx (low in the throat) to transform the energy; a sound-molding resonator in ...

  3. What Is Speech? What Is Language?

    Speech is how we say sounds and words. Speech includes: How we make speech sounds using the mouth, lips, and tongue. For example, we need to be able to say the "r" sound to say "rabbit" instead of "wabbit.". How we use our vocal folds and breath to make sounds. Our voice can be loud or soft or high- or low-pitched.

  4. SPEECH Definition & Meaning

    Speech definition: the faculty or power of speaking; oral communication; ability to express one's thoughts and emotions by speech sounds and gesture. See examples of SPEECH used in a sentence.

  5. Language

    Language - Speech, Physiology, Phonetics: In societies in which literacy is all but universal and language teaching at school begins with reading and writing in the native tongue, one is apt to think of language as a writing system that may be pronounced. In point of fact, language generally begins as a system of spoken communication that may be represented in various ways in writing.

  6. Speech Production

    Definition. Speech production is the process of uttering articulated sounds or words, i.e., how humans generate meaningful speech. It is a complex feedback process in which hearing, perception, and information processing in the nervous system and the brain are also involved. Speaking is in essence the by-product of a necessary bodily process ...

  7. 2.1 How Humans Produce Speech

    Speech is produced by bringing air from the lungs to the larynx (respiration), where the vocal folds may be held open to allow the air to pass through or may vibrate to make a sound (phonation). The airflow from the lungs is then shaped by the articulators in the mouth and nose (articulation). The field of phonetics studies the sounds of human ...

  8. The anatomical and physiological basis of human speech production

    The major medium for the transmission of human language is vocalization, or speech. Humans use rapid, highly variable, extended sound sequences to transmit the complex information content of language. Speech is a very efficient communication medium: it costs little energetically, it does not require visual contact with the intended receiver(s), and it can be carried out simultaneously with ...

  9. 1

    Central to all these human interactions is speech communication, i.e. communication via an articulatory-acoustic-auditory channel (AAA) between a sender and a receiver, supplemented by a gestural-optical-visual channel (GOV). Speech communication is based on cognitive constructs that order the world and human action in space and time.

  10. When Did Human Speech Evolve? : 13.7: Cosmos And Culture : NPR

    Email. A new study that relies on brain-imaging of cerebral blood flows suggests that human speech and complex tool-making skills emerged together almost two million years ago. Commentator Barbara ...

  11. Why speech is a human innovation

    Speaking isn't the only avenue for language. After all, linguistic messaging can be transmitted by hand signals. Or handwriting. Or texting. But speech is the original and most basic mode of human communication. So understanding its origins ought to generate deeper comprehension of language more generally.

  12. Speech

    Speech - Vocalization, Pitch, Intonation: The voice has various attributes; these are chiefly frequency, harmonic structure, and intensity. The immediate result of vocal cord vibration is the fundamental tone of the voice, which determines its pitch. In physical terms, the frequency of vibration as the foremost vocal attribute corresponds to the number of air puffs per second, counted as ...

  13. Speech Definition & Meaning

    speech: [noun] the communication or expression of thoughts in spoken words. exchange of spoken words : conversation.

  14. Articulating: The Neural Mechanisms of Speech Production

    Abstract. Speech production is a highly complex sensorimotor task involving tightly coordinated processing across large expanses of the cerebral cortex. Historically, the study of the neural underpinnings of speech suffered from the lack of an animal model. The development of non-invasive structural and functional neuroimaging techniques in the ...

  15. The Parts of Human Speech Organs & Their Definitions

    Vibrations of the Larynx. Three more parts of the speech mechanism and organs of speech are the larynx, epiglottis and vocal folds. The larynx is covered by a flap of skin called the epiglottis. The epiglottis blocks the trachea to keep food from going into your lungs when you swallow. Across the larynx are two thin bands of tissue called the ...

  16. Mechanics of human voice production and control

    A. Vocal fold anatomy and biomechanics. The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific ...

  17. Speech Perception

    Abstract. Speech perception is conventionally defined as the perceptual and cognitive processes leading to the discrimination, identification, and interpretation of speech sounds. However, to gain a broader understanding of the concept, such processes must be investigated relative to their interaction with long-term knowledge—lexical ...

  18. Freedom of Speech

    For many liberals, the legal right to free speech is justified by appealing to an underlying moral right to free speech, understood as a natural right held by all persons. (Some use the term human right equivalently—e.g., Alexander 2005—though the appropriate usage of that term is contested.)

  19. speech noun

    Synonyms speech speech lecture address talk sermon These are all words for a talk given to an audience. speech a formal talk given to an audience:. Several people made speeches at the wedding. lecture a talk given to a group of people to tell them about a particular subject, often as part of a university or college course:. a lecture on the Roman army

  20. Human Definition & Meaning

    human: [adjective] of, relating to, or characteristic of humans (see 2human).

  21. Personification

    Definition of Personification. Personification is a figure of speech in which an idea or thing is given human attributes and/or feelings or is spoken of as if it were human. Personification is a common form of metaphor in that human characteristics are attributed to nonhuman things. This allows writers to create life and motion within inanimate objects, animals, and even abstract ideas by ...

  22. The 8 Parts of Speech

    A part of speech (also called a word class) is a category that describes the role a word plays in a sentence.Understanding the different parts of speech can help you analyze how words function in a sentence and improve your writing. The parts of speech are classified differently in different grammars, but most traditional grammars list eight parts of speech in English: nouns, pronouns, verbs ...

  23. Human speech

    Human speech synonyms, Human speech pronunciation, Human speech translation, English dictionary definition of Human speech. Noun 1. speech communication - communication by word of mouth; "his speech was garbled"; "he uttered harsh language"; "he recorded the spoken language of...

  24. California university cancels Muslim valedictorian's speech, citing

    The University of Southern California, citing safety concerns and passions around the latest Middle East conflict, has canceled its valedictorian speech from a Muslim student who said she was ...

  25. What USC Got Wrong When It Canceled Its Valedictorian's Speech

    On Monday, Andrew Guzman, the provost of the University of Southern California, sent a letter to the campus community announcing the cancellation of the speech by the student valedictorian ...

  26. History made as US military conducts first ever human vs AI dogfight

    The US military has carried out the first ever dogfight between a human pilot and an AI-controlled fighter jet. The computer-controlled F-16 jet took on a manned F-16 aircraft in aerial combat at ...

  27. Chaos in Dubai as UAE records heaviest rainfall in 75 years

    Chaos ensued in the United Arab Emirates after the country witnessed the heaviest rainfall in 75 years, with some areas recording more than 250 mm of precipitation in fewer than 24 hours, the ...

  28. Inter-American Court of Human Rights meets in Barbados

    The relationship between human rights and the climate crisis will be explored this week, when the Inter-American Court on Human Rights will hear testimony from experts from around the globe and ...

  29. Sydney church stabbing being treated as act of terrorism, police say

    SYDNEY (AP) — Australian police say a knife attack in Sydney that wounded a bishop and a priest during a church service as horrified worshippers watched online and in person, and sparked a riot was an act of terrorism.

  30. Elon Musk Resists Brazilian Censorship

    Mary Anastasia O'Grady writes "The Americas," a weekly column on politics, economics and business in Latin America and Canada that appears every Monday in the Journal.