The power of ‘voice,’ and empowering the voiceless

  • Search Search

essay on human voice

Many people use their voices everyday—to talk to people, to communicate their needs and wants—but the idea of ‘voice’ goes much deeper. Having a voice gives an individual agency and power, and a way to express his or her beliefs. But what happens when that voice is expressed differently from the norm? What happens when that voice is in some way silenced?

Headshot of Meryl Alper

Meryl Alper, assistant professor of communication studies at Northeastern, explored this idea of “voice” in children and young teenagers who used an iPad app that converted symbols to audible words to help them communicate.

While it may seem like the app helped to return voice to those who used it, Alper found that the technology was subject to economic structures and defined through the lens of ableism.

“People with disabilities are not passively given voices by the able-bodied; disabled individuals, rather, are actively taking and making them,” she said.

Her book on the subject, Giving Voice: Mobile Communication, Disability, and Inequality , was recently recognized by the Association of American Publishers’ PROSE Awards, which honor “the very best in professional and scholarly publishing.”

We often hear about technology giving voice to the voiceless. What does ‘voice’ represent in your research? And what sorts of ‘voices’ are left out of technological advances?

“Giving voice to the voiceless” regularly signifies that the historically underrepresented, disadvantaged, or vulnerable gain opportunities to organize, increase visibility, and express themselves by leveraging the strengths of information, media, and communication technologies. A long list of tools and platforms—including the internet, Facebook, Twitter, community radio, and free and open software—have all been said to “give voice.”

In the book, I critically reflect on how “giving voice to the voiceless” becomes a powerful, and potentially harmful, trope in our society that masks structural inequalities. I do this by considering the separate meanings of “giving,” “voice,” and “the voiceless.” The notion of “the voiceless” suggests a static and clearly defined group. Discussions about “giving” them voice can reinforce and naturalize not “having” a voice, without also questioning the complex dynamics between having and giving, as well as speaking and listening. Additionally, “giving voice” does not challenge the means and methods by which voice may have been obtained, taken, or even stolen in the first place, and how technology and technological infrastructure can and does uphold the status quo.

What were the biggest takeaways from your research?

I studied how non- and minimally-speaking youth with developmental disabilities impacting their speech used voice output communication technologies that take the form of mobile tablets and apps—think of the technology used by the late Stephen Hawking, but simplified on an iPad. The impact of these technologies on the lives of these children and their families was at once positive, negative, and sometimes of little impact at all. We are collectively responsible for how overly simplistic narratives about technology metaphorically and materially “giving voice” to those with disabilities circulate, particularly as social media platforms monetize and incentivize clicks and retweets of stories. These kinds of news and media portrayals are derided among many in the disability community as “inspiration porn.” In economically, politically, and socially uncertain times, certainty in technology as a fix, certainty in disability as something in need of fixing, and the relationship between these certain fixations is something to think very critically about.

We also need to stay vigilant about protecting disability rights and improving disability policy, as well as the policies that acutely impact people with disabilities, such as education, healthcare, and internet access. Having a voice in general, and the role of technology in exploiting that voice, must be understood in relation to other forms of exploitation. People with disabilities are not passively given voices by the able-bodied; individuals with disabilities, rather, are actively taking and making them. Considering all the ways in which our media ecology and political environment are rapidly changing, at stake in these matters is not only which voices get to speak, but who is thought to have agency to speak in the first place.

Giving Voice received an honorable mention from the PROSE Awards. What does this honor mean to you and for your work?

It is a great privilege for my book to be counted among the 2018 honorees and as one of two winners in the Media and Cultural Studies category, as hundreds of exceptional books were published in the discipline in 2017. Media, communication, and cultural studies is a wide and vibrant field, encompassing two different departments at Northeastern alone (communication studies, and media and screen studies). As an assistant professor, it is immensely rewarding and affirming for my work to be considered of a similar caliber to past category winners, including acclaimed senior scholars in my field.

The award also makes a clear statement about the future of the discipline. Giving Voice is broadly about what it means to have a voice in a technologized world and is based on qualitative research among children, families, and people with disabilities. Those populations, and their concerns, are more often than not treated as niche or specialty within the academy. Qualitative research is also regularly undervalued compared to quantitative research. The honor motivates me to keep following my instincts, centering marginalized groups in empirical and theoretical work on technology and society, and posing research questions that excite me.

Editor's Picks

Barnacle-inspired polymers could present new way to design antibiotics, researchers say, northeastern mba students craft global strategies for ukrainian companies navigating war, california passed a law to financially protect children used in online content. but does it go far enough,  the uk plans to ban tv junk food advertising before 9 p.m. could it cut child obesity levels, how could the 2024 presidential election determine supreme court retirements, featured stories, new northeastern research hub connects oakland and boston to advanced semiconductor research, northeastern event aims to bring more indigenous high school students into stem fields, help them define their future, northeastern archivist contributes to exhibit on desegregation of boston public schools, what to expect from the 2024-2025 supreme court term, science & technology.

essay on human voice

Recent Stories

essay on human voice

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Don’t Underestimate the Power of Your Voice

  • Dan Bullock
  • Raúl Sánchez

essay on human voice

It’s not just what you say, it’s how you say it.

Our voices matter as much as our words matter. They have the power to awaken the senses and lead others to act, close deals, or land us successful job interviews. Through our voices, we create nuances of meaning, convey our emotions, and find the secret to communicating our executive presence. So, how do we train our voices to be more visceral, effective, and command attention?

  • The key lies in harnessing our voices using the principles of vocalics. Vocalics primarily consists of three linguistic elements: stress (volume) , intonation (rising and falling tone), and rhythm (pacing). By combining vocalics with public speaking skills, we can colors our words with the meaning and emotion that motivates others to act.
  • Crank up your volume: No, we don’t mean shout. The effective use of volume goes beyond trying to be the loudest person in the room. To direct the flow of any conversation, you must overtly stress what linguists call focus words. When you intentionally place volume on certain words, you emphasize parts of a message and shift the direction of a conversation toward your preferred outcome.
  • Use a powerful speech style: The key to achieving a powerful speech style, particularly during job interviews and hiring decisions, is to first concentrate on the “melody” of your voice, also called intonation. This rise or fall of our voice conveys grammatical meaning (questions or statements) or even attitude (surprise, joy, sarcasm).
  • Calibrate your vocal rhythm with the right melody: Our messages are perceived differently depending on the way we use rhythm in our voices. Deliberately varying our pacing with compelling pauses creates “voiced” punctuation, a powerful way to hold the pulse of the moment.
  • Dan Bullock is a language and communications specialist/trainer at the United Nations Secretariat, training diplomats and global UN staff. Dan is the co-author of How to Communicate Effectively with Anyone, Anywhere (Career Press, 2021).   He also serves as faculty teaching business communication, linguistics, and public relations within the Division of Programs in Business at New York University’s School of Professional Studies. Dan was the director of corporate communications at a leading NYC public relations firm, and his corporate clients have included TD Bank and Pfizer. 
  • Raúl Sánchez is an award-winning clinical assistant professor and the corporate program coordinator at New York University’s School of Professional Studies. Raúl is the co-author of How to Communicate Effectively with Anyone, Anywhere (Career Press, 2021). He has designed and delivered corporate trainings for Deloitte and the United Nations, as well as been a writing consultant for Barnes & Noble Press and PBS. Raúl was awarded the NYU School of Professional Studies Teaching Excellence Award and specializes in linguistics and business communication.

Partner Center

NUHA Foundation

  • Mission Statement
  • Our History
  • A few of our favourite quotes
  • Young Writers
  • Matched Prizes
  • Frequently Asked Questions
  • Blogging Entries
  • Other Blog Posts
  • Country Resource Pages on Education
  • Alliance for Development and Population Services
  • Alternatives Durables pour le Development
  • Health and Education NOW!
  • Green Village Children Centre
  • Future Foundations
  • Educate A Child International
  • ComplitKenya
  • Club des Amis du Cameroun (CAMIC)
  • Canada-Mathare Education Trust (CMETrust)
  • Busoga Volunteers for Community Development
  • Bamburi Great News School
  • Angelic Army School
  • Akili Girls’ Preparatory School
  • Wamulu International
  • World Action Fund
  • English Conversation Programme (ECP)
  • Trivia about NUHA

The Power of the Human Voice

Posted on August 8, 2014 by the Editor

It takes the human voice to infuse words with shades of deeper meaning. The role of the human voice in giving deeper meaning to words is crucial when one looks at the significance of denotative and connotative meanings of expressions. For example, one person can utter the following words: l am thirsty . The surface or general meaning is that the person needs some water. However, depending on the context of the utterance, in terms of the reason for such expression, the role and position of the speaker-on a deeper or connotative basis the same words could mean: Give me some water now! In which case: I am thirsty would galvanise the person receiving the order to fetch water as quickly as humanly possible.

The human voice is able to infuse words with shades of deeper meaning because that power of speech can unearth the real intentions, mood, character, identity and culture of the speaker in question. It is easy for a person to write down something and mislead his or her audience or the entire world. However, once one has an opportunity to physically interact with and listen to the person`s voice- the real emotional, physical and cultural elements of the speaker can be easily picked up and placed in their right perspective. By the same token, actors, educators, editors, politicians, religious leaders, advertisers, insurance agents, singers, writers, inspirational speakers suffuse their voices with certain words to successfully appeal to their audiences.

Verbal communication is unique to humans. Human beings are emotional creatures. The human voice is thought to convey emotional valence, arousal and intensity. Music is a powerful medium capable of eliciting a broad range of emotions. The ability to detect emotion in speech and music is an important task in our daily lives. Studies have been conducted to determine why and how music is able to influence its listeners’ moods and emotions. Results showed that melodies with the voice were better recognised than all other instrumental melodies. The authors suggest that the biological significance of the human voice provides a greater depth of processing and enhanced memory.

Think about a normal day in one’s life. How many words does a person speak? How many words do you hear? According to Caleb Lott in the article titled: The Power of the Human Voice , while there are several different numbers floating around, an average human speaks a minimum of 7000 words every day. The same writer goes on to say that the human voice is a tremendous asset which can be used to make the ordinary extraordinary. For example, the games Thomas Was Alone and Bastion use the human voice in a unique way that dynamically affects the players’ experiences of the games. This is so because a narrative-focused game is not only a powerful and amazing way to tell the story but also does so in a way that the visuals cannot convey. The writing is amazing, but without the awe-inspiring narration, the impact of the writing would be lessened.

The human voice is an amazing tool that can have a profound effect on video games. Using a narrator affects the gameplay and the experience the player remembers after walking away from the game. Think of being held in awe, listening to the radio where the mellifluous voices of one`s favourite program’s hosts awaken, mesmerise, excite or sooth one. This boils down to the fact that our visceral reactions to the ways people play form an integral part of our interactions and communication. Annie Tucker Morgan in Talk to Me: The Powerful Effects of the Human Voice says there is a reason why many people’s first instinct when they are upset is to call their mother. Mother’s love is not only enduring but it is something strong that a person finds echoing instinctively and emotionally. She goes on to explain how a University of Wisconsin -Madison study has identified a concrete link between the sound of Mom’s voice and the soothing of jangled nerves through the release of stress-relieving oxytocin -also known as the “love hormone” in the brain. Researchers say that women prefer deep male voices on the condition that those voices are saying complementary things, but also that a woman’s particular preference for the pitch of a male voice depends on the pitch of her own. Jeffrey Jacob, founder and president of Persuasive Speaking highlighted the correlation between people’s voices and their professional and personal successes. A study conducted showed that if the other person does not like the sound of one’s voice, one might have a hard time securing his or her approval.

If we do not verbalise we write down things. Is writing not something of great magnificence? If so, why can we not make a difference?

The world has never been static, so has writing. It is dynamic. It makes the world revel and reveal itself. Out went the traditional writing feather or pen, and in surged the typewriter, then the “wise” computer. Kudos, the world crooned in celebration of probably one of civilization’s amazing conquest and result.

However, this does not mean that the pen is down and out. Not at all. Neither does it mean that the pen has ceased to be mightier than the sword. Writing is writing whether by virtue of the might of the pen or the wizardry of the computer. In verbal communication one can detect the power of the human voice and the mood of the speaker through such elements of speech as intonation, speed, pause, pitch and emphasis. In the written text, register and paragraphing (for example through the use exclamations) can help detect the speaker’s intentions and emotions.

Different words mean different things to different people. How do writers hold the attention of readers? Through the beauty of words, story-telling helps us derive entertainment from reading, escape from an onerous or anxious life, and of course, understand more about of the world. Through words writers create plots that are not devoid of suspense and mystery. Watts in Writing A Novel says, “A plot is like a knitted sweater-only as good as the stitches. Without the links we have a tangle of wool, chaotic and uninteresting. We get immersed in reading because of the power of causality, the power of words. Words play a crucial role in creating a work of art like a novel. Watts in Writing A Novel says a good answer to a narrative question is as satisfying as scratching an itch.

Through writing we find courage, ammunition and inspiration to go on, in spite of all the odds, we find vision to define and refine our identities and destinies. Yes, through writing we find ourselves, our voice and verve.

J.D. Salinger came up with an interesting observation. He said “What really knocks me out is a book that, when you’re all done reading it, you wish the author that wrote it was a terrific friend of yours and you could call him up on the phone whenever you felt like it. That doesn’t happen much, though.” Are you not ready to knock many a reader out? Are you not ready to unleash your greatness? How many writers are sitting on their works of art?

Writers and words are good bedfellows. Pass that word. Maya Angelou, the famous author of I Know Why the Caged Bird Sings says “Words mean more than what is set down on paper. It takes the human voice to infuse them with shades of deeper meaning.” A word is a unit of expression which is intertwined with sight, sound, smell, touch, and body movement. I think it is memorable (and obviously powerful) because it appeals to our physical, emotional and intellectual processes. As language practitioners, this knowledge (of the mental schema) is crucial.

What is in a word? For me, words illuminate, revel and reveal the world. Literature is literature because of words that constitute it. Patrick Rothfuss says, “Words are pale shadows of forgotten names. As names have power, words have power. Words can light fires in the minds of men. Words can wring tears from the hardest hearts.” Yet, Rudyard Kipling claims, “Words are, of course, the most powerful drug used by mankind” I think this is a very interesting observation.

Patrick Rothfuss illustrates this by declaring, “Words are pale shadows of forgotten names. As names have power, words have power. Words can light fires in the minds of men. Words can wring tears from the hardest hearts.”

The beauty of literature is in seeking and gaining an insight into the complexity and diversity of life through the analysis of how the human voice infuses words with shades of deeper meaning. For indeed the dynamic human voice can roar, soar and breathe life into different pregnant clouds of words and meanings.

14 comments on “ The Power of the Human Voice ”

' src=

Powerful essay, indeed the human voice has power to articulate emotions, ideas, perception, convictions and so much more and by so doing, breathing life into words.

' src=

Henry, thank you for your great words of encouragement.

Wonderful! Spoken words externalise how the speaker perceive the world, how the speaker feels inside…..

Francisco, thank you for stopping by!

Indeed what a wonderful piece of literature,It reminds me of my secondary education days back in the early 1980s when I did “ANIMAL FARM ” by Charles Dickens.

Mr. Mlotshwa, thank you for stopping by. Much appreciated.

Speechless! the language in this piece is just amazing.Well done Mr Ndaba

Khalaz, thank you!

this is a very nice and awesome essay. Great job! 😀

Musa, many thanks!

Ndaba is a compelling writer. An informative piec

Claire, thank you. Humbled.

Wow. This is very excellent, well-written,powerful and informative. You are a great writer. Keep writing.

Tshego, thank you for your kind words!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Anatomy and Physiology of Voice Production  | Understanding How Voice is Produced |   Learning About the Voice Mechanism  |   How Breakdowns Result in Voice Disorders

Larynx Highly specialized structure atop the windpipe responsible for sound production, air passage during breathing and protecting the airway during swallowing

Vocal Folds (also called Vocal Cords) “Fold-like” soft tissue that is the main vibratory component of the voice box; comprised of a cover (epithelium and superficial lamina propria), vocal ligament (intermediate and deep laminae propria), and body (thyroarytenoid muscle)

Glottis (also called Rima Glottides) Opening between the two vocal folds; the glottis opens during breathing and closes during swallowing and sound production

Voice as We Know It = Voiced Sound + Resonance + Articulation

The “spoken word” results from three components of voice production: voiced sound, resonance, and articulation.

Voiced sound: The basic sound produced by vocal fold vibration is called “voiced sound.” This is frequently described as a “buzzy” sound. Voiced sound for singing differs significantly from voiced sound for speech.

Resonance: Voiced sound is amplified and modified by the vocal tract resonators (the throat, mouth cavity, and nasal passages). The resonators produce a person’s recognizable voice.

Articulation: The vocal tract articulators (the tongue, soft palate, and lips) modify the voiced sound. The articulators produce recognizable words.

Voice Depends on Vocal Fold Vibration and Resonance

Sound is produced when aerodynamic phenomena cause vocal folds to vibrate rapidly in a sequence of vibratory cycles with a speed of about:

  • 110 cycles per second or Hz (men) = lower pitch
  • 180 to 220 cycles per second (women) = medium pitch
  • 300 cycles per second (children) = higher pitchhigher voice: increase in frequency of vocal fold vibrationlouder voice: increase in amplitude of vocal fold vibration

Vibratory Cycle = Open + Close Phase

The vocal fold vibratory cycle has phases that include an orderly sequence of opening and closing the top and bottom of the vocal folds, letting short puffs of air through at high speed. Air pressure is converted into sound waves.

Not Like a Guitar String

Vocal folds vibrate when excited by aerodynamic phenomena; they are not plucked like a guitar string. Air pressure from the lungs controls the open phase. The passing air column creates a trailing “Bernoulli effect,” which controls the close phase.

Voice production involves a three-step process.

  • A column of air pressure is moved towards the vocal folds
  • Air is moved out of the lungs and towards the vocal folds by coordinated action of the diaphragm, abdominal muscles, chest muscles, and rib cage
  • Vocal folds are moved to midline by voice box muscles, nerves, and cartilages
  • Column of air pressure opens bottom of vocal folds
  • Column of air continues to move upwards, now towards the top of vocal folds, and opens the top
  • The low pressure created behind the fast-moving air column produces a “Bernoulli effect” which causes the bottom to close, followed by the top
  • Closure of the vocal folds cuts off the air column and releases a pulse of air
  • New cycle repeats
  • Loudness:  Increase in air flow “blows” vocal folds wider apart, which stay apart longer during a vibratory cycle – thus increasing amplitude of the sound pressure wave
  • Pitch:  Increase in frequency of vocal fold vibration raises pitch

ap_01_160

– repeat 1-10 In the closed position (—) maintained by muscle,  opens and closes in a cyclical, ordered and even manner (1 – 10) as a column of air pressure  from the lungs below flows through. This very rapid ordered closing and opening produced by the column of air is referred to as the mucosal wave. The lower edge opens first (2-3) followed by the upper edge thus letting air flow through (4-6). The air column that flows through creates a “Bernouli effect” which causes the lower edge to close (7-9) as it escapes upwards. The escaping “puffs of air” (10) are converted to sound which is then transformed into voice by vocal tract resonators. Any change that affects this mucosal wave – stiffness of vocal fold layers, weakness or failure of closure, imbalance between R and L vocal folds from a lesion on one vocal fold – causes voice problems.  (For more information, see  Anatomy: How Breakdowns Result in Voice Disorders .)

  • Vocal tract – resonators and articulators: The nose, pharynx, and mouth amplify and modify sound, allowing it to take on the distinctive qualities of voiceThe way that voice is produced is analogous to the way that sound is produced by a trombone. The trombone player produces sound at the mouthpiece of the instrument with his lips vibrating from air that passes from the mouth. The vibration within the mouthpiece produces sound, which is then altered or “shaped” as it passes throughout the instrument. As the slide of the trombone is changed, the sound of the musical instrument is similarly changed.

Amazing Outcomes of Human Voice

The human voice can be modified in many ways. Consider the spectrum of sounds – whispering, speaking, orating, shouting – as well as the different sounds that are possible in different forms of vocal music, such as rock singing, gospel singing, and opera singing.

Key Factors for Normal Vocal Fold Vibration

To vibrate efficiently vocal folds need to be:

At the midline or “closed”: Failure to move vocal folds to the midline, or any lesion which prevents the vocal fold edges from meeting, allows air to escape and results in breathy voice.Key players: muscles, cartilages, nerves

Pliable: The natural “built-in” elasticity of vocal folds makes them pliable. The top, edge, and bottom of the vocal folds that meet in the midline and vibrate need to be pliable. Changes in vocal fold pliability, even if limited to just one region or “spot,” can cause voice disorders, as seen in vocal fold scarring.Key players: epithelium, superficial lamina propria

“Just right” tension: Inability to adjust tension during singing can cause a failure to reach high notes or breaks in voice.Key players: muscle, nerve, cartilages

“Just right” mass: Changes in the soft tissue bulk of the vocal folds – such as decrease or thinning as in scarring or increase or swelling, as in Reinke’s edema, produce many voice symptoms – hoarseness, altered voice pitch, effortful phonation, etc. (For more information, see Vocal Fold Scarring and Reinke’s Edema .)Key players: muscles, nerves, epithelium, superficial lamina propria

Learning About the Voice Mechanism  

June 19, 2017

Human Voices Are Unique but We're Not That Good at Recognizing Them

People are good at picking out voices of familiar people’s speech but ear-witness testimonies of strangers’ voices are notoriously unreliable and inaccurate

By Carolyn McGettigan , Nadine Lavan & The Conversation Global

essay on human voice

AntonioMari Getty Images

The following essay is reprinted with permission from The Conversation , an online publication covering the latest research.

“Alexa, who am I? ” Amazon Echo’s voice-controlled virtual assistant, Alexa, doesn’t have an answer to that—yet. However, for other applications of speech technology, computer algorithms are increasingly able to discriminate, recognise and identify individuals from voice recordings.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing . By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Of course, these algorithms are far from perfect, as was recently shown when a BBC journalist broke into his own voice-controlled bank account  using his twin brother’s voice . Is this a case of computers just failing at something humans can do perfectly? We decided to find out.

Each human being  has a voice that is distinct  and different from everyone else’s. So it seems intuitive that we’d be able to identify someone from their voice fairly easily. But how well can you actually do this? When it comes to recognising your closest family and friends, you’re probably quite good. But would you be able to recognise the voice of your first primary school teacher if you heard them again today? How about the guy on the train this morning who was shouting into his phone? What if you had to pick him out, not from his talking voice, but from samples of his laughter, or singing?

To date, research has only explored voice identity perception using a limited set of vocalisations, for example sentences that have been read aloud or snippets of conversational speech. These studies have found that we can actually recognise voices of familiar people’s speech  quite well . But they have also shown that there are problems: ear-witness testimonies are notoriously  unreliable and inaccurate .

It’s important to keep in mind that these studies have not captured much of the flexibility of the sounds we can make with our voices. This is bound to have an effect on how we process the identity of the person behind the voice we are listening to. Therefore, we are currently missing a very large and important piece of the puzzle.

Recognising voices requires two broad processes to operate together: we need to distinguish between the voices of different people (telling people apart) and we need to be able to attribute a single identity to all the different sounds (talking, laughing, shouting) that can come from the same person (“telling people together”). We set out to investigate the limits of these abilities in humans.

Voice experiment

Our recent study,  published in the Journal of Experimental Psychology: General , confirms that voice identity perception can be extremely challenging. Capitalising on how variable a single person’s voice can be, we presented 46 listeners with laughter and vowels produced by five people. Listeners were asked to make a very simple judgement about pairs of sounds: were they made by the same person, or by two different people? As long as they could compare vowels to vowels or laughter to laughter respectively, discriminating between speakers was relatively successful.

But when we asked our listeners to make this judgement based on a mixed pair of sounds, such as directly comparing vowels to laughter in a pair, they couldn’t discriminate between speakers at all—especially if they were not familiar with the speaker. However, even though a sub-group of people who knew the speakers performed better overall, they still struggled significantly with the challenge of “telling people together”.

Similar effects have been reported by studies showing, for example, that it is  difficult to recognise a bilingual speaker  across their two languages. What’s surprising about these findings is how bad voice perception can be once listeners are exposed to natural variation in the sounds that a voice can produce. So, it’s intriguing to consider that while we each have a unique voice, we don’t yet know how useful that uniqueness is.

But why have we evolved to have unique voices if we can’t even recognise them? That’s really an open question so far. We don’t actually know whether we have evolved to have unique voices—we also all have different and largely unique fingerprints, but there’s no evolutionary advantage to that as far as we can tell. It just so happens that based on differences in anatomy and, probably most importantly, how we use our voice, that we all sound different to each other.

Luckily computer algorithms are still able to make the most of the individuality of the human voice. They have probably already outdone humans in some cases—and they will keep on improving. The way these machine-learning algorithms recognise speakers is based on mathematical solutions to create “voice prints”—unique representations picking up the specific acoustic features of each individual voice.

In contrast to computers, humans might not know what they are listening out for, or  how to separate out these acoustic features . So, the way that voice prints are created for the algorithms is not closely modelled on what human listeners appear to do—we’re still working on this. In the long term, it will be interesting to see if there is any overlap in the way human listeners and machine-learning algorithms recognise voices. While human listeners are unlikely to glean any insights from how computers solve this problem, conversely we might be able to build machines that emulate effective aspects of human performance.

It is rumoured that Amazon is currently working on teaching Alexa how to  identify specific users by their voice . If this works, it will be a truly impressive feat and may put a stop to  further unwanted orders of dollhouses . But, do be patient if Alexa makes mistakes—you may not be able to do it any better yourself.

This article was originally published on  The Conversation . Read the original article .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Acoust Soc Am

Logo of jas

Mechanics of human voice production and control

As the primary means of communication, voice plays an important role in daily life. Voice also conveys personal information such as social status, personal traits, and the emotional state of the speaker. Mechanically, voice production involves complex fluid-structure interaction within the glottis and its control by laryngeal muscle activation. An important goal of voice research is to establish a causal theory linking voice physiology and biomechanics to how speakers use and control voice to communicate meaning and personal information. Establishing such a causal theory has important implications for clinical voice management, voice training, and many speech technology applications. This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality. Current efforts to develop mechanical and computational models of voice production are also critically reviewed. Finally, issues and future challenges in developing a causal theory of voice production and perception are discussed.

I. INTRODUCTION

In the broad sense, voice refers to the sound we produce to communicate meaning, ideas, opinions, etc. In the narrow sense, voice, as in this review, refers to sounds produced by vocal fold vibration, or voiced sounds. This is in contrast to unvoiced sounds which are produced without vocal fold vibration, e.g., fricatives which are produced by airflow through constrictions in the vocal tract, plosives produced by sudden release of a complete closure of the vocal tract, or other sound producing mechanisms such as whispering. For voiced sound production, vocal fold vibration modulates airflow through the glottis and produces sound (the voice source), which propagates through the vocal tract and is selectively amplified or attenuated at different frequencies. This selective modification of the voice source spectrum produces perceptible contrasts, which are used to convey different linguistic sounds and meaning. Although this selective modification is an important component of voice production, this review focuses on the voice source and its control within the larynx.

For effective communication of meaning, the voice source, as a carrier for the selective spectral modification by the vocal tract, contains harmonic energy across a large range of frequencies that spans at least the first few acoustic resonances of the vocal tract. In order to be heard over noise, such harmonic energy also has to be reasonably above the noise level within this frequency range, unless a breathy voice quality is desired. The voice source also contains important information of the pitch, loudness, prosody, and voice quality, which convey meaning (see Kreiman and Sidtis, 2011 , Chap. 8 for a review), biological information (e.g., size), and paralinguistic information (e.g., the speaker's social status, personal traits, and emotional state; Sundberg, 1987 ; Kreiman and Sidtis, 2011 ). For example, the same vowel may sound different when spoken by different people. Sometimes a simple “hello” is all it takes to recognize a familiar voice on the phone. People tend to use different voices to different speakers on different occasions, and it is often possible to tell if someone is happy or sad from the tone of their voice.

One of the important goals of voice research is to understand how the vocal system produces voice of different source characteristics and how people associate percepts to these characteristics. Establishing a cause-effect relationship between voice physiology and voice acoustics and perception will allow us to answer two essential questions in voice science and effective clinical care ( Kreiman et al. , 2014 ): when the output voice changes, what physiological alteration caused this change; if a change to voice physiology occurs, what change in perceived voice quality can be expected? Clinically, such knowledge would lead to the development of a physically based theory of voice production that is capable of better predicting voice outcomes of clinical management of voice disorders, thus improving both diagnosis and treatment. More generally, an understanding of this relationship could lead to a better understanding of the laryngeal adjustments that we use to change voice quality, adopt different speaking or singing styles, or convey personal information such as social status and emotion. Such understanding may also lead to the development of improved computer programs for synthesis of naturally sounding, speaker-specific speech of varying emotional percepts.

Understanding such cause-effect relationship between voice physiology and production necessarily requires a multi-disciplinary effort. While voice production results from a complex fluid-structure-acoustic interaction process, which again depends on the geometry and material properties of the lungs, larynx, and the vocal tract, the end interest of voice is its acoustics and perception. Changes in voice physiology or physics that cannot be heard are not that interesting. On the other hand, the physiology and physics may impose constraints on the co-variations among fundamental frequency (F0), vocal intensity, and voice quality, and thus the way we use and control our voice. Thus, understanding voice production and voice control requires an integrated approach, in which physiology, vocal fold vibration, and acoustics are considered as a whole instead of disconnected components. Traditionally, the multi-disciplinary nature of voice production has led to a clear divide between research activities in voice production, voice perception, and their clinical or speech applications, with few studies attempting to link them together. Although much advancement has been made in understanding the physics of phonation, some misconceptions still exist in textbooks in otolaryngology and speech pathology. For example, the Bernoulli effect, which has been shown to play a minor role in phonation, is still considered an important factor in initiating and sustaining phonation in many textbooks and reviews. Tension and stiffness are often used interchangeably despite that they have different physical meanings. The role of the thyroarytenoid muscle in regulating medial compression of the membranous vocal folds is often understated. On the other hand, research on voice production often focuses on the glottal flow and vocal fold vibration, but can benefit from a broader consideration of the acoustics of the produced voice and their implications for voice communication.

This paper provides a review on our current understanding of the cause-effect relation between voice physiology, voice production, and voice perception, with the hope that it will help better bridge research efforts in different aspects of voice studies. An overview of vocal fold physiology is presented in Sec. II , with an emphasis on laryngeal regulation of the geometry, mechanical properties, and position of the vocal folds. The physical mechanisms of self-sustained vocal fold vibration and sound generation are discussed in Sec. III , with a focus on the roles of various physical components and features in initiating phonation and affecting the produced acoustics. Some misconceptions of the voice production physics are also clarified. Section IV discusses the physiologic control of F0, vocal intensity, and voice quality. Section V reviews past and current efforts in developing mechanical and computational models of voice production. Issues and future challenges in establishing a causal theory of voice production and perception are discussed in Sec. VI .

II. VOCAL FOLD PHYSIOLOGY AND BIOMECHANICS

A. vocal fold anatomy and biomechanics.

The human vocal system includes the lungs and the lower airway that function to supply air pressure and airflow (a review of the mechanics of the subglottal system can be found in Hixon, 1987 ), the vocal folds whose vibration modulates the airflow and produces voice source, and the vocal tract that modifies the voice source and thus creates specific output sounds. The vocal folds are located in the larynx and form a constriction to the airway [Fig. 1(a) ]. Each vocal fold is about 11–15 mm long in adult women and 17–21 mm in men, and stretches across the larynx along the anterior-posterior direction, attaching anteriorly to the thyroid cartilage and posteriorly to the anterolateral surface of the arytenoid cartilages [Fig. 1(c) ]. Both the arytenoid [Fig. 1(d) ] and thyroid [Fig. 1(e) ] cartilages sit on top of the cricoid cartilage and interact with it through the cricoarytenoid joint and cricothyroid joint, respectively. The relative movement of these cartilages thus provides a means to adjust the geometry, mechanical properties, and position of the vocal folds, as further discussed below. The three-dimensional airspace between the two opposing vocal folds is the glottis. The glottis can be divided into a membranous portion, which includes the anterior portion of the glottis and extends from the anterior commissure to the vocal process of the arytenoid, and a cartilaginous portion, which is the posterior space between the arytenoid cartilages.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g001.jpg

(Color online) (a) Coronal view of the vocal folds and the airway; (b) histological structure of the vocal fold lamina propria in the coronal plane (image provided by Dr. Jennifer Long of UCLA); (c) superior view of the vocal folds, cartilaginous framework, and laryngeal muscles; (d) medial view of the cricoarytenoid joint formed between the arytenoid and cricoid cartilages; (e) posterolateral view of the cricothyroid joint formed by the thyroid and the cricoid cartilages. The arrows in (d) and (e) indicate direction of possible motions of the arytenoid and cricoid cartilages due to LCA and CT muscle activation, respectively.

The vocal folds are layered structures, consisting of an inner muscular layer (the thyroarytenoid muscle) with muscle fibers aligned primarily along the anterior-posterior direction, a soft tissue layer of the lamina propria, and an outmost epithelium layer [Figs. 1(a) and 1(b) ]. The thyroarytenoid (TA) muscle is sometimes divided into a medial and a lateral bundle, with each bundle responsible for a certain vocal fold posturing function. However, such functional division is still a topic of debate ( Zemlin, 1997 ). The lamina propria consists of the extracellular matrix (ECM) and interstitial substances. The two primary ECM proteins are the collagen and elastin fibers, which are aligned mostly along the length of the vocal folds in the anterior-posterior direction ( Gray et al. , 2000 ). Based on the density of the collagen and elastin fibers [Fig. 1(b) ], the lamina propria can be divided into a superficial layer with limited and loose elastin and collagen fibers, an intermediate layer of dominantly elastin fibers, and a deep layer of mostly dense collagen fibers ( Hirano and Kakita, 1985 ; Kutty and Webb, 2009 ). In comparison, the lamina propria (about 1 mm thick) is much thinner than the TA muscle.

Conceptually, the vocal fold is often simplified into a two-layer body-cover structure ( Hirano, 1974 ; Hirano and Kakita, 1985 ). The body layer includes the muscular layer and the deep layer of the lamina propria, and the cover layer includes the intermediate and superficial lamina propria and the epithelium layer. This body-cover concept of vocal fold structure will be adopted in the discussions below. Another grouping scheme divides the vocal fold into three layers. In addition to a body and a cover layer, the intermediate and deep layers of the lamina propria are grouped into a vocal ligament layer ( Hirano, 1975 ). It is hypothesized that this layered structure plays a functional role in phonation, with different combinations of mechanical properties in different layers leading to production of different voice source characteristics ( Hirano, 1974 ). However, because of lack of data of the mechanical properties in each vocal fold layer and how they vary at different conditions of laryngeal muscle activation, a definite understanding of the functional roles of each vocal fold layer is still missing.

The mechanical properties of the vocal folds have been quantified using various methods, including tensile tests ( Hirano and Kakita, 1985 ; Zhang et al. , 2006b ; Kelleher et al. , 2013a ), shear rheometry ( Chan and Titze, 1999 ; Chan and Rodriguez, 2008 ; Miri et al. , 2012 ), indentation ( Haji et al. , 1992a , b ; Tran et al. , 1993 ; Chhetri et al. , 2011 ), and a surface wave method ( Kazemirad et al. , 2014 ). These studies showed that the vocal folds exhibit a nonlinear, anisotropic, viscoelastic behavior. A typical stress-strain curve of the vocal folds under anterior-posterior tensile test is shown in Fig. ​ Fig.2. 2 . The slope of the curve, or stiffness, quantifies the extent to which the vocal folds resist deformation in response to an applied force. In general, after an initial linear range, the slope of the stress-strain curve (stiffness) increases gradually with further increase in the strain (Fig. ​ (Fig.2), 2 ), presumably due to the gradual engagement of the collagen fibers. Such nonlinear mechanical behavior provides a means to regulate vocal fold stiffness and tension through vocal fold elongation or shortening, which plays an important role in the control of the F0 or pitch of voice production. Typically, the stress is higher during loading than unloading, indicating a viscous behavior of the vocal folds. Due to the presence of the AP-aligned collagen, elastin, and muscle fibers, the vocal folds also exhibit anisotropic mechanical properties, stiffer along the AP direction than in the transverse plane. Experiments ( Hirano and Kakita, 1985 ; Alipour and Vigmostad, 2012 ; Miri et al. , 2012 ; Kelleher et al. , 2013a ) showed that the Young's modulus along the AP direction in the cover layer is more than 10 times (as high as 80 times in Kelleher et al. , 2013a ) larger than in the transverse plane. Stiffness anisotropy has been shown to facilitate medial-lateral motion of the vocal folds ( Zhang, 2014 ) and complete glottal closure during phonation ( Xuan and Zhang, 2014 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g002.jpg

Typical tensile stress-strain curve of the vocal fold along the anterior-posterior direction during loading and unloading at 1 Hz. The slope of the tangent line (dashed lines) to the stress-strain curve quantifies the tangent stiffness. The stress is typically higher during loading than unloading due to the viscous behavior of the vocal folds. The curve was obtained by averaging data over 30 cycles after a 10-cycle preconditioning.

Accurate measurement of vocal fold mechanical properties at typical phonation conditions is challenging, due to both the small size of the vocal folds and the relatively high frequency of phonation. Although tensile tests and shear rheometry allow direct measurement of material modules, the small sample size often leads to difficulties in mounting tissue samples to the testing equipment, thus creating concerns of accuracy. These two methods also require dissecting tissue samples from the vocal folds and the laryngeal framework, making it impossible for in vivo measurement. The indentation method is ideal for in vivo measurement and, because of the small size of indenters used, allows characterization of the spatial variation of mechanical properties of the vocal folds. However, it is limited for measurement of mechanical properties at conditions of small deformation. Although large indentation depths can be used, data interpretation becomes difficult and thus it is not suitable for assessment of the nonlinear mechanical properties of the vocal folds.

There has been some recent work toward understanding the contribution of individual ECM components to the macro-mechanical properties of the vocal folds and developing a structurally based constitutive model of the vocal folds (e.g., Chan et al. , 2001 ; Kelleher et al. , 2013b ; Miri et al. , 2013 ). The contribution of interstitial fluid to the viscoelastic properties of the vocal folds and vocal fold stress during vocal fold vibration and collision has also been investigated using a biphasic model of the vocal folds in which the vocal fold was modeled as a solid phase interacting with an interstitial fluid phase ( Zhang et al. , 2008 ; Tao et al. , 2009 , Tao et al. , 2010 ; Bhattacharya and Siegmund, 2013 ). This structurally based approach has the potential to predict vocal fold mechanical properties from the distribution of collagen and elastin fibers and interstitial fluids, which may provide new insights toward the differential mechanical properties between different vocal fold layers at different physiologic conditions.

B. Vocal fold posturing

Voice communication requires fine control and adjustment of pitch, loudness, and voice quality. Physiologically, such adjustments are made through laryngeal muscle activation, which stiffens, deforms, or repositions the vocal folds, thus controlling the geometry and mechanical properties of the vocal folds and glottal configuration.

One important posturing is adduction/abduction of the vocal folds, which is primarily achieved through motion of the arytenoid cartilages. Anatomical analysis and numerical simulations have shown that the cricoarytenoid joint allows the arytenoid cartilages to slide along and rotate about the long axis of the cricoid cartilage, but constrains arytenoid rotation about the short axis of the cricoid cartilage ( Selbie et al. , 1998 ; Hunter et al. , 2004 ; Yin and Zhang, 2014 ). Activation of the lateral cricoarytenoid (LCA) muscles, which attach anteriorly to the cricoid cartilage and posteriorly to the arytenoid cartilages, induce mainly an inward rotation motion of the arytenoid about the cricoid cartilages in the coronal plane, and moves the posterior portion of the vocal folds toward the glottal midline. Activation of the interarytenoid (IA) muscles, which connect the posterior surfaces of the two arytenoids, slides and approximates the arytenoid cartilages [Fig. 1(c) ], thus closing the cartilaginous glottis. Because both muscles act on the posterior portion of the vocal folds, combined action of the two muscles is able to completely close the posterior portion of the glottis, but is less effective in closing the mid-membranous glottis (Fig. ​ (Fig.3; 3 ; Choi et al. , 1993 ; Chhetri et al. , 2012 ; Yin and Zhang, 2014 ). Because of this inefficiency in mid-membranous approximation, LCA/IA muscle activation is unable to produce medial compression between the two vocal folds in the membranous portion, contrary to current understandings ( Klatt and Klatt, 1990 ; Hixon et al. , 2008 ). Complete closure and medial compression of the mid-membranous glottis requires the activation of the TA muscle ( Choi et al. , 1993 ; Chhetri et al. , 2012 ). The TA muscle forms the bulk of the vocal folds and stretches from the thyroid prominence to the anterolateral surface of the arytenoid cartilages (Fig. ​ (Fig.1). 1 ). Activation of the TA muscle produces a whole-body rotation of the vocal folds in the horizontal plane about the point of its anterior attachment to the thyroid cartilage toward the glottal midline ( Yin and Zhang, 2014 ). This rotational motion is able to completely close the membranous glottis but often leaves a gap posteriorly (Fig. ​ (Fig.3). 3 ). Complete closure of both the membranous and cartilaginous glottis thus requires combined activation of the LCA/IA and TA muscles. The posterior cricoarytenoid (PCA) muscles are primarily responsible for opening the glottis but may also play a role in voice production of very high pitches, as discussed below.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g003.jpg

Activation of the LCA/IA muscles completely closes the posterior glottis but leaves a small gap in the membranous glottis, whereas TA activation completely closes the anterior glottis but leaves a gap at the posterior glottis. From unpublished stroboscopic recordings from the in vivo canine larynx experiments in Choi et al. (1993) .

Vocal fold tension is regulated by elongating or shortening the vocal folds. Because of the nonlinear material properties of the vocal folds, changing vocal fold length also leads to changes in vocal fold stiffness, which otherwise would stay constant for linear materials. The two laryngeal muscles involved in regulating vocal fold length are the cricothyroid (CT) muscle and the TA muscle. The CT muscle consists of two bundles. The vertically oriented bundle, the pars recta, connects the anterior surface of the cricoid cartilage and the lower border of the thyroid lamina. Its contraction approximates the thyroid and cricoid cartilages anteriorly through a rotation about the cricothyroid joint. The other bundle, the pars oblique, is oriented upward and backward, connecting the anterior surface of the cricoid cartilage to the inferior cornu of the thyroid cartilage. Its contraction displaces the cricoid and arytenoid cartilages backwards ( Stone and Nuttall, 1974 ), although the thyroid cartilage may also move forward slightly. Contraction of both bundles thus elongates the vocal folds and increases the stiffness and tension in both the body and cover layers of the vocal folds. In contrast, activation of the TA muscle, which forms the body layer of the vocal folds, increase the stiffness and tension in the body layer. Activation of the TA muscle, in addition to an initial effect of mid-membranous vocal fold approximation, also shortens the vocal folds, which decreases both the stiffness and tension in the cover layer ( Hirano and Kakita, 1985 ; Yin and Zhang, 2013 ). One exception is when the tension in the vocal fold cover is already negative (i.e., under compression), in which case shortening the vocal folds further through TA activation decreases tension (i.e., increased compression force) but may increase stiffness in the cover layer. Activation of the LCA/IA muscles generally does not change the vocal fold length much and thus has only a slight effect on vocal fold stiffness and tension ( Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, activation of the LCA/IA muscles (and also the PCA muscles) does stabilize the arytenoid cartilage and prevent it from moving forward when the cricoid cartilage is pulled backward due to the effect of CT muscle activation, thus facilitating extreme vocal fold elongation, particularly for high-pitch voice production. As noted above, due to the lack of reliable measurement methods, our understanding of how vocal fold stiffness and tension vary at different muscular activation conditions is limited.

Activation of the CT and TA muscles also changes the medial surface shape of the vocal folds and the glottal channel geometry. Specifically, TA muscle activation causes the inferior part of the medial surface to bulge out toward the glottal midline ( Hirano and Kakita, 1985 ; Hirano, 1988 ; Vahabzadeh-Hagh et al. , 2016 ), thus increasing the vertical thickness of the medial surface. In contrast, CT activation reduces this vertical thickness of the medial surface. Although many studies have investigated the prephonatory glottal shape (convergent, straight, or divergent) on phonation ( Titze, 1988a ; Titze et al. , 1995 ), a recent study showed that the glottal channel geometry remains largely straight under most conditions of laryngeal muscle activation ( Vahabzadeh-Hagh et al. , 2016 ).

III. PHYSICS OF VOICE PRODUCTION

A. sound sources of voice production.

The phonation process starts from the adduction of the vocal folds, which approximates the vocal folds to reduce or close the glottis. Contraction of the lungs initiates airflow and establishes pressure buildup below the glottis. When the subglottal pressure exceeds a certain threshold pressure, the vocal folds are excited into a self-sustained vibration. Vocal fold vibration in turn modulates the glottal airflow into a pulsating jet flow, which eventually develops into turbulent flow into the vocal tract.

In general, three major sound production mechanisms are involved in this process ( McGowan, 1988 ; Hofmans, 1998 ; Zhao et al. , 2002 ; Zhang et al. , 2002a ), including a monopole sound source due to volume of air displaced by vocal fold vibration, a dipole sound source due to the fluctuating force applied by the vocal folds to the airflow, and a quadrupole sound source due to turbulence developed immediately downstream of the glottal exit. When the false vocal folds are tightly adducted, an additional dipole source may arise as the glottal jet impinges onto the false vocal folds ( Zhang et al. , 2002b ). The monopole sound source is generally small considering that the vocal folds are nearly incompressible and thus the net volume flow displacement is small. The dipole source is generally considered as the dominant sound source and is responsible for the harmonic component of the produced sound. The quadrupole sound source is generally much weaker than the dipole source in magnitude, but it is responsible for broadband sound production at high frequencies.

For the harmonic component of the voice source, an equivalent monopole sound source can be defined at a plane just downstream of the region of major sound sources, with the source strength equal to the instantaneous pulsating glottal volume flow rate. In the source-filter theory of phonation ( Fant, 1970 ), this monopole sound source is the input signal to the vocal tract, which acts as a filter and shapes the sound source spectrum into different sounds before they are radiated from the mouth to the open as the voice we hear. Because of radiation from the mouth, the sound source is proportional to the time derivative of the glottal flow. Thus, in the voice literature, the time derivate of the glottal flow, instead of the glottal flow, is considered as the voice source.

The phonation cycle is often divided into an open phase, in which the glottis opens (the opening phase) and closes (the closing phase), and a closed phase, in which the glottis is closed or remains a minimum opening area when the glottal closure is incomplete. The glottal flow increases and decreases in the open phase, and remains zero during the closed phase or minimum for incomplete glottal closure (Fig. ​ (Fig.4). 4 ). Compared to the glottal area waveform, the glottal flow waveform reaches its peak at a later time in the cycle so that the glottal flow waveform is more skewed to the right. This skewing in the glottal flow waveform to the right is due to the acoustic mass in the glottis and the vocal tract (when the F0 is lower than a nearby vocal tract resonance frequency), which causes a delay in the increase in the glottal flow during the opening phase, and a faster decay in the glottal flow during the closing phase ( Rothenberg, 1981 ; Fant, 1982 ). Because of this waveform skewing to the right, the negative peak of the time derivative of the glottal flow in the closing phase is often much more dominant than the positive peak in the opening phase. The instant of the most negative peak is thus considered the point of main excitation of the vocal tract and the corresponding negative peak, also referred to as the maximum flow declination rate (MFDR), is a major determinant of the peak amplitude of the produced voice. After the negative peak, the time derivative of the glottal flow waveform returns to zero as phonation enters the closed phase.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g004.jpg

(Color online) Typical glottal flow waveform and its time derivative (left) and their correspondence to the spectral slopes of the low-frequency and high-frequency portions of the voice source spectrum (right).

Much work has been done to directly link features of the glottal flow waveform to voice acoustics and potentially voice quality (e.g., Fant, 1979 , 1982 ; Fant et al. , 1985 ; Gobl and Chasaide, 2010 ). These studies showed that the low-frequency spectral shape (the first few harmonics) of the voice source is primarily determined by the relative duration of the open phase with respect to the oscillation period (To/T in Fig. ​ Fig.4, 4 , also referred to as the open quotient). A longer open phase often leads to a more dominant first harmonic (H1) in the low-frequency portion of the resulting voice source spectrum. For a given oscillation period, shortening the open phrase causes most of the glottal flow change to occur within a duration (To) that is increasingly shorter than the period T. This leads to an energy boost in the low-frequency portion of the source spectrum that peaks around a frequency of 1/To. For a glottal flow waveform of a very short open phase, the second harmonic (H2) or even the fourth harmonic (H4) may become the most dominant harmonic. Voice source with a weak H1 relative to H2 or H4 is often associated with a pressed voice quality.

The spectral slope in the high-frequency range is primarily related to the degree of discontinuity in the time derivative of the glottal flow waveform. Due to the waveform skewing discussed earlier, the most dominant source of discontinuity often occurs around the instant of main excitation when the time derivative of the glottal flow waveform returns from the negative peak to zero within a time scale of Ta (Fig. ​ (Fig.4). 4 ). For an abrupt glottal flow cutoff ( Ta  = 0), the time derivative of the glottal flow waveform has a strong discontinuity at the point of main excitation, which causes the voice source spectrum to decay asymptotically at a roll-off rate of −6 dB per octave toward high frequencies. Increasing Ta from zero leads to a gradual return from the negative peak to zero. When approximated by an exponential function, this gradual return functions as a lower-pass filter, with a cutoff frequency around 1/ Ta , and reduces the excitation of harmonics above the cutoff frequency 1/ Ta . Thus, in the frequency range concerning voice perception, increasing Ta often leads to reduced higher-order harmonic excitation. In the extreme case when there is minimal vocal fold contact, the time derivative of the glottal flow waveform is so smooth that the voice source spectrum only has a few lower-order harmonics. Perceptually, strong excitation of higher-order harmonics is often associated with a bright output sound quality, whereas voice source with limited excitation of higher-order harmonics is often perceived to be weak.

Also of perceptual importance is the turbulence noise produced immediately downstream of the glottis. Although small in amplitude, the noise component plays an important role in voice quality perception, particularly for female voice in which aspiration noise is more persistent than in male voice. While the noise component of voice is often modeled as white noise, its spectrum often is not flat and may exhibit different spectral shapes, depending on the glottal opening and flow rate as well as the vocal tract shape. Interaction between the spectral shape and relative levels of harmonic and noise energy in the voice source has been shown to influence the perception of voice quality ( Kreiman and Gerratt, 2012 ).

It is worth noting that many of the source parameters are not independent from each other and often co-vary. How they co-vary at different voicing conditions, which is essential to natural speech synthesis, remains to be the focus of many studies (e.g., Sundberg and Hogset, 2001 ; Gobl and Chasaide, 2003 ; Patel et al. , 2011 ).

B. Mechanisms of self-sustained vocal fold vibration

That vocal fold vibration results from a complex airflow-vocal fold interaction within the glottis rather than repetitive nerve stimulation of the larynx was first recognized by van den Berg (1958) . According to his myoelastic-aerodynamic theory of voice production, phonation starts from complete adduction of the vocal folds to close the glottis, which allows a buildup of the subglottal pressure. The vocal folds remain closed until the subglottal pressure is sufficiently high to push them apart, allowing air to escape and producing a negative (with respect to atmospheric pressure) intraglottal pressure due to the Bernoulli effect. This negative Bernoulli pressure and the elastic recoil pull the vocal folds back and close the glottis. The cycle then repeats, which leads to sustained vibration of the vocal folds.

While the myoelastic-aerodynamic theory correctly identifies the interaction between the vocal folds and airflow as the underlying mechanism of self-sustained vocal fold vibration, it does not explain how energy is transferred from airflow into the vocal folds to sustain this vibration. Traditionally, the negative intraglottal pressure is considered to play an important role in closing the glottis and sustaining vocal fold vibration. However, it is now understood that a negative intraglottal pressure is not a critical requirement for achieving self-sustained vocal fold vibration. Similarly, an alternatingly convergent-divergent glottal channel geometry during phonation has been considered a necessary condition that leads to net energy transfer from airflow into the vocal folds. We will show below that an alternatingly convergent-divergent glottal channel geometry does not always guarantee energy transfer or self-sustained vocal fold vibration.

For flow conditions typical of human phonation, the glottal flow can be reasonably described by Bernoulli's equation up to the point when airflow separates from the glottal wall, often at the glottal exit at which the airway suddenly expands. According to Bernoulli's equation, the flow pressure p at a location within the glottal channel with a time-varying cross-sectional area A is

where P sub and P sup are the subglottal and supraglottal pressure, respectively, and A sep is the time-varying glottal area at the flow separation location. For simplicity, we assume that the flow separates at the upper margin of the medial surface. To achieve a net energy transfer from airflow to the vocal folds over one cycle, the air pressure on the vocal fold surface has to be at least partially in-phase with vocal fold velocity. Specifically, the intraglottal pressure needs to be higher in the opening phase than in the closing phase of vocal fold vibration so that the airflow does more work on the vocal folds in the opening phase than the work the vocal folds do back to the airflow in the closing phase.

Theoretical analysis of the energy transfer between airflow and vocal folds ( Ishizaka and Matsudaira, 1972 ; Titze, 1988a ) showed that this pressure asymmetry can be achieved by a vertical phase difference in vocal fold surface motion (also referred to as a mucosal wave), i.e., different portions of the vocal fold surface do not necessarily move inward and outward together as a whole. This mechanism is illustrated in Fig. ​ Fig.5, 5 , the upper left of which shows vocal fold surface shape in the coronal plane for six consecutive, equally spaced instants during one vibration cycle in the presence of a vertical phase difference. Instants 2 and 3 in solid lines are in the closing phase whereas 5 and 6 in dashed lines are in the opening phase. Consider for an example energy transfer at the lower margin of the medial surface. Because of the vertical phase difference, the glottal channel has a different shape in the opening phase (dashed lines 5 and 6) from that in the closing (solid lines 3 and 2) when the lower margin of the medial surface crosses the same locations. Particularly, when the lower margin of the medial surface leads the upper margin in phase, the glottal channel during opening (e.g., instant 6) is always more convergent [thus a smaller A sep / A in Eq. (1) ] or less divergent than that in the closing (e.g., instant 2) for the same location of the lower margin, resulting in an air pressure [Eq. (1) ] that is higher in the opening phase than the closing phase (Fig. ​ (Fig.5, 5 , top row). As a result, energy is transferred from airflow into the vocal folds over one cycle, as indicated by a non-zero area enclosed by the aerodynamic force-vocal fold displacement curve in Fig. ​ Fig.5 5 (top right). The existence of a vertical phase difference in vocal fold surface motion is generally considered as the primary mechanism of phonation onset.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g005.jpg

Two energy transfer mechanisms. Top row: the presence of a vertical phase difference leads to different medial surface shapes between glottal opening (dashed lines 5 and 6; upper left panel) and closing (solid lines 2 and 3) when the lower margin of the medial surface crosses the same locations, which leads to higher air pressure during glottal opening than closing and net energy transfer from airflow into vocal folds at the lower margin of the medial surface. Middle row: without a vertical phase difference, vocal fold vibration produces an alternatingly convergent-divergent but identical glottal channel geometry between glottal opening and closing (bottom left panel), thus zero energy transfer (middle row). Bottom row: without a vertical phase difference, air pressure asymmetry can be imposed by a negative damping mechanism.

In contrast, without a vertical phase difference, the vocal fold surface during opening (Fig. ​ (Fig.5, 5 , bottom left; dashed lines 5 and 6) and closing (solid lines 3 and 2) would be identical when the lower margin crosses the same positions, for which Bernoulli's equation would predict symmetric flow pressure between the opening and closing phases, and zero net energy transfer over one cycle (Fig. ​ (Fig.5, 5 , middle row). Under this condition, the pressure asymmetry between the opening and closing phases has to be provided by an external mechanism that directly imposes a phase difference between the intraglottal pressure and vocal fold movement. In the presence of such an external mechanism, the intraglottal pressure is no longer the same between opening and closing even when the glottal channel has the same shape as the vocal fold crosses the same locations, resulting in a net energy transfer over one cycle from airflow to the vocal folds (Fig. ​ (Fig.5, 5 , bottom row). This energy transfer mechanism is often referred to as negative damping, because the intraglottal pressure depends on vocal fold velocity and appears in the system equations of vocal fold motion in a form similar to a damping force, except that energy is transferred to the vocal folds instead of being dissipated. Negative damping is the only energy transfer mechanism in a single degree-of-freedom system or when the entire medial surface moves in phase as a whole.

In humans, a negative damping can be provided by an inertive vocal tract ( Flanagan and Landgraf, 1968 ; Ishizaka and Matsudaira, 1972 ; Ishizaka and Flanagan, 1972 ) or a compliant subglottal system ( Zhang et al. , 2006a ). Because the negative damping associated with acoustic loading is significant only for frequencies close to an acoustic resonance, phonation sustained by such negative damping alone always occurs at a frequency close to that acoustic resonance ( Flanagan and Landgraf, 1968 ; Zhang et al. , 2006a ). Although there is no direct evidence of phonation sustained dominantly by acoustic loading in humans, instabilities in voice production (or voice breaks) have been reported when the fundamental frequency of vocal fold vibration approaches one of the vocal tract resonances (e.g., Titze et al. , 2008 ). On the other hand, this entrainment of phonation frequency to the acoustic resonance limits the degree of independent control of the voice source and the spectral modification by the vocal tract, and is less desirable for effective speech communication. Considering that humans are capable of producing a large variety of voice types independent of vocal tract shapes, negative damping due to acoustic coupling to the sub- or supra-glottal acoustics is unlikely the primary mechanism of energy transfer in voice production. Indeed, excised larynges are able to vibrate without a vocal tract. On the other hand, experiments have shown that in humans the vocal folds vibrate at a frequency close to an in vacuo vocal fold resonance ( Kaneko et al. , 1986 ; Ishizaka, 1988 ; Svec et al. , 2000 ) instead of the acoustic resonances of the sub- and supra-glottal tracts, suggesting that phonation is essentially a resonance phenomenon of the vocal folds.

A negative damping can be also provided by glottal aerodynamics. For example, glottal flow acceleration and deceleration may cause the flow to separate at different locations between opening and closing even when the glottis has identical geometry. This is particularly the case for a divergent glottal channel geometry, which often results in asymmetric flow separation and pressure asymmetry between the glottal opening and closing phases ( Park and Mongeau, 2007 ; Alipour and Scherer, 2004 ). The effect of this negative damping mechanism is expected to be small at phonation onset at which the vocal fold vibration amplitude and thus flow unsteadiness is small and the glottal channel is less likely to be divergent. However, its contribution to energy transfer may increase with increasing vocal fold vibration amplitude and flow unsteadiness ( Howe and McGowan, 2010 ). It is important to differentiate this asymmetric flow separation between glottal opening and closing due to unsteady flow effects from a quasi-steady asymmetric flow separation that is caused by asymmetry in the glottal channel geometry between opening and closing. In the latter case, because flow separation may occur at a more upstream location for a divergent glottal channel than a convergent glottal channel, an asymmetric glottal channel geometry (e.g., a glottis opening convergent and closing divergent) may lead to asymmetric flow separation between glottal opening and closing. Compared to conditions of a fixed flow separation (i.e., flow separates at the same location during the entire cycle, as in Fig. ​ Fig.5), 5 ), such geometry-induced asymmetric flow separation actually reduces pressure asymmetry between glottal opening and closing [this can be shown using Eq. (1) ] and thus weakens net energy transfer. In reality, these two types of asymmetric flow separation mechanisms (due to unsteady effects or changes in glottal channel geometry) interact and can result in very complex flow separation patterns ( Alipour and Scherer, 2004 ; Sciamarella and Le Quere, 2008 ; Sidlof et al. , 2011 ), which may or may not enhance energy transfer.

From the discussion above it is clear that a negative Bernoulli pressure is not a critical requirement in either one of the two mechanisms. Being proportional to vocal fold displacement, the negative Bernoulli pressure is not a negative damping and does not directly provide the required pressure asymmetry between glottal opening and closing. On the other hand, the existence of a vertical phase difference in vocal fold vibration is determined primarily by vocal fold properties (as discussed below), rather than whether the intraglottal pressure is positive or negative during a certain phase of the oscillation cycle.

Although a vertical phase difference in vocal fold vibration leads to a time-varying glottal channel geometry, an alternatingly convergent-divergent glottal channel geometry does not guarantee self-sustained vocal fold vibration. For example, although the in-phase vocal fold motion in the bottom left of Fig. ​ Fig.5 5 (the entire medial surface moves in and out together) leads to an alternatingly convergent-divergent glottal geometry, the glottal geometry is identical between glottal opening and closing and thus this motion is unable to produce net energy transfer into the vocal folds without a negative damping mechanism (Fig. ​ (Fig.5, 5 , middle row). In other words, an alternatingly convergent-divergent glottal geometry is an effect, not cause, of self-sustained vocal fold vibration. Theoretically, the glottis can maintain a convergent or divergent shape during the entire oscillation cycle and yet still self-oscillate, as observed in experiments using physical vocal fold models which had a divergent shape during most portions of the oscillation cycle ( Zhang et al. , 2006a ).

C. Eigenmode synchronization and nonlinear dynamics

The above shows that net energy transfer from airflow into the vocal folds is possible in the presence of a vertical phase difference. But how is this vertical phase difference established, and what determines the vertical phase difference and the vocal fold vibration pattern? In voice production, vocal fold vibration with a vertical phase difference results from a process of eigenmode synchronization, in which two or more in vacuo eigenmodes of the vocal folds are synchronized to vibrate at the same frequency but with a phase difference ( Ishizaka and Matsudaira, 1972 ; Ishizaka, 1981 ; Horacek and Svec, 2002 ; Zhang et al. , 2007 ), in the same way as a travelling wave formed by superposition of two standing waves. An eigenmode or resonance is a pattern of motion of the system that is allowed by physical laws and boundary constraints to the system. In general, for each mode, the vibration pattern is such that all parts of the system move either in-phase or 180° out of phase, similar to a standing wave. Each eigenmode has an inherently distinct eigenfrequency (or resonance frequency) at which the eigenmode can be maximally excited. An example of eigenmodes that is often encountered in speech science is formants, which are peaks in the output voice spectra due to excitation of acoustic resonances of the vocal tract, with the formant frequency dependent on vocal tract geometry. Figure ​ Figure6 6 shows three typical eigenmodes of the vocal fold in the coronal plane. In Fig. ​ Fig.6, 6 , the thin line indicates the resting vocal fold surface shape, whereas the solid and dashed lines indicate extreme positions of the vocal fold when vibrating at the corresponding eigenmode, spaced 180° apart in a vibratory cycle. The first eigenmode shows an up and down motion in the vertical direction, which does not modulate glottal airflow much. The second eigenmode has a dominantly in-phase medial-lateral motion along the medial surface, which does modulate airflow. The third eigenmode also exhibits dominantly medial-lateral motion, but the upper portion of the medial surface vibrates 180° out of phase with the lower portion of the medial surface. Such out-of-phase motion as in the third eigenmode is essential to achieving vocal fold vibration with a large vertical phase difference, e.g., when synchronized with an in-phase eigenmode as in Fig. 6(b) .

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g006.jpg

Typical vocal fold eigenmodes exhibiting (a) a dominantly superior-inferior motion, (b) a medial-lateral in-phase motion, and (c) a medial-lateral out-of-phase motion along the medial surface.

In the absence of airflow, the vocal fold in vacuo eigenmodes are generally neutral or damped, meaning that when excited they will gradually decay in amplitude with time. When the vocal folds are subject to airflow, however, the vocal fold-airflow coupling modifies the eigenmodes and, in some conditions, synchronizes two eigenmodes to the same frequency (Fig. ​ (Fig.7). 7 ). Although vibration in each eigenmode by itself does not produce net energy transfer (Fig. ​ (Fig.5, 5 , middle row), when two modes are synchronized at the same frequency but with a phase difference in time, the vibration velocity associated with one eigenmode [e.g., the eigenmode in Fig. 6(b) ] will be at least partially in-phase with the pressure induced by the other eigenmode [e.g., the eigenmode in Fig. 6(c) ], and this cross-model pressure-velocity interaction will produce net energy transfer into the vocal folds ( Ishizaka and Matsudaira, 1972 ; Zhang et al. , 2007 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g007.jpg

A typical eigenmode synchronization pattern. The evolution of the first three eigenmodes is shown as a function of the subglottal pressure. As the subglottal pressure increases, the frequencies (top) of the second and third vocal fold eigenmodes gradually approach each other and, at a threshold subglottal pressure, synchronize to the same frequency. At the same time, the growth rate (bottom) of the second mode becomes positive, indicating the coupled airflow-vocal fold system becomes linearly unstable and phonation starts.

The minimum subglottal pressure required to synchronize two eigenmodes and initiate net energy transfer, or the phonation threshold pressure, is proportional to the frequency spacing between the two eigenmodes being synchronized and the coupling strength between the two eigenmodes ( Zhang, 2010 ):

where ω 0,1 and ω 0,2 are the eigenfrequencies of the two in vacuo eigenmodes participating in the synchronization process and β is the coupling strength between the two eigenmodes. Thus, the closer the two eigenmodes are to each other in frequency or the more strongly they are coupled, the less pressure is required to synchronize them. This is particularly the case in an anisotropic material such as the vocal folds in which the AP stiffness is much larger than the stiffness in the transverse plane. Under such anisotropic stiffness conditions, the first few in vacuo vocal fold eigenfrequencies tend to cluster together and are much closer to each other compared to isotropic stiffness conditions ( Titze and Strong, 1975 ; Berry, 2001 ). Such clustering of eigenmodes makes it possible to initiate vocal fold vibration at very low subglottal pressures.

The coupling strength β between the two eigenmodes in Eq. (2) depends on the prephonatory glottal opening, with the coupling strength increasing with decreasing glottal opening (thus lowered phonation threshold pressure). In addition, the coupling strength also depends on the spatial similarity between the air pressure distribution over the vocal fold surface induced by one eigenmode and vocal fold surface velocity of the other eigenmode ( Zhang, 2010 ). In other words, the coupling strength β quantifies the cross-mode energy transfer efficiency between the eigenmodes that are being synchronized. The higher the degree of cross-mode pressure-velocity similarity, the better the two eigenmodes are coupled, and the less subglottal pressure is required to synchronize them.

In reality, the vocal folds have an infinite number of eigenmodes. Which eigenmodes are synchronized and eventually excited depends on the frequency spacing and relative coupling strength among different eigenmodes. Because vocal fold vibration depends on the eigenmodes that are eventually excited, changes in the eigenmode synchronization pattern often lead to changes in the F0, vocal fold vibration pattern, and the resulting voice quality. Previous studies have shown that a slight change in vocal fold properties such as stiffness or medial surface shape may cause phonation to occur at a different eigenmode, leading to a qualitatively different vocal fold vibration pattern and abrupt changes in F0 ( Tokuda et al. , 2007 ; Zhang, 2009 ). Eigenmode synchronization is not limited to two vocal fold eigenmodes, either. It may also occur between a vocal fold eigenmode and an eigenmode of the subglottal or supraglottal system. In this sense, the negative damping due to subglottal or supraglottal acoustic loading can be viewed as the result of synchronization between one of the vocal fold modes and one of the acoustic resonances.

Eigenmode synchronization discussed above corresponds to a 1:1 temporal synchronization of two eigenmodes. For a certain range of vocal fold conditions, e.g., when asymmetry (left-right or anterior-posterior) exists in the vocal system or when the vocal folds are strongly coupled with the sub- or supra-glottal acoustics, synchronization may occur so that the two eigenmodes are synchronized not toward the same frequency, but at a frequency ratio of 1:2, 1:3, etc., leading to subharmonics or biphonation ( Ishizaka and Isshiki, 1976 ; Herzel, 1993 ; Herzel et al. , 1994 ; Neubauer et al. , 2001 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Titze, 2008 ; Lucero et al. , 2015 ). Temporal desynchronization of eigenmodes often leads to irregular or chaotic vocal fold vibration ( Herzel et al. , 1991 ; Berry et al. , 1994 ; Berry et al. , 2006 ; Steinecke and Herzel, 1995 ). Transition between different synchronization patterns, or bifurcation, often leads to a sudden change in the vocal fold vibration pattern and voice quality.

These studies show that the nonlinear interaction between vocal fold eigenmodes is a central feature of the phonation process, with different synchronization or desynchronization patterns producing a large variety of voice types. Thus, by changing the geometrical and biomechanical properties of the vocal folds, either through laryngeal muscle activation or mechanical modification as in phonosurgery, we can select eigenmodes and eigenmode synchronization pattern to control or modify our voice, in the same way as we control speech formants by moving articulators in the vocal tract to modify vocal tract acoustic resonances.

The concept of eigenmode and eigenmode synchronization is also useful for phonation modeling, because eigenmodes can be used as building blocks to construct more complex motion of the system. Often, only the first few eigenmodes are required for adequate reconstruction of complex vocal fold vibrations (both regular and irregular; Herzel et al. , 1994 ; Berry et al. , 1994 ; Berry et al. , 2006 ), which would significantly reduce the degrees of freedom required in computational models of phonation.

D. Biomechanical requirements of glottal closure during phonation

An important feature of normal phonation is the complete closure of the membranous glottis during vibration, which is essential to the production of high-frequency harmonics. Incomplete closure of the membranous glottis, as often observed in pathological conditions, often leads to voice production of a weak and/or breathy quality.

It is generally assumed that approximation of the vocal folds through arytenoid adduction is sufficient to achieve glottal closure during phonation, with the duration of glottal closure or the closed quotient increasing with increasing degree of vocal fold approximation. While a certain degree of vocal fold approximation is obviously required for glottal closure, there is evidence suggesting that other factors also are in play. For example, excised larynx experiments have shown that some larynges would vibrate with incomplete glottal closure despite that the arytenoids are tightly sutured together ( Isshiki, 1989 ; Zhang, 2011 ). Similar incomplete glottal closure is also observed in experiments using physical vocal fold models with isotropic material properties ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In these experiments, increasing the subglottal pressure increased the vocal fold vibration amplitude but often did not lead to improvement in the glottal closure pattern ( Xuan and Zhang, 2014 ). These studies show that addition stiffness or geometry conditions are required to achieve complete membranous glottal closure.

Recent studies have started to provide some insight toward these additional biomechanical conditions. Xuan and Zhang (2014) showed that embedding fibers along the anterior-posterior direction in otherwise isotropic models is able to improve glottal closure ( Xuan and Zhang, 2014 ). With an additional thin stiffer outmost layer simulating the epithelium, these physical models are able to vibrate with a considerably long closed period. It is interesting that this improvement in the glottal closure pattern occurred only when the fibers were embedded to a location close to the vocal fold surface in the cover layer. Embedding fibers in the body layer did not improve the closure pattern at all. This suggests a possible functional role of collagen and elastin fibers in the intermediate and deep layers of the lamina propria in facilitating glottal closure during vibration.

The difference in the glottal closure pattern between isotropic and anisotropic vocal folds could be due to many reasons. Compared to isotropic vocal folds, anisotropic vocal folds (or fiber-embedded models) are better able to maintain their adductory position against the subglottal pressure and are less likely to be pushed apart by air pressure ( Zhang, 2011 ). In addition, embedding fibers along the AP direction may also enhance the medial-lateral motion, further facilitating glottal closure. Zhang (2014) showed that the first few in vacuo eigenmodes of isotropic vocal folds exhibit similar in-phase, up-and-down swing-like motion, with the medial-lateral and superior-inferior motions locked in a similar phase relationship. Synchronization of modes of similar vibration patterns necessarily leads to qualitatively the same vibration patterns, in this case an up-and-down swing-like motion, with vocal fold vibration dominantly along the superior-inferior direction, as observed in recent physical model experiments ( Thomson et al. , 2005 ; Zhang et al. , 2006a ). In contrast, for vocal folds with the AP stiffness much higher than the transverse stiffness, the first few in vacuo modes exhibit qualitatively distinct vibration patterns, and the medial-lateral motion and the superior-inferior motion are no longer locked in a similar phase in the first few in vacuo eigenmodes. This makes it possible to strongly excite large medial-lateral motion without proportional excitation of the superior-inferior motion. As a result, anisotropic models exhibit large medial-lateral motion with a vertical phase difference along the medial surface. The improved capability to maintain adductory position against the subglottal pressure and to vibrate with large medial-lateral motion may contribute to the improved glottal closure pattern observed in the experiment of Xuan and Zhang (2014) .

Geometrically, a thin vocal fold has been shown to be easily pushed apart by the subglottal pressure ( Zhang, 2016a ). Although a thin anisotropic vocal fold vibrates with a dominantly medial-lateral motion, this is insufficient to overcome its inability to maintain position against the subglottal pressure. As a result, the glottis never completely closes during vibration, which leads to a relatively smooth glottal flow waveform and weak excitation of higher-order harmonics in the radiated output voice spectrum ( van den Berg, 1968 ; Zhang, 2016a ). Increasing vertical thickness of the medial surface allows the vocal fold to better resist the glottis-opening effect of the subglottal pressure, thus maintaining the adductory position and achieving complete glottal closure.

Once these additional stiffness and geometric conditions (i.e., certain degree of stiffness anisotropy and not-too-small vertical vocal fold thickness) are met, the duration of glottal closure can be regulated by varying the vertical phase difference in vocal fold motion along the medial surface. A non-zero vertical phase difference means that, when the lower margins of the medial surfaces start to open, the glottis would continue to remain closed until the upper margins start to open. One important parameter affecting the vertical phase difference is the vertical thickness of the medial surface or the degree of medial bulging in the inferior portion of the medial surface. Given the same condition of vocal fold stiffness and vocal fold approximation, the vertical phase difference during vocal fold vibration increases with increasing vertical medial surface thickness (Fig. ​ (Fig.8). 8 ). Thus, the thicker the medial surface, the larger the vertical phase difference, and the longer the closed phase (Fig. ​ (Fig.8; 8 ; van den Berg, 1968 ; Alipour and Scherer, 2000 ; Zhang, 2016a ). Similarly, the vertical phase difference and thus the duration of glottal closure can be also increased by reducing the elastic surface wave speed in the superior-inferior direction ( Ishizaka and Flanagan, 1972 ; Story and Titze, 1995 ), which depends primarily on the stiffness in the transverse plane and to a lesser degree on the AP stiffness, or increasing the body-cover stiffness ratio ( Story and Titze, 1995 ; Zhang, 2009 ).

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g008.jpg

(Color online) The closed quotient CQ and vertical phase difference VPD as a function of the medial surface thickness, the AP stiffness (G ap ), and the resting glottal angle ( α ). Reprinted with permission of ASA from Zhang (2016a) .

Theoretically, the duration of glottal closure can be controlled by changing the ratio between the vocal fold equilibrium position (or the mean glottal opening) and the vocal fold vibration amplitude. Both stiffening the vocal folds and tightening vocal fold approximation are able to move the vocal fold equilibrium position toward glottal midline. However, such manipulations often simultaneously reduce the vibration amplitude. As a result, the overall effect on the duration of glottal closure is unclear. Zhang (2016a) showed that stiffening the vocal folds or increasing vocal fold approximation did not have much effect on the duration of glottal closure except around onset when these manipulations led to significant improvement in vocal fold contact.

E. Role of flow instabilities

Although a Bernoulli-based flow description is often used for phonation models, the realistic glottal flow is highly three-dimensional and much more complex. The intraglottal pressure distribution is shown to be affected by the three-dimensionality of the glottal channel geometry ( Scherer et al. , 2001 ; Scherer et al. , 2010 ; Mihaescu et al. , 2010 ; Li et al. , 2012 ). As the airflow separates from the glottal wall as it exits the glottis, a jet forms downstream of the flow separation point, which leads to the development of shear layer instabilities, vortex roll-up, and eventually vortex shedding from the jet and transition into turbulence. The vortical structures would in turn induce disturbances upstream, which may lead to oscillating flow separation point, jet attachment to one side of the glottal wall instead of going straight, and possibly alternating jet flapping ( Pelorson et al. , 1994 ; Shinwari et al. , 2003 ; Triep et al. , 2005 ; Kucinschi et al. , 2006 ; Erath and Plesniak, 2006 ; Neubauer et al. , 2007 ; Zheng et al. , 2009 ). Recent experiments and simulations also showed that for a highly divergent glottis, airflow may separate inside the glottis, which leads to the formation and convection of intraglottal vortices ( Mihaescu et al. , 2010 ; Khosla et al. , 2014 ; Oren et al. , 2014 ).

Some of these flow features have been incorporated in phonation models (e.g., Liljencrants, 1991 ; Pelorson et al. , 1994 ; Kaburagi and Tanabe, 2009 ; Erath et al. , 2011 ; Howe and McGowan, 2013 ). Resolving other features, particularly the jet instability, vortices, and turbulence downstream of the glottis, demands significantly increased computational costs so that simulation of a few cycles of vocal fold vibration often takes days or months. On the other hand, the acoustic and perceptual relevance of these intraglottal and supraglottal flow structures has not been established. From the sound production point of view, these complex flow structures in the downstream glottal flow field are sound sources of quadrupole type (dipole type when obstacles are present in the pathway of airflow, e.g., tightly adducted false vocal folds). Due to the small length scales associated with the flow structures, these sound sources are broadband in nature and mostly at high frequencies (generally above 2 kHz), with an amplitude much smaller than the harmonic component of the voice source. Therefore, if the high-frequency component of voice is of interest, these flow features have to be accurately modeled, although the degree of accuracy required to achieve perceptual sufficiency has yet to be determined.

It has been postulated that the vortical structures may directly affect the near-field glottal fluid-structure interaction and thus vocal fold vibration and the harmonic component of the voice source. Once separated from the vocal fold walls, the glottal jet starts to develop jet instabilities and is therefore susceptible to downstream disturbances, especially when the glottis takes on a divergent shape. In this way, the unsteady supraglottal flow structures may interact with the boundary layer at the glottal exit and affect the flow separation point within the glottal channel ( Hirschberg et al. , 1996 ). Similarly, it has been hypothesized that intraglottal vortices can induce a local negative pressure on the medial surface of the vocal folds as the intraglottal vortices are convected downstream and thus may facilitate rapid glottal closure during voice production ( Khosla et al. , 2014 ; Oren et al. , 2014 ).

While there is no doubt that these complex flow features affect vocal fold vibration, the question remains concerning how large an influence these vortical structures have on vocal fold vibration and the produced acoustics. For the flow conditions typical of voice production, many of the flow features or instabilities have time scales much different from that of vocal fold vibration. For example, vortex shedding at typical voice conditions occurs generally at frequencies above 1000 Hz ( Zhang et al. , 2004 ; Kucinschi et al. , 2006 ). Considering that phonation is essentially a resonance phenomenon of the vocal folds (Sec. III B ) and the mismatch between vocal fold resonance and typical frequency scales of the vortical structures, it is questionable that compared to vocal fold inertia and elastic recoil, the pressure perturbations on vocal fold surface due to intraglottal or supraglottal vortical structures are strong enough or last for a long enough period to have a significant effect on voice production. Given a longitudinal shear modulus of the vocal fold of about 10 kPa and a shear strain of 0.2, the elastic recoil stress of the vocal fold is approximately 2000 Pa. The pressure perturbations induced by intraglottal or supraglottal vortices are expected to be much smaller than the subglottal pressure. Assuming an upper limit of about 20% of the subglottal pressure for the pressure perturbations (as induced by intraglottal vortices, Oren et al. , 2014 ; in reality this number is expected to be much smaller at normal loudness conditions and even smaller for supraglottal vortices) and a subglottal pressure of 800 Pa (typical of normal speech production), the pressure perturbation on vocal fold surface is about 160 Pa, which is much smaller than the elastic recoil stress. Specifically to the intraglottal vortices, while a highly divergent glottal geometry is required to create intraglottal vortices, the presence of intraglottal vortices induces a negative suction force applied mainly on the superior portion of the medial surface and, if the vortices are strong enough, would reduce the divergence of the glottal channel. In other words, while intraglottal vortices are unable to create the necessary divergence conditions required for their creation, their existence tends to eliminate such conditions.

There have been some recent studies toward quantifying the degree of the influence of the vortical structures on phonation. In an excised larynx experiment without a vocal tract, it has been observed that the produced sound does not change much when sticking a finger very close to the glottal exit, which presumably would have significantly disturbed the supraglottal flow field. A more rigorous experiment was designed in Zhang and Neubauer (2010) in which they placed an anterior-posteriorly aligned cylinder in the supraglottal flow field and traversed it in the flow direction at different left-right locations and observed the acoustics consequences. The hypothesis was that, if these supraglottal flow structures had a significant effect on vocal fold vibration and acoustics, disturbing these flow structures would lead to noticeable changes in the produced sound. However, their experiment found no significant changes in the sound except when the cylinder was positioned within the glottal channel.

The potential impact of intraglottal vortices on phonation has also been numerically investigated ( Farahani and Zhang, 2014 ; Kettlewell, 2015 ). Because of the difficulty in removing intraglottal vortices without affecting other aspects of the glottal flow, the effect of the intraglottal vortices was modeled as a negative pressure superimposed on the flow pressure predicted by a base glottal flow model. In this way, the effect of the intraglottal vortices can be selectively activated or deactivated independently of the base flow so that its contribution to phonation can be investigated. These studies showed that intraglottal vortices only have small effects on vocal fold vibration and the glottal flow. Kettlewell (2015) further showed that the vortices are either not strong enough to induce significant pressure perturbation on vocal fold surfaces or, if they are strong enough, the vortices advect rapidly into the supraglottal region and the induced pressure perturbations would be too brief to have any impact to overcome the inertia of the vocal fold tissue.

Although phonation models using simplified flow models neglecting flow vortical structures are widely used and appear to qualitatively compare well with experiments ( Pelorson et al. , 1994 ; Zhang et al. , 2002a ; Ruty et al. , 2007 ; Kaburagi and Tanabe, 2009 ), more systematic investigations are required to reach a definite conclusion regarding the relative importance of these flow structures to phonation and voice perception. This may be achieved by conducting parametric studies in a large range of conditions over which the relative strength of these vortical structures are known to vary significantly and observing their consequences on voice production. Such an improved understanding would facilitate the development of computationally efficient reduced-order models of phonation.

IV. BIOMECHANICS OF VOICE CONTROL

A. fundamental frequency.

In the discussion of F0 control, an analogy is often made between phonation and vibration in strings in the voice literature (e.g., Colton et al. , 2011 ). The vibration frequency of a string is determined by its length, tension, and mass. By analogy, the F0 of voice production is also determined by its length, tension, and mass, with the mass interpreted as the mass of the vocal folds that is set into vibration. Specifically, F0 increases with increasing tension, decreasing mass, and decreasing vocal fold length. While the string analogy is conceptually simple and heuristically useful, some important features of the vocal folds are missing. Other than the vague definition of an effective mass, the string model, which implicitly assumes cross-section dimension much smaller than length, completely neglects the contribution of vocal fold stiffness in F0 control. Although stiffness and tension are often not differentiated in the voice literature, they have different physical meanings and represent two different mechanisms that resist deformation (Fig. ​ (Fig.2). 2 ). Stiffness is a property of the vocal fold and represents the elastic restoring force in response to deformation, whereas tension or stress describes the mechanical state of the vocal folds. The string analogy also neglects the effect of vocal fold contact, which introduces additional stiffening effect.

Because phonation is essentially a resonance phenomenon of the vocal folds, the F0 is primarily determined by the frequency of the vocal fold eigenmodes that are excited. In general, vocal fold eigenfrequencies depend on both vocal fold geometry, including length, depth, and thickness, and the stiffness and stress conditions of the vocal folds. Shorter vocal folds tend to have high eigenfrequencies. Thus, because of the small vocal fold size, children tend to have the highest F0, followed by female and then male. Vocal fold eigenfrequencies also increase with increasing stiffness or stress (tension), both of which provide a restoring force to resist vocal fold deformation. Thus, stiffening or tensioning the vocal folds would increase the F0 of the voice. In general, the effect of stiffness on vocal fold eigenfrequencies is more dominant than tension when the vocal fold is slightly elongated or shortened, at which the tension is small or even negative and the string model would underestimate F0 or fail to provide a prediction. As the vocal fold gets further elongated and tension increases, the stiffness and tension become equally important in affecting vocal fold eigenfrequencies ( Titze and Hunter, 2004 ; Yin and Zhang, 2013 ).

When vocal fold contact occurs during vibration, the vocal fold collision force appears as an additional restoring force ( Ishizaka and Flanagan, 1972 ). Depending on the extent, depth of influence, and duration of vocal fold collision, this additional force can significantly increase the effective stiffness of the vocal folds and thus F0. Because the vocal fold contact pattern depends on the degree of vocal fold approximation, subglottal pressure, and vocal fold stiffness and geometry, changes in any of these parameters may have an effect on F0 by affecting vocal fold contact ( van den Berg and Tran, 1959 ; Zhang, 2016a ).

In humans, F0 can be increased by increasing either vocal fold eigenfrequencies or the extent and duration of vocal fold contact. Control of vocal fold eigenfrequencies is largely achieved by varying the stiffness and tension along the AP direction. Due to the nonlinear material properties of the vocal folds, both the AP stiffness and tension can be controlled by elongating or shortening the vocal folds, through activation of the CT muscle. Although elongation also increases vocal fold length which lowers F0, the effect of the increase in stiffness and tension on F0 appears to dominate that of increasing length.

The effect of TA muscle activation on F0 control is a little more complex. In addition to shortening vocal fold length, TA activation tensions and stiffens the body layer, decreases tension in the cover layer, but may decrease or increase the cover stiffness ( Yin and Zhang, 2013 ). Titze et al. (1988) showed that depending on the depth of the body layer involved in vibration, increasing TA activation can either increase or decrease vocal fold eigenfrequencies. On the other hand, Yin and Zhang (2013) showed that for an elongated vocal fold, as is often the case in phonation, the overall effect of TA activation is to reduce vocal fold eigenfrequencies. Only for conditions of a slightly elongated or shortened vocal folds, TA activation may increase vocal fold eigenfrequencies. In addition to the effect on vocal fold eigenfrequencies, TA activation increases vertical thickness of the vocal folds and produces medial compression between the two folds, both of which increase the extent and duration of vocal tract contact and would lead to an increased F0 ( Hirano et al. , 1969 ). Because of these opposite effects on vocal fold eigenfrequencies and vocal fold contact, the overall effect of TA activation on F0 would vary depending on the specific vocal fold conditions.

Increasing subglottal pressure or activation of the LCA/IA muscles by themselves do not have much effect on vocal fold eigenfrequencies ( Hirano and Kakita, 1985 ; Chhetri et al. , 2009 ; Yin and Zhang, 2014 ). However, they often increase the extent and duration of vocal fold contact during vibration, particularly with increasing subglottal pressure, and thus lead to increased F0 ( Hirano et al. , 1969 ; Ishizaka and Flanagan, 1972 ; Zhang, 2016a ). Due to nonlinearity in vocal fold material properties, increased vibration amplitude at high subglottal pressures may lead to increased effective stiffness and tension, which may also increase F0 ( van den Berg and Tan, 1959 ; Ishizaka and Flanagan, 1972 ; Titze, 1989 ). Ishizaka and Flanagan (1972) showed in their two-mass model that vocal fold contact and material nonlinearity combined can lead to an increase of about 40 Hz in F0 when the subglottal pressure is increased from about 200 to 800 Pa. In the continuum model of Zhang (2016a) , which includes the effect of vocal fold contact but not vocal fold material nonlinearity, increasing subglottal pressure alone can increase the F0 by as large as 20 Hz/kPa.

B. Vocal intensity

Because voice is produced at the glottis, filtered by the vocal tract, and radiated from the mouth, an increase in vocal intensity can be achieved by either increasing the source intensity or enhancing the radiation efficiency. The source intensity is controlled primarily by the subglottal pressure, which increases the vibration amplitude and the negative peak or MFDR of the time derivative of the glottal flow. The subglottal pressure depends primarily on the alveolar pressure in the lungs, which is controlled by the respiratory muscles and the lung volume. In general, conditions of the laryngeal system have little effect on the establishment of the alveolar pressure and subglottal pressure ( Hixon, 1987 ; Finnegan et al. , 2000 ). However, an open glottis often results in a small glottal resistance and thus a considerable pressure drop in the lower airway and a reduced subglottal pressure. An open glottis also leads to a large glottal flow rate and a rapid decline in the lung volume, thus reducing the duration of speech between breaths and increasing the respiratory effort required in order to maintain a target subglottal pressure ( Zhang, 2016b ).

In the absence of a vocal tract, laryngeal adjustments, which control vocal fold stiffness, geometry, and position, do not have much effect on the source intensity, as shown in many studies using laryngeal, physical, or computational models of phonation ( Tanaka and Tanabe, 1986 ; Titze, 1988b ; Zhang, 2016a ). In the experiment by Tanaka and Tanabe (1986) , for a constant subglottal pressure, stimulation of the CT and LCA muscles had almost no effects on vocal intensity whereas stimulation of the TA muscle slightly decreased vocal intensity. In an excised larynx experiment, Titze (1988b) found no dependence of vocal intensity on the glottal width. Similar secondary effects of laryngeal adjustments have also been observed in a recent computational study ( Zhang, 2016a ). Zhang (2016a) also showed that the effect of laryngeal adjustments may be important at subglottal pressures slightly above onset, in which case an increase in either AP stiffness or vocal fold approximation may lead to improved vocal fold contact and glottal closure, which significantly increased the MFDR and thus vocal intensity. However, these effects became less efficient with increasing vocal intensity.

The effect of laryngeal adjustments on vocal intensity becomes a little more complicated in the presence of the vocal tract. Changing vocal tract shape by itself does not amplify the produced sound intensity because sound propagation in the vocal tract is a passive process. However, changes in vocal tract shape may provide a better impedance match between the glottis and the free space outside the mouth and thus improve efficiency of sound radiation from the mouth ( Titze and Sundberg, 1992 ). This is particularly the case for harmonics close to a formant, which are often amplified more than the first harmonic and may become the most energetic harmonic in the spectrum of the output voice. Thus, vocal intensity can be increased through laryngeal adjustments that increase excitation of harmonics close to the first formant of the vocal tract ( Fant, 1982 ; Sundberg, 1987 ) or by adjusting vocal tract shape to match one of the formants with one of the dominant harmonics in the source spectrum.

In humans, all three strategies (respiratory, laryngeal, and articulatory) are used to increase vocal intensity. When asked to produce an intensity sweep from soft to loud voice, one generally starts with a slightly breathy voice with a relatively open glottis, which requires the least laryngeal effort but is inefficient in voice production. From this starting position, vocal intensity can be increased by increasing either the subglottal pressure, which increases vibration amplitude, or vocal fold adduction (approximation and/or thickening). For a soft voice with minimal vocal fold contact and minimal higher-order harmonic excitation, increasing vocal fold adduction is particularly efficient because it may significantly improve vocal fold contact, in both spatial extent and duration, thus significantly boosting the excitation of harmonics close to the first formant. In humans, for low to medium vocal intensity conditions, vocal intensity increase is often accompanied by simultaneous increases in the subglottal pressure and the glottal resistance ( Isshiki, 1964 ; Holmberg et al. , 1988 ; Stathopoulos and Sapienza, 1993 ). Because the pitch level did not change much in these experiments, the increase in glottal resistance was most likely due to tighter vocal fold approximation through LCA/IA activation. The duration of the closed phase is often observed to increase with increasing vocal intensity ( Henrich et al. , 2005 ), indicating increased vocal fold thickening or medial compression, which are primarily controlled by the TA muscle. Thus, it seems that both the LCA/IA/TA muscles and subglottal pressure increase play a role in vocal intensity increase at low to medium intensity conditions. For high vocal intensity conditions, when further increase in vocal fold adduction becomes less effective ( Hirano et al. , 1969 ), vocal intensity increase appears to rely dominantly on the subglottal pressure increase.

On the vocal tract side, Titze (2002) showed that the vocal intensity can be increased by matching a wide epilarynx with lower glottal resistance or a narrow epilarynx with higher glottal resistance. Tuning the first formant (e.g., by opening mouth wider) to match the F0 is often used in soprano singing to maximize vocal output ( Joliveau et al. , 2004 ). Because radiation efficiency can be improved through adjustments in either the vocal folds or the vocal tract, this makes it possible to improve radiation efficiency yet still maintain desired pitch or articulation, whichever one wishes to achieve.

C. Voice quality

Voice quality generally refers to aspects of the voice other than pitch and loudness. Due to the subjective nature of voice quality perception, many different descriptions are used and authors often disagree with the meanings of these descriptions ( Gerratt and Kreiman, 2001 ; Kreiman and Sidtis, 2011 ). This lack of a clear and consistent definition of voice quality makes it difficult for studies of voice quality and identifying its physiological correlates and controls. Acoustically, voice quality is associated with the spectral amplitude and shape of the harmonic and noise components of the voice source, and their temporal variations. In the following we focus on physiological factors that are known to have an impact on the voice spectra and thus are potentially perceptually important.

One of the first systematic investigations of the physiological controls of voice quality was conducted by Isshiki (1989 , 1998) using excised larynges, in which regions of normal, breathy, and rough voice qualities were mapped out in the three-dimensional parameter space of the subglottal pressure, vocal fold stiffness, and prephonatory glottal opening area (Fig. ​ (Fig.9). 9 ). He showed that for a given vocal fold stiffness and prephonatory glottal opening area, increasing subglottal pressure led to voice production of a rough quality. This effect of the subglottal pressure can be counterbalanced by increasing vocal fold stiffness, which increased the region of normal voice in the parameter space of Fig. ​ Fig.9. 9 . Unfortunately, the details of this study, including the definition and manipulation of vocal fold stiffness and perceptual evaluation of different voice qualities, are not fully available. The importance of the coordination between the subglottal pressure and laryngeal conditions was also demonstrated in van den Berg and Tan (1959) , which showed that although different vocal registers were observed, each register occurred in a certain range of laryngeal conditions and subglottal pressures. For example, for conditions of low longitudinal tension, a chest-like phonation was possible only for small airflow rates. At large values of the subglottal pressure, “it was impossible to obtain good sound production. The vocal folds were blown too wide apart…. The shape of the glottis became irregularly curved and this curving was propagated along the glottis.” Good voice production at large flow rates was possible only with thyroid cartilage compression which imitates the effect of TA muscle activation. Irregular vocal fold vibration at high subglottal pressures has also been observed in physical model experiments (e.g., Xuan and Zhang, 2014 ). Irregular or chaotic vocal fold vibration at conditions of pressure-stiffness mismatch has also been reported in the numerical simulation of Berry et al. (1994) , which showed that while regular vocal fold vibration was observed for typical vocal fold stiffness conditions, irregular vocal fold vibration (e.g., subharmonic or chaotic vibration) was observed when the cover layer stiffness was significantly reduced while maintaining the same subglottal pressure.

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000140-002614_1-g009.jpg

A three-dimensional map of normal (N), breathy (B), and rough (R) phonation in the parameter space of the prephonatory glottal area (Ag0), subglottal pressure (Ps), vocal fold stiffness (k). Reprinted with permission of Springer from Isshiki (1989) .

The experiments of van den Berg and Tan (1959) and Isshiki (1989) also showed that weakly adducted vocal folds (weak LCA/IA/TA activation) often lead to vocal fold vibration with incomplete glottal closure during phonation. When the airflow is sufficiently high, the persistent glottal gap would lead to increased turbulent noise production and thus phonation of a breathy quality (Fig. ​ (Fig.9). 9 ). The incomplete glottal closure may occur in the membranous or the cartilaginous portion of the glottis. When the incomplete glottal closure is limited to the cartilaginous glottis, the resulting voice is breathy but may still have strong harmonics at high frequencies. When the incomplete glottal closure occurs in the membranous glottis, the reduced or slowed vocal fold contact would also reduce excitation of higher-order harmonics, resulting in a breathy and weak quality of the produced voice. When the vocal folds are sufficiently separated, the coupling between the two vocal folds may be weakened enough so that each vocal fold can vibrate at a different F0. This would lead to biphonation or voice containing two distinct fundamental frequencies, resulting in a perception similar to that of the beat frequency phenomenon.

Compared to a breathy voice, a pressed voice is presumably produced with tight vocal fold approximation or even some degree of medial compression in the membranous portion between the two folds. A pressed voice is often characterized by a second harmonic that is stronger than the first harmonic, or a negative H1-H2, with a long period of glottal closure during vibration. Although a certain degree of vocal fold approximation and stiffness anisotropy is required to achieve vocal fold contact during phonation, the duration of glottal closure has been shown to be primarily determined by the vertical thickness of the vocal fold medial surface ( van den Berg, 1968 ; Zhang, 2016a ). Thus, although it is generally assumed that a pressed voice can be produced with tight arytenoid adduction through LCA/IA muscle activation, activation of the LCA/IA muscles alone is unable to achieve prephonatory medial compression in the membranous glottis or change the vertical thickness of the medial surface. Activation of the TA muscle appears to be essential in producing a voice change from a breathy to a pressed voice quality. A weakened TA muscle, as in aging or muscle atrophy, would lead to difficulties in producing a pressed voice or even sufficient glottal closure during phonation. On the other hand, strong TA muscle activation, as in for example, spasmodic dysphonia, may lead to too tight a closure of the glottis and a rough voice quality ( Isshiki, 1989 ).

In humans, vocal fold stiffness, vocal fold approximation, and geometry are regulated by the same set of laryngeal muscles and thus often co-vary, which has long been considered as one possible origin of vocal registers and their transitions ( van den Berg, 1968 ). Specifically, it has been hypothesized that changes in F0 are often accompanied by changes in the vertical thickness of the vocal fold medial surface, which lead to changes in the spectral characteristics of the produced voice. The medial surface thickness is primarily controlled by the CT and TA muscles, which also regulate vocal fold stiffness and vocal fold approximation. Activation of the CT muscle reduces the medial surface thickness, but also increases vocal fold stiffness and tension, and in some conditions increases the resting glottal opening ( van den Berg and Tan, 1959 ; van den Berg, 1968 ; Hirano and Kakita, 1985 ). Because the LCA/IA/TA muscles are innervated by the same nerve and often activated together, an increase in the medial surface thickness through TA muscle activation is often accompanied by increased vocal fold approximation ( Hirano and Kakita, 1985 ) and contact. Thus, if one attempts to increase F0 primarily by activation of the LCA/IA/TA muscles, the vocal folds are likely to have a large medial surface thickness and probably low AP stiffness, which will lead to a chest-like voice production, with large vertical phase difference along the medial surface, long closure of the glottis, small flow rate, and strong harmonic excitation. In the extreme case of strong TA activation and minimum CT activation and very low subglottal pressure, the glottis can remain closed for most of the cycle, leading to a vocal fry-like voice production. In contrast, if one attempts to increase F0 by increasing CT activation alone, the vocal folds, with a small medial surface thickness, are likely to produce a falsetto-like voice production, with incomplete glottal closure and a nearly sinusoidal flow waveform, very high F0, and a limited number of harmonics.

V. MECHANICAL AND COMPUTER MODELS FOR VOICE APPLICATIONS

Voice applications generally fall into two major categories. In the clinic, simulation of voice production has the potential to predict outcomes of clinical management of voice disorders, including surgery and voice therapy. For such applications, accurate representation of vocal fold geometry and material properties to the degree that matches actual clinical treatment is desired, and for this reason continuum models of the vocal folds are preferred over lumped-element models. Computational cost is not necessarily a concern in such applications but still has to be practical. In contrast, for some other applications, particularly in speech technology applications, the primary goal is to reproduce speech acoustics or at least perceptually relevant features of speech acoustics. Real-time capability is desired in these applications, whereas realistic representation of the underlying physics involved is often not necessary. In fact, most of the current speech synthesis systems consider speech purely as an acoustic signal and do not model the physics of speech production at all. However, models that take into consideration the underlying physics, at least to some degree, may hold the most promise in speech synthesis of natural-sounding, speaker-specific quality.

A. Mechanical vocal fold models

Early efforts on artificial speech production, dating back to as early as the 18th century, focused on mechanically reproducing the speech production system. A detailed review can be found in Flanagan (1972) . The focus of these early efforts was generally on articulation in the vocal tract rather than the voice source, which is understandable considering that meaning is primarily conveyed through changes in articulation and the lack of understanding of the voice production process. The vibrating element in these mechanical models, either a vibrating reed or a slotted rubber sheet stretched over an opening, is only a rough approximation of the human vocal folds.

More sophisticated mechanical models have been developed more recently to better reproduce the three-dimensional layered structure of the vocal folds. A membrane (cover)-cushion (body) two-layer rubber vocal fold model was first developed by Smith (1956) . Similar mechanical models were later developed and used in voice production research (e.g., Isogai et al. , 1988 ; Kakita, 1988 ; Titze et al. , 1995 ; Thomson et al. , 2005 ; Ruty et al. , 2007 ; Drechsel and Thomson, 2008 ), using silicone or rubber materials or liquid-filled membranes. Recent studies ( Murray and Thomson, 2012 ; Xuan and Zhang, 2014 ) have also started to embed fibers into these models to simulate the anisotropic material properties due to the presence of collagen and elastin fibers in the vocal folds. A similar layered vocal fold model has been incorporated into a mechanical talking robot system ( Fukui et al. , 2005 ; Fukui et al. , 2007 ; Fukui et al. , 2008 ). The most recent version of the talking robot, Waseda Talker, includes mechanisms for the control of pitch and resting glottal opening, and is able to produce voice of modal, creaky, or breathy quality. Nevertheless, although a mechanical voice production system may find application in voice prosthesis or humanoid robotic systems in the future, current mechanical models are still a long way from reproducing or even approaching humans' capability and flexibility in producing and controlling voice.

B. Formant synthesis and parametric voice source models

Compared to mechanically reproducing the physical process involved in speech production, it is easier to reproduce speech as an acoustic signal. This is particularly the case for speech synthesis. One approach adopted in most of the current speech synthesis systems is to concatenate segments of pre-recorded natural voice into new speech phrases or sentences. While relatively easy to implement, in order to achieve natural-sounding speech, this approach requires a large database of words spoken in different contexts, which makes it difficult to apply to personalized speech synthesis of varying emotional percepts.

Another approach is to reproduce only perceptually relevant acoustic features of speech, as in formant synthesis. The target acoustic features to be reproduced generally include the F0, sound amplitude, and formant frequencies and bandwidths. This approach gained popularity with the development of electrical synthesizers and later computer simulations which allow flexible and accurate control of these acoustic features. Early formant-based synthesizers used simple sound sources, often a filtered impulse train as the sound source for voiced sounds and white noise for unvoiced sounds. Research on the voice sources (e.g., Fant, 1979 ; Fant et al. , 1985 ; Rothenberg et al. , 1971 ; Titze and Talkin, 1979 ) has led to the development of parametric voice source models in the time domain, which are capable of producing voice source waveforms of varying F0, amplitude, open quotient, and degree of abruptness of the glottal flow shutoff, and thus synthesis of different voice qualities.

While parametric voice source models provide flexibility in source variations, synthetic speech generated by the formant synthesis still suffers limited naturalness. This limited naturalness may result from the primitive rules used in specifying dynamic controls of the voice source models ( Klatt, 1987 ). Also, the source model control parameters are not independent from each other and often co-vary during phonation. A challenge in formant synthesis is thus to specify voice source parameter combinations and their time variation patterns that may occur in realistic voice production of different voice qualities by different speakers. It is also possible that some perceptually important features are missing from time-domain voice source models ( Klatt, 1987 ). Human perception of voice characteristics is better described in the frequency domain as the auditory system performs an approximation to Fourier analysis of the voice and sound in general. While time-domain models have better correspondence to the physical events occurring during phonation (e.g., glottal opening and closing, and the closed phase), it is possible some spectral details of perceptual importance are not captured in the simple time-domain voice source models. For example, spectral details in the low and middle frequencies have been shown to be of considerable importance to naturalness judgment, but are difficult to be represented in a time-domain source model ( Klatt, 1987 ). A recent study ( Kreiman et al. , 2015 ) showed that spectral-domain voice source models are able to create significantly better matches to natural voices than time-domain voice source models. Furthermore, because of the independence between the voice source and the sub- and supra-glottal systems in formant synthesis, interactions and co-variations between vocal folds and the sub- and supra-glottal systems are by design not accounted for. All these factors may contribute to the limited naturalness of the formant synthesized speech.

C. Physically based computer models

An alternative approach to natural speech synthesis is to computationally model the voice production process based on physical principles. The control parameters would be geometry and material properties of the vocal system or, in a more realistic way, respiratory and laryngeal muscle activation. This approach avoids the need to specify consistent characteristics of either the voice source or the formants, thus allowing synthesis and modification of natural voice in a way intuitively similar to human voice production and control.

The first such computer model of voice production is the one-mass model by Flanagan and Landgraf (1968) , in which the vocal fold is modeled as a horizontally moving single-degree of freedom mass-spring-damper system. This model is able to vibrate in a restricted range of conditions when the natural frequency of the mass-spring system is close to one of the acoustic resonances of the subglottal or supraglottal tracts. Ishizaka and Flanagan (1972) extended this model to a two-mass model in which the upper and lower parts of the vocal fold are modeled as two separate masses connected by an additional spring along the vertical direction. The two-mass model is able to vibrate with a vertical phase difference between the two masses, and thus able to vibrate independently of the acoustics of the sub- and supra-glottal tracts. Many variants of the two-mass model have since been developed. Titze (1973) developed a 16-mass model to better represent vocal fold motion along the anterior-posterior direction. To better represent the body-cover layered structure of the vocal folds, Story and Titze (1995) extended the two-mass model to a three-mass model, adding an additional lateral mass representing the inner muscular layer. Empirical rules have also been developed to relate control parameters of the three-mass model to laryngeal muscle activation levels ( Titze and Story, 2002 ) so that voice production can be simulated with laryngeal muscle activity as input. Designed originally for speech synthesis purpose, these lumped-element models of voice production are generally fast in computational time and ideal for real-time speech synthesis.

A drawback of the lumped-element models of phonation is that the model control parameters cannot be directly measured or easily related to the anatomical structure or material properties of the vocal folds. Thus, these models are not as useful in applications in which a realistic representation of voice physiology is required, as, for example, in the clinical management of voice disorders. To better understand the voice source and its control under different voicing conditions, more sophisticated computational models of the vocal folds based on continuum mechanics have been developed to understand laryngeal muscle control of vocal fold geometry, stiffness, and tension, and how changes in these vocal fold properties affect the glottal fluid-structure interaction and the produced voice. One of the first such models is the finite-difference model by Titze and Talkin (1979) , which coupled a three-dimensional vocal fold model of linear elasticity with the one-dimensional glottal flow model of Ishizaka and Flanagan (1972) . In the past two decades more refined phonation models using a two-dimensional or three-dimensional Navier-Stokes description of the glottal flow have been developed (e.g., Alipour et al. , 2000 ; Zhao et al. , 2002 ; Tao et al. , 2007 ; Luo et al. , 2009 ; Zheng et al. , 2009 ; Bhattacharya and Siegmund, 2013 ; Xue et al. , 2012 , 2014 ). Continuum models of laryngeal muscle activation have also been developed to model vocal fold posturing ( Hunter et al. , 2004 ; Gommel et al. , 2007 ; Yin and Zhang, 2013 , 2014 ). By directly modeling the voice production process, continuum models with realistic geometry and material properties ideally hold the most promise in reproducing natural human voice production. However, because the phonation process is highly nonlinear and involves large displacement and deformation of the vocal folds and complex glottal flow patterns, modeling this process in three dimensions is computationally very challenging and time-consuming. As a result, these computational studies are often limited to one or two specific aspects instead of the entire voice production process, and the acoustics of the produced voice, other than F0 and vocal intensity, are often not investigated. For practical applications, real-time or not, reduced-order models with significantly improved computational efficiency are required. Some reduced-order continuum models, with simplifications in both the glottal flow and vocal fold dynamics, have been developed and used in large-scale parametric studies of voice production (e.g., Titze and Talkin, 1979 ; Zhang, 2016a ), which appear to produce qualitatively reasonable predictions. However, these simplifications have yet to be rigorously validated by experiment.

VI. FUTURE CHALLENGES

We currently have a general understanding of the physical principles of voice production. Toward establishing a cause-effect theory of voice production, much is to be learned about voice physiology and biomechanics. This includes the geometry and mechanical properties of the vocal folds and their variability across subject, sex, and age, and how they vary across different voicing conditions under laryngeal muscle activation. Even less is known about changes in vocal fold geometry and material properties in pathologic conditions. The surface conditions of the vocal folds and their mechanical properties have been shown to affect vocal fold vibration ( Dollinger et al. , 2014 ; Bhattacharya and Siegmund, 2015 ; Tse et al. , 2015 ), and thus need to be better quantified. While in vivo animal or human larynx models ( Moore and Berke, 1988 ; Chhetri et al. , 2012 ; Berke et al. , 2013 ) could provide such information, more reliable measurement methods are required to better quantify the viscoelastic properties of the vocal fold, vocal fold tension, and the geometry and movement of the inner vocal fold layers. While macro-mechanical properties are of interest, development of vocal fold constitutive laws based on ECM distribution and interstitial fluids within the vocal folds would allow us to better understand how vocal fold mechanical properties change with prolonged vocal use, vocal fold injury, and wound healing, which otherwise is difficult to quantify.

While oversimplification of the vocal folds to mass and tension is of limited practical use, the other extreme is not appealing, either. With improved characterization and understanding of vocal fold properties, establishing a cause-effect relationship between voice physiology and production thus requires identifying which of these physiologic features are actually perceptually relevant and under what conditions, through systematic parametric investigations. Such investigations will also facilitate the development of reduced-order computational models of phonation in which perceptually relevant physiologic features are sufficiently represented and features of minimum perceptual relevance are simplified. We discussed earlier that many of the complex supraglottal flow phenomena have questionable perceptual relevance. Similar relevance questions can be asked with regard to the geometry and mechanical properties of the vocal folds. For example, while the vocal folds exhibit complex viscoelastic properties, what are the main material properties that are definitely required in order to reasonably predict vocal fold vibration and voice quality? Does each of the vocal fold layers, in particular, the different layers of the lamina propria, have a functional role in determining the voice output or preventing vocal injury? Current vocal fold models often use a simplified vocal fold geometry. Could some geometric features of a realistic vocal fold that are not included in current models have an important role in affecting voice efficiency and voice quality? Because voice communication spans a large range of voice conditions (e.g., pitch, loudness, and voice quality), the perceptual relevance and adequacy of specific features (i.e., do changes in specific features lead to perceivable changes in voice?) should be investigated across a large number of voice conditions rather than a few selected conditions. While physiologic models of phonation allow better reproduction of realistic vocal fold conditions, computational models are more suitable for such systematic parametric investigations. Unfortunately, due to the high computational cost, current studies using continuum models are often limited to a few conditions. Thus, the establishment of cause-effect relationship and the development of reduced-order models are likely to be iterative processes, in which the models are gradually refined to include more physiologic details to be considered in the cause-effect relationship.

A causal theory of voice production would allow us to map out regions in the physiological parameter space that produce distinct vocal fold vibration patterns and voice qualities of interest (e.g., normal, breathy, rough voices for clinical applications; different vocal registers for singing training), similar to that described by Isshiki (1989 ; also Fig. ​ Fig.9). 9 ). Although the voice production system is quite complex, control of voice should be both stable and simple, which is required for voice to be a robust and easily controlled means of communication. Understanding voice production in the framework of nonlinear dynamics and eigenmode interactions and relating it to voice quality may facilitate toward this goal. Toward practical clinical applications, such a voice map would help us understand what physiologic alteration caused a given voice change (the inverse problem), and what can be done to restore the voice to normal. Development of efficient and reliable tools addressing the inverse problem has important applications in the clinical diagnosis of voice disorders. Some methods already exist that solve the inverse problem in lumped-element models (e.g., Dollinger et al. , 2002 ; Hadwin et al. , 2016 ), and these can be extended to physiologically more realistic continuum models.

Solving the inverse problem would also provide an indirect approach toward understanding the physiologic states that lead to percepts of different emotional states or communication of other personal traits, which are otherwise difficult to measure directly in live human beings. When extended to continuous speech production, this approach may also provide insights into the dynamic physiologic control of voice in running speech (e.g., time contours of the respiratory and laryngeal adjustments). Such information would facilitate the development of computer programs capable of natural-sounding, conversational speech synthesis, in which the time contours of control parameters may change with context, speaking style, or emotional state of the speaker.

ACKNOWLEDGMENTS

This study was supported by research Grant Nos. R01 DC011299 and R01 DC009229 from the National Institute on Deafness and Other Communication Disorders, the National Institutes of Health. The author would like to thank Dr. Liang Wu for assistance in preparing the MRI images in Fig. ​ Fig.1, 1 , Dr. Jennifer Long for providing the image in Fig. 1(b) , Dr. Gerald Berke for providing the stroboscopic recording from which Fig. ​ Fig.3 3 was generated, and Dr. Jody Kreiman, Dr. Bruce Gerratt, Dr. Ronald Scherer, and an anonymous reviewer for the helpful comments on an earlier version of this paper.

The Human Voice in Speech and Singing

  • Reference work entry
  • Cite this reference work entry

essay on human voice

  • Björn Lindblom Prof. 2 &
  • Johan Sundberg 3  

Part of the book series: Springer Handbooks ((SHB))

10k Accesses

9 Citations

3 Altmetric

This chapter describes various aspects of the human voice as a means of communication in speech and singing. From the point of view of function, vocal sounds can be regarded as the end result of a three stage process: (1) the compression of air in the respiratory system, which produces an exhalatory airstream, (2) the vibrating vocal foldsʼ transformation of this air stream to an intermittent or pulsating air stream, which is a complex tone, referred to as the voice source, and (3) the filtering of this complex tone in the vocal tract resonator. The main function of the respiratory system is to generate an overpressure of air under the glottis, or a subglottal pressure. Section  16.1 describes different aspects of the respiratory system of significance to speech and singing, including lung volume ranges, subglottal pressures, and how this pressure is affected by the ever-varying recoil forces. The complex tone generated when the air stream from the lungs passes the vibrating vocal folds can be varied in at least three dimensions: fundamental frequency, amplitude and spectrum. Section  16.2 describes how these properties of the voice source are affected by the subglottal pressure, the length and stiffness of the vocal folds and how firmly the vocal folds are adducted. Section  16.3 gives an account of the vocal tract filter, how its form determines the frequencies of its resonances, and Sect.  16.4 gives an account for how these resonance frequencies or formants shape the vocal sounds by imposing spectrum peaks separated by spectrum valleys, and how the frequencies of these peaks determine vowel and voice qualities. The remaining sections of the chapter describe various aspects of the acoustic signals used for vocal communication in speech and singing. The syllable structure is discussed in Sect.  16.5 , the closely related aspects of rhythmicity and timing in speech and singing is described in Sect.  16.6 , and pitch and rhythm aspects in Sect.  16.7 . The impressive control of all these acoustic characteristics of vocal signals is discussed in Sect.  16.8 , while Sect.  16.9 considers expressive aspects of vocal communication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

essay on human voice

Biophysics of Vocal Production in Mammals

essay on human voice

Nonlinear Acoustic Analysis of Voice Production

Abbreviations.

articulation class

long-term-average spectra

maximum flow declination rate

magnetic resonance imaging

resting expiratory level

sound pressure level

speech transmission index

total lung capacity

vital capacity

T.J. Hixon: Respiratory Function in Speech and Song (Singular, San Diego 1991) pp. 1–54

Google Scholar  

A. L. Winkworth, P. J. Davis, E. Ellis, R. D. Adams: Variability and consistency in speech breathing during reading: Lung volumes, speech intensity, and linguistic factors, JSHR 37 , 535–556 (1994)

M. Thomasson: From Air to Aria , Ph.D. Thesis (Music Acoustics, KTH 2003)

B. Conrad, P. Schönle: Speech and respiration, Arch. Psychiat. Nervenkr. 226 , 251–268 (1979)

M.H. Draper, P. Ladefoged, D. Whitteridge: Respiratory muscles in speech, J. Speech Hear. Disord. 2 , 16–27 (1959)

R. Netsell: Speech physiology. In: Normal aspects of speech, hearing, and language , ed. by P.D. Minifie, T.J. Hixon, P. Hixon, P. Williams (Prentice-Hall, Englewood Cliffs 1973) pp. 211–234

P. Ladefoged, M.H. Draper, D. Whitteridge: Syllables and stress, Misc. Phonetica 3 , 1–14 (1958)

P. Ladefoged: Speculations on the control of speech. In: A Figure of Speech: A Festschrift for John Laver , ed. by W.J. Hardcastle, J. Mackenzie Beck (Lawrence Erlbaum, Mahwah 2005) pp. 3–21

T.J. Hixon, G. Weismer: Perspectives on the Edinburgh study of speech breathing, J. Speech Hear. Disord. 38 , 42–60 (1995)

S. Nooteboom: The prosody of speech: melody and rhythm. In: The Handbook of Phonetic Sciences , ed. by W.J. Hardcastle, J. Laver (Blackwell, Oxford 1997) pp. 640–673

M. Rothenberg: The breath-stream dynamics of simple-released-plosive production Bibliotheca Phonetica 6 (Karger, Basel 1968)

D.H. Klatt, K.N. Stevens, J. Mead: Studies of articulatory activity and airflow during speech, Ann. NY Acad. Sci. 155 , 42–55 (1968)

ADS   Google Scholar  

J.J. Ohala: Respiratory activity in speech. In: Speech production and speech modeling , ed. by W.J. Hardcastle, A. Marchal (Dordrecht, Kluwer 1990) pp. 23–53

R.H. Stetson: Motor Phonetics: A Study of Movements in Action (North Holland, Amsterdam 1951)

P. Ladefoged: Linguistic aspects of respiratory phenomena, Ann. NY Acad. Sci. 155 , 141–151 (1968)

L.H. Kunze: Evaluation of methods of estimating sub-glottal air pressure muscles, J. Speech Hear. Disord. 7 , 151–164 (1964)

R. Leanderson, J. Sundberg, C. von Euler: Role of the diaphragmatic activity during singing: a study of transdiaphragmatic pressures, J. Appl. Physiol. 62 , 259–270 (1987)

J. Sundberg, N. Elliot, P. Gramming, L. Nord: Short-term variation of subglottal pressure for expressive purposes in singing and stage speech. A preliminary investigation, J. Voice 7 , 227–234 (1993)

J. Sundberg: Synthesis of singing, in Musica e Technologia: Industria e Cultura per lo Sviluppo del Mezzagiorno. In: Proceedings of a symposium in Venice , ed. by R. Favaro (Unicopli, Milan 1987) pp. 145–162

J. Sundberg: Synthesis of singing by rule. In: Current Directions in Computer Music Research, System Development Foundation Benchmark Series , ed. by M. Mathews, J. Pierce (MIT, Cambridge 1989), 45-55 & 401-403

J. Molinder: Akustiska och perceptuella skillnader mellan röstfacken lyrisk och dramatisk sopran, unpublished thesis work (Lund Univ. Hospital, Dept of Logopedics, Lund 1997)

T. Baer: Reflex activation of laryngeal muscles by sudden induced subglottal pressure changes, J. Acoust. Soc. Am. 65 , 1271–1275 (1979)

T. Cleveland, J. Sundberg: Acoustic analyses of three male voices of different quality. In: SMAC 83. Proceedings of the Stockholm Internat Music Acoustics Conf , Vol. 1, ed. by A. Askenfelt, S. Felicetti, E. Jansson, J. Sundberg (Roy. Sw. Acad. Music, Stockholm 1985) pp. 143–156, No. 46:1

J. Sundberg, C. Johansson, H. Willbrand, C. Ytterbergh: From sagittal distance to area, Phonetica 44 , 76–90 (1987)

I.R. Titze: Phonation threshold pressure: A missing link in glottal aerodynamics, J. Acoust. Soc. Am. 91 , 2926–2935 (1992)

I.R. Titze: Principles of Voice Production (Prentice-Hall, Englewood Cliffs 1994)

G. Fant: Acoustic theory of speech production (Mouton, The Hague 1960)

K.N. Stevens: Acoustic Phonetics (MIT, Cambridge 1998)

M. Hirano: Clinical Examination of Voice (Springer, New York 1981)

M. Rothenberg: A new inversefiltering technique for deriving the glottal air flow waveform during voicing, J. Acoust. Soc. Am. 53 , 1632–1645 (1973)

G. Fant: The voice source – Acoustic modeling. In: STL/Quart. Prog. Status Rep. 4 (Royal Inst. of Technology, Stockholm 1982) pp. 28–48

C. Gobl: The voice source in speech communication production and perception experiments involving inverse filtering and synthesis. D.Sc. thesis (Royal Inst. of Technology (KTH), Stockholm 2003)

G. Fant, J. Liljencrants, Q. Lin: A four-parameter model of glottal flow. In: STL/Quart. Prog. Status Rep. 4, Speech, Music and Hearing (Royal Inst. of Technology, Stockholm 1985) pp. 1–13

D.H. Klatt, L.C. Klatt: Analysis, synthesis and pereception of voice quality variations among female and male talkers, J. Acoust. Soc. Am. 87 (2), 820–857 (1990)

M. Ljungqvist, H. Fujisaki: A comparative study of glottal waveform models. In: Technical Report of the Institute of Electronics and Communications Engineers , Vol. EA85-58 (Institute of Electronics and Communications Engineers, Tokyo 1985) pp. 23–29

A.E. Rosenberg: Effect of glottal pulse shape on the quality of natural vowels, J. Acoust. Soc. Am. 49 , 583–598 (1971)

M. Rothenberg, R. Carlson, B. Granström, J. Lindqvist-Gauffin: A three- parameter voice source for speech synthesis. In: Proceedings of the Speech Communication Seminar 2 , ed. by G. Fant (Almqvist & Wiksell, Stockholm 1975) pp. 235–243

K. Ishizaka, J.L. Flanagan: Synthesis of voiced sounds from a two-mass model of the vocal cords, The Bell Syst. Tech. J. 52 , 1233–1268 (1972)

Liljencrants: Chapter A translating and rotating mass model of the vocal folds. In: STL/Quart. Prog. Status Rep. 1, Speech, Music and Hearing (Royal Inst. of Technology, Stockholm 1991) pp. 1–18

A. Ní Chasaide, C. Gobl: Voice source variation. In: The Handbook of Phonetic Sciences , ed. by W.J. Hardcastle, J. Laver (Blackwell, Oxford 1997) pp. 427–462

E.B. Holmberg, R.E. Hillman, J.S. Perkell: Glottal air flow and pressure measurements for loudness variation by male and female speakers, J. Acoust. Soc. Am. 84 , 511–529 (1988)

J.S. Perkell, R.E. Hillman, E.B. Holmberg: Group differences in measures of voice production and revised values of maximum airflow declination rate, J. Acoust. Soc. Am. 96 , 695–698 (1994)

J. Gauffin, J. Sundberg: Spectral correlates of glottal voice source waveform characteristics, J. Speech Hear. Res. 32 , 556–565 (1989)

J. Svec, H. Schutte, D. Miller: On pitch jumps between chest and falsetto registers in voice: Data on living and excised human larynges, J. Acoust. Soc. Am. 106 , 1523–1531 (1999)

J. Sundberg, M. Andersson, C. Hultqvist: Effects of subglottal pressure variation on professional baritone singers voice sources, J. Acoust. Soc. Am. 105 , 1965–1971 (1999)

J. Sundberg, E. Fahlstedt, A. Morell: Effects on the glottal voice source of vocal loudness variation in untrained female and male subjects, J. Acoust. Soc. Am. 117 , 879–885 (2005)

P. Sjölander, J. Sundberg: Spectrum effects of subglottal pressure variation in professional baritone singers, J. Acoust. Soc. Am. 115 , 1270–1273 (2004)

P. Branderud, H. Lundberg, J. Lander, H. Djamshidpey, I. Wäneland, D. Krull, B. Lindblom: X-ray analyses of speech: Methodological aspects, Proc. of 11th Swedish Phonetics Conference (Stockholm Univ., Stockholm 1996) pp. 168–171

B. Lindblom: A numerical model of coarticulation based on a Principal Components analysis of tongue shapes. In: Proc. 15th Int. Congress of the Phonetic Sciences , ed. by D. Recasens, M. Josep Solé, J. Romero (Universitat Autònoma de Barcelona, Barcelona 2003), CD-ROM

G.E. Peterson, H. Barney: Control methods used in a study of the vowels, J. Acoust. Soc. Am. 24 , 175–184 (1952)

Hillenbrand et al.: Acoustic characteristics of American English vowels, J. Acoust. Soc. Am. 97 (5), 3099–3111 (1995)

G. Fant: Analysis and synthesis of speech processes. In: Manual of Phonetics , ed. by B. Malmberg (North-Holland, Amsterdam 1968) pp. 173–277

G. Fant: Formant bandwidth data. In: STL/Quart. Prog. Status Rep. 7 (Royal Inst. of Technology, Stockholm 1962) pp. 1–3

G. Fant: Vocal tract wall effects, losses, and resonance bandwidths. In: STL/Quart. Prog. Status Rep. 2-3 (Royal Inst. of Technology, Stockholm 1972) pp. 173–277

A.S. House, K.N. Stevens: Estimation of formant bandwidths from measurements of transient response of the vocal tract, J. Speech Hear. Disord. 1 , 309–315 (1958)

O. Fujimura, J. Lindqvist: Sweep-tone measurements of vocal-tract characteristics, J. Acoust. Soc. Am. 49 , 541–558 (1971)

I. Lehiste, G.E. Peterson: Vowel amplitude and phonemic stress in American English, J. Acoust. Soc. Am. 3 , 428–435 (1959)

I. Lehiste: Suprasegmentals (MIT Press, Cambridge 1970)

O. Jespersen: Lehrbuch der Phonetik (Teubner, Leipzig 1926)

T. Bartholomew: A physical definition of good voice quality in the male voice, J. Acoust. Soc. Am. 6 , 25–33 (1934)

J. Sundberg: Production and function of the singing formant. In: Report of the eleventh congress Copenhagen 1972 (Proceedings of the 11th international congress of musicology) , ed. by H. Glahn, S. Sörensen, P. Ryom (Wilhelm Hansen, Copenhagen 1972) pp. 679–686

J. Sundberg: Articulatory interpretation of the ʼsinging formantʼ, J. Acoust. Soc. Am. 55 , 838–844 (1974)

J. Sundberg: Level and center frequency of the singer´s formant, J. Voice. 15 , 176–186 (2001)

MathSciNet   Google Scholar  

G. Berndtsson, J. Sundberg: Perceptual significance of the center frequency of the singers formant, Scand. J. Logopedics Phoniatrics 20 , 35–41 (1995)

L. Dmitriev, A. Kiselev: Relationship between the formant structure of different types of singing voices and the dimension of supraglottal cavities, Fol. Phoniat. 31 , 238–41 (1979)

P. Ladefoged: Three areas of experimental phonetics (Oxford Univ. Press, London 1967)

J. Barnes, P. Davis, J. Oates, J. Chapman: The relationship between professional operatic soprano voice and high range spectral energy, J. Acoust. Soc. Am. 116 , 530–538 (2004)

M. Nordenberg, J. Sundberg: Effect on LTAS on vocal loudness variation, Logopedics Phoniatrics Vocology 29 , 183–191 (2004)

R. Weiss, W.S. Brown, J. Morris: Singerʼs formant in sopranos: Fact or fiction, J. Voice 15 , 457–468 (2001)

J.M. Heinz, K.N. Stevens: On the relations between lateral cineradiographs, area functions, and acoustics of speech. In: Proc. Fourth Int. Congress on Acoustics , Vol. 1a (1965), paper A44

C. Johansson, J. Sundberg, H. Willbrand: X-ray study of articulation and formant frequencies in two female singers. In: Proc. of Stockholm Music Acoustics Conference 1983 (SMAC 83) , Vol. 46(1), ed. by A. Askenfelt, S. Felicetti, E. Jansson, J Sundberg (Kgl. Musikaliska Akad., Stockholm 1985) pp. 203–218

T. Baer, J.C. Gore, L.C. Gracco, P. Nye: Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels, J. Acoust. Soc. Am. 90 (2), 799–828 (1991)

D. Demolin, M. George, V. Lecuit, T. Metens, A. Soquet: Détermination par IRM de lʼouverture du velum des voyelles nasales du français. In: Actes des XXièmes Journées dʼÉtudes sur la Parole (1996)

A. Foldvik, K. Kristiansen, J. Kvaerness, A. Torp, H. Torp: Three-dimensional ultrasound and magnetic resonance imaging: a new dimension in phonetic research (Proc. Fut. Congress Phonetic Science Stockholm Univ., Stockholm 1995), Vol. 4, 46-49

B.H. Story, I.R. Titze, E.A. Hoffman: Vocal tract area functions from magnetic resonance imaging, J. Acoust. Soc. Am. 100 , 537–554 (1996)

O. Engwall: Talking tongues, D.Sc. thesis (Royal Institute of Technology (KTH), Stockholm 2002)

B. Lindblom, J. Sundberg: Acoustical consequences of lip, tongue, jaw and larynx movement, J. Acoust. Soc. Am. 50 , 1166–1179 (1971), also in Papers in Speech Communication: Speech Production, ed. by R.D. Kent, B.S. Atal, J.L. Miller (Acoust. Soc. Am., New York 1991) pp.329-342

J. Stark, B. Lindblom, J. Sundberg: APEX - an articulatory synthesis model for experimental and computational studies of speech production. In: Fonetik 96: Papers presented at the Swedish Phonetics Conference TMH-QPSR 2/1996 (Royal Institute of Technology, Stockholm 1996) pp. 45–48

J. Stark, C. Ericsdotter, B. Lindblom, J. Sundberg: Using X-ray data to calibrate the APEX the synthesis. In: Fonetik 98: Papers presented at the Swedish Phonetics Conference (Stockholm Univ., Stockholm 1998)

J. Stark, C. Ericsdotter, P. Branderud, J. Sundberg, H.-J. Lundberg, J. Lander: The APEX model as a tool in the specification of speaker-specific articulatory behavior. In: Proc. 14th Int. Congress of the Phonetic Sciences , ed. by J.J. Ohala (1999)

C. Ericsdotter: Articulatory copy synthesis: Acoustic performane of an MRI and X-ray-based framework. In: Proc. 15th Int. Congress of the Phonetic Sciences , ed. by D. Recasens, M. Josep Solé, J. Romero (Universitat Autònoma de Barcelona, Barcelona 2003), CD-ROM

C. Ericsdotter: Articulatory-acoustic relationships in Swedish vowel sounds, PhD thesis (Stockholm University, Stockholm 2005)

K.N. Stevens, A.S. House: Development of a quantitative description of vowel articulation, J. Acoust. Soc. Am. 27 , 484–493 (1955)

S. Maeda: Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In: Speech Production and Speech Modeling , ed. by W.J. Hardcastle, A. Marchal (Dordrecht, Kluwer 1990) pp. 131–150

P. Branderud, H. Lundberg, J. Lander, H. Djamshidpey, I. Wäneland, D. Krull, B. Lindblom: X-ray analyses of speech: methodological aspects. In: Proc. XIIIth Swedish Phonetics Conf. (FONETIK 1998) (KTH, Stockholm 1998)

C.Y. Espy-Wilson: Articulatory strategies, speech acoustics and variability. In: From sound to Sense: 50+ Years of Discoveries in Speech Communication , ed. by J. Slifka, S. Manuel, M. Mathies (MIT, Cambridge 2004)

J. Sundberg: Formant technique in a professional female singer, Acustica 32 , 89–96 (1975)

J. Sundberg, J. Skoog: Dependence of jaw opening on pitch and vowel in singers, J. Voice 11 , 301–306 (1997)

G. Fant: Glottal flow, models and interaction, J. Phon. 14 , 393–399 (1986)

E. Joliveau, J. Smith, J. Wolfe: Vocal tract resonances in singing: The soprano voice, J. Acoust. Soc. Am. 116 , 2434–2439 (2004)

R.K. Potter, A.G. Kopp, H.C. Green: Visible Speech (Van Norstrand, New York 1947)

M. Joosg: Acoustic phonetics, Language 24 , 447–460 (2003), supplement 2

C.F. Hockett: A Manual of Phonology (Indiana Univ. Publ., Bloomington 1955)

F.H. Guenther: Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production, Psychol. Rev. 102 , 594–621 (1995)

R.D. Kent, B.S. Atal, J.L. Miller: Papers in Speech Communication: Speech Perception (Acoust. Soc. Am., New York 1991)

S.D. Goldinger, D.B. Pisoni, P. Luce: Speech perception and spoken word recognition. In: Principles of experimental phonetics , ed. by N.J. Lass (Mosby, St Louis 1996) pp. 277–327

H.M. Sussman, D. Fruchter, J. Hilbert, J. Sirosh: Linear correlates in the speech signal: The orderly output constraint, Behav. Brain Sci. 21 , 241–299 (1998)

B. Lindblom: Economy of speech gestures. In: The Production of Speech , ed. by P.F. MacNeilage (Springer, New York 1983) pp. 217–245

P.A. Keating, B. Lindblom, J. Lubker, J. Kreiman: Variability in jaw height for segments in English and Swedish VCVs, J. Phonetics 22 , 407–422 (1994)

K. Rapp: A study of syllable timing. In: STL/Quart. Prog. Status Rep. 1 (Royal Inst. of Technology, Stockholm( 1971) pp. 14–19

F. Koopmans-van Beinum, J. Van der Stelt (Eds.): Early stages in the development of speech movements (Stockton, New York 1986)

K. Oller: Metaphonology and infant vocalizations. In: Precursors of early speech , ed. by B. Lindblom, R. Zetterström (Stockton, New York 1986) pp. 21–36

L. Roug, L. Landberg, L. Lundberg: Phonetic development in early infancy, J. Child Language 16 , 19–40 (1989)

R. Stark: Stages of speech development in the first year of life. In: Child Phonology: Volume 1: Production , ed. by G. Yeni-Komshian, J. Kavanagh, C. Ferguson (Academic, New York 1980) pp. 73–90

R. Stark: Prespeech segmental feature development. In: Language Acquisition , ed. by P. Fletcher, M. Garman (Cambridge UP, New York 1986) pp. 149–173

D.K. Oller, R.E. Eilers: The role of audition in infant babbling, Child Devel. 59 (2), 441–449 (1988)

C. Stoel-Gammon, D. Otomo: Babbling development of hearing-impaired and normally hearing subjects, J. Speech Hear. Dis. 51 , 33–41 (1986)

R.E. Eilers, D.K. Oller: Infant vocalizations and the early diagnosis of severe hearing impairment, J. Pediatr. 124 (2), 99–203 (1994)

D. Ertmer, J. Mellon: Beginning to talk at 20 months: Early vocal development in a young cochlear implant recipient, J. Speech Lang. Hear. Res. 44 , 192–206 (2001)

R.D. Kent, M.J. Osberger, R. Netsell, C.G. Hustedde: Phonetic development in identical twins who differ in auditory function, J. Speech Hear. Dis. 52 , 64–75 (1991)

M. Lynch, D. Oller, M. Steffens: Development of speech-like vocalizations in a child with congenital absence of cochleas: The case of total deafness, Appl. Psychol. 10 , 315–333 (1989)

C. Stoel-Gammon: Prelinguistic vocalizations of hearing-impaired and normally hearing subjects: A comparison of consonantal inventories, J. Speech Hear. Dis. 53 , 302–315 (1988)

P.F. MacNeilage, B.L. Davis: Acquisition of speech production: The achievement of segmental independence. In: Speech production and speech modeling , ed. by W.J. Hardcastle, A. Marchal (Dordrecht, Kluwer 1990) pp. 55–68

T. Houtgast, H.J.M. Steeneken: A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria, J. Acoust. Soc. Am. 77 , 1069–1077 (1985)

T. Houtgast, H.J.M. Steeneken: Past, Present and Future of the Speech Transmission Index , ed. by S.J. van Wijngaarden (NTO Human Factors, Soesterberg 2002)

R. Drullman, J.M. Festen, R. Plomp: Effect of temporal envelope smearing on speech reception, J. Acoust. Soc. Am. 95 , 1053–1064 (1994)

R. Drullman, J.M. Festen, R. Plomp: Effect of reducing slow temporal modulations on speech reception, J. Acoust. Soc. Am. 95 , 2670–2680 (1994)

J. Morton, S. Marcus, C. Frankish: Perceptual centers (P-centers), Psych. Rev. 83 , 405–408 (1976)

S. Marcus: Acoustic determinants of perceptual center (P-center) location, Perception & Psychophysics 30 , 247–256 (1981)

G. Allen: The place of rhythm in a theory of language, UCLA Working Papers 10 , 60–84 (1968)

G. Allen: The location of rhythmic stress beats in English: An experimental study, UCLA Working Papers 14 , 80–132 (1970)

J. Eggermont: Location of the syllable beat in routine scansion recitations of a Dutch poem, IPO Annu. Prog. Rep. 4 , 60–69 (1969)

V.A. Kozhevnikov, L.A. Chistovich: Speech Articulation and Perception, JPRS 30 , 543 (1965)

C.E. Hoequist: The perceptual center and rhythm categories, Lang. Speech 26 , 367–376 (1983)

K.J. deJong: The correlation of P-center adjustments with articulatory and acoustic events, Perception Psychophys. 56 , 447–460 (1994)

A.D. Patel, A. Löfqvist, W. Naito: The acoustics and kinematics of regularly timed speech: a database and method for the study of the P-center problem. In: Proc. 14th Int. Congress of the Phonetic Sciences , ed. by J.J. Ohala (1999)

P. Howell: Prediction of P-centre location from the distribution of energy in the amplitude envelope: I & II, Perception Psychophys. 43 , 90–93, 99 (1988)

B. Pompino-Marschall: On the psychoacoustic nature of the Pcenter phenomenon, J. Phonetics 17 , 175–192 (1989)

C.A. Harsin: Perceptual-center modeling is affected by including acoustic rate-of-change modulations, Perception Psychophys. 59 , 243–251 (1997)

C.A. Fowler: Converging sources of evidence on spoken and perceived rhythms of speech: Cyclic production of vowels in monosyllabic stress feet, J. Exp. Psychol. Gen. 112 , 386–412 (1983)

H. Fujisaki: Dynamic characteristics of voice fundamental frequency in speech and singing. In: The Production of Speech , ed. by P.F. MacNeilage (Springer, New York 1983) pp. 39–55

J. Frid: Lexical and acoustic modelling of Swedish prosody, Dissertation (Lund University, Lund 2003)

S. Öhman: Numerical model of coarticulation, J. Acoust. Soc. Am. 41 , 310–320 (1967)

J. tʼHart: F0 stylization in speech: Straight lines versus parabolas, J. Acoust. Soc. Am. 90 (6), 3368–3370 (1991)

D. Abercrombie: Elements of General Phonetics (Edinburgh Univ. Press, Edinburgh 1967)

K.L. Pike: The intonation of America English (Univ. of Michigan Press, Ann Arbor 1945)

G. Fant, A. Kruckenberg: Notes on stress and word accent in Swedish. In: STL/Quart. Prog. Status Rep. 2-3 (Royal Inst. of Technology, Stockholm 1994) pp. 125–144

R. Dauer: Stress timing and syllable-timing reanalyzed, J. Phonetics 11 , 51–62 (1983)

A. Eriksson: Aspects of Swedish rhythm, PhD thesis, Gothenburg Monographs in Linguistics (Gothenburg University, Gothenburg 1991)

O. Engstrand, D. Krull: Duration of syllable-sized units in casual and elaborated speech: cross-language observations on Swedish and Spanish, TMH-QPSR 44 , 69–72 (2002)

A.D. Patel, J.R. Daniele: An empirical comparison of rhythm in language and music, Cognition 87 , B35–B45 (2003)

D. Huron, J. Ollen: Agogic contrast in French and English themes: Further support for Patel and Daniele (2003), Music Perception 21 , 267–271 (2003)

D.H. Klatt: Synthesis by rule of segmental durations in English sentences. In: Frontiers of speech communication research , ed. by B. Lindblom, S. Öhman (Academic, London 1979) pp. 287–299

B. Lindblom: Final lengthening in speech and music. In: Nordic Prosody , ed. by E. Gårding, R. Bannert (Department of Linguistics Lund University, Lund 1978) pp. 85–102

A. Friberg, U Battel: Structural communication. In: The Science and Psychology of Music Performance , ed. by R. Parncutt, GE McPherson (Oxford Univ., Oxford 2001) pp. 199–218

J. Sundberg: Emotive transforms, Phonetica 57 , 95–112 (2000)

Brownlee: The role of sentence stress in vowel reduction and formant undershoot: A study of lab speech and informal spontaneous speech, PhD thesis (University of Texas, Austin 1996)

S.-J. Moon: An acoustic and perceptual study of undershoot in clear and citation- form speech, PhD dissertation (Univ. of Texas, Austin 1991)

K.N. Stevens, A.S. House: Perturbation of vowel articulations by consonantal context. An acoustical study, JSHR 6 , 111–128 (1963)

B. Lindblom: Spectrographic study of vowel reduction, J. Acoust. Soc. Am. 35 , 1773–1781 (1963)

P. Delattre: An acoustic and articulatory study of vowel reduction in four languages, IRAL-Int. Ref. Appl. VII/ 4 , 295–325 (1969)

D.P. Kuehn, K.L. Moll: A cineradiographic study of VC and CV articulatory velocities, J. Phonetics 4 , 303–320 (1976)

J.E. Flege: Effects of speaking rate on tongue position and velocity of movement in vowel production, J. Acoust. Soc. Am. 84 (3), 901–916 (1988)

R.J.J.H. van Son, L.C.W. Pols: "Formant movements of Dutch vowels in a text, read at normal and fast rate, J. Acoust. Soc. Am. 92 (1), 121–127 (1992)

D. van Bergem: Acoustic and Lexical Vowel Reduction, Doctoral Dissertation (University of Amsterdam, Amsterdam 1995)

W.L. Nelson, J.S. Perkell, J.R. Westbury: Mandible movements during increasingly rapid articulations of single syllables: Preliminary observations, J. Acoust. Soc. Am. 75 (3), 945–951 (1984)

S.-J. Moon, B. Lindblom: Interaction between duration, context and speaking style in English stressed vowels, J. Acoust. Soc. Am. 96 (1), 40–55 (1994)

C.S. Sherrington: Man on his nature (MacMillan, London 1986)

R. Granit: The Purposive Brain (MIT, Cambridge 1979)

N. Bernstein: The coordination and regulation of movements (Pergamon, Oxford 1967)

P.F. MacNeilage: Motor control of serial ordering of speech, Psychol. Rev. 77 , 182–196 (1970)

A. Löfqvist: Theories and Models of Speech Production. In: The Handbook of Phonetic Sciences , ed. by W.J. Hardcastle, J. Laver (Blackwell, Oxford 1997) pp. 405–426

J.S. Perkell: Articulatory processes. In: The Handbook of Phonetic Sciences. 5 , ed. by W.J. Hardcastle, J. Laver. (Blackwell, Oxford 1997) pp. 333–370

J. Sundberg, R. Leandersson, C. von Euler, E. Knutsson: Influence of body posture and lung volume on subglottal pressure control during singing, J. Voice 5 , 283–291 (1991)

T. Sears, J. Newsom Davis: The control of respiratory muscles during voluntary breathing. In: Sound production in man , ed. by A. Bouhuys et al. (Annals of the New York Academy of Science, New York 1968) pp. 183–190

B. Lindblom, J. Lubker, T. Gay: Formant frequencies of some fixed-mandible vowels and a model of motor programming by predictive simulation, J. Phonetics 7 , 147–161 (1979)

T. Gay, B. Lindblom, J. Lubker: Production of bite-block vowels: Acoustic equivalence by selective compensation, J. Acoust. Soc. Am. 69 (3), 802–810 (1981)

W.J. Hardcastle, J. Laver (Eds.): The Handbook of Phonetic Sciences (Blackwell, Oxford 1997)

J. S. Perkell, D. H. Klatt: Invariance and variability in speech processes (LEA, Hillsdale 1986)

A. Liberman, I. Mattingly: The motor theory of speech perception revised, Cognition 21 , 1–36 (1985)

C.A. Fowler: An event approach to the study of speech perception from a direct- realist perspective, J. Phon. 14 (1), 3–28 (1986)

E.L. Saltzman, K.G. Munhall: A dynamical approach to gestural patterning in speech production, Ecol. Psychol. 1 , 91–163 (1989)

M. Studdert-Kennedy: How did language go discrete?. In: Evolutionary Prerequisites of Language , ed. by M. Tallerman (Oxford Univ., Oxford 2005) pp. 47–68

R. Jakobson, G. Fant, M. Halle: Preliminaries to Speech Analysis, Acoustics Laboratory, MIT Tech. Rep. No. 13 (MIT, Cambridge 1952)

B. Lindblom: Explaining phonetic variation: A sketch of the H&H theory. In: Speech Production and Speech Modeling , ed. by W.J. Hardcastle, A. Marchal (Dordrecht, Kluwer 1990) pp. 403–439

B. Lindblom: Role of articulation in speech perception: Clues from production, J. Acoust. Soc. Am. 99 (3), 1683–1692 (1996)

E. Rapoport: Emotional expression code in opera and lied singing, J. New Music Res. 25 , 109–149 (1996)

J. Sundberg, E. Prame, J. Iwarsson: Replicability and accuracy of pitch patterns in professional singers. In: Vocal Fold Physiology, Controlling Complexity and Chaos , ed. by P. Davis, N. Fletcher (Singular, San Diego 1996) pp. 291–306, Chap. 20

J.J. Ohala: An ethological perspective on common cross-language utilization of F0 of voice, Phonetica 41 , 1–16 (1984)

I. Fónagy: Hörbare Mimik, Phonetica 1 , 25–35 (1967)

K. Scherer: Expression of emotion in voice and music, J. Voice 9 , 235–248 (1995)

P. Juslin, P. Laukka: Communication of emotions in vocal expression and music performance: Different channels, same code?, Psychol. Rev. 129 , 770–814 (2003)

J. Sundberg, J. Iwarsson, H. Hagegård: A singers expression of emotions in sung performance,. In: Vocal Fold Physiology: Voice Quality Control , ed. by O. Fujimura, M. Hirano (Singular, San Diego 1995) pp. 217–229

Download references

Author information

Authors and affiliations.

Department of Linguistics, Stockholm University, 10691, Stockholm, Sweden

Björn Lindblom Prof.

Department of Speech, Music, and Hearing, KTH–Royal Institute of Technology, SE-10044, Stockholm, Sweden

Johan Sundberg

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Björn Lindblom Prof. or Johan Sundberg .

Editor information

Editors and affiliations.

Center for Computer Research in Music and Acoustics, Stanford University, 94305, Stanford, CA, USA

Thomas D. Rossing Prof.

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Science+Business Media, LLC New York

About this entry

Cite this entry.

Lindblom, B., Sundberg, J. (2007). The Human Voice in Speech and Singing. In: Rossing, T. (eds) Springer Handbook of Acoustics. Springer Handbooks. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30425-0_16

Download citation

DOI : https://doi.org/10.1007/978-0-387-30425-0_16

Publisher Name : Springer, New York, NY

Print ISBN : 978-0-387-30446-5

Online ISBN : 978-0-387-30425-0

eBook Packages : Physics and Astronomy Reference Module Physical and Materials Science Reference Module Chemistry, Materials and Physics

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. What makes a human voice attractive? • Earth.com

    essay on human voice

  2. The Power of Voice Essay

    essay on human voice

  3. Why the human voice is so versatile

    essay on human voice

  4. Human Voice

    essay on human voice

  5. The Human Voice

    essay on human voice

  6. Human Voice

    essay on human voice

VIDEO

  1. Essay on Human Rights/Essay on human rights in english/Human rights essay/essay on Human rights Day

  2. Making of The Human Voice

  3. Not a human voice?

  4. Essay voice audition for @ChuyAnimates

  5. Human Voice

  6. What Is “Self” and Where Does It Come From?

COMMENTS

  1. The power of ‘voice,’ and empowering the voiceless

    Many people use their voices everyday—to talk to people, to communicate their needs and wants—but the idea of ‘voice’ goes much deeper. Having a voice gives an individual agency and power, and a way to express his or her beliefs. But what happens when that voice is in some way silenced?

  2. Don’t Underestimate the Power of Your Voice

    Summary. Our voices matter as much as our words matter. They have the power to awaken the senses and lead others to act, close deals, or land us successful job interviews. Through our voices, we...

  3. The Power of the Human Voice - NUHA Foundation

    The human voice is able to infuse words with shades of deeper meaning because that power of speech can unearth the real intentions, mood, character, identity and culture of the speaker in question. It is easy for a person to write down something and mislead his or her audience or the entire world.

  4. The Power of Using your voice - Voices of Youth

    A voice is a tool that can be used for standing up for what is right, rather than what is easy. A voice gives your opinions a platform, and gifts you with the opportunity to have perspective and knowledge on things that matter. No two voices are the same, each voice has something different to say.

  5. Understanding Voice Production - THE VOICE FOUNDATION

    The human voice can be modified in many ways. Consider the spectrum of sounds – whispering, speaking, orating, shouting – as well as the different sounds that are possible in different forms of vocal music, such as rock singing, gospel singing, and opera singing.

  6. Human Voices Are Unique but We're Not That Good at ...

    4 min read. Human Voices Are Unique but We're Not That Good at Recognizing Them. People are good at picking out voices of familiar people’s speech but ear-witness testimonies of strangers’ voices...

  7. Mechanics of human voice production and control - PMC

    This paper provides a review of voice physiology and biomechanics, the physics of vocal fold vibration and sound production, and laryngeal muscular control of the fundamental frequency of voice, vocal intensity, and voice quality.

  8. The Human Voice in Speech and Singing | SpringerLink

    This chapter acoustics speech describes various aspects of the human voice as a means of communication in speech and singing. From the point of view of function, vocal sounds can be regarded as the...

  9. Beyond speech: Exploring diversity in the human voice

    Our systematic and large-scale acoustic comparison confirmed what is intuitively familiar: speech, singing, and nonverbal vocalizations are three very different ways of using the human voice.

  10. The human voice, in all its idiosyncratic glory | Science - AAAS

    In his new book, This Is the Voice, journalist John Colapinto creates a compelling narrative surrounding this vast and complex topic and investigates what makes the voice uniquely human. Colapinto creatively describes the structure of the book as “a little like the vocal signal itself.”