Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data.  Research  (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Register for the guide on foundation models

Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

  • Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
  • Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
  • Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
  • Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.

Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:

  • Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting. 
  • Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
  • N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
  • Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent.  While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
  • Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.

A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.

Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.

Convert speech into text using AI-powered speech recognition and transcription.

Convert text into natural-sounding speech in a variety of languages and voices.

AI-powered hybrid cloud software.

Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.

IBM watsonx Assistant helps organizations provide better customer experiences with an AI chatbot that understands the language of the business, connects to existing customer care systems, and deploys anywhere with enterprise security and scalability. watsonx Assistant automates repetitive tasks and uses machine learning to resolve customer support issues quickly and efficiently.

Speech Recognition: Everything You Need to Know in 2024

speech recognition tasks

Speech recognition, also known as automatic speech recognition (ASR) , enables seamless communication between humans and machines. This technology empowers organizations to transform human speech into written text. Speech recognition technology can revolutionize many business applications , including customer service, healthcare, finance and sales.

In this comprehensive guide, we will explain speech recognition, exploring how it works, the algorithms involved, and the use cases of various industries.

If you require training data for your speech recognition system, here is a guide to finding the right speech data collection services.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text.

Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents, dialects, and speech patterns.

What are the features of speech recognition systems?

Speech recognition systems have several components that work together to understand and process human speech. Key features of effective speech recognition are:

  • Audio preprocessing: After you have obtained the raw audio signal from an input device, you need to preprocess it to improve the quality of the speech input The main goal of audio preprocessing is to capture relevant speech data by removing any unwanted artifacts and reducing noise.
  • Feature extraction: This stage converts the preprocessed audio signal into a more informative representation. This makes raw audio data more manageable for machine learning models in speech recognition systems.
  • Language model weighting: Language weighting gives more weight to certain words and phrases, such as product references, in audio and voice signals. This makes those keywords more likely to be recognized in a subsequent speech by speech recognition systems.
  • Acoustic modeling : It enables speech recognizers to capture and distinguish phonetic units within a speech signal. Acoustic models are trained on large datasets containing speech samples from a diverse set of speakers with different accents, speaking styles, and backgrounds.
  • Speaker labeling: It enables speech recognition applications to determine the identities of multiple speakers in an audio recording. It assigns unique labels to each speaker in an audio recording, allowing the identification of which speaker was speaking at any given time.
  • Profanity filtering: The process of removing offensive, inappropriate, or explicit words or phrases from audio data.

What are the different speech recognition algorithms?

Speech recognition uses various algorithms and computation techniques to convert spoken language into written language. The following are some of the most commonly used speech recognition methods:

  • Hidden Markov Models (HMMs): Hidden Markov model is a statistical Markov model commonly used in traditional speech recognition systems. HMMs capture the relationship between the acoustic features and model the temporal dynamics of speech signals.
  • Estimate the probability of word sequences in the recognized text
  • Convert colloquial expressions and abbreviations in a spoken language into a standard written form
  • Map phonetic units obtained from acoustic models to their corresponding words in the target language.
  • Speaker Diarization (SD): Speaker diarization, or speaker labeling, is the process of identifying and attributing speech segments to their respective speakers (Figure 1). It allows for speaker-specific voice recognition and the identification of individuals in a conversation.

Figure 1: A flowchart illustrating the speaker diarization process

The image describes the process of speaker diarization, where multiple speakers in an audio recording are segmented and identified.

  • Dynamic Time Warping (DTW): Speech recognition algorithms use Dynamic Time Warping (DTW) algorithm to find an optimal alignment between two sequences (Figure 2).

Figure 2: A speech recognizer using dynamic time warping to determine the optimal distance between elements

Dynamic time warping is a technique used in speech recognition to determine the optimum distance between the elements.

5. Deep neural networks: Neural networks process and transform input data by simulating the non-linear frequency perception of the human auditory system.

6. Connectionist Temporal Classification (CTC): It is a training objective introduced by Alex Graves in 2006. CTC is especially useful for sequence labeling tasks and end-to-end speech recognition systems. It allows the neural network to discover the relationship between input frames and align input frames with output labels.

Speech recognition vs voice recognition

Speech recognition is commonly confused with voice recognition, yet, they refer to distinct concepts. Speech recognition converts  spoken words into written text, focusing on identifying the words and sentences spoken by a user, regardless of the speaker’s identity. 

On the other hand, voice recognition is concerned with recognizing or verifying a speaker’s voice, aiming to determine the identity of an unknown speaker rather than focusing on understanding the content of the speech.

What are the challenges of speech recognition with solutions?

While speech recognition technology offers many benefits, it still faces a number of challenges that need to be addressed. Some of the main limitations of speech recognition include:

Acoustic Challenges:

  • Assume a speech recognition model has been primarily trained on American English accents. If a speaker with a strong Scottish accent uses the system, they may encounter difficulties due to pronunciation differences. For example, the word “water” is pronounced differently in both accents. If the system is not familiar with this pronunciation, it may struggle to recognize the word “water.”

Solution: Addressing these challenges is crucial to enhancing  speech recognition applications’ accuracy. To overcome pronunciation variations, it is essential to expand the training data to include samples from speakers with diverse accents. This approach helps the system recognize and understand a broader range of speech patterns.

  • For instance, you can use data augmentation techniques to reduce the impact of noise on audio data. Data augmentation helps train speech recognition models with noisy data to improve model accuracy in real-world environments.

Figure 3: Examples of a target sentence (“The clown had a funny face”) in the background noise of babble, car and rain.

Background noise makes distinguishing speech from background noise difficult for speech recognition software.

Linguistic Challenges:

  • Out-of-vocabulary words: Since the speech recognizers model has not been trained on OOV words, they may incorrectly recognize them as different or fail to transcribe them when encountering them.

Figure 4: An example of detecting OOV word

speech recognition tasks

Solution: Word Error Rate (WER) is a common metric that is used to measure the accuracy of a speech recognition or machine translation system. The word error rate can be computed as:

Figure 5: Demonstrating how to calculate word error rate (WER)

Word Error Rate (WER) is metric to evaluate the performance  and accuracy of speech recognition systems.

  • Homophones: Homophones are words that are pronounced identically but have different meanings, such as “to,” “too,” and “two”. Solution: Semantic analysis allows speech recognition programs to select the appropriate homophone based on its intended meaning in a given context. Addressing homophones improves the ability of the speech recognition process to understand and transcribe spoken words accurately.

Technical/System Challenges:

  • Data privacy and security: Speech recognition systems involve processing and storing sensitive and personal information, such as financial information. An unauthorized party could use the captured information, leading to privacy breaches.

Solution: You can encrypt sensitive and personal audio information transmitted between the user’s device and the speech recognition software. Another technique for addressing data privacy and security in speech recognition systems is data masking. Data masking algorithms mask and replace sensitive speech data with structurally identical but acoustically different data.

Figure 6: An example of how data masking works

Data masking protects sensitive or confidential audio information in speech recognition applications by replacing or encrypting the original audio data.

  • Limited training data: Limited training data directly impacts  the performance of speech recognition software. With insufficient training data, the speech recognition model may struggle to generalize different accents or recognize less common words.

Solution: To improve the quality and quantity of training data, you can expand the existing dataset using data augmentation and synthetic data generation technologies.

13 speech recognition use cases and applications

In this section, we will explain how speech recognition revolutionizes the communication landscape across industries and changes the way businesses interact with machines.

Customer Service and Support

  • Interactive Voice Response (IVR) systems: Interactive voice response (IVR) is a technology that automates the process of routing callers to the appropriate department. It understands customer queries and routes calls to the relevant departments. This reduces the call volume for contact centers and minimizes wait times. IVR systems address simple customer questions without human intervention by employing pre-recorded messages or text-to-speech technology . Automatic Speech Recognition (ASR) allows IVR systems to comprehend and respond to customer inquiries and complaints in real time.
  • Customer support automation and chatbots: According to a survey, 78% of consumers interacted with a chatbot in 2022, but 80% of respondents said using chatbots increased their frustration level.
  • Sentiment analysis and call monitoring: Speech recognition technology converts spoken content from a call into text. After  speech-to-text processing, natural language processing (NLP) techniques analyze the text and assign a sentiment score to the conversation, such as positive, negative, or neutral. By integrating speech recognition with sentiment analysis, organizations can address issues early on and gain valuable insights into customer preferences.
  • Multilingual support: Speech recognition software can be trained in various languages to recognize and transcribe the language spoken by a user accurately. By integrating speech recognition technology into chatbots and Interactive Voice Response (IVR) systems, organizations can overcome language barriers and reach a global audience (Figure 7). Multilingual chatbots and IVR automatically detect the language spoken by a user and switch to the appropriate language model.

Figure 7: Showing how a multilingual chatbot recognizes words in another language

speech recognition tasks

  • Customer authentication with voice biometrics: Voice biometrics use speech recognition technologies to analyze a speaker’s voice and extract features such as accent and speed to verify their identity.

Sales and Marketing:

  • Virtual sales assistants: Virtual sales assistants are AI-powered chatbots that assist customers with purchasing and communicate with them through voice interactions. Speech recognition allows virtual sales assistants to understand the intent behind spoken language and tailor their responses based on customer preferences.
  • Transcription services : Speech recognition software records audio from sales calls and meetings and then converts the spoken words into written text using speech-to-text algorithms.

Automotive:

  • Voice-activated controls: Voice-activated controls allow users to interact with devices and applications using voice commands. Drivers can operate features like climate control, phone calls, or navigation systems.
  • Voice-assisted navigation: Voice-assisted navigation provides real-time voice-guided directions by utilizing the driver’s voice input for the destination. Drivers can request real-time traffic updates or search for nearby points of interest using voice commands without physical controls.

Healthcare:

  • Recording the physician’s dictation
  • Transcribing the audio recording into written text using speech recognition technology
  • Editing the transcribed text for better accuracy and correcting errors as needed
  • Formatting the document in accordance with legal and medical requirements.
  • Virtual medical assistants: Virtual medical assistants (VMAs) use speech recognition, natural language processing, and machine learning algorithms to communicate with patients through voice or text. Speech recognition software allows VMAs to respond to voice commands, retrieve information from electronic health records (EHRs) and automate the medical transcription process.
  • Electronic Health Records (EHR) integration: Healthcare professionals can use voice commands to navigate the EHR system , access patient data, and enter data into specific fields.

Technology:

  • Virtual agents: Virtual agents utilize natural language processing (NLP) and speech recognition technologies to understand spoken language and convert it into text. Speech recognition enables virtual agents to process spoken language in real-time and respond promptly and accurately to user voice commands.

Further reading

  • Top 5 Speech Recognition Data Collection Methods in 2023
  • Top 11 Speech Recognition Applications in 2023

External Links

  • 1. Databricks
  • 2. PubMed Central
  • 3. Qin, L. (2013). Learning Out-of-vocabulary Words in Automatic Speech Recognition . Carnegie Mellon University.
  • 4. Wikipedia

speech recognition tasks

Next to Read

10+ speech data collection services in 2024, top 5 speech recognition data collection methods in 2024, top 4 speech recognition challenges & solutions in 2024.

Your email address will not be published. All fields are required.

Related research

Top 11 Voice Recognition Applications in 2024

Top 11 Voice Recognition Applications in 2024

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, speech recognition.

1085 papers with code • 315 benchmarks • 87 datasets

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

speech recognition tasks

Benchmarks Add a Result

speech recognition tasks

Most implemented papers

Listen, attend and spell.

speech recognition tasks

Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

speech recognition tasks

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.

Communication-Efficient Learning of Deep Networks from Decentralized Data

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device.

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.

Deep Speech: Scaling up end-to-end speech recognition

We present a state-of-the-art speech recognition system developed using end-to-end deep learning.

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

Recurrent Neural Network Regularization

wojzaremba/lstm • 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges

Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep neural networks (DNNs) to execute complex inference tasks such as image classification and speech recognition, among others.

SpeakWrite Official Logo, Light Version, 2019. All rights reserved.

The SpeakWrite Blog

Ultimate guide to speech recognition technology (2023).

  • April 12, 2023

Learn about speech recognition technology—how speech to text software works, benefits, limitations, transcriptions, and other real world applications.

speech recognition tasks

Whether you’re a professional in need of more efficient transcription solutions or simply want your voice-enabled device to work smarter for you, this guide to speech recognition technology is here with all the answers.

Few technologies have evolved rapidly in recent years as speech recognition. In just the last decade, speech recognition has become something we rely on daily. From voice texting to Amazon Alexa understanding natural language queries, it’s hard to imagine life without speech recognition software.

But before deep learning was ever a word people knew, mid-century were engineers paving the path for today’s rapidly advancing world of automatic speech recognition. So let’s take a look at how speech recognition technologies evolved and speech-to-text became king.

What Is Speech Recognition Technology?

With machine intelligence and deep learning advances, speech recognition technology has become increasingly popular. Simply put, speech recognition technology (otherwise known as speech-to-text or automatic speech recognition) is software that can convert the sound waves of spoken human language into readable text. These programs match sounds to word sequences through a series of steps that include:

  • Pre-processing: may consist of efforts to improve the audio of speech input by reducing and filtering the noise to reduce the error rate
  • Feature extraction: this is the part where sound waves and acoustic signals are transformed into digital signals for processing using specialized speech technologies.
  • Classification: extracted features are used to find spoken text; machine learning features can refine this process.
  • Language modeling: considers important semantic and grammatical rules of a language while creating text.

How Does Speech Recognition Technology Work?

Speech recognition technology combines complex algorithms and language models to produce word output humans can understand. Features such as frequency, pitch, and loudness can then be used to recognize spoken words and phrases.

Here are some of the most common models for speech recognition, which include acoustic models and language models . Sometimes, several of these are interconnected and work together to create higher-quality speech recognition software and applications.

Natural Language Processing (NLP)

“Hey, Siri, how does speech-to-text work?”

Try it—you’ll likely hear your digital assistant read a sentence or two from a relevant article she finds online, all thanks to the magic of natural language processing.

Natural language processing is the artificial intelligence that gives machines like Siri the ability to understand and answer human questions. These AI systems enable devices to understand what humans are saying, including everything from intent to parts of speech.

But NLP is used by more than just digital assistants like Siri or Alexa—it’s how your inbox knows which spam messages to filter, how search engines know which websites to offer in response to a query, and how your phone knows which words to autocomplete.

Neural Networks

Neural networks are one of the most powerful AI applications in speech recognition. They’re used to recognize patterns and process large amounts of data quickly.

For example, neural networks can learn from past input to better understand what words or phrases you might use in a conversation. It uses those patterns to more accurately detect the words you’re saying.

Leveraging cutting-edge deep learning algorithms, neural networks are revolutionizing how machines recognize speech commands. By imitating neurons in our brains and creating intricate webs of electrochemical connections between them, these robust architectures can process data with unparalleled accuracy for various applications such as automatic speech recognition.

Hidden Markov Models (HMM)

The Hidden Markov Model is a powerful tool for acoustic modeling, providing strong analytical capabilities to accurately detect natural speech. Its application in the field of Natural Language Processing has allowed researchers to efficiently train machines on word generation tasks, acoustics, and syntax to create unified probabilistic models.

Speaker Diarization

Speaker diarization is an innovative process that segments audio streams into distinguishable speakers, allowing the automatic speech recognition transcript to organize each speaker’s contributions separately. Using unique sound qualities and word patterns, this technique pinpoints conversations accurately so every voice can be heard.

The History of Speech Recognition Technology

It’s hard to believe that just a few short decades ago, the idea of having a computer respond to speech felt like something straight out of science fiction. Yet, Fast-forward to today, and voice-recognition technology has gone from being an obscure concept to becoming so commonplace you can find it in our smartphones.

But where did this all start? First, let’s take a look at the history of speech recognition technology – from its uncertain early days through its evolution into today’s easy-to-use technology.

Speech recognition technology has existed since the 1950s when Bell Laboratory researchers first developed systems to recognize simple commands . However, early speech recognition systems were limited in their capabilities and could not identify more complex phrases or sentences.

In the 1980s, advances in computing power enabled the development of better speech recognition systems that could understand entire sentences. Today, speech recognition technology has become much more advanced, with some systems able to recognize multiple languages and dialects with high accuracy.

Timeline of Speech Recognition Programs

  • 1952 – Bell Labs researchers created “Audrey,” an innovative system for recognizing individual digits. Early speech recognition systems were limited in their capabilities and could not identify more complex phrases or sentences.
  • 1962 – IBM shook the tech sphere in 1962 at The World’s Fair, showcasing a remarkable 16-word speech recognition capability – nicknamed “Shoebox” —that left onlookers awestruck.
  • 1980s – IBM revolutionized the typewriting industry in the 1980s with Tangora , a voice-activated system that could understand up to 20,000 words. Advances in computing power enabled the development of better speech recognition systems that could understand entire sentences.
  • 1996 – IBM’s VoiceType Simply Speaking application recognized 42,000 English and Spanish words.
  • 2007 – Google launched GOOG-411 as a telephone directory service, an endeavor that provided immense amounts of data for improving speech recognition systems over time. Now, this technology is available across 30 languages through Google Voice Search .
  • 2017 – Microsoft made history when its research team achieved the remarkable goal of transcribing phone conversations utilizing various deep-learning models.

How is Speech Recognition Used Today?

Speech recognition technology has come a long way since its inception at Bell Laboratories.

Today, speech recognition technology has become much more advanced, with some systems able to recognize multiple languages and dialects with high accuracy and low error rates.

Speech recognition technology is used in a wide range of applications in our daily lives, including:

  • Voice Texting: Voice texting is a popular feature on many smartphones that allow users to compose text messages without typing.
  • Smart Home Automation: Smart home systems use voice commands technology to control lights, thermostats, and other household appliances with simple commands.
  • Voice Search: Voice search is one of the most popular applications of speech recognition, as it allows users to quickly
  • Transcription: Speech recognition technology can transcribe spoken words into text fast.
  • Military and Civilian Vehicle Systems: Speech recognition technology can be used to control unmanned aerial vehicles, military drones, and other autonomous vehicles.
  • Medical Documentation: Speech recognition technology is used to quickly and accurately transcribe medical notes, making it easier for doctors to document patient visits.

Key Features of Advanced Speech Recognition Programs

If you’re looking for speech recognition technology with exceptional accuracy that can do more than transcribe phonetic sounds, be sure it includes these features.

Acoustic training

Advanced speech recognition programs use acoustic training models to detect natural language patterns and better understand the speaker’s intent. In addition, acoustic training can teach AI systems to tune out ambient noise, such as the background noise of other voices.

Speaker labeling

Speaker labeling is a feature that allows speech recognition systems to differentiate between multiple speakers, even if they are speaking in the same language. This technology can help keep track of who said what during meetings and conferences, eliminating the need for manual transcription.

Dictionary customization

Advanced speech recognition programs allow users to customize their own dictionaries and include specialized terminology to improve accuracy. This can be especially useful for medical professionals who need accurate documentation of patient visits.

If you don’t want your transcript to include any naughty words, then you’ll want to make sure your speech recognition system consists of a filtering feature. Filtering allows users to specify which words should be filtered out of their transcripts, ensuring that they are clean and professional.

Language weighting

Language weighting is a feature used by advanced speech recognition systems to prioritize certain commonly used words over others. For example, this feature can be helpful when there are two similar words, such as “form” and “from,” so the system knows which one is being spoken.

The Benefits of Speech Recognition Technology

Human speech recognition technology has revolutionized how people navigate, purchase, and communicate. Additionally, speech-to-text technology provides a vital bridge to communication for individuals with sight and auditory disabilities. Innovations like screen readers, text-to-speech dictation systems, and audio transcriptions help make the world more accessible to those who need it most.

Limits of Speech Recognition Programs

Despite its advantages, speech recognition technology still needs to be improved.

  • Accuracy rate and reliability – the quality of the audio signal and the complexity of the language being spoken can significantly impact the system’s ability to accurately interpret spoken words. For now, speech-to-text technology has a higher average error rate than humans.
  • Formatting – Exporting speech recognition results into a readable format, such as Word or Excel, can be difficult and time-consuming—especially if you must adhere to professional formatting standards.
  • Ambient noise – Speech recognition systems are still incapable of reliably recognizing speech in noisy environments. If you plan on recording yourself and turning it into a transcript later, make sure the environment is quiet and free from distractions.
  • Translation – Human speech and language are difficult to translate word for word, as things like syntax, context, and cultural differences can lead to subtle meanings that are lost in direct speech-to-text translations.
  • Security – While speech recognition systems are great for controlling devices, you don’t always have control over how your data is stored and used once recorded.

Using Speech Recognition for Transcriptions

Speech recognition technology is commonly used to transcribe audio recordings into text documents and has become a standard tool in business and law enforcement. There are handy apps like Otter.ai that can help you quickly and accurately transcribe and summarize meetings and speech-to-text features embedded in document processors like Word.

However, you should use speech recognition technology for transcriptions with caution because there are a number of limitations that could lead to costly mistakes.

If you’re creating an important legal document or professional transcription , relying on speech recognition technology or any artificial intelligence to provide accurate results is not recommended. Instead, it’s best to employ a professional transcription service or hire an experienced typist to accurately transcribe audio recordings.

Human typists have an accuracy level of 99% – 100%, can follow dictation instructions, and can format your transcript appropriately depending on your instructions. As a result, there is no need for additional editing once your document is delivered (usually in 3 hours or less), and you can put your document to use immediately.

Unfortunately, speech recognition technology can’t achieve these things yet. You can expect an accuracy of up to 80% and little to no professional formatting. Additionally, your dictation instructions will fall on deaf “ears.” Frustratingly, they’ll just be included in the transcription rather than followed to a T. You’ll wind up spending extra time editing your transcript for readability, accuracy, and professionalism.

So if you’re looking for dependable, accurate, fast transcriptions, consider human transcription services instead.

Is Speech Recognition Technology Accurate?

The accuracy of speech recognition technology depends on several factors, including the quality of the audio signal, the complexity of the language being spoken, and the specific algorithms used by the system.

Some speech recognition software can withstand poor acoustic quality, identify multiple speakers, understand accents, and even learn industry jargon. Others are more rudimentary and may have limited vocabulary or may only be able to work with pristine audio quality.

Speaker identification vs. speech recognition: what’s the difference?

The two are often used interchangeably. However, there is a distinction. Speech recognition technology shouldn’t be confused with speech identification technology, which identifies who is speaking rather than what the speaker has to say.

What type of technology is speech recognition?

Speech recognition is a type of technology that allows computers to understand and interpret spoken words. It is a form of artificial intelligence (AI) that uses algorithms to recognize patterns in audio signals, such as the sound of speech. Speech recognition technology has been around for decades.

Is speech recognition AI technology?

Yes, speech recognition is a form of artificial intelligence (AI) that uses algorithms to recognize patterns in audio signals, such as the sound of speech. Speech recognition technology has been around for decades, but it wasn’t until recently that systems became sophisticated enough to accurately understand and interpret spoken words.

What are examples of speech recognition devices?

Examples of speech recognition devices include virtual assistants such as Amazon Alexa, Google Assistant, and Apple Siri. Additionally, many mobile phones and computers now come with built-in voice recognition software that can be used to control the device or issue commands. Speech recognition technology is also used in various other applications, such as automated customer service systems, medical transcription software, and real-time language translation systems.

See How Much Your Business Could Be Saving in Transcription Costs

With accurate transcriptions produced faster than ever before, using human transcription services could be an excellent decision for your business. Not convinced? See for yourself! Try our cost savings calculator today and see how much your business could save in transcription costs.

speech recognition tasks

Explore FAQs

Discover blogs, get support.

speech recognition tasks

Introduction to speech recognition with TensorFlow

Master the basics of speech recognition with tensorflow: learn how to build and train models, implement real-time audio recognition, and develop practical applications.

In the previous tutorial , I showed you how to make Handwritten sentence recognition. Now it's time for Speech recognition! Speech recognition is an essential field of Artificial Intelligence (AI) that is used to recognize a person's Speech and convert it into machine-readable text. It has many applications in many industries, such as customer service, healthcare, automotive, education, and entertainment. With the advancement in deep learning and natural language processing, speech recognition has become more accurate and efficient. This tutorial will discuss the basics of speech recognition and how to build a basic speech recognition model using TensorFlow.

History of Speech Recognition

speech recognition tasks

The development of speech recognition technology dates back to the 1940s when it was used for military communication and air traffic control systems. 

In the 1950s, researchers developed the first commercial speech recognition system to recognize digits spoken into a telephone. This system was limited to identifying only numbers, not full words or sentences.

In the 1960s, researchers developed more advanced speech recognition systems to recognize isolated words and short phrases. This marked a significant advancement in speech recognition, enabling machines to understand basic speech commands.

In the 1970s, researchers developed the first large-vocabulary speech recognition systems. These systems were capable of recognizing connected Speech and could handle large vocabularies. At the same time, the development of artificial neural networks and deep learning began to revolutionize the field of speech recognition. This led to the development of more accurate speech recognition systems in the 1980s. 

In the 1990s, the first commercial speech recognition products were released. These products were based on deep learning models and could be used in various applications, such as dictation software, voice-controlled user interfaces, and speech-to-text transcription services.

In the 2000s, the development of speech recognition technology continued to advance with the development of more accurate models and the incorporation of acoustic models. This led to the development of virtual assistant devices such as Google Home and Amazon Alexa.

In the 2010s, the development of deep learning algorithms further improved the accuracy of speech recognition models. In 2020, individuals utilized speech recognition technology for various purposes, ranging from customer service to healthcare and entertainment."

What are the Problems and Challenges

speech recognition tasks

One of speech recognition's main challenges is dealing with human speech variability. People may have different accents, pronunciations, and speech patterns, and the model must recognize this variability accurately. Additionally, background noises and other environmental factors can interfere with the model's accuracy. 

Another challenge is dealing with words that sound similar. For example, the words "to" and "too" may sound similar but have different meanings. The model must distinguish between these words to generate an accurate output. Similarly, words with multiple meanings can be difficult for the model to interpret accurately. 

The quality of the audio signal also affects the accuracy of speech recognition. Poorly recorded audio signals can make it difficult for the model to recognize the Speech. Additionally, the model must be trained on a large dataset of audio samples to achieve a high level of accuracy. 

Finally, the model must be able to recognize and process multiple languages. Different languages have different phonetic and grammatical rules, and the model must recognize these differences to generate an accurate output. 

All of these challenges make speech recognition a difficult task. However, with the advancement of deep learning and natural language processing, speech recognition has become more accurate and efficient. With the right techniques and data, it is possible to create a high-quality speech recognition model.

Techniques in speech recognition:

speech recognition tasks

The development of machine learning has significantly improved the accuracy of speech recognition. Machine learning algorithms are used to recognize complex speech patterns, understand natural language, and distinguish between different languages.

Deep learning is one of the most popular machine learning techniques in speech recognition. Deep learning uses artificial neural networks to learn from large datasets and can be used to recognize complex patterns. It has been used to develop virtual assistant devices such as Google Home and Amazon Alexa and speech-to-text transcription services.

Other machine learning techniques used in speech recognition include Hidden Markov Models (HMM), Dynamic Time Warping (DTW), and phonetic-based approaches. HMM is a statistical approach for modeling time series data and is used for recognizing speech patterns. DTW is a technique used for comparing two temporal sequences and is used for recognizing similar speech patterns. Phonetic-based approaches recognize Speech based on their phonetic similarity.

Additionally, there are techniques for improving the accuracy of speech recognition models. Beamforming is a technique that reduces background noise by focusing on the sound source. Noise cancellation is a technique used to reduce background noise by subtracting it from the audio signal. Both of these techniques can improve the accuracy of speech recognition models.

Until 2018, the most common technique in Speech Recognition was Deep Neural Networks with LSTM, and everything changed when transformers were released. When Transformers were released, they significantly impacted the field of speech recognition. The Transformers are a type of neural network used for natural language processing tasks and to recognize complex patterns in the input audio. They are beneficial for tasks such as speech recognition because they can model long-term dependencies in the data. 

The introduction of Transformers has allowed for more accurate speech recognition models. You can use them to recognize different languages, understand natural language, and distinguish between similar words. The increased accuracy of these models has enabled the development of virtual assistant devices, voice-controlled user interfaces, and speech-to-text transcription services.

Implementation:

In this tutorial, I will demonstrate how to combine a 2D convolutional neural network (CNN), recurrent neural network (RNN), and a Connectionist Temporal Classification (CTC) loss to build an automatic speech recognition (ASR) model. 

This tutorial will utilize the LJSpeech dataset, which features brief audio recordings of a solitary speaker reciting passages from seven non-fiction books. 

To gauge the effectiveness of our model, we'll employ the Word Error Rate (WER) and Character Error Rate (CER) evaluation metrics. These metrics calculate the discrepancy between the recognized words/characters and the original spoken words/characters. WER is determined by summing up the number of substitutions, insertions, and deletions that occur in the sequence of recognized words and dividing the result by the total number of initially spoken words. CER follows the same principle but on a character level.

Prerequisites:

Before we begin, you will need to have the following software installed:

  • TensorFlow (We will be using version 2.10 in this tutorial);
  • mltu==0.1.7

The LJSpeech Dataset:

We'll begin by downloading the LJSpeech Dataset. This dataset contains 13000 audio files in a ".wav" format. All the actual labels are also given to us in the annotation file.

To simplify this for us a little, I wrote a short script that we'll use to download this dataset:

Likely a slower method to download this dataset, but you don't need to do anything manually. 

Now we have our dataset downloaded, and we need to preprocess it before moving forward to another step. Preprocessing consists of several steps. All the preprocessing we can do with the following code:

First, we are loading raw audio data with librosa.load(audio_path) from the Python library librosa . It loads an audio file specified by the audio_path parameter and returns a tuple of two objects: The raw audio signal as a NumPy array, representing the audio samples and the sample rate (number of samples per second) of the audio signal, typically defined as an integer. So, we are iterating through dataset metadata and preprocessing the 'wav' audio data with actual transcription. It looks as follows:

speech recognition tasks

Then we preprocess this rad audio data further to receive a spectrogram that we will use to train our model. 

An audio spectrogram is a visual representation of the frequency content of an audio signal over time. It displays the signal's power spectral density (PSD), which gives a measure of the strength of different frequency components of the signal.

The spectrogram is usually represented as an image. The X-axis represents time, the Y-axis represents frequency, and the color or brightness represents the magnitude of the frequency components at each time frame. The brighter the color or higher the brightness, the higher the magnitude of the corresponding frequency component.

Spectrograms are commonly used in audio analysis and processing, as they provide a clear representation of the frequency content of a signal and can reveal important information such as pitch, harmonics, and transient events. They are also useful for identifying different types of sounds and for performing tasks such as noise reduction, pitch correction, and audio compression.

An example of an audio spectrogram would look following:

speech recognition tasks

Keep in mind that when we were iterating through actual transcription, we replaced all capital letters with lower ones and removed all unusual alphabet letters.

When we prepare our dataset, we can create our TensorFlow data provider, which will provide us with data through all the training processes; we won't need to hold all the data on RAM. 

As before, I will use my " mltu " package:

You may notice that I am using the WavReader as the data preprocessor and SpectrogramPadding, LabelIndexer, and LabelPadding as transformers. In the given code, the purpose of each component is as follows:

  • WavReader : This class reads audio files (in WAV format) and converts them into spectrograms. It uses the parameters frame_length, frame_step, and fft_length to determine how the audio signals should be split into frames and transformed into spectrograms.
  • SpectrogramPadding : This class is used to pad spectrograms to a consistent length so that all spectrograms in a batch have the same shape. It uses the parameter max_spectrogram_length to determine the length to which the spectrograms should be padded and the padding_value to determine the value used for padding.
  • LabelIndexer : This class converts text labels into numerical representations, for example, transforming words into integers. It uses the vocab parameter, a dictionary of all the words in the vocabulary, to determine how to map words to integers.
  • LabelPadding : This class is used to pad text labels to a consistent length so that all text labels in a batch have the same length. It uses the parameter max_word_length to determine the length to which the text labels should be padded and the padding_value to determine the value used for padding.

These components are used together to preprocess the data before it is fed into a machine-learning model. By preprocessing the data this way, it becomes easier to train a model on the data and ensure that the model receives a consistent input format.

When training the model, we can't rely on training loss. For this purpose, we'll split the dataset into the training 90% and validation 10%:

The model architecture:

CNNs for speech recognition : Convolutional Neural Networks are a type of machine learning architecture that is mostly used for analyzing visual datasets. They are good at analyzing images because they can pick up on the spatial and temporal relationships between the pixels in the image. 

The convolution layer examines the essential features of the input data, and the subsampling layer compresses these features into a more straightforward form.

For speech recognition, CNNs take a spectrogram of the speech signal, which is represented as an image, and use these features to recognize speech.

RNNs for speech recognition : Recurrent Neural Networks are a type of deep learning architecture that can handle large sequential inputs. The key idea behind RNNs is that they use the current information and previous inputs to produce the current output. This makes them well-suited for sequential data tasks, such as natural language processing and speech recognition.

RNNs are the preferred deep learning architecture for speech recognition because they are good at modeling sequential data. They can capture the long-term dependencies between the features in the input dataset and produce outputs based on past observations. This is particularly useful for speech recognition tasks because the output of a speech frame depends on previous frames of observations. RNNs and their improved version, Long-Short Term Memory (LSTM) RNNs, have the best performance (Except Transformers) for speech recognition tasks among all deep learning architectures and are the preferred choice.

So we'll define our model:

For a deeper understanding of the architecture, we can open our model.py file, where we create the TensorFlow sequential model step-by-step:

Great, now we have the model that we need to compile. Let's do it:

As you might notice, I am using CTCloss and custom CER and WER metrics. CER and WER I introduced in my previous tutorials, but these also are one of the most common metrics to tell how accurate our predictions are from actual transcription. CTC is the most common loss when training stuff related to language recognition and extraction.

Now, we can define our callbacks (introduced in previous tutorials) and start the training process:

Training process:

To track the training process, we added the TensorBoard metric, there we can check what our curves of loss, CER, and WER metrics were. Here is the loss curve:

speech recognition tasks

We can see that while training, our loss was constantly decreasing; that's what we expected to see. But we might see that validation loss was falling since the 48 training step, and then it increased. This means our mode might be overfitting. We may see similar scenarios in our CER and WER curves. Let's take a look at them. Here is the CER curve:

speech recognition tasks

I was wrong; the CER of validation was constantly improving until step 100. But we can see here a huge gap between training and validation CERs. This is because we are not using any augmentation techniques for our audio data. Let's look at the WER curve:

speech recognition tasks

It looks very similar to CER; that's what I was expecting. But overall, our CER is only 1.7%, and our WER is 7%. This means that our model performs well on this dataset!

Test model inference:

Our model is trained, and it gave us pretty satisfying results. How can we test it out on single inference? I wrote a script that iterates through validation data from our training data:

If you want to test this on your recording, remove the iterative loop and link our audio 'wav' recording, it should handle it! The trained model can be downloaded from this link. 

Conclusion:

Speech recognition is a field of AI with a rich history of advancements dating back to the 1940s. With deep learning and natural language processing integration, speech recognition has become more accurate and efficient. The main challenges in speech recognition include the following:

  • Dealing with human speech variability;
  • Recognizing similar words;
  • The quality of the audio signal.

Several techniques are used in speech recognition, including Deep learning, Hidden Markov Models, Dynamic Time Warping, and phonetic-based approaches. Additionally, beamforming and noise cancellation techniques can be used to improve the accuracy of speech recognition models. 

The introduction of transformers has significantly impacted speech recognition, enabling more accurate models for tasks such as speech recognition, natural language processing, and virtual assistant devices. 

This tutorial demonstrated how to build a basic speech recognition model using TensorFlow by combining a 2D CNN, RNN, and CTC loss. With the right techniques and data, speech recognition can be a powerful tool for many industries.

The trained model used in this tutorial can be downloaded from  this link .

Complete tutorial code on  GitHub .

speech recognition tasks

Home > What is ASR

What is ASR: Understanding Automatic Speech Recognition

12 min read

Anna Maricheva

Roman Kyrychenko

Table of Contents

What is asr, how does asr work, speech recognition models, key components of automatic speech recognition, challenges and limitations, recent advancements in asp technologies, expert opinion, applications of asr, ethical and privacy considerations, final thoughts.

The ASR technology is significantly transforming our daily lives and the way we engage with devices. The increasing demand for speech recognition technology across diverse industries, including healthcare, banking, and retail, serves as a significant driver for the expansion of the market, which is projected to reach US$8.53 billion in 2024, according to Statista . 

But what is ASR , exactly? In this article, we will consider the key components of the technology and the diverse applications that leverage its capabilities.

What is ASR: Understanding Automatic Speech Recognition

Automatic speech recognition (ASR) is a technology that converts spoken language into written text. It uses complex algorithms and machine learning techniques to analyze audio signals and transcribe them into text. ASR systems are designed to recognize and interpret human speech, making it possible to convert spoken words into a format that computers can process and understand.

The development of ASR can be traced back to the mid-20th century, when researchers began exploring methods to automate speech recognition. Early systems relied on pattern matching and acoustic modeling techniques, exhibiting limited accuracy and vocabulary coverage. However, with the advent of machine learning and computational advancements, the automatic speech recognition model underwent a significant evolution. The introduction of Hidden Markov Models (HMMs) in the 1970s marked an important moment for ASR, enabling more robust speech recognition capabilities.

Recently, speech recognition has changed a lot with the use of deep learning and neural network-based models. Not too long ago, the prevailing model for speech recognition was wave2text. This technology allowed for the conversion of spoken words into text, enabling a range of applications from virtual assistants to transcribing voice recordings. However, as technology continues to advance at a rapid pace, new models and approaches have emerged. 

ASR operates through a multi-step process that involves capturing, processing, and interpreting spoken language. The following steps outline the fundamentals of the ASR process:

  • Audio input. The process begins with the capture of human speech through a microphone or any audio recording device. The audio input is then digitized to create a digital representation of the spoken words.
  • Preprocessing. The digitized audio undergoes preprocessing, which involves filtering out background noise, normalizing the audio levels, and segmenting the speech into smaller units for analysis.
  • Feature extraction. During this phase, the system extracts acoustic features from the preprocessed audio, such as mel-frequency cepstral coefficients (MFCCs) or spectrograms. These features serve as the input for the subsequent recognition stage.
  • Speech recognition . The extracted features are fed into a speech recognition model that employs machine learning algorithms, such as deep neural networks and Hidden Markov Models (HMMs), to match the acoustic features with linguistic units and generate the corresponding textual output.
  • Language modeling. In this step, language models are utilized to enhance the accuracy of recognizing spoken words by considering the context and grammar of the language being spoken.
  • Output generation. Finally, the recognized speech is transcribed into written text, which can be further processed for various applications, such as generating subtitles, transcribing meetings, or enabling voice commands for devices.

For a better understanding, let’s explore some of the most influential models for speech recognition.

Hidden Markov Models (HMM)

One of the earliest and most widely used models for speech recognition is the Hidden Markov Model. HMMs are statistical models that represent the probability distribution over sequences of observations. In the context of speech recognition, HMMs are used to model the acoustic features of speech, such as phonemes and words. While HMMs have been instrumental in laying the foundation for ASR, they have limitations, such as their inability to capture long-range dependencies in speech.

Wave2text, also known as the traditional model for speech recognition, is operated by converting audio waveforms into text through a series of complex algorithms. This model had its strengths, particularly in handling clear and distinct speech, but it faced challenges with understanding accents, dialects, and background noise. Furthermore, wave2text was computationally intensive and often required significant processing power to deliver accurate results.

Recurrent neural networks (RNN)

When it comes to modeling sequential data, Recurrent Neural Networks (RNNs) have proven to be invaluable for speech recognition. RNNs, with their ability to capture temporal dependencies, are well-suited for modeling the sequential nature of speech. Through architectures such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), RNNs have demonstrated remarkable success in understanding the context and dynamics of speech, leading to improved accuracy and fluency in ASR systems.

Transformer-based architectures

Transformers, originally developed for natural language processing, have also made a remarkable impact on speech recognition. These models are good at capturing extensive relationships and contextual details, which improves the understanding of spoken languages. Through self-attention mechanisms, transformers can effectively process audio input and generate accurate transcriptions.

What is ASR: Understanding Automatic Speech Recognition

The development of ASR involves several key components that work together to accurately process and interpret spoken language. 

Speech signal processing

Speech signal processing is fundamental to ASR systems. It converts analog audio signals into digital form for computer analysis. Techniques like digitization, signal filtering, and feature extraction capture speech characteristics like phonemes and intonation. This conversion into a computationally analyzable format sets the stage for further ASR steps like acoustic modeling and language processing.

Acoustic modeling techniques

Acoustic modeling plays an important role in ASR by capturing the relationship between speech signals and the corresponding phonetic units. Techniques such as Hidden Markov Models (HMMs) and neural networks are commonly used for acoustic modeling. HMMs are particularly effective in representing the temporal and sequential nature of speech, while neural networks, including deep learning architectures, have shown remarkable capability in learning complex patterns and variations in speech signals. These techniques enable ASR systems to recognize and differentiate between different phonemes and spoken sounds, contributing to the accuracy and robustness of speech recognition.

Language modeling

Language modeling is essential for understanding the structure and context of spoken language. It involves the use of statistical models, such as n-grams, as well as more advanced approaches, like neural language models, to predict the likelihood of word sequences and phrases in a given language. By incorporating linguistic knowledge and contextual information, language models enhance the accuracy of ASR systems in deciphering spoken sentences and inferring the intended meaning behind the words.

Decoding algorithms 

Decoding algorithms are responsible for determining the most likely sequence of words that correspond to a given input speech signal. Techniques such as Viterbi decoding and beam search are commonly used to efficiently search through the space of possible word sequences and identify the most probable interpretation of the input speech. These algorithms are crucial for aligning the acoustic and language models, resolving ambiguities, and producing accurate transcriptions or commands based on the input speech.

Even though the ASR system is highly beneficial, it still faces several challenges and limitations:

Background noise

A major challenge for ASR is dealing with background noise. In noisy places like crowded areas, industrial settings, or vehicles, ASR accuracy can decline. Background noise, echoes, and other acoustic interferences make it difficult for ASR to accurately recognize and transcribe speech in such environments.

Speaker variations and accents

Another significant challenge is the diversity of speakers and accents encountered in everyday communication. ASR systems may struggle to accurately transcribe speech from individuals with different dialects, accents, or speech impediments. The variations in pitch, intonation, and pronunciation among speakers also pose a considerable challenge for ASR technology. It must adapt to accurately interpret a wide range of vocal characteristics.

Contextual ambiguity

Contextual ambiguity presents a substantial challenge for ASR, particularly in understanding and interpreting natural language. Human speech often relies on context, tone, and non-verbal cues to convey meaning, which can be challenging for ASR systems to accurately interpret. Ambiguous phrases, homophones, and colloquial expressions further complicate the task of accurate speech recognition and understanding.

Handling different languages

Recognizing and understanding different languages and dialects is not easy. Multilingual ASR faces difficulties because each language has its own sounds, sentence structures, and word meanings. To work well, ASR needs special models and algorithms designed for each language, making it accurate in recognizing and writing down spoken words.

Limitations in real-time applications

ASR encounters challenges in how quickly it processes information and responds in real-time applications like live transcription, virtual assistants, and voice-controlled devices. To be useful, ASR needs to quickly and accurately change spoken words into text. But it is difficult to keep up 

with the fast pace of real-time speech. Although ASR has gotten better over the years, there is still room for improvements.

Despite these challenges, the ongoing research and development efforts aim to improve the technology and enhance its robustness and effectiveness. Nowadays, advancements in machine learning, neural networks, and data collection methodologies are contributing to overcoming some of these challenges. Several recent models have already overcome certain challenges.

As we can see, modern ASR systems leverage large-scale datasets, such as speech corpora and transcribed audio, to train models that can accurately comprehend diverse speech patterns and accents. Additionally, the integration of language models and contextual information has further improved the accuracy and naturalness of ASR outputs.

In 2023, ASR systems have made significant strides in contextual understanding and natural language processing, enabling them to comprehend the nuances of human speech more effectively. 

The emergence of transformers-based speech recognition models has substantially changed the standard techniques for speech recognition tasks. These advanced models have made it possible to efficiently process over 100 languages within a single model, an accomplishment that was previously unimaginable.

With the development of various models aimed at enhancing accuracy, efficiency, and adaptability, two key models that have garnered attention in this domain are Whisper and SeamlessM4T. 

This model was introduced in the paper titled “Robust Speech Recognition via Large-Scale Weak Supervision” by Alec Radford and his team at OpenAI. It leverages electromyography (EMG) signals to capture subtle muscle movements associated with speech production, even when the speaker is whispering or mouthing words without vocalizing. By decoding these EMG signals, Whisper can accurately transcribe silent speech, offering a discreet and efficient mode of communication.

The applications of Whisper are diverse and impactful. In healthcare settings, it can facilitate communication for individuals with speech impairments or those in environments where vocalization is impractical. Moreover, Whisper holds promise in enhancing privacy and convenience for users in public spaces or during virtual interactions. Additionally, this model has the potential to augment human-computer interaction, enabling seamless control of devices through silent commands.

SeamlessM4T 

This model represents a breakthrough in multilingual and multi-speaker speech recognition, catering to the diverse linguistic landscape and communication patterns worldwide. This model is designed to accurately transcribe speech in various languages and dialects, as well as distinguish and transcribe multiple speakers within the same audio input. By harnessing advanced machine learning algorithms, SeamlessM4T has demonstrated remarkable proficiency in understanding and transcribing complex speech patterns across different languages and speakers.

The implications of SeamlessM4T extend across numerous domains, from multilingual customer service and transcription services to facilitating cross-cultural communication and language learning. Businesses can leverage this model to analyze customer interactions in different languages, while educational institutions can utilize it to develop more inclusive and effective language learning tools. Furthermore, SeamlessM4T has the potential to bridge linguistic barriers, fostering enhanced communication and collaboration on a global scale.

The advancements in speech recognition technology this year have laid the groundwork for a more interconnected and multilingual future. The potential applications of these cutting-edge models are vast, ranging from real-time translation services to voice-activated systems and beyond. The future of speech recognition technology looks brighter and more promising than ever before.

Data Scientist

Roman Kyrychenko

The versatility of ASR has led to its integration into a wide range of applications. Some notable use cases include:

  • Virtual assistants. ASR powers virtual assistants like Siri, Alexa, and Google Assistant, enabling users to issue voice commands for tasks such as setting reminders, playing music, or fetching information.
  • ASR transcription services. ASR technology is extensively used for transcribing interviews, meetings, lectures, and other spoken content, providing a convenient and efficient alternative to manual transcription.
  • Customer service. Many businesses leverage ASR for interactive voice response (IVR) systems and customer service applications, allowing customers to communicate with automated systems for inquiries, reservations, and support.
  • Accessibility. ASR plays a pivotal role in enhancing accessibility for individuals with disabilities, facilitating real-time captioning, speech-to-text conversion, and communication aids.
  • Healthcare. In the healthcare industry, ASR is utilized for medical dictation, allowing healthcare professionals to quickly and accurately document patient information, saving time and enhancing the quality of care.
  • Automotive industry. ASR is integrated into vehicles to enable hands-free communication, navigation, and entertainment systems, enhancing both convenience and safety for drivers.

What is ASR: Understanding Automatic Speech Recognition

The widespread applications of ASR continue to expand, offering innovative solutions across various sectors. However, as with any form of advanced technology, ASR raises significant ethical and privacy considerations that need to be carefully addressed to ensure the fair and responsible use of this powerful tool.

Many ASR systems rely on recording and analyzing users’ speech to improve accuracy and functionality. While this data collection is often necessary for the proper functioning of ASR systems, it raises important privacy considerations.

First and foremost, users must be fully informed about the extent of data collection and how their speech data will be used. Transparency regarding data collection practices, including obtaining explicit consent from users, is crucial in upholding privacy rights. Additionally, measures should be in place to secure and protect the collected speech data from unauthorized access or misuse.

Furthermore, the anonymization of speech data is essential to prevent the identification of individuals based on their speech patterns. Robust data anonymization techniques can help mitigate the risk of privacy breaches and ensure that users’ identities remain protected.

ASR is essential in modern technology, transforming how we interact with devices and enhancing communication between humans and machines. It is crucial in healthcare, education, customer service, and accessibility, making operations more efficient for everyone. ASR enables hands-free device usage, supports multilingual communication, and powers virtual assistants and transcription apps. Ongoing advancements in machine learning promise improved accuracy, expanding ASR’s impact on real-time translation, voice-enabled environments, and personalized healthcare. As ASR progresses, its significance in daily life grows, shaping how we communicate, work, and access information.

How accurate is ASR technology?

How does asr benefit individuals and businesses, what advancements are being made in asr research and development, what are the future prospects for asr technology, we will be happy to hear your thoughts cancel reply.

I agree with Privacy Policy and Terms of Services

Read More in Our Blog

Minimum viable product (mvp) development 101: the main do’s and don’ts.

Minimum Viable Product (MVP) Development 101: The Main Do’s and Don’Ts

1The MVP is not dead and here is why2The main steps of MVP development3Best practices for creating an MVP4Summing up Say, you have this amazing idea for a software product but you are not too sure about whether it’s going to be a success or not. How do you test your idea with real users without ...

How to Decide Which Features Are Crucial In Your MVP App

How to Decide Which Features Are Crucial In Your MVP App

Choosing the essential features for your minimum viable product For many entrepreneurs, having a great idea and a solid development team automatically equals the success of a future project.  But things are not so simple. The users may not like the product, it may lack certain ...

Want to stay updated on the latest tech news?

Sign up for our monthly blog newsletter in the form below.

Headquarters

82 Laisves al., Kaunas, 44250, Lithuania

  • IT Consulting
  • Software Development
  • Web Development
  • Mobile App Development
  • UI/UX Design
  • Quality Assurance
  • Machine Learning, AI, Data Science, Big Data
  • Salesforce Development
  • DevOps Services
  • Support and Maintenance
  • Technologies

Reviews on other sites

Copyright © 2008-2024 SoftTeco ®

  • Privacy Policy
  • Terms of Services
  • Cookie Settings

Softteco Logo Footer

Audio Course documentation

Pre-trained models for automatic speech recognition

Audio course.

and get access to the augmented documentation experience

to get started

In this section, we’ll cover how to use the pipeline() to leverage pre-trained models for speech recognition. In Unit 2 , we introduced the pipeline() as an easy way of running speech recognition tasks, with all pre- and post-processing handled under-the-hood and the flexibility to quickly experiment with any pre-trained checkpoint on the Hugging Face Hub. In this Unit, we’ll go a level deeper and explore the different attributes of speech recognition models and how we can use them to tackle a range of different tasks.

As detailed in Unit 3, speech recognition model broadly fall into one of two categories:

  • Connectionist Temporal Classification (CTC): encoder-only models with a linear classification (CTC) head on top
  • Sequence-to-sequence (Seq2Seq): encoder-decoder models, with a cross-attention mechanism between the encoder and decoder

Prior to 2022, CTC was the more popular of the two architectures, with encoder-only models such as Wav2Vec2, HuBERT and XLSR achieving breakthoughs in the pre-training / fine-tuning paradigm for speech. Big corporations, such as Meta and Microsoft, pre-trained the encoder on vast amounts of unlabelled audio data for many days or weeks. Users could then take a pre-trained checkpoint, and fine-tune it with a CTC head on as little as 10 minutes of labelled speech data to achieve strong performance on a downstream speech recognition task.

However, CTC models have their shortcomings. Appending a simple linear layer to an encoder gives a small, fast overall model, but can be prone to phonetic spelling errors. We’ll demonstrate this for the Wav2Vec2 model below.

Probing CTC Models

Let’s load a small excerpt of the LibriSpeech ASR dataset to demonstrate Wav2Vec2’s speech transcription capabilities:

We can pick one of the 73 audio samples and inspect the audio sample as well as the transcription:

Alright! Christmas and roast beef, sounds great! 🎄 Having chosen a data sample, we now load a fine-tuned checkpoint into the pipeline() . For this, we’ll use the official Wav2Vec2 base checkpoint fine-tuned on 100 hours of LibriSpeech data:

Next, we’ll take an example from the dataset and pass its raw data to the pipeline. Since the pipeline consumes any dictionary that we pass it (meaning it cannot be re-used), we’ll pass a copy of the data. This way, we can safely re-use the same audio sample in the following examples:

We can see that the Wav2Vec2 model does a pretty good job at transcribing this sample - at a first glance it looks generally correct. Let’s put the target and prediction side-by-side and highlight the differences:

Comparing the target text to the predicted transcription, we can see that all words sound correct, but some are not spelled accurately. For example:

  • CHRISTMAUS vs. CHRISTMAS
  • ROSE vs. ROAST
  • SIMALYIS vs. SIMILES

This highlights the shortcoming of a CTC model. A CTC model is essentially an ‘acoustic-only’ model: it consists of an encoder which forms hidden-state representations from the audio inputs, and a linear layer which maps the hidden-states to characters:

This means that the system almost entirely bases its prediction on the acoustic input it was given (the phonetic sounds of the audio), and so has a tendency to transcribe the audio in a phonetic way (e.g. CHRISTMAUS ). It gives less importance to the language modelling context of previous and successive letters, and so is prone to phonetic spelling errors. A more intelligent model would identify that CHRISTMAUS is not a valid word in the English vocabulary, and correct it to CHRISTMAS when making its predictions. We’re also missing two big features in our prediction - casing and punctuation - which limits the usefulness of the model’s transcriptions to real-world applications.

Graduation to Seq2Seq

Cue Seq2Seq models! As outlined in Unit 3, Seq2Seq models are formed of an encoder and decoder linked via a cross-attention mechanism. The encoder plays the same role as before, computing hidden-state representations of the audio inputs, while the decoder plays the role of a language model . The decoder processes the entire sequence of hidden-state representations from the encoder and generates the corresponding text transcriptions. With global context of the audio input, the decoder is able to use language modelling context as it makes its predictions, correcting for spelling mistakes on-the-fly and thus circumventing the issue of phonetic predictions.

There are two downsides to Seq2Seq models:

  • They are inherently slower at decoding, since the decoding process happens one step at a time, rather than all at once
  • They are more data hungry, requiring significantly more training data to reach convergence

In particular, the need for large amounts of training data has been a bottleneck in the advancement of Seq2Seq architectures for speech. Labelled speech data is difficult to come by, with the largest annotated datasets at the time clocking in at just 10,000 hours. This all changed in 2022 upon the release of Whisper . Whisper is a pre-trained model for speech recognition published in September 2022 by the authors Alec Radford et al. from OpenAI. Unlike its CTC predecessors, which were pre-trained entirely on un-labelled audio data, Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise.

This is an order of magnitude more data than the un-labelled audio data used to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this pre-training data is multilingual (or “non-English”) data. This results in checkpoints that can be applied to over 96 languages, many of which are considered low-resource , meaning the language lacks a large corpus of data suitable for training.

When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art pipe systems, with near 3% word error rate (WER) on the test-clean subset of LibriSpeech pipe and a new state-of-the-art on TED-LIUM with 4.7% WER ( c.f. Table 8 of the Whisper paper ).

Of particular importance is Whisper’s ability to handle long-form audio samples, its robustness to input noise and ability to predict cased and punctuated transcriptions. This makes it a viable candidate for real-world speech recognition systems.

The remainder of this section will show you how to use the pre-trained Whisper models for speech recognition using 🤗 Transformers. In many situations, the pre-trained Whisper checkpoints are extremely performant and give great results, thus we encourage you to try using the pre-trained checkpoints as a first step to solving any speech recognition problem. Through fine-tuning, the pre-trained checkpoints can be adapted for specific datasets and languages to further improve upon these results. We’ll demonstrate how to do this in the upcoming subsection on fine-tuning .

The Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoint is multilingual only. All nine of the pre-trained checkpoints are available on the Hugging Face Hub . The checkpoints are summarised in the following table with links to the models on the Hub. “VRAM” denotes the required GPU memory to run the model with the minimum batch size of 1. “Rel Speed” is the relative speed of a checkpoint compared to the largest model. Based on this information, you can select a checkpoint that is best suited to your hardware.

Let’s load the Whisper Base checkpoint, which is of comparable size to the Wav2Vec2 checkpoint we used previously. Preempting our move to multilingual speech recognition, we’ll load the multilingual variant of the base checkpoint. We’ll also load the model on the GPU if available, or CPU otherwise. The pipeline() will subsequently take care of moving all inputs / outputs from the CPU to the GPU as required:

Great! Now let’s transcribe the audio as before. The only change we make is passing an extra argument, max_new_tokens , which tells the model the maximum number of tokens to generate when making its prediction:

Easy enough! The first thing you’ll notice is the presence of both casing and punctuation. Immediately this makes the transcription easier to read compared to the un-cased and un-punctuated transcription from Wav2Vec2. Let’s put the transcription side-by-side with the target:

Whisper has done a great job at correcting the phonetic errors we saw from Wav2Vec2 - both Christmas and roast are spelled correctly. We see that the model still struggles with SIMILES , being incorrectly transcribed as similarly , but this time the prediction is a valid word from the English vocabulary. Using a larger Whisper checkpoint can help further reduce transcription errors, at the expense of requiring more compute and a longer transcription time.

We’ve been promised a model that can handle 96 languages, so lets leave English speech recognition for now and go global 🌎! The Multilingual LibriSpeech (MLS) dataset is the multilingual equivalent of the LibriSpeech dataset, with labelled audio data in six languages. We’ll load one sample from the Spanish split of the MLS dataset, making use of streaming mode so that we don’t have to download the entire dataset:

Again, we’ll inspect the text transcription and take a listen to the audio segment:

This is the target text that we’re aiming for with our Whisper transcription. Although we now know that we can probably do better this, since our model is also going to predict punctuation and casing, neither of which are present in the reference. Let’s forward the audio sample to the pipeline to get our text prediction. One thing to note is that the pipeline consumes the dictionary of audio inputs that we input, meaning the dictionary can’t be re-used. To circumvent this, we’ll pass a copy of the audio sample, so that we can re-use the same audio sample in the proceeding code examples:

Great - this looks very similar to our reference text (arguably better since it has punctuation and casing!). You’ll notice that we forwarded the "task" as a generate key-word argument (generate kwarg). Setting the "task" to "transcribe" forces Whisper to perform the task of speech recognition , where the audio is transcribed in the same language that the speech was spoken in. Whisper is also capable of performing the closely related task of speech translation , where the audio in Spanish can be translated to text in English. To achieve this, we set the "task" to "translate" :

Now that we know we can toggle between speech recognition and speech translation, we can pick our task depending on our needs. Either we recognise from audio in language X to text in the same language X (e.g. Spanish audio to Spanish text), or we translate from audio in any language X to text in English (e.g. Spanish audio to English text).

To read more about how the "task" argument is used to control the properties of the generated text, refer to the model card for the Whisper base model.

Long-Form Transcription and Timestamps

So far, we’ve focussed on transcribing short audio samples of less than 30 seconds. We mentioned that one of the appeals of Whisper was its ability to work on long audio samples. We’ll tackle this task here!

Let’s create a long audio file by concatenating sequential samples from the MLS dataset. Since the MLS dataset is curated by splitting long audiobook recordings into shorter segments, concatenating samples is one way of reconstructing longer audiobook passages. Consequently, the resulting audio should be coherent across the entire sample.

We’ll set our target audio length to 5 minutes, and stop concatenating samples once we hit this value:

Alright! 5 minutes and 17 seconds of audio to transcribe. There are two problems with forwarding this long audio sample directly to the model:

  • Whisper is inherently designed to work with 30 second samples: anything shorter than 30s is padded to 30s with silence, anything longer than 30s is truncated to 30s by cutting of the extra audio, so if we pass our audio directly we’ll only get the transcription for the first 30s
  • Memory in a transformer network scales with the sequence length squared: doubling the input length quadruples the memory requirement, so passing super long audio files is bound to lead to an out-of-memory (OOM) error

The way long-form transcription works in 🤗 Transformers is by chunking the input audio into smaller, more manageable segments. Each segment has a small amount of overlap with the previous one. This allows us to accurately stitch the segments back together at the boundaries, since we can find the overlap between segments and merge the transcriptions accordingly:

🤗 Transformers chunking algorithm. Source: https://huggingface.co/blog/asr-chunking.

The advantage of chunking the samples is that we don’t need the result of chunk i i i to transcribe the subsequent chunk i + 1 i + 1 i + 1 . The stitching is done after we have transcribed all the chunks at the chunk boundaries, so it doesn’t matter which order we transcribe chunks in. The algorithm is entirely stateless , so we can even do chunk i + 1 i + 1 i + 1 at the same time as chunk i i i ! This allows us to batch the chunks and run them through the model in parallel, providing a large computational speed-up compared to transcribing them sequentially. To read more about chunking in 🤗 Transformers, you can refer to this blog post .

To activate long-form transcriptions, we have to add one additional argument when we call the pipeline. This argument, chunk_length_s , controls the length of the chunked segments in seconds. For Whisper, 30 second chunks are optimal, since this matches the input length Whisper expects.

To activate batching, we need to pass the argument batch_size to the pipeline. Putting it all together, we can transcribe the long audio sample with chunking and batching as follows:

We won’t print the entire output here since it’s pretty long (312 words total)! On a 16GB V100 GPU, you can expect the above line to take approximately 3.45 seconds to run, which is pretty good for a 317 second audio sample. On a CPU, expect closer to 30 seconds.

Whisper is also able to predict segment-level timestamps for the audio data. These timestamps indicate the start and end time for a short passage of audio, and are particularly useful for aligning a transcription with the input audio. Suppose we want to provide closed captions for a video - we need these timestamps to know which part of the transcription corresponds to a certain segment of video, in order to display the correct transcription for that time.

Activating timestamp prediction is straightforward, we just need to set the argument return_timestamps=True . Timestamps are compatible with both the chunking and batching methods we used previously, so we can simply append the timestamp argument to our previous call:

And voila! We have our predicted text as well as corresponding timestamps.

Whisper is a strong pre-trained model for speech recognition and translation. Compared to Wav2Vec2, it has higher transcription accuracy, with outputs that contain punctuation and casing. It can be used to transcribe speech in English as well as 96 other languages, both on short audio segments and longer ones through chunking . These attributes make it a viable model for many speech recognition and translation tasks without the need for fine-tuning. The pipeline() method provides an easy way of running inference in one-line API calls with control over the generated predictions.

While the Whisper model performs extremely well on many high-resource languages, it has lower transcription and translation accuracy on low-resource languages, i.e. those with less readily available training data. There is also varying performance across different accents and dialects of certain languages, including lower accuracy for speakers of different genders, races, ages or other demographic criteria ( c.f. Whisper paper ).

To boost the performance on low-resource languages, accents or dialects, we can take the pre-trained Whisper model and train it on a small corpus of appropriately selected data, in a process called fine-tuning . We’ll show that with as little as ten hours of additional data, we can improve the performance of the Whisper model by over 100% on a low-resource language. In the next section, we’ll cover the process behind selecting a dataset for fine-tuning.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.30(2); 2010 Jan 13

How the Human Brain Recognizes Speech in the Context of Changing Speakers

Katharina von kriegstein.

1 Wellcome Trust Centre for Neuroimaging, University College London, London WC1N 3BG, United Kingdom,

2 Auditory Group, Medical School, University of Newcastle-upon-Tyne, Newcastle-upon-Tyne NE2 4HH, United Kingdom,

3 Max Planck Institute for Cognitive and Brain Science, 04103 Leipzig, Germany,

David R. R. Smith

4 Department of Psychology, University of Hull, Hull HU6 7RX, United Kingdom, and

5 Centre for the Neural Basis of Hearing, University of Cambridge, Cambridge CB2 3EG, United Kingdom

Roy D. Patterson

Stefan j. kiebel, timothy d. griffiths.

We understand speech from different speakers with ease, whereas artificial speech recognition systems struggle with this task. It is unclear how the human brain solves this problem. The conventional view is that speech message recognition and speaker identification are two separate functions and that message processing takes place predominantly in the left hemisphere, whereas processing of speaker-specific information is located in the right hemisphere. Here, we distinguish the contribution of specific cortical regions, to speech recognition and speaker information processing, by controlled manipulation of task and resynthesized speaker parameters. Two functional magnetic resonance imaging studies provide evidence for a dynamic speech-processing network that questions the conventional view. We found that speech recognition regions in left posterior superior temporal gyrus/superior temporal sulcus (STG/STS) also encode speaker-related vocal tract parameters, which are reflected in the amplitude peaks of the speech spectrum, along with the speech message. Right posterior STG/STS activated specifically more to a speaker-related vocal tract parameter change during a speech recognition task compared with a voice recognition task. Left and right posterior STG/STS were functionally connected. Additionally, we found that speaker-related glottal fold parameters (e.g., pitch), which are not reflected in the amplitude peaks of the speech spectrum, are processed in areas immediately adjacent to primary auditory cortex, i.e., in areas in the auditory hierarchy earlier than STG/STS. Our results point to a network account of speech recognition, in which information about the speech message and the speaker's vocal tract are combined to solve the difficult task of understanding speech from different speakers.

Introduction

The same sentence, spoken by different speakers, can sound very different, and the acoustic differences between speakers enable us to relate speech to a specific person and recognize each other by voice. Also, changing voice properties can be useful in adapting to a specific context, such as whispering in quiet surroundings. However, to extract meaning and understand what has been said, the variability within and between speakers must be resolved. This is a nontrivial problem, and the sophisticated algorithms used by speech recognition machines still perform well below humans ( Deng et al., 2006 ; O'Shaughnessy, 2008 ). Currently, it is unclear why human speech recognition is so robust to variation in speaker characteristics ( Friederici, 2002 ; Wong et al., 2004 ; Hickok and Poeppel, 2007 ; Obleser and Eisner, 2009 ).

One of the main obstacles to understanding speech from different speakers is that the formant frequencies (i.e., the amplitude peaks in the frequency spectrum of speech sounds) contain information about both the type of speech sound (/a/, /i/, /n/, etc.) and speaker-related vocal tract parameters. This information is fundamentally intermingled and difficult to separate ( Joos, 1948 ; Ladefoged and Broadbent, 1957 ; Sussman, 1986 ; Nearey, 1989 ; Welling et al., 2002 ; Adank et al., 2004 ; Johnson, 2005 ; Ames and Grossberg, 2008 ; Turner et al., 2009 ) (see Fig. 1 ). In contrast, glottal fold parameters do not affect the formant position or timbre. Rather, they determine voice pitch or whether speech is voiced or whispered.

An external file that holds a picture, illustration, etc.
Object name is zns9990975390001.jpg

The contribution of glottal fold and vocal tract parameters to the speech output. A , Shown is a sagittal section through a human head and neck. Green circle, Glottal folds; blue lines, extension of the vocal tract from glottal folds to tip of the nose and lips. B , The three plots show three different sounds determined by glottal fold parameters. In voiced speech, the vibration of the glottal folds results in lower voices (120 Hz GPR; top) or higher voices (200 Hz GPR; middle). If glottal folds are constricted, they produce a noise-like sound that is heard as whispered speech (0 Hz GPR; bottom). C , The vocal tract filters the sound wave coming from the glottal folds, which introduces amplitude peaks at certain frequencies (“formants”; blue lines). Note that the different glottal fold parameters do not influence the formant position. D , Both speech- and speaker-related vocal tract parameters influence the position of the formants. Here we show as an example the formant shifts associated with the speech sounds /u/ and /a/ (first and second plot) and an /a/ with a shorter and longer vocal tract length (second and third plot).

With a recently described approach ( Kawahara et al., 2004 ), natural speech sounds can be modified in a major speaker-related vocal tract parameter, i.e., vocal tract length (VTL). Previous functional magnetic resonance imaging (fMRI) studies using these resynthesized sounds revealed that regions in posterior superior temporal gyrus/superior temporal sulcus (STG/STS) respond specifically to VTL information in human speech in contrast to similar information in, for example, animal calls ( von Kriegstein et al., 2007 ). The reason for this specificity is unclear. VTL information is an important cue for speaker recognition ( Lavner et al., 2000 ), and regions responding to this information might be involved in recognizing other humans by voice ( Belin et al., 2004 ; von Kriegstein and Giraud, 2004 ). Another reason might be that posterior STG/STS is responsive to VTL changes, because this area contributes to speech recognition by processing information about speaker-specific vocal tract dynamics. Using two fMRI studies, we focus on testing this latter hypothesis. We show that VTL-sensitive regions in posterior STG/STS are involved in speech recognition and that left and right posterior STG/STS are functionally connected when recognizing speech in the context of changing speakers. In addition, we harnessed the distinction between vocal tract and glottal fold parameters to show that (1) posterior STG/STS is involved in processing both speaker-specific formant and speech information and that (2) speaker-related glottal fold and vocal tract parameters are processed in separate brain regions. We present a hypothesis of how speaker-related acoustic variability is dealt with by the human brain. In addition, we discuss the implications of our findings for two influential but opposing theoretical accounts (abstractionist vs exemplar models) of speech processing ( Goldinger, 1996 ; Pisoni, 1997 ).

Materials and Methods

The stimuli were based on syllables recorded from a single speaker (16 bit resolution, 48 kHz sample rate) that were preprocessed with level balancing to minimize loudness differences, and perceptual centering to reduce rhythmic distractions as described previously ( von Kriegstein et al., 2006 ). Experiment 1 contained 96 syllables (48 consonant–vowel, 48 vowel–consonant). Experiment 2 contained 150 vowel–consonant–vowel syllables. Versions of the original syllables were synthesized to simulate speakers with different glottal pulse rate (GPR) and VTL using a sophisticated vocoder referred to as STRAIGHT ( Kawahara et al., 1999 , 2004 ). In addition, whispered syllables were produced by resynthesizing the recorded speech sounds with a broadband noise and lifting the spectrum 6 dB per octave to match the spectral slope of whispered speech ( Fujimura and Lindqvist, 1971 ). For both experiments, spoken syllables were concatenated to form syllable sequences. Example syllable sequences for both experiments are available online as supplemental data (available at www.jneurosci.org as supplemental material). In experiment 1 (supplemental Fig. S1, available at www.jneurosci.org as supplemental material), sequences lasted 9.44 s and contained eight syllabic events (680 ms stimulus, 500 ms pause). In experiment 2 (supplemental Fig. S1, available at www.jneurosci.org as supplemental material), all syllable sequences lasted 8.4 s and contained six syllabic events (1100 ms stimulus, 300 ms pause). Before each sequence, participants received a visual instruction to perform either a speech recognition task (“speech task”) or a control task (which was a “loudness task” in experiment 1 and a “speaker task” in experiment 2) (see below).

Experimental design

Experiment 1.

Experiment 1 was a 2 × 2 × 2 factorial design with the factors VTL (VTL varies/VTL fixed), task (speech task/loudness task), and glottal fold parameters (voiced/whispered) (supplemental Fig. S1, available at www.jneurosci.org as supplemental material). It was used to address three questions that we will detail in the following.

Do VTL-sensitive regions in posterior STG/STS also participate in speech recognition tasks?

To locate VTL-sensitive regions, half of the syllable sequences contained syllable events that differed in vocal tract length (VTL varies); during the other half, the VTL of the speaker was fixed (VTL fixed). VTL values were resynthesized to range from 10.6 to 21.7 cm in eight equal logarithmic steps. To investigate responses to speech recognition, we included a speech task (speech task) and a control task (loudness task) in the design. In the speech task, subjects indicated via button press whether the current syllable was different from the previous one. In the loudness task, subjects indicated via button press whether the level of the current syllable was different from the previous one. Within each syllable sequence, there were three different syllables (e.g., /ga/, /ke/, /la/; /mu/, /mi/, /ka/; etc.) and three different values of sound level [values differed by 9–12 dB sound pressure level (SPL)]. The changes in syllable and sound level were independent. Each sequence (with a specific stimulus combination) always occurred twice, once in the speech task and once in the loudness task. To address the question whether posterior STG/STS responds to VTL as well as to the speech task, we tested regions responsive to VTL (“main effect of VTL”), to the speech task (“main effect of task”), as well as the interaction between the two (“VTL × task”). In this interaction, we were specifically interested in regions responding more to a speech task when VTL varied while controlling for stimulus as well as for task effects: (VTL varies/speech task > VTL varies/loudness task) > (VTL fixed/speech task > VTL fixed/loudness task).

Do VTL-sensitive regions in posterior STG/STS respond differently to different glottal fold parameters, i.e., voiced and whispered speech?

Glottal fold parameters do not influence the formant position in the speech spectrum ( Fig. 1 ). Therefore, if posterior STG/STS contains a mechanism for formant processing, responses to voiced and whispered speech should be similar in this region. Half of the syllable sequences in experiment 1 were voiced (fundamental frequency set at 160Hz) and half of them were whispered. To check whether VTL-sensitive regions in posterior STG/STS respond similarly to voiced and whispered speech, we used two approaches. First, we performed contrasts for the main effect of VTL, the main effect of task, and the VTL × task interaction separately for voiced and whispered speech and entered these contrasts in a second-level t statistic for a conjunction analysis (e.g., conjunction of “main effect of VTL voiced” and “main effect of VTL whispered”). Second, we tested the interaction between the contrasts of interest with the factor “glottal fold parameter” at a relatively low statistical threshold ( p = 0.01 uncorrected).

Where are glottal fold parameters processed in the human brain?

The inclusion of voiced and whispered speech permits a test of where these two glottal fold parameters are processed differentially in the human brain. We probed the “main effect of glottal fold parameter” in both directions, i.e., voiced > whispered and whispered > voiced.

In summary, experiment 1 had a 2 × 2 × 2 factorial design with eight experimental conditions: (1) speech task, VTL varies, whispered; (2) speech task, VTL varies, voiced; (3) speech task, VTL fixed, whispered; (4) speech task, VTL fixed, voiced; (5) loudness task, VTL varies, whispered; (6) loudness task, VTL varies, voiced; (7) loudness task, VTL fixed, whispered; (8) loudness task, VTL fixed, voiced. The experiment also included a silence condition. The order of conditions was randomized.

Experiment 2

Experiment 2 was a 2 × 2 factorial design with the factors VTL (VTL varies/GPR varies) and task (speech task/speaker task) (supplemental Fig. S1, available at www.jneurosci.org as supplemental material). It was designed to complement experiment 1 by addressing the following two questions.

Is VTL-sensitive posterior STG/STS specifically processing formant information?

When listening to sequences with varying VTL, subjects usually have the impression that the speech sounds are produced by different synthetic speakers (E. Gaudrain, S. Li, V. S. Ban, R. D. Patterson, unpublished observations). In contrast, the sequences with fixed VTL are perceived as spoken by the same speaker. The high-level percept of different speakers is a confound when investigating the acoustic effect of vocal tract length. By acoustic effect, we mean the speaker-related shift in formant positions. In experiment 2, half of the syllable sequences were spoken by speakers that differed in VTL, and the other half was spoken by speakers that differed in the vibration rate of the glottal folds (GPR). The GPR and VTL values (GPR: 95, 147, 220 Hz; VTL: 9.1, 13.6, 20.3 cm) were chosen because preliminary behavioral studies indicated that subjects perceive these values as a change of speaker rather than a change of the voice characteristics within the speaker. GPR changes affect the pitch of the syllable but do not alter the formant positions. In contrast, VTL changes shift the formant frequencies but not the pitch. Thus, only VTL information is intermingled with the formant information determining the speech message (e.g., /a/), whereas GPR information is independent ( Fig. 1 ). To test whether posterior STG/STS is processing speaker-related formant information, we probed the main effect of VTL (i.e., VTL varies > GPR varies).

Is VTL-sensitive posterior STG/STS specifically modulated by the speech task?

During a speech task, subjects might automatically process the speaker characteristics of the stimulus significantly more than they do in a comparable loudness task. This could potentially explain differential responses in experiment 1 for the main effect of task (speech task > loudness task) and the VTL × task interaction [(VTL varies/speech task > VTL varies/loudness task) > (VTL fixed/speech task > VTL fixed/loudness task)]. Such an explanation would counter our hypothesis that posterior STG/STS is responding to speaker-specific formant information to use this information for speech recognition. Accordingly, in experiment 2, we included not only a speech task but also a speaker task. In the speech task, subjects indicated via button press whether the current syllable was different from the previous one. In the speaker task, subjects indicated via button press whether the current speaker was different from the previous one. Subjects were asked to only score two consecutive syllable events as different if they clearly perceived a change of speaker rather than a change of the voice of one speaker. Within each sequence, there were three different syllables (e.g., /aga/, /ake/, /ala/; or /esi/, /elu/, /ero/; etc.) and three different speakers (i.e., different VTLs or different GPRs). Changes in syllable and speaker were independent. Each sequence (with a specific stimulus combination) always occurred twice, once in the speech task and once in the speaker task.

To test whether VTL-sensitive posterior STG/STS is specifically involved in speech recognition, we tested the blood oxygen level-dependent (BOLD) signal changes for the contrast main effect of task (i.e., in the direction speech > speaker task) and for the task × VTL interaction, i.e., (VTL varies/speech task > VTL varies/speaker task) > (GPR varies/speech task > GPR varies/speaker task).

In summary, experiment 2 was a 2 × 2 factorial design with four conditions: (1) speech task, VTL varies; (2) speech task, GPR varies; (3) speaker task, VTL varies; (4) speaker task, GPR varies. The experiment additionally included a silence condition. The order of conditions was randomized.

Participants

Eighteen subjects participated in experiment 1 (all right handed; 10 female, 8 male; aged 19–40 years; mean age of 26 years; native language: 15 English, 2 German, 1 Spanish). In experiment 2, 14 subjects were included in the analysis (all right handed; 5 female, 9 male; aged 20–37 years; mean age of 26 years; native language: 11 English, 2 German, 1 Spanish). All subjects were proficient in English and had been living in the United Kingdom for at least 3 years at time of testing. None of the subjects was trained in a tone language. Five additional subjects were excluded from experiment 2 to match behavioral performance for the different conditions across the group (supplemental Table S1, available at www.jneurosci.org as supplemental material). All subjects gave informed consent, and the experiment was performed with the approval of the Institute of Neurology Ethics Committee (London, UK). None of the subjects had any history of neurological or psychiatric disorder. All subjects reported having normal hearing, and they all had normal structural MRI brain scans.

Scanner setup

The stimuli were delivered using a custom electrostatic system at 70 dB SPL. After each syllable sequence, functional gradient-echo planar images (EPIs) were acquired [sparse imaging ( Hall et al., 1999 )] on a 3 T scanner (42 slices; −5° tilt; slice thickness of 2 mm, interslice distance of 1 mm; cardiac triggering; Siemens). Because of the cardiac triggering, there was a variable scan repetition time (time to repeat, 2.73 s + length of stimulus presentation + time to next pulse; time to echo, 65 ms). The 42 transverse slices of each brain volume covered the entire brain. The task instruction was presented during the last 10 slice acquisitions of each volume. It was followed by a fixation cross displayed during the subsequent stimulus sequence. Experiment 1 included 222 brain volumes for each subject (3 runs of 74 volumes each). Experiment 2 included 210 brain volumes for each subject (5 runs of 42 volumes each). Subjects were allowed to rest for several minutes between runs. The first two volumes were discarded from each run. Thus, there were 24 volumes for each of the eight experimental conditions plus 24 volumes for the silence condition in experiment 1. In experiment 2, there were 40 volumes for each of the four experimental condition plus 40 volumes for the silence condition.

Data analysis

The behavioral data were analyzed using SPSS 12.02.

Imaging data were analyzed using the statistical parametric mapping package (SPM5; http://www.fil.ion.ucl.ac.uk/spm ). Scans were realigned, unwarped, and spatially normalized ( Friston et al., 1995a ) to Montreal Neurological Institute (MNI) standard stereotactic space ( Evans et al., 1993 ) and spatially smoothed with an isotropic Gaussian kernel of 8 mm full-width at half-maximum.

Activity analyses

Statistical parametric maps were generated by modeling the evoked hemodynamic response for the different stimuli as boxcars convolved with a synthetic hemodynamic response function in the context of the general linear model ( Friston et al., 1995b ). Population-level inferences concerning BOLD signal changes between conditions of interest were based on a random-effects model that estimated the second-level t statistic at each voxel. To display common activations for similar contrasts in both experiments (see Figs. 2 , ​ ,4), 4 ), we entered the contrasts of interest (e.g., “speech task > loudness task” and “speech task > speaker task”) for each subject in a second-level t statistic and performed a conjunction across the two contrasts.

An external file that holds a picture, illustration, etc.
Object name is zns9990975390002.jpg

BOLD responses associated with the main effect of VTL (red) and main effect of task (green) as revealed by the conjunction analysis of experiment 1 and experiment 2. The group mean structural image is overlaid with the statistical parametric maps for the respective contrasts. “Control task” refers to loudness task in experiment 1 and to speaker task in experiment 2. L, Left hemisphere; VTL, acoustic effect of vocal tract length. The dotted lines on the sagittal section indicate the slices displayed as horizontal and coronal sections. The plots show the parameter estimates for experiments 1 and 2 separately. The small bar graphs on top of the plots display the main effects and their significance threshold in a repeated-measures ANOVA. Results of post hoc t tests are indicated by the brackets within the plot. * p < 0.05, *** p < 0.001. ns, Nonsignificant. Error bars represent ±1 SEM.

An external file that holds a picture, illustration, etc.
Object name is zns9990975390004.jpg

Overview of BOLD responses in right and left hemisphere. This figure also includes the BOLD responses reported in a previous study ( von Kriegstein et al., 2007 ). The right-sided activation for the previous study is shown at a threshold of p < 0.003 for display purposes. The voxel with the maximum statistic for this study is at (60, −42, −2), Z = 3.12.

Connectivity analyses (psychophysiological interactions)

Based on our activity results (see Results), the hypothesis was generated that left and right STG/STS are functionally connected when recognizing speech in the context of changing speakers. To test this hypothesis, we performed psychophysiological interaction analyses (PPI) ( Friston et al., 1997 ). We selected the left posterior STG/STS as seed region of interest, identified at the individual subject level using the main effect of VTL (supplemental Table S2, available at www.jneurosci.org as supplemental material). Subjects for which a cluster of the main effect could be localized in left posterior STG/STS were included in the PPI analyses (experiment 1, n = 17; experiment 2, n = 11; Z -score >1.5). We extracted the first eigenvariate from these clusters (PPI-seed regions). PPI regressors were created using routines implemented in SPM5. The psychological variables were the interaction contrasts: (syllable task/VTL varies > syllable task/VTL fixed) > (loudness task/VTL varies > loudness task/VTL fixed) in experiment 1 and (syllable task/VTL varies > syllable task/GPR varies) > (speaker task/VTL varies > speaker task/GPR varies) in experiment 2. PPI regressor, psychological variable, and first eigenvariate were entered in a design matrix at the single-subject level. Population-level inferences about BOLD signal changes were based on a random-effects model that estimated the second-level statistic at each voxel using a one-sample t test.

Significance thresholds for fMRI data and anatomical hypotheses

For each contrast, responses were considered significant at p < 0.001, uncorrected, if the localization of activity was in accordance with a priori anatomical hypotheses. Anatomical hypotheses for the main effect of VTL, main effect of task, and the interaction between the two were restricted to STG/STS based on previous studies on VTL processing ( von Kriegstein et al., 2006 , 2007 ). Hypotheses for the PPI interactions were restricted to right STG/STS. Based on studies investigating pitch processing with complex artificial nonspeech sounds, the hypothesis for BOLD responses related to pitch in the present experiment (GPR) were restricted to anterolateral Heschl's gyrus and planum polare ( Griffiths et al., 2001 ; Patterson et al., 2002 ; Penagos et al., 2004 ; Bendor and Wang, 2005 ). Otherwise, responses were considered significant at p < 0.05 familywise error (fwe) corrected.

Regions are claimed to overlap if they adhere to both of the following criteria: (1) overlap on visual inspection given the above significance criteria and (2) activation by one contrast (i.e., speech task > control task) is significant in a region of interest defined by another contrast (i.e., VTL varies > VTL fixed) at p < 0.05 fwe corrected.

In the text, we only refer to activations that conform to these significance criteria. All other regions at p < 0.001 uncorrected are listed in the tables for the respective contrast.

To plot percentage signal changes for significant activations, we extracted the parameter estimates from the region of interest at the voxel in which we found the maximum value of the statistic. These values were then plotted using SPSS 12.02.

We begin with the contrasts relevant to the hypothesis that posterior STG/STS is responsive to speaker-specific formant information (VTL) and at the same time contributes to speech recognition. These contrasts are partly also relevant to the second hypothesis, which is that VTL is processed in different regions from glottal fold parameters. This is tested by the contrast VTL varies > GPR varies and GPR varies > VTL varies in experiment 2 and by testing for differences in BOLD activation between conditions with the two glottal fold parameters voiced and whispered speech in experiment 1.

Speaker-related vocal tract changes are processed in posterior STG/STS: VTL varies > VTL fixed (experiment 1) and VTL varies > GPR varies (experiment 2)

In experiment 1, the contrast VTL varies > VTL fixed revealed BOLD responses along bilateral STG and STS (supplemental Table S3, Fig. S2, red, available at www.jneurosci.org as supplemental material). In experiment 2, the contrast VTL varies > GPR varies shows responses in left posterior STG/STS (supplemental Table S3, Fig. S3, red, available at www.jneurosci.org as supplemental material). Activation in the right posterior STG/STS only shows a trend to significance ( Z = 2.9) (supplemental Table S3, available at www.jneurosci.org as supplemental material). A conjunction analysis involving these two contrasts is displayed in Figure 2 (red) showing the common activation in left posterior STG/STS [MNI coordinates: (−58, −50, 10) and (−60, −36, 10)].

In experiment 1, behavioral performance was better in conditions with fixed VTL compared with varying VTL (main effect of VTL, F (1,17) = 41, p < 0.0001). In experiment 2, the behavioral performance was matched for the two conditions F (1,13) = 2.408, p = 0.15 ( Table 1 ).

Behavioral results

The table presents the percentage correct responses over the group for each condition of the experiments. Although there were performance differences in experiment 1 between the different conditions (see description in Results), experiment 2 was matched for behavioral performance across all conditions.

Left posterior VTL-sensitive STG/STS is modulated by speech recognition: Speech task > loudness task (experiment 1) and speech task > speaker task (experiment 2)

There is more activation in STG/STS during the speech task than in the loudness task (experiment 1) (supplemental Fig. S2, green, available at www.jneurosci.org as supplemental material) and also in contrast to the speaker task (experiment 2) (supplemental Fig. S3, green, available at www.jneurosci.org as supplemental material). All activated regions are listed in supplemental Table S4 (available at www.jneurosci.org as supplemental material). A conjunction analysis of the two contrasts is displayed in Figure 2 (green) showing the common activation in left STG/STS. In both experiments, activation for the speech task, in contrast to the control task, overlaps with the main effect for VTL in left posterior STG/STS ( Fig. 2 ) (supplemental Figs. S2, S3, green and red overlap, available at www.jneurosci.org as supplemental material).

At the behavioral level, in experiment 1, the speech task was easier than the control task, i.e., the loudness task ( F (1,17) = 157, p < 0.001) ( Table 1 ). Behavioral performance in experiment 2 was the same for the syllable and the control task, i.e., the speaker task ( F (1,13) = 0.246, p = 0.63) ( Table 1 ).

Right posterior STG/STS is modulated by speech recognition when speaker-related vocal tract parameters vary

Task × vtl interaction (experiment 1).

We tested the following interaction: (speech task/VTL varies > speech task/VTL fixed) > (loudness task/VTL varies > loudness task/VTL fixed). For this contrast, we found enhanced BOLD responses in right posterior STG/STS ( Fig. 3 , magenta) (supplemental Table S5, available at www.jneurosci.org as supplemental material).

An external file that holds a picture, illustration, etc.
Object name is zns9990975390003.jpg

BOLD responses associated with the interaction between task and VTL. The contrast for experiment 1 is rendered in magenta and for experiment 2 in cyan. The plots show the parameter estimates for experiments 1 and 2 separately [MNI coordinates: experiment 1, (52, −22, 0); experiment 2, (68, −42, 16)]. The small bar graphs on top of the plots show the significant interaction and main effects and their significance threshold in a repeated-measures ANOVA. Results of post hoc t test are indicated by the brackets within the plot. * p < 0.05. ns, Nonsignificant. Error bars represent ±1 SEM.

At the behavioral level, there was no interaction between task and VTL ( F (1,17) = 17, p = 0.73).

In contrast to results of a previous fMRI study on speaker normalization ( Wong et al., 2004 ), the right hemispheric activation in our study can be explained by neither increased task difficulty nor activity attributable to processing voice information per se (the stimulus input was exactly the same during the speech and loudness tasks). One potential reason, however, for differential responses in right STG/STS is an implicit processing of voice characteristics, i.e., subjects might automatically process the speaker characteristics of the stimulus significantly more during a speech task compared with a loudness task. In experiment 2, we control additionally for this potential confound by using a speaker task as control task.

Task × VTL interaction (experiment 2)

The speaker task in experiment 2 focuses attention explicitly on voice characteristics, which controls for implicit processing of voice characteristics during the speech task. We examined the interaction (speech task/VTL varies > speech task/GPR varies) > (speaker task/VTL varies > speaker task/GPR varies). For this contrast, there are enhanced BOLD responses in right posterior STG/STS in a similar location to the responses in experiment 1 ( Fig. 3 , cyan) (supplemental Table S5, available at www.jneurosci.org as supplemental material). At the behavioral level, there was no VTL × task interaction ( F (1,13) = 1.4, p < 0.3).

The right posterior STG/STS is commonly activated for the interactions in experiments 1 and 2 (supplemental Table S5, available at www.jneurosci.org as supplemental material). The conjunction analysis for the two experiments is displayed in Figure 4 (right panel, red). In addition, Figure 4 relates the current findings to findings of a previous study ( von Kriegstein et al., 2007 ) ( Fig. 4 , blue). The contrast shows activations that are specific to VTL information in speech in contrast to similar acoustic changes in another communication sound, e.g., frog calls. VTL responsive regions in that study were located in right posterior STG/STS [(60, −42, −2), Z = 3.12; (60, −34, 4), Z = 2.99] ( Fig. 4 , right panel, blue) and in left posterior STG/STS [(−60, −32, 6), Z = 3.74; (−60, −48, 14), Z = 3.23] ( Fig. 4 , left panel, blue).

Responses in posterior STG/STS are similar for voiced and whispered speech

The percentage signal change in left posterior STG/STS is similar for the main effect VTL (VTL varies > VTL fixed) in voiced speech and in whispered speech (supplemental Fig. S4, red bars, available at www.jneurosci.org as supplemental material), and also for the main effect of task (speech task > loudness task) (supplemental Fig. S4, green bars, available at www.jneurosci.org as supplemental material). The percentage signal change for the interaction in right STG/STS is also similar for voiced and whispered speech (supplemental Fig. S4, magenta bars, available at www.jneurosci.org as supplemental material).

Functional connectivity between left and right posterior STG/STS is increased during speech recognition when speaker-related vocal tract parameters vary: PPI analyses: task × VTL interaction (experiments 1 and 2)

We tested the functional connectivity of left posterior STG/STS for the following psychological variables: (speech task/VTL varies > speech task/VTL fixed) > (loudness task/VTL varies > loudness task/VTL fixed) in experiment 1 and (speech task/VTL varies > speech task/GPR varies) > (speaker task/VTL varies > speaker task/GPR varies) in experiment 2. The analyses reveal that activity in VTL-sensitive left posterior STG/STS (seed region) ( Fig. 5 red) has a stronger correlation to activity in right posterior STG/STS (target region) ( Fig. 5 , green) when recognizing speech from varying speakers than when recognizing speech from the same speaker. Importantly, this connectivity increase is specific to speech recognition in the context of changing speakers, because we use the task × VTL interaction as psychological variable in the PPI. Experiment 2 additionally shows that enhanced connectivity between left and right posterior STG/STS during speech recognition is attributable to speaker VTL changes rather than speaker GPR changes. In both experiments, the PPI target region is located in consistently close proximity posterior to regions showing enhanced activity in the task × VTL interactions [ Fig. 5 , magenta; the same contrast is also displayed in Fig. 3 (magenta, experiment 1; cyan, experiment 2)].

An external file that holds a picture, illustration, etc.
Object name is zns9990975390005.jpg

Functional connectivity (PPI) between left and right posterior STG/STS. Seed regions were taken from individual subject clusters; here the group mean is shown (red). Target regions identified by the PPI analysis (VTL × task, connectivity) are shown in green [MNI coordinates: experiment 1, (58, −46, 20), Z = 3.03; experiment 2, (60, −52, 20), Z = 3.26)]. BOLD responses associated with the interaction between task and VTL (VTL × task, activity) are displayed to demonstrate their consistently close proximity to PPI target regions in right posterior STG/STS.

Glottal fold parameters are processed along Heschl's gyrus

Voiced > whispered and whispered > voiced (experiment 1).

Contrasting all conditions containing voiced sounds with all conditions containing whispered sounds reveals an enhanced BOLD response adjacent to primary auditory cortex in anterolateral Heschl's gyrus (auditory cortex area Te1.2) ( Morosan et al., 2001 ) of both hemispheres ( Fig. 6 , red) (supplemental Table S6, available at www.jneurosci.org as supplemental material). The reverse contrast (whispered > voiced) reveals responses in and around the posteromedial end of Heschl's gyrus (Te1.1) in both hemispheres (familywise error corrected, p < 0.05) ( Fig. 6 , yellow) (supplemental Table S6, available at www.jneurosci.org as supplemental material). There was a significant location (Te1.1, Te1.2) × glottal fold parameter (voiced/whispered) interaction ( F (1,17) = 123, p < 0.001) ( Fig. 6 , plot).

An external file that holds a picture, illustration, etc.
Object name is zns9990975390006.jpg

BOLD responses for voiced and whispered speech. The group mean structural image is overlaid with the statistical parametric maps for the contrasts between (1) voiced > whispered speech (red), (2) whispered > voiced speech (yellow), and (3) pitch varies > VTL varies (cyan). The plot shows parameter estimates for voiced and whispered speech in Te1.2 and Te1.1 (volume of interest). Error bars represent ±1 SEM. A repeated-measures ANOVA with the factors location (Te1.1, Te1.2) and sound quality (voiced, whispered) revealed a significant interaction of sound quality × location ( F (1,17) = 28, p < 0.0001), indicating differential responsiveness to whispered sounds in Te1.1 and to voiced sounds in Te1.2. *** p < 0.001.

Behavioral performance was better, on average, for voiced than whispered speech ( F (1,17) = 38, p < 0.001) ( Table 1 ). This difference was attributable to performance differences in the loudness task [task × glottal fold parameter interaction, F (1,17) = 53, p < 0.0001; post hoc paired t tests: whispered > voiced in the loudness task (size fixed, t = 5.6, p < 0.0001; size variable, t = 4.8, p < 0.001); whispered > voiced in the speech task (size fixed, t = −1.9, p < 0.08; size variable, t = −0.7, p < 0.5)]. We probed a task × glottal fold parameter interaction to check whether the pattern of BOLD responses in the main effect of glottal fold parameter is attributable to these differences in behavioral performance. There was no such interaction in Heschl's gyrus even at a low statistical threshold ( p = 0.05 uncorrected). Furthermore, we found that contrasts, for which the behavioral performance is similar (speech task whispered > speech task voiced; speech task voiced > speech task whispered), reveal the same pattern of responses as the main effect in Te1.1 and Te1.2.

GPR varies > VTL varies (experiment 2)

The rate of glottal fold vibration (i.e., GPR) results in speech with different fundamental frequencies. This is heard as voices with different pitch. BOLD responses for contrasting all conditions in which GPR varies with all conditions in which VTL varies (GPR varies > VTL varies) partly overlap with those for the contrast voiced > whispered but extend farther along the superior temporal plane ( Fig. 6 , cyan). The behavioral performance is matched for this contrast ( Table 1 ).

Our results show that speaker-related vocal tract parameters, which influence the formant position of the speech signal, are processed in posterior STG/STS. In contrast, speaker-related glottal fold parameters, which do not influence the formant position, are processed in areas immediately adjacent to primary auditory cortex, i.e., Te1.0 ( Kaas and Hackett, 2000 ; Morosan et al., 2001 ). Vocal tract parameter-sensitive areas in posterior STG/STS are also involved in speech recognition. Left posterior STG/STS is (1) responsive to changes in vocal tract parameters (main effect of VTL) and (2) modulated by a speech recognition task (main effect of task). Right posterior STG/STS is modulated by the speech task only if vocal tract length varies but not if glottal fold parameters vary (VTL × task interaction). Functional connectivity between left and right posterior STG/STS is increased when recognizing speech from different speakers.

Representation of speaker-related acoustic variability

Vocal tract parameters.

The experiments reported here are in accordance with previous studies investigating speaker-related vocal tract parameters (i.e., VTL) ( von Kriegstein et al., 2006 , 2007 ). For all studies, the maximum of BOLD responses to VTL changes occurs in left posterior STG/STS. In all studies, there is also activation in similar STG/STS areas of the right hemisphere at a relatively lower, sometimes nonsignificant, statistical threshold (for experiments 1 and 2, see Table S3, available at www.jneurosci.org as supplemental material) (for a previous experiment, see Fig. 4 ).

Glottal fold parameters

In voiced speech, the glottal pulse rate is perceived as voice pitch. Studies that contrast pitch-producing nonspeech sounds with spectrally matched noises reveal differential activation in anterolateral Heschl's gyrus (Te1.2) adjacent to primary auditory cortex ( Griffiths et al., 2001 ; Patterson et al., 2002 ; Penagos et al., 2004 ). The differential activation to voiced over whispered speech overlaps with this putative pitch processing area ( Fig. 6 , red). Furthermore, differential activation in a region adjacent to lateral Heschl's gyrus (Te1.2) for GPR (pitch)-varying versus VTL-varying syllable sequences ( Fig. 6 , cyan) complements similar findings for artificial sounds ( Patterson et al., 2002 ). These findings imply that voice pitch is processed in similar areas as the pitch of nonspeech sounds. The results are in accordance with the assumption that there are increasingly independent representations of pitch (here elicited by GPR) and timbre (here elicited by VTL) beyond primary auditory cortex ( Nelken and Bar-Yosef, 2008 ; Bizley et al., 2009 ).

The inclusion of whispered conditions reveals a surprising result: whispered speech produces differential activation, not in primary auditory cortex but in regions immediately adjacent to it: posteromedial Heschl's gyrus (Te1.1) ( Fig. 6 , yellow). In whispered speech, the constriction of the glottal folds produces noise that is, after passing through the vocal tract, perceived as whispered speech ( Abercrombie, 1967 ). Previous studies involving noise bursts never found comparable effects ( Griffiths et al., 2001 ; Patterson et al., 2002 ; von Kriegstein et al., 2006 ). Whispered speech in the current study is technically noise, but its spectrum contains formants. Because the voiced and whispered stimuli are resynthesized from the same recordings, the conditions are precisely matched with regard to spectral characteristics. We speculate that the activity in Te1.1 for whispered speech is not noise processing per se but noise that is specifically being processed as a communication signal.

Vocal tract information in speech recognition

One of the present key findings is that regions responding to changes in vocal tract length in posterior STG/STS are also involved in speech recognition.

There is neurophysiological evidence that relatively fast changing aspects of auditory input, which are relevant for speech perception, are processed in left-hemispheric temporal lobe areas, whereas slower changing information, e.g., identity, is predominantly processed in the right temporal lobe ( Poeppel, 2003 ; von Kriegstein et al., 2003 ; Belin et al., 2004 ; Boemio et al., 2005 ; Giraud et al., 2007 ; Overath et al., 2007 ; Abrams et al., 2008 ; Lewis et al., 2009 ). In view of this dichotomy, an involvement of left-hemispheric areas in speech recognition, as found in the current study, is expected, but left-hemispheric processing of speaker-related vocal tract parameters is unexpected. Conversely, processing of speaker-related parameters in right STG/STS is expected, but involvement of right-hemispheric areas in speech recognition (compared with a high-level control condition) is surprising ( Scott, 2005 ; Vigneau et al., 2006 ; Leff et al., 2008 ).

Why are regions in bilateral STG/STS responsive to changes in speaker-related vocal tract parameters and to speech recognition? It has been suggested that speech recognition involves both hemispheres but with computational differences between the hemispheres ( Boatman et al., 1998 ; Hickok and Poeppel, 2000 ; Boatman, 2004 ; Boatman et al., 2006 ). The exact nature of these differences is unclear. In the following, we provide a speculative theoretical account that explains our results in terms of distinct but coupled mechanisms in the left and right hemisphere.

A potential mechanism for dealing with speaker-related vocal tract variability in speech recognition

In speech recognition, the brain needs to decode a fast-varying, information-rich, auditory input stream online. Theoretical accounts of brain function suggest that online recognition can be accomplished using dynamic models of the environment that predict sensory input ( Knill et al., 1998 ; Friston, 2005 ; Kiebel et al., 2008 ). There is increasing evidence that the brain uses such a mechanism ( Wolpert et al., 1995 ; Rao and Ballard, 1999 ; Bonte et al., 2006 ; Summerfield et al., 2006 ; von Kriegstein and Giraud, 2006 ; Overath et al., 2007 ). This scheme might be especially powerful if information changing at slower time scales predicts information at faster time scales ( Kiebel et al., 2008 ; Balaguer-Ballester et al., 2009 ). For example, knowledge of the relatively constant vocal tract length of a speaker helps, among other constraints, to identify possible formant positions determining the phonemes of that speaker. Prediction of speech trajectories also implies that the dynamic uncertainty about speaker- and speech-related parameters is encoded. Dynamic uncertainty measures are valuable for online recognition because they preclude premature interpretations of speech input.

A prediction mechanism, which is based on knowledge about speaker characteristics, would prove useful in everyday conversational situations in which the speaker does not change rapidly. Such a scheme, which exploits the temporal stability of speaker parameters, would explain findings that speech from the same speaker is more intelligible than speech from changing speakers ( Creelman, 1957 ; Mullennix et al., 1989 ; Pisoni, 1997 ). In this view, brain regions that encode speaker-specific parameters and dynamic uncertainty about these are critical in a speech recognition network ( von Kriegstein et al., 2008 ). This would explain our findings (1) that bilateral posterior STG/STS is involved in both processing changes in vocal tract parameters and speech recognition and (2) that functional connectivity between left and right posterior STG/STS increases during speech recognition in the context of changing speakers.

We hypothesize that the right posterior STG/STS activation reflects the extraction of speaker parameters (here VTL), which is used by an internal model to effectively recognize the speech message. A change in VTL, during speech recognition, would prompt an adjustment of the vocal tract parameters of the internal model. This additional processing would not be necessary when VTL is fixed. We assume that VTL is just one of many speaker-related parameters that can be used to adjust an internal model. Other relevant parameters may include speaking rate ( Adank and Devlin, 2010 ), visual information (e.g., face), and social information about the speaker (e.g., accent). Furthermore, in tone languages GPR changes are used to mark both message and speaker changes ( Wong and Diehl, 2003 ). We speculate that, especially in these languages, GPR-sensitive regions in the right hemisphere provide information about the speaker-related variation of pitch to the left hemisphere.

In line with the assumption that left- and right-hemispheric function is specialized for distinct time scales in speech ( Poeppel, 2003 ; Boemio et al., 2005 ; Giraud et al., 2007 ; Overath et al., 2008 ), we speculate that the left posterior STG/STS deals with vocal tract dynamics at a short time scale, e.g., at the length of one syllable or shorter. In this view, the main function of this area is not to determine the vocal tract parameters. Rather, left posterior STG/STS uses speaker-related vocal tract parameters, probably in part provided by right STG/STS, to represent fast vocal tract dynamics for speech recognition. Because certainty about VTL will lend more certainty to the representation of fast vocal tract dynamics, a sudden speaker change will lead to increased uncertainty about the fast speech dynamics. We assume that this burst in uncertainty triggers adjustment processes in the representation of fast speech dynamics, which explains why the speech-processing left STG/STS is also sensitive to speaker changes.

From the viewpoint of theoretical accounts of human and artificial speech processing, the proposed mechanism is a hybrid between abstract and exemplar models and captures the advantages of both (1) the ability to extract abstract features from the input and (2) the representation and use of speaker-related details during speech recognition. Such a scheme, which integrates abstract and exemplar approaches, may be implemented computationally using techniques as described previously ( Kiebel et al., 2009 ). The location of such mechanisms in posterior STG/STS is in line with the implication of this region in phonological representation of speech sounds ( Hickok and Poeppel, 2007 ; Desai et al., 2008 ; Leff et al., 2008 ), as well as the processing of visual vocal tract movements ( Puce et al., 1998 ; O'Toole et al., 2002 ). We hypothesize that right and left posterior STG/STS encode speech features at various time scales and serve recognition by using a comprehensive, internal speech model that is updated during a speaker change.

This work was supported by Volkswagen Stiftung Grant I/79 783, the Wellcome Trust, and the United Kingdom Medical Research Council Grants G9900362 and G0500221. We thank Tom C. Walters for helping with preparation of Figure 1 .

  • Abercrombie D. Edinburgh: Edinburgh UP; 1967. Elements of general phonetics. [ Google Scholar ]
  • Abrams DA, Nicol T, Zecker S, Kraus N. Right-hemisphere auditory cortex is dominant for coding syllable patterns in speech. J Neurosci. 2008; 28 :3958–3965. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Adank P, Devlin JT. On-line plasticity in spoken sentence comprehension: adapting to time-compressed speech. Neuroimage. 2010; 49 :1124–1132. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Adank P, van Hout R, Smits R. An acoustic description of the vowels of Northern and Southern Standard Dutch. J Acoust Soc Am. 2004; 116 :1729–1738. [ PubMed ] [ Google Scholar ]
  • Ames H, Grossberg S. Speaker normalization using cortical strip maps: a neural model for steady-state vowel categorization. J Acoust Soc Am. 2008; 124 :3918–3936. [ PubMed ] [ Google Scholar ]
  • Balaguer-Ballester E, Clark NR, Coath M, Krumbholz K, Denham SL. Understanding pitch perception as a hierarchical process with top-down modulation. PLoS Comput Biol. 2009; 5 :e1000301. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Belin P, Fecteau S, Bédard C. Thinking the voice: neural correlates of voice perception. Trends Cogn Sci. 2004; 8 :129–135. [ PubMed ] [ Google Scholar ]
  • Bendor D, Wang X. The neuronal representation of pitch in primate auditory cortex. Nature. 2005; 436 :1161–1165. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Bizley JK, Walker KM, Silverman BW, King AJ, Schnupp JW. Interdependent encoding of pitch, timbre, and spatial location in auditory cortex. J Neurosci. 2009; 29 :2064–2075. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Boatman D. Cortical bases of speech perception: evidence from functional lesion studies. Cognition. 2004; 92 :47–65. [ PubMed ] [ Google Scholar ]
  • Boatman D, Hart J, Jr, Lesser RP, Honeycutt N, Anderson NB, Miglioretti D, Gordon B. Right hemisphere speech perception revealed by amobarbital injection and electrical interference. Neurology. 1998; 51 :458–464. [ PubMed ] [ Google Scholar ]
  • Boatman DF, Lesser RP, Crone NE, Krauss G, Lenz FA, Miglioretti DL. Speech recognition impairments in patients with intractable right temporal lobe epilepsy. Epilepsia. 2006; 47 :1397–1401. [ PubMed ] [ Google Scholar ]
  • Boemio A, Fromm S, Braun A, Poeppel D. Hierarchical and asymmetric temporal sensitivity in human auditory cortices. Nat Neurosci. 2005; 8 :389–395. [ PubMed ] [ Google Scholar ]
  • Bonte M, Parviainen T, Hytönen K, Salmelin R. Time course of top-down and bottom-up influences on syllable processing in the auditory cortex. Cereb Cortex. 2006; 16 :115–123. [ PubMed ] [ Google Scholar ]
  • Creelman CD. Case of the unknown talker. J Acoust Soc Am. 1957; 29 :655–655. [ Google Scholar ]
  • Deng L, Dong Y, Acero A. Structured speech modeling. IEEE Trans Audio Speech Lang Processing. 2006; 14 :1492–1504. [ Google Scholar ]
  • Desai R, Liebenthal E, Waldron E, Binder JR. Left posterior temporal regions are sensitive to auditory categorization. J Cogn Neurosci. 2008; 20 :1174–1188. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Evans AC, Collins DL, Mills SR, Brown ED, Kelly RL, Phinney RE. 3D statistical neuroanatomical models from 305 MRI volumes. Proc IEEE Nucl Sci Symp Med Imag Conf. 1993; 3 :1813–1817. [ Google Scholar ]
  • Friederici AD. Towards a neural basis of auditory sentence processing. Trends Cogn Sci. 2002; 6 :78–84. [ PubMed ] [ Google Scholar ]
  • Friston K. A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci. 2005; 360 :815–836. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Friston KJ, Ashburner J, Frith CD, Poline JB, Heather JD, Frackowiak RSJ. Spatial registration and normalisation of images. Hum Brain Mapp. 1995a; 2 :165–189. [ Google Scholar ]
  • Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: a general linear approach. Hum Brain Mapp. 1995b; 2 :189–210. [ Google Scholar ]
  • Friston KJ, Buechel C, Fink GR, Morris J, Rolls E, Dolan RJ. Psychophysiological and modulatory interactions in neuroimaging. Neuroimage. 1997; 6 :218–229. [ PubMed ] [ Google Scholar ]
  • Fujimura O, Lindqvist J. Sweep-tone measurements of vocal-tract characteristics. J Acoust Soc Am. 1971; 49 (Suppl 2):541+. [ PubMed ] [ Google Scholar ]
  • Giraud AL, Kleinschmidt A, Poeppel D, Lund TE, Frackowiak RS, Laufs H. Endogenous cortical rhythms determine cerebral specialization for speech perception and production. Neuron. 2007; 56 :1127–1134. [ PubMed ] [ Google Scholar ]
  • Goldinger SD. Words and voices: episodic traces in spoken word identification and recognition memory. J Exp Psychol Learn Mem Cogn. 1996; 22 :1166–1183. [ PubMed ] [ Google Scholar ]
  • Griffiths TD, Uppenkamp S, Johnsrude I, Josephs O, Patterson RD. Encoding of the temporal regularity of sound in the human brainstem. Nat Neurosci. 2001; 4 :633–637. [ PubMed ] [ Google Scholar ]
  • Hall DA, Haggard MP, Akeroyd MA, Palmer AR, Summerfield AQ, Elliott MR, Gurney EM, Bowtell RW. “Sparse” temporal sampling in auditory fMRI. Hum Brain Mapp. 1999; 7 :213–223. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hickok G, Poeppel D. Towards a functional neuroanatomy of speech perception. Trends Cogn Sci. 2000; 4 :131–138. [ PubMed ] [ Google Scholar ]
  • Hickok G, Poeppel D. The cortical organization of speech processing. Nat Rev Neurosci. 2007; 8 :393–402. [ PubMed ] [ Google Scholar ]
  • Johnson K. Speaker normalization in speech perception. In: Pisoni DB, Remez RE, editors. The handbook of speech perception. Oxford: Blackwell Publishing; 2005. pp. 363–389. [ Google Scholar ]
  • Joos M. Acoustic phonetics. Language. 1948; 24 :1–136. [ Google Scholar ]
  • Kaas JH, Hackett TA. Subdivisions of auditory cortex and processing streams in primates. Proc Natl Acad Sci U S A. 2000; 97 :11793–11799. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kawahara H, Masuda-Kasuse I, de Cheveigne A. Restructuring speech representations using pitch-adaptive time-frequency smoothing and instantaneous-frequency-based F0 estraction: possible role of repetitive structure in sounds. Speech Commun. 1999; 27 :187–207. [ Google Scholar ]
  • Kawahara H, Irino T, Divenyi P. Speech separation by humans and machines. Norwell, MA: Kluwer Academic; 2004. Underlying prinicples of a high-quality speech manipulation system STRAIGHT and its application to speech segregation; pp. 167–180. [ Google Scholar ]
  • Kiebel SJ, Daunizeau J, Friston KJ. A hierarchy of time-scales and the brain. PLoS Comput Biol. 2008; 4 :e1000209. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kiebel SJ, von Kriegstein K, Daunizeau J, Friston KJ. Recognizing sequences of sequences. PLoS Comput Biol. 2009; 5 :e1000464. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Knill D, Kersten D, Yuille A, Richards W. Perception as Bayesian inference. Cambridge, UK: Cambridge UP; 1998. Introduction: a Bayesian formulation of visual perception; pp. 1–21. [ Google Scholar ]
  • Ladefoged P, Broadbent DE. Information conveyed by vowels. J Acoust Soc Am. 1957; 29 :98–104. [ PubMed ] [ Google Scholar ]
  • Lavner Y, Gath I, Rosenhouse J. The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels. Speech Commun. 2000; 30 :9–26. [ Google Scholar ]
  • Leff AP, Schofield TM, Stephan KE, Crinion JT, Friston KJ, Price CJ. The cortical dynamics of intelligible speech. J Neurosci. 2008; 28 :13209–13215. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Lewis JW, Talkington WJ, Walker NA, Spirou GA, Jajosky A, Frum C, Brefczynski-Lewis JA. Human cortical organization for processing vocalizations indicates representation of harmonic structure as a signal attribute. J Neurosci. 2009; 29 :2283–2296. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Morosan P, Rademacher J, Schleicher A, Amunts K, Schormann T, Zilles K. Human primary auditory cortex: cytoarchitectonic subdivisions and mapping into a spatial reference system. Neuroimage. 2001; 13 :684–701. [ PubMed ] [ Google Scholar ]
  • Mullennix JW, Pisoni DB, Martin CS. Some effects of talker variability on spoken word recognition. J Acoust Soc Am. 1989; 85 :365–378. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nearey TM. Static, dynamic, and relational properties in vowel perception. J Acoust Soc Am. 1989; 85 :2088–2113. [ PubMed ] [ Google Scholar ]
  • Nelken I, Bar-Yosef O. Neurons and objects: the case of auditory cortex. Front Neurosci. 2008; 2 :107–113. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Obleser J, Eisner F. Pre-lexical abstraction of speech in the auditory cortex. Trends Cogn Sci. 2009; 13 :14–19. [ PubMed ] [ Google Scholar ]
  • O'Shaughnessy D. Invited paper: automatic speech recognition: history, methods and challenges. Pattern Recognit. 2008; 41 :2965–2979. [ Google Scholar ]
  • O'Toole AJ, Roark DA, Abdi H. Recognizing moving faces: a psychological and neural synthesis. Trends Cogn Sci. 2002; 6 :261–266. [ PubMed ] [ Google Scholar ]
  • Overath T, Cusack R, Kumar S, von Kriegstein K, Warren JD, Grube M, Carlyon RP, Griffiths TD. An information theoretic characterisation of auditory encoding. PLoS Biol. 2007; 5 :e288. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Overath T, Kumar S, von Kriegstein K, Griffiths TD. Encoding of spectral correlation over time in auditory cortex. J Neurosci. 2008; 28 :13268–13273. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Patterson RD, Uppenkamp S, Johnsrude IS, Griffiths TD. The processing of temporal pitch and melody information in auditory cortex. Neuron. 2002; 36 :767–776. [ PubMed ] [ Google Scholar ]
  • Penagos H, Melcher JR, Oxenham AJ. A neural representation of pitch salience in nonprimary human auditory cortex revealed with functional magnetic resonance imaging. J Neurosci. 2004; 24 :6810–6815. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Pisoni DB. Some thoughts on “normalization” in speech perception. In: Johnson K, Mullenix JW, editors. Talker variability in speech processing. San Diego: Academic; 1997. pp. 9–32. [ Google Scholar ]
  • Poeppel D. The analysis of speech in different temporal integration windows: cerebral lateralization as “asymmetric sampling in time.” Speech Commun. 2003; 41 :245–255. [ Google Scholar ]
  • Puce A, Allison T, Bentin S, Gore JC, McCarthy G. Temporal cortex activation in humans viewing eye and mouth movements. J Neurosci. 1998; 18 :2188–2199. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci. 1999; 2 :79–87. [ PubMed ] [ Google Scholar ]
  • Scott SK. Auditory processing: speech, space and auditory objects. Curr Opin Neurobiol. 2005; 15 :197–201. [ PubMed ] [ Google Scholar ]
  • Summerfield C, Egner T, Greene M, Koechlin E, Mangels J, Hirsch J. Predictive codes for forthcoming perception in the frontal cortex. Science. 2006; 314 :1311–1314. [ PubMed ] [ Google Scholar ]
  • Sussman HM. A neuronal model of vowel normalization and representation. Brain Lang. 1986; 28 :12–23. [ PubMed ] [ Google Scholar ]
  • Turner RE, Walters TC, Monaghan JJ, Patterson RD. A statistical formant-pattern model for estimating vocal-tract length from formant frequency data. J Acoust Soc Am. 2009; 125 :2374–2386. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Vigneau M, Beaucousin V, Hervé PY, Duffau H, Crivello F, Houdé O, Mazoyer B, Tzourio-Mazoyer N. Meta-analyzing left hemisphere language areas: phonology, semantics, and sentence processing. Neuroimage. 2006; 30 :1414–1432. [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Giraud AL. Distinct functional substrates along the right superior temporal sulcus for the processing of voices. Neuroimage. 2004; 22 :948–955. [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Giraud AL. Implicit multisensory associations influence voice recognition. PLoS Biol. 2006; 4 :e326. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Eger E, Kleinschmidt A, Giraud AL. Modulation of neural responses to speech by directing attention to voices or verbal content. Brain Res Cogn Brain Res. 2003; 17 :48–55. [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Warren JD, Ives DT, Patterson RD, Griffiths TD. Processing the acoustic effect of size in speech sounds. Neuroimage. 2006; 32 :368–375. [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Smith DR, Patterson RD, Ives DT, Griffiths TD. Neural representation of auditory size in the human voice and in sounds from other resonant sources. Curr Biol. 2007; 17 :1123–1128. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • von Kriegstein K, Dogan O, Grüter M, Giraud AL, Kell CA, Grüter T, Kleinschmidt A, Kiebel SJ. Simulation of talking faces in the human brain improves auditory speech recognition. Proc Natl Acad Sci U S A. 2008; 105 :6747–6752. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Welling L, Ney H, Kanthak S. Speaker adaptive modeling by vocal tract normalization. IEEE Trans Speech Audio Process. 2002; 10 :415–426. [ Google Scholar ]
  • Wolpert DM, Ghahramani Z, Jordan MI. An internal model for sensorimotor integration. Science. 1995; 269 :1880–1882. [ PubMed ] [ Google Scholar ]
  • Wong PC, Diehl RL. Perceptual normalization for inter- and intratalker variation in Cantonese level tones. J Speech Lang Hear Res. 2003; 46 :413–421. [ PubMed ] [ Google Scholar ]
  • Wong PC, Nusbaum HC, Small SL. Neural bases of talker normalization. J Cogn Neurosci. 2004; 16 :1173–1184. [ PubMed ] [ Google Scholar ]

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 May 2022

A study of transformer-based end-to-end speech recognition system for Kazakh language

  • Mamyrbayev Orken 1 ,
  • Oralbekova Dina 1 , 2 ,
  • Alimhan Keylan 1 , 3 ,
  • Turdalykyzy Tolganay 1 &
  • Othman Mohamed 4  

Scientific Reports volume  12 , Article number:  8337 ( 2022 ) Cite this article

9746 Accesses

14 Citations

Metrics details

  • Computer science
  • Information technology
  • Scientific data

Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operation, as with recurrent neural networks. In this work, Transformer models and an end-to-end model based on connectionist temporal classification were considered to build a system for automatic recognition of Kazakh speech. It is known that Kazakh is part of a number of agglutinative languages and has limited data for implementing speech recognition systems. Some studies have shown that the Transformer model improves system performance for low-resource languages. Based on our experiments, it was revealed that the joint use of Transformer and connectionist temporal classification models contributed to improving the performance of the Kazakh speech recognition system and with an integrated language model it showed the best character error rate 3.7% on a clean dataset.

Similar content being viewed by others

speech recognition tasks

Towards audio-based identification of Ethio-Semitic languages using recurrent neural network

Amlakie Aschale Alemu, Malefia Demilie Melese & Ayodeji Olalekan Salau

speech recognition tasks

A neural speech decoding framework leveraging deep learning and speech synthesis

Xupeng Chen, Ran Wang, … Adeen Flinker

speech recognition tasks

Dissecting neural computations in the human auditory pathway using deep neural networks for speech

Yuanning Li, Gopala K. Anumanchipalli, … Edward F. Chang

Introduction

Innovative information and digital technologies are increasingly making their way into the life of a modern person: this applies to deep learning systems like voice recognition, images, speech recognition and synthesis. Namely, speech technologies are widely used in communications, robotics and other areas of professional activity. Speech recognition is a way to interact with technology. Speech recognition technology provides recognition of individual words or text, with its further conversion into a sequence of words or commands. There are traditional speech recognition systems that are based on acoustic, language models and lexicon. The acoustic model (AM) was built based on hidden Markov models (HMM) with the Gaussian Mixture Model (GMM), and the language model (LM) was based on n-gram models. The components of these systems were trained separately, which made it difficult to manage and configure them, which led to a decrease in the efficiency of using these systems. With the advent of deep learning, the performance of speech to text systems has improved. Artificial neural networks began to be used for acoustic modeling instead of GMM, which led to improved results that were obtained in many research works 1 , 2 , 3 . Thus, the HMM-DNN architecture has become one of the most common models for continuous speech recognition.

Currently, the end-to-end (E2E) model has become widespread. The E2E structure presents the system as a single neural network, unlike the traditional one, which has several independent elements 4 , 5 . The E2E system provides direct reflection of acoustic signals in the sequence of labels without intermediate states, without the need to perform subsequent processing at the output, which makes it easy to implement. To increase the performance of E2E systems, it is necessary to solve the main tasks related to the definition of the model architecture, the collection of a sufficiently large speech corpus with the appropriate transcription, and the availability of high-performance equipment. Solving these issues ensures the successful implementation of not only speech recognition systems, but also other deep learning systems. In addition, E2E systems can significantly improve the quality of recognition from learning large amounts of training data.

Models based on the Connectionist temporal classification 6 (CTC), models based on the attention mechanism 7 are illustrative examples of end-to-end systems. In a CTC-based model, there is no need to align at the frame level between acoustics and transcription, since a special token is allocated, like an "empty label" which determines the beginning and end of one phoneme 8 . In the attention mechanism based encoder/decoder models, the encoder is an AM—converts input speech into a high-level representation, the attention mechanism is an alignment model, and determines encoded frames that are related to the creation of the current output, the decoder is similar to the AM—operates autoregressive, predicting each output token depending on previous predictions 9 . The above E2E models are based on convolutional and modified recurrent neural networks (RNNs). The models implemented using RNN perform calculations on the character positions of the input and output data, thus generating a sequence of hidden states depending on the previous hidden state of the network. This sequential process does not provide parallelization of learning in training examples, which is a problem with a longer sequence of input data and takes much longer to train the network. In 10 , another Transformer-based model was proposed, which allows parallelization of the learning process, and this model also removes repetitions and uses its internal attention to find the dependencies between the received and resulting data. The big advantage of this architecture is the fast-learning rate and the lack of sequential operation, as with RNN. In previous studies 11 , 12 it was revealed that the combined use of Transformer models and an E2E model, like CTC, contributed to the improvement of the quality of the English and Chinese speech recognition system.

It should be noted that the attention mechanism is a common method that greatly improves the quality of the system in machine translation and speech recognition. And the Transformer model uses this attention mechanism to increase the learning rate. This model has its own internal attention, which aligns all positions of the input sequence to find a representation of the set, which does not require alignments. In addition, Transformer does not need to process the end of the text after processing its start.

In order to implement such models, a large amount of speech data are required for training, which is problematic for languages with limited training data, namely for the Kazakh language, which is included in the group of agglutinative languages. To date, systems have been developed based on the CTC model 13 , 14 for recognizing Kazakh speech with different sets of training data. The use of other methods and models to improve the accuracy of recognition of the Kazakh speech is a promising direction and can improve the performance of the recognition system with a small size of the training sample.

The main goal of our study is to improve the accuracy of the automatic recognition system for Kazakh continuous speech by increasing training data, as well as the use of models based on Transformer and CTC for recognizing Kazakh speech.

The structure of the work is given in the following order: Sect.  2 presents traditional methods of speech recognition, Sect.  3 provides an analytical review of the scientific direction. Section  4 describes the principles of operation of the Transformer-based model and the model we proposed. Further, in Sects.  5 and 6 , our experimental data, corpus of speech, and equipment for the experiment are described, and the results obtained are analyzed. The conclusions are given in the final section..

Traditional speech recognition methods

Traditional sequence recognition focused on estimating the maximum a posteriori probability. Formally, this approach is a transformation of a sequence of acoustic speech characteristics X into a sequence of words W. Acoustic characteristics are a sequence of feature vectors of length T: X  =  {x t   ∈  R D | t  =  1, … , T} , and the sequence of words is defined as W  =  {w n   ∈  V | n  =  1, … , N} , having length N, where V is a vocabulary. The most probable word sequence W ∗ can be estimated by maximizing P(W|X) for all possible word sequences V ∗ (1) 15 . This process can be represented by the following expression:

Therefore, the main goal of the automatic speech recognition (ASR) is to find a suitable model that will accurately determine the posterior distribution \(P\left( {W \, | \, X} \right)\) .

The process of automatic speech recognition consists of sequences of the following steps:

Extraction of features from the input signal.

Acoustic modeling (determines which phones were pronounced for subsequent recognition).

Language modeling (checks the correspondence of spoken words to the most likely sequences).

Decoding a sequence of words spoken by a person.

The most important parts of a speech recognition system are feature extraction methods and recognition methods. Feature extraction is a process that allocates a small amount of data essential for solving a problem. To extract features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) algorithms are commonly used 16 , 17 , 18 . The popular one is MFCC.

In the speech recognition task, the original signal is converted into feature vectors, on the basis of which classification will then be performed.

Acoustic model

The acoustic model (AM) uses deep neural networks and hidden Markov models. Deep neural network, convolutional neural network (CNN), or long short-term memory, which is a variant of the recurrent neural network is used to map the acoustic frame x t to the phonetic state of the subsequent f t at each input time t (2):

Before this acoustic modeling procedure, the output targets of the neural network models, a sequence of phonetic states at the frame level f 1:T , are generated by HMM and GMM in special training methods. GMM models the acoustic element at the frame level x 1:T , and HMM estimates the most probable sequence of phonetic states f 1:T .

The acoustic model is optimized for the cross-entropy error, which is the phonetic classification error per frame.

Language model

The language model p(w) models the most probable sequences of words regardless of acoustics (3):

where w < u is the previous recognized word.

Currently, RNN or LSTM are commonly used extensively for language model architecture, as they can capture long-term dependencies rather than traditional n-gram models, which are based on the Markov assumption and limited to a certain n-range of word history.

Hidden Markov models

For a long time, a system based on hidden Markov models (HMM) was the main model for continuous speech recognition. The HMM mechanism can be used not only in acoustic modeling but also in the language model. But in general, the use of the HMM model gives a greater advantage when modeling the acoustic component.

In this HMM, the phone is the observation and the feature is the latent state. For an HMM that has a state set {1,…, J}, the HMM-based model uses the Bayesian theorem and introduces the HMM state sequence S = {s t   ∈  {1,…, J} | t = 1,…, T} пo p (L|X) (4).

p(X|S), p(S|L), and p(L) in Eq. ( 4 ) correspond to the acoustic model, the pronunciation model and the language model, respectively.

The acoustic model P (X|S) indicates the probability of observing X from the hidden sequence S. According to the probability chain rule and the observation independence hypothesis in the HMM (observations at any time depend only on the hidden state at that time), P(X|S) can be decomposed into the following form (5):

In the acoustic model, p (x t |s t ) is the probability of observation, which is usually represented by mixtures of Gaussian distributions. The distribution of the posteriori probability of the hidden state p (s t |x t ) can be calculated by the method of deep neural networks.

Two approaches, HMM-GMM and HMM-DNN, can be used to calculate p (X|S) in Eq. 5 . The first approach HMM-GMM was for a long time the main method for building speech-to-text technology. With the development of deep learning technology, DNN is introduced into speech recognition for acoustic modeling. The role of DNN is to calculate the posterior probability of the HMM state, which can be converted into probabilities, replacing the usual GMM observation probability. Consequently, the transition of HMM-GMM to the hybrid model HMM-DNN has yielded excellent recognition results, and is becoming a popular ASR architecture.

Hybrid models have some important limitations. For example, ANN with more than two hidden levels were rarely used due to computational performance limitations, and the context-dependent model described above takes into account numerous effective methods developed for GMM-HMM.

The learning process is complex and difficult for global optimization. Components of traditional models are usually trained on different datasets and methods.

Hybrid models based on DNN-HMM

To calculate  P(x t |s t ) directly, GMM was used, because this model gives the possibility to simulate the distribution for each state, allowing to obtain probability values of input sequences. However, in practice, these assumptions cannot always be modeled by GMM. DNNs have shown significant improvements over GMMs due to their ability to study nonlinear functions. DNN cannot directly provide a conditional probability. The frame-by-frame posterior distribution is used to turn the probability model P(x t |s t ) into a classification problem P(s t |x t ) using a pseudo-likelihood trick as a joint probability approximation (6) 15 . The application this probability is referred to as a "hybrid  architecture".

A numerator is a DNN classifier trained with a set of input functions as input x t and target state s t . The denominator P(st) is the prior probability of the state s t . Frame-by-frame training requires frame-by-frame alignment with x t as input and s t as target. This negotiation is usually achieved by using a weaker HMM/GMM negotiation system or using human-made dictionaries. The quality and quantity of alignment labels are usually the most significant limitations of the hybrid approach.

End-to-end speech recognition models

E2E automatic speech recognition is a new technology in the field of ASR based on a neural network, which offers many advantages. E2E ASR is a single integrated approach with a much simpler training approach with models that work at a low audio frame rate. This reduces learning time, decoding time, and allows joint optimization with subsequent processing, such as understanding the natural language.

For the global calculation of P(W | X) using E2E speech recognition models, the input can be represented as a sequence of acoustic features X  =  (x 1 ,…, x t ) , the sequence of target marks as y  =  (y 1 ,…, y t ) , and the sequences words in the form W  =  w m  =  (w 1 ,…, w m ) .

Thus, the ANN finds probabilities P(∙|x 1 ),…,P(∙|x t ) , where the input probability parameters are some representations of a sequence of words, i.e. labels.

The basic principle of operation is that modern E2E models are trained on the basis of big data. From the above, we can detect the main problem, it concerns the recognition of languages with limited training data, such as Kazakh, Kyrgyz, Turkish, etc. For such low-resource languages, there are no large corpuses of training data.

Related work/literature review

The Transformer model was first introduced in 8 , in order to reduce sequential calculations and the number of operations for correlating input and output position signals. Experiments were conducted on machine translation tasks, from English to German and from English to French. As a result, the model was shown to have achieved good performance compared to existing results. Moreover, Transformer works perfectly for other tasks with large and limited training data, and is very fruitful for all kinds of seq2seq tasks.

The use of Transformer for speech-to-text conversion also showed good results and was reflected in the following research papers:

To implement a faster and more accurate ASR system, Transformer and ASR achievements based on RNN were combined by Karita et al. 11 . To build the model, a Connectionist temporal classification (CTC) was E2E with Transformer for co-learning and decoding. This approach speeds up learning and facilitates LM integration. The proposed ASR system implements significant improvements in various ASR tasks. For example, it lowered WER from 11.1% to 4.5% for the Wall Street Journal and from 16.1% to 11.6% for TED-LIUM, introducing CTC and LM integration into the Transformer baseline.

Moritz et al. 19 proposed a Transformer-based model for streaming speech recognition that requires an entire speech utterance as input. Time-limited self-attention in the encoder and triggered attention for the encoder-decoder with attention mechanism were applied to generate the output after the spoken word. The model architecture achieved the best result in E2E streaming speech recognition − 2.8% and 7.3% WER for "pure" and "other" LibriSpeech test data.

The Weak-Attention Suppression (WAS) method was proposed by Yangyang Shi and other 20 , which dynamically causes sparse attention probabilities. This method suppresses the attention of uncritical and redundant continuous acoustic frames and is more likely to suppress past frames than future ones. It was shown that the proposed method leads to a decrease in WER compared to the basic types of Transformer. In Test LibriSpeech, the proposed WAS method reduced WER by 10% in cleanliness testing and by 5% in another test for streaming Transformers, which led to a new advanced level among streaming models.

Dong Linhao and co-authors 21 presented a Speech-Transformer system using a 2D attention mechanism that co-processes the time and frequency axes of 2D speech inputs, thereby providing more expressive representations for the Speech-Transformer. The Wall Street Journal (WSJ) corpus was used as training data. The results of the experiment showed that this model allows to reduce the training time and at the same time can provide a competitive WER.

Gangi et al. 22 suggested Transformer with SLT adaptation—an architecture for spoken language translation, for processing long input sequences with low information density to solve ASR problems. The adaptation was based on downsampling the input data using convolutional neural networks and modeling the two-dimensional nature of the audio spectrogram using 2D components. Experiments show that the SLT-adapted Transformer outperforms the RNN-based baseline in both translation quality and learning time, providing high performance in six language areas.

Takaaki Hori et al. 23 advanced the Transformer architecture, on the basis of which a context window was developed, which was trained in monologue and dialogue scenarios. Monologue tests on CSJ and TED-LIUM3 and dialog tests on SWITCHBOARD and HKUST were applied. As a result, results were obtained that surpass the basic E2E ASR with one sound and with or without speaker i-vectors.

In the E2E system, the RNN-based encoder-decoder model was replaced by the Transformer architecture in Chang X. et al. research 24 . And in order to use this model in the masking network of the neural beamformer in the multi-channel case, the self-attention component has been modified so that it is limited to a segment, rather than the entire sequence, in order to reduce the amount of computation. In addition to improvements to the model architecture, preprocessing of external dereverberation, weighted prediction error (WPE), was also included, which allows the model to process reverberated signals. Experiments with the extended wsj1-2mix corpus show that Transformer-based models achieve better results in echo-free conditions in single-channel and multi-channel modes, respectively.

Transformer architecture

The Transformer model was first created for machine translation, replacing recurrent neural networks (RNNs) in natural language processing (NLP) tasks. In this model, recurrence was completely eliminated, instead, for each statement, using the internal attention mechanism (self-attention mechanism), signs were built to identify the significance of other sequences for this utterance. Therefore, the generated features for a given statement are the result of linear transformations of sequence features that are significant.

The Transformer model consists of one large block, which in turn consists of blocks of encoders and decoders (Fig.  1 ). Here, the encoder takes as input the feature vectors from the audio signal X  =  (x 1 ,…,x T ) and outputs a sequence of intermediate representations. Further, based on the received representations, the decoder reproduces the output sequence W  =  w m  =  (w 1 ,…,w M ) . Each stage of the model uses the previous symbols to output the next, because it is autoregressive. The Transformer architecture uses several layers of self-attention in the encoder and decoder blocks that are interconnected with each other. Consider each block individually.

figure 1

General scheme of the model.

Encoder and decoder networks

Conventional E2E encoder/decoder models for speech recognition tasks consist of a single encoder and decoder, an attention mechanism. The encoder converts the vector of acoustic features into an alternative representation, and the decoder predicts a sequence of labels from the alternative information provided by the encoder, then attention highlights the significant parts of the frame for predicting the output. In contrast to these models, the Transformer model can have several encoders and decoders, and each of them contains its own internal attention mechanism.

An encoder block consists of sets of encoders; as a study, 6 coders are usually taken, which are located one above the other. The number of encoders is not fixed, it is possible to experiment with an arbitrary number of encoders in a block. All encoders have the same structure but different weights. The input of the encoder receives extracted feature vectors from the audio signal, obtained using Mel-frequency cepstral coefficients or convolutional neural networks. Then the first encoder transforms these data using self-attention into a set of vectors, and through the feed forward ANN transmits the received outputs to the next encoder. The last encoder processes the vectors and transfers the data of the encoded functions to the decoder block.

A decoder block is a set of decoders, and their number is usually identical to the number of encoders. Each part of the encoder can be divided into two sublayers: the input data entering the encoder first passes through the multi-head attention layer, which helps the encoder look at other words in the incoming sentence during encoding of a particular word. The output of the inner multi-head attention layer is sent to the feed-forward neural network. The exact same network is independently applied to each word in the sentence.

The decoder also contains two of these layers, but there is an attention layer between them that helps the decoder focus on significant parts of the incoming sentence, as is similar to the usual attention mechanism in seq2seq models. This component will take into account previous characters/words and, based on these data, outputs the posterior probabilities of the subsequent character/words.

Self-attention mechanism

The Transformer model includes Scaled Dot-Product Attention 10 . The advantage of self-attention is fast calculation and shortening of the path between words, as well as potential interpretability. This attention includes 3 vectors: queries, keys and values, and scaling (7):

These parameters are considered useful for calculating attention. Multi-head attention combines several self-attention maps into general matrix calculations (8):

Here \({s}_{h}=Attention(Q{W}_{h}^{Q}, K{W}_{h}^{K}, V{W}_{h}^{V})\) . h is the amount of attention in the layer, \(Q{W}_{h}^{Q}, K{W}_{h}^{K}, V{W}_{h}^{V}, {s}_{h}\) – trained weight matrices.

The multi-head attention mechanism can be used as an optimization problem. Using this mechanism, you can bypass problems associated with unsuccessful initialization, as well as improve the speed of training. In addition, after training, you can exclude some parts of the heads of attention, since these changes will not affect the quality of decoding in any way. The number of heads in the model is designed to regulate attention mechanisms. In addition, this mechanism helps the network to easily access any information, regardless of the length of the sequence, because this is done easily, regardless of the number of words in the set.

In the Transformer architecture, you can see the Normalize element, which is necessary to normalize feature values, since after using the attention mechanism, these values can have different values. As a normalization, the Layer Normalization method is usually used (Fig.  2 ).

figure 2

Transformer Model.

The outputs of several heads can also be different, and in the final vector the spread of values can be large. To prevent this, an approach has been proposed 11 where values at each position are converted with a two-layer perception. After applying the attention mechanism, the values are projected to a larger dimension using the trained weights, where they are then transformed by the nonlinear activation function ReLU, and then these values are projected to the original dimension, after which the next normalization occurs.

Proposed model

Typically, Connectionist temporal classification (CTC) is used as a loss function to train recurrent neural networks to recognize input speech without pre-aligning the input and output data 11 . To achieve high performance from the CTC model, it is necessary to use an external language model, since direct decoding will not work correctly. In addition, the Kazakh language has a rather diverse mechanism of word formation, which the use of language mode contributes to an increase in the quality of recognition of Kazakh speech.

In this work, we will jointly use the Transformer and CTC models with LM. The use of LM CTC in decoding results in rapid model convergence, which reduces the amount of time to decode and improves system performance. The CTC function, after receiving the output from the encoder, finds the probability by formula 9 for arbitrary alignment between the encoder output and the output symbol sequence.

Here \(x\) is the output vector of the encoder, R is an additional operator for removing blank spaces and repeated symbols, \(\gamma\) is a series of predicted symbols. This equation determines the sum of all alignments using dynamic programming, and helps to train the neural network on unlabeled data.

The general structure of the resulting model is shown in Fig.  3 .

figure 3

The structure of our model.

During training, the multi-task loss method was used to bring the general formula for combining probabilities according to the negative logarithm, as presented in 10 .

Thus, the resulting model can be represented by the following expression ( 10 ):

where \(\lambda\) —configurable parameter and satisfies the condition— \(0\le \lambda \le 1\) .

The following additions have been included to improve model performance:

(1) Using a character-level language model in feature extraction. Convolutional neural networks were used to extract features. To extract high-dimensional features from the audio data, we first wrap all the network parameters under the last hidden CNN layer. Softmax was used as an activation function. Next, a maxpooling layer was added to eliminate noise signals and reduce noise with dimensionality reduction. This layer is needed to reduce the size of the collapsed element into a vector. Also it helps to reduce the processing power required for data processing by reducing the dimensionality. And adaptation of training with character-level language model, without disturbing the structure of the neural network during training, allows us to preserve maximum non-linearity for subsequent processing. Thus, our extracted features are already high-level, and there is no need to map these raw data to phonemes.

(2) Application of a language model at the level of words and phrases when decoding together with CTC.

To measure the quality of the Kazakh speech recognition system, the following parameters were used: CER—the number of incorrectly recognized characters, because characters are the most common and simple output units for generating texts; and based on the word error rate (WER) 25 .

Data availability

Not applicable.

Seide, G. L., & Yu, D. Conversational Speech. Transcription Using Context-Dependent Deep Neural. Networks. Interspeech (2011).

Bourlard, H., & Morgan, N. Connectionist speech recognition: A hybrid approach. p. 352 (1993) https://doi.org/10.1007/978-1-4615-3210-1 .

Smit, P., Virpioja, S. & Kurimo, M. Advances in subword-based HMM-DNN speech recognition across languages. Comput. Speech Lang. 66 , 1. https://doi.org/10.1016/j.csl.2020.101158 (2021).

Article   Google Scholar  

Wang, D., Wang, X. & Lv, S. An overview of end-to-end automatic speech recognition. Symmetry 11 , 1018. https://doi.org/10.3390/sym11081018 (2019).

Mamyrbayev, O. & Oralbekova, D. Modern trends in the development of speech recognition systems. News of the National academy of sciences of the republic of Kazakhstan 4 (332), 42–51 (2020).

Google Scholar  

Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, Pittsburgh, USA, 2006

Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. Listen attend and spell. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2016

Cui, X., & Gong, Y. Variable parameter Gaussian mixture hidden Markov modeling for speech recognition. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., 2003, pp. I-I. https://doi.org/10.1109/ICASSP.2003.1198704 .

Yan, Y., Qi, W., Gong, Y., Liu, D., Duan, N., Chen, J., Zhang, R., & Zhou, M. ProphetNet: Predicting future N-gram for sequence-to-sequence pre-training. arXiv - CS - Computation and Language, 2020. arxiv-2001.04063.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010 (2017).

Karita, S., Soplin, N. E. Y., Watanabe, S., Delcroix, M., Ogawa, A., & Nakatani, T. (2019). Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 1408–1412. https://doi.org/10.21437/Interspeech.2019-1938 .

Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6084–6088 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053165 .

Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. End-to-End Model Based on RNN-T for Kazakh Speech Recognition. In 2021 3rd International Conference on Computer Communication and the Internet (ICCCI), 2021, pp. 163–167. https://doi.org/10.1109/ICCCI51764.2021.9486811 .

Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A. & Zhumazhanov, B. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-Eur. J. Enterpris. Technol. 19 (115), 84–92 (2022).

Kamath, U., Liu, J. & Whitaker, J. Deep Learning for NLP and Speech Recognition (Springer, 2019).

Book   Google Scholar  

El-Henawy, I. M., Khedr, W. I. & ELkomy OM, Abdalla A-ZMI,. Recognition of phonetic Arabic figures via wavelet-based Mel Frequency Cepstrum using HMMs. HBRC J. 10 (1), 49–54 (2014).

Mohan, B. J. & Ramesh Babu, N. Speech recognition using MFCC and DTW. International Conference on Advances in Electrical Engineering (ICAEE) 1 , 1–4. https://doi.org/10.1109/ICAEE.2014.6838564 (2014).

Dave, N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 1 , 1 (2013).

Moritz, N., Hori, T., & Le, J. Streaming Automatic Speech Recognition with the Transformer Model. In ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6074–6078. https://doi.org/10.1109/ICASSP40776.2020.9054476 .

Shi, Y., Wang, Y., Wu, C., Fuegen, C., Zhang, F., Le, D., Yeh, C., & Seltzer, M. Weak-attention suppression for transformer-based speech recognition. ArXiv abs/2005.09137 (2020).

Dong, L., Xu, S., & Xu, B. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5884–5888 (2018).

Gangi, M. A. D., Negri, M., Cattoni, R., Dessì, R., & Turchi, M. Enhancing Transformer for End-to-end Speech-to-Text Translation. MTSummit (2019).

Hori, T., Moritz, N., Hori, C., & Roux, J. L. Transformer-based Long-context End-to-end Speech Recognition. INTERSPEECH 2020, Shanghai, China (2020).

Chang, X., Zhang, W., Qian, Y., Le Roux, J., & Watanabe, S. End-to-end multi-speaker speech recognition with transformer. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Levenshtein, V. I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Doklady 10 , 707–710 (1996).

ADS   MathSciNet   Google Scholar  

Mamyrbayev, O. et al. Development of security systems using DNN and i & x-vector classifiers. East.-Eur. J. Enterpris. Technol. 49 (112), 32–45 (2021).

Kingma, D. P., & Adam, B. J. A method for stochastic optimization. arXiv, 2014. http://arxiv.org/abs/1412 . 6980 (data of request: 18.04.2021).

LeCun, Y., Bottou, L., Orr, G. B., & M¨uller, K.-R. Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 1998, pp. 9–50.

Download references

Acknowledgements

This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

This study was funded by the Science Committee of the Ministry of Education and Science of the Republic Kazakhstan (Grant No. AP08855743).

Author information

Authors and affiliations.

Institute of Information and Computational Technologies CS MES RK, Almaty, Kazakhstan

Mamyrbayev Orken, Oralbekova Dina, Alimhan Keylan & Turdalykyzy Tolganay

Satbayev University, Almaty, Kazakhstan

Oralbekova Dina

L.N. Gumilyov Eurasian National University, Nur-Sultan, Kazakhstan

Alimhan Keylan

Universiti Putra Malaysia, Kuala Lumpur, Malaysia

Othman Mohamed

You can also search for this author in PubMed   Google Scholar

Contributions

O.M. built a model and applied transfer learning to realized recognition model and participated in the preparation of the manuscript, K.A. and M.O. carried out the analysis of literatures on the topic under study, D.O. built an end-to-end model based on Transformer, participated in the research and prepared the manuscript, T.T. prepared data for training, D.O. helped in drawing up the program. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Oralbekova Dina .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Orken, M., Dina, O., Keylan, A. et al. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci Rep 12 , 8337 (2022). https://doi.org/10.1038/s41598-022-12260-y

Download citation

Received : 28 December 2021

Accepted : 05 May 2022

Published : 18 May 2022

DOI : https://doi.org/10.1038/s41598-022-12260-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A comprehensive survey on automatic speech recognition using neural networks.

  • Amandeep Singh Dhanjal
  • Williamjeet Singh

Multimedia Tools and Applications (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

speech recognition tasks

Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition

  • Conference paper
  • First Online: 22 November 2023
  • Cite this conference paper

Book cover

  • Siddharth Rathod   ORCID: orcid.org/0009-0003-2176-4413 13 ,
  • Monil Charola   ORCID: orcid.org/0009-0000-5145-8378 13 &
  • Hemant A. Patil   ORCID: orcid.org/0000-0002-4068-2005 13  

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14338))

Included in the following conference series:

  • International Conference on Speech and Computer

481 Accesses

1 Citations

Dysarthria is a motor speech disorder that affects an individual’s ability to articulate words, making speech recognition a challenging task. Automatic Speech Recognition (ASR) technologies have the potential to greatly benefit individuals with dysarthria by providing them with a means of communication through computing and portable digital devices. These technologies can serve as an interaction medium, enabling dysarthric patients to communicate with others and computers. In this paper, we propose a transfer learning approach using the Whisper model to develop a dysarthric ASR system. Whisper, Web-scale Supervised Pretraining for Speech Recognition, is a multi-tasking model trained on various speech-related tasks, such as speech transcription on various languages, speech translation, voice activity detection, language identification, etc. on a wide scale of 680,000 h of labeled audio data. Using the proposed Whisper-based approach, we have obtained an word recognition average accuracy of \(59.78\%\) using 155 words of UA-Speech Corpus, using the Bi-LSTM classifier model.

  • Encoder-decoder transformer
  • WSPSR (Whisper)
  • Automatic speech recognition (ASR)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Agarap, A.F.: Deep learning using rectified linear units (ReLU). CoRR abs/1803.08375 (2018). http://arxiv.org/abs/1803.08375 . Accessed 6 Feb 2023

Bock, S., Weiß, M.: A proof of local convergence for the ADAM optimizer. In: 2019 International Joint Conference on Neural Networks, IJCNN, Budapest, Hungary, pp. 1–8 (2019)

Google Scholar  

Iwamoto, Y., Shinozaki, T.: Unsupervised spoken term discovery using Wav2Vec 2.0. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, pp. 1082–1086 (2021)

Kim, H., et al.: Dysarthric speech database for universal access research. In: INTERSPEECH, Brisbane, Australia, pp. 1741–1744 (2008)

Lieberman, P.: Primate vocalizations and human linguistic ability. J. Acoust. Soc. Am. (JASA) 44 (6), 1574–1584 (1968)

Article   Google Scholar  

Lin, Y.Y., et al.: A speech command control-based recognition system for dysarthric patients based on deep learning technology. Appl. Sci. 11 (6), 2477 (2021)

O’Shea, K., Nash, R.: An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015). Accessed 25 Feb 2023

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022). Accessed 6 Mar 2023

Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45 (11), 2673–2681 (1997)

Sehgal, S., Cunningham, S.: Model adaptation and adaptive training for the recognition of dysarthric speech. In: Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies, Dresden, Germany, pp. 65–71 (2015)

Shahamiri, S.R.: Speech vision: an end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Trans. Neural Syst. Rehabil. Eng. 29 , 852–861 (2021). https://doi.org/10.1109/TNSRE.2021.3076778

Torrey, L., Shavlik, J.: Transfer learning. In: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 242–264. IGI Global (2010)

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), vol. 30, Long Beach, USA (2017)

Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in NIPS, vol. 31, Montreal, Canada (2018)

Zhao, Y., Kuruvilla-Dugdale, M., Song, M.: Voice conversion for persons with amyotrophic lateral sclerosis. IEEE J. Biomed. Health Inform. 24 (10), 2942–2949 (2019)

Download references

Acknowledgments

The authors would like to express their sincere appreciation to the Ministry of Electronics and Information Technology (MeitY), New Delhi, Govt. of India, for the project ‘Speech Technologies in Indian Languages BHASHINI’, (Grant ID: 11(1)2022-HCC (TDIL)) for their support.

Author information

Authors and affiliations.

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India

Siddharth Rathod, Monil Charola & Hemant A. Patil

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Siddharth Rathod .

Editor information

Editors and affiliations.

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia

Alexey Karpov

Koneru Lakshmaiah Education Foundation, Vaddeswaram, India

K. Samudravijaya

Indian Institute of Information Technology Dharwad, Dharwad, India

K. T. Deepak

Indian Institute of Technology Dharwad, Dharwad, India

Rajesh M. Hegde

KIIT Group of Colleges, Gurugram, India

Shyam S. Agrawal

S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Rathod, S., Charola, M., Patil, H.A. (2023). Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14338. Springer, Cham. https://doi.org/10.1007/978-3-031-48309-7_46

Download citation

DOI : https://doi.org/10.1007/978-3-031-48309-7_46

Published : 22 November 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-48308-0

Online ISBN : 978-3-031-48309-7

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. Speech Recognition: Everything You Need to Know in 2023

    speech recognition tasks

  2. Automatic Speech Routines

    speech recognition tasks

  3. How to Build an Effective Speech Recognition System

    speech recognition tasks

  4. Automatic Speech Recognition Technology: The Ultimate Guide to ASR

    speech recognition tasks

  5. Speech Recognition AI: What is it and How Does it Work

    speech recognition tasks

  6. How Does Speech Recognition Work? Learn about Speech to Text, Voice Recognition and Speech Synthesis

    speech recognition tasks

VIDEO

  1. Speech Recognition Project

  2. How to Enable Speech Recognition in Windows 11

  3. ChatGPT Insane Update: Now it Speak, Hear and See! #chatgpt

  4. Speech Recognition in ai || Defination || Speech Recognition v/s Voice Recognition

  5. Introduction to approach to an speech Recognition || Speech Recognition|| ai

  6. Speech Recognition Book Website

COMMENTS

  1. What Is Speech Recognition?

    Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format. While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal ...

  2. Speech recognition

    Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. ... hence reducing the difficulty of the speech recognition task should be possible. In practice, this is rarely ...

  3. Speech Recognition: Everything You Need to Know in 2024

    Speech recognition is the process of converting human speech into written text. Learn more about speech recognition techniques, challenges, and best practices. Research. ... CTC is especially useful for sequence labeling tasks and end-to-end speech recognition systems. It allows the neural network to discover the relationship between input ...

  4. What is Automatic Speech Recognition?

    Automatic Speech Recognition. Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text. It has many applications, such as voice user interfaces.

  5. Speech Recognition

    Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

  6. Speech Recognition Overview: Main Approaches, Tools & Techniques

    Speech recognition is the core element of complex speaker recognition solutions and is commonly implemented with the help of ML algorithms and deep neural networks. Depending on the complexity of the task at hand, you can combine different speaker recognition technologies, algorithms, and tools to improve the performance of your speech ...

  7. What is Speech Recognition?

    voice portal (vortal): A voice portal (sometimes called a vortal ) is a Web portal that can be accessed entirely by voice. Ideally, any type of information, service, or transaction found on the Internet could be accessed through a voice portal.

  8. Ultimate Guide To Speech Recognition Technology (2023)

    Simply put, speech recognition technology (otherwise known as speech-to-text or automatic speech recognition) is software that can convert the sound waves of spoken human language into readable text. These programs match sounds to word sequences through a series of steps that include: Pre-processing: may consist of efforts to improve the audio ...

  9. Automatic speech recognition

    Automatic speech recognition (ASR) converts a speech signal to text, mapping a sequence of audio inputs to text outputs. Virtual assistants like Siri and Alexa use ASR models to help users everyday, and there are many other useful user-facing applications like live captioning and note-taking during meetings.

  10. PDF Lecture 12: An Overview of Speech Recognition

    We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated word versus continuous speech: Some speech systems only need identify single words at a time (e.g., speaking a number to route a phone call to a company to the

  11. Speech Recognition

    Speech recognition is the way to translate the input speech signal into its corresponding transcript [37].Generations of transcripts from the input speech signal is a challenging task when it comes to native languages like Tamil, because of the variations in accents and dialects.

  12. Introduction to speech recognition with TensorFlow

    The introduction of transformers has significantly impacted speech recognition, enabling more accurate models for tasks such as speech recognition, natural language processing, and virtual assistant devices. This tutorial demonstrated how to build a basic speech recognition model using TensorFlow by combining a 2D CNN, RNN, and CTC loss.

  13. What is ASR: Understanding Automatic Speech Recognition

    Human speech often relies on context, tone, and non-verbal cues to convey meaning, which can be challenging for ASR systems to accurately interpret. Ambiguous phrases, homophones, and colloquial expressions further complicate the task of accurate speech recognition and understanding. Handling different languages

  14. A review of deep learning techniques for speech processing

    Speech processing is a field dedicated to the study and application of methods for analyzing and manipulating speech signals. It encompasses a range of tasks, including automatic speech recognition (ASR) [1], [2], speaker recognition (SR) [3], and speech synthesis or text-to-speech [4].

  15. PDF Multi-Task Learning for Speech Recognition: An Overview

    3.3 For Noise-Robust Speech Recognition The degradations caused by noise and reverberation are a common problem for speech recognition. Learning complementary information about the acoustic environment can be fruitful for the speech recognition task. Thus, generating denoised speech, also referred to as speech enhancement (SE), is an e ective

  16. Pre-trained models for automatic speech recognition

    Pre-trained models for automatic speech recognition. In this section, we'll cover how to use the pipeline() to leverage pre-trained models for speech recognition. In Unit 2, we introduced the pipeline() as an easy way of running speech recognition tasks, with all pre- and post-processing handled under-the-hood and the flexibility to quickly experiment with any pre-trained checkpoint on the ...

  17. How the Human Brain Recognizes Speech in the Context of Changing

    Before each sequence, participants received a visual instruction to perform either a speech recognition task ("speech task") or a control task (which was a "loudness task" in experiment 1 and a "speaker task" in experiment 2) (see below). Experimental design.

  18. A study of transformer-based end-to-end speech recognition ...

    Conventional E2E encoder/decoder models for speech recognition tasks consist of a single encoder and decoder, an attention mechanism. The encoder converts the vector of acoustic features into an ...

  19. Transformers in Automatic Speech Recognition

    The Transformer, a model relying entirely on the attention mechanism, brought significant improvements in performance on several natural language processing tasks. This chapter presents its impact on the speech processing domain and, more specifically, on the automatic speech recognition task. A short history of the evolution of automatic ...

  20. PDF Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

    downstream multilingual ASR and speech-to-text translation tasks. We also demon-strate that despite using a labeled training set 1/7-th the size of that used for the Whisper model [1], our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages. 1 Introduction

  21. Task-Consistent Meta Learning for Low-Resource Speech Recognition

    In this work, to tackle the task-conflict problem, we propose a task-consistent multilingual meta learning (TCMML) algorithm for low-resource speech recognition by directing model parameters in a consistent direction. Extensive experimental results demonstrate that our method effectively enhances the few-shot learning ability of meta-learning.

  22. A multitask co-training framework for improving speech ...

    Inspired by recent works in end-to-end ST [2, 7, 21], we suggest: (1) jointly modeling ST, MT, and ASR tasks in multitask learning; (2) keeping the roles of each module consistent in multitask training; (3) maintain semantic consistency between text and speech.In this paper, we propose a novel multitask learning framework, multitask co-training network (MCTN), to jointly model multiple tasks ...

  23. Application of speech recognition and autoreference models for logging

    Application of speech recognition and autoreference models for logging tasks. Protocols play an important role in decision-making across many fields. The effectiveness of the work of various organizations and teams directly depends on their quality and speed of writing. Therefore, automating this process is of utmost importance today.

  24. Transfer Learning Using Whisper for Dysarthric Automatic Speech Recognition

    Whisper, Web-scale Supervised Pretraining for Speech Recognition, is a multi-tasking model trained on various speech-related tasks, such as speech transcription on various languages, speech translation, voice activity detection, language identification, etc. on a wide scale of 680,000 h of labeled audio data.