Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format.

While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.

IBM has had a prominent role within speech recognition since its inception, releasing of “Shoebox” in 1962. This machine had the ability to recognize 16 different words, advancing the initial work from Bell Labs from the 1950s. However, IBM didn’t stop there, but continued to innovate over the years, launching VoiceType Simply Speaking application in 1996. This speech recognition software had a 42,000-word vocabulary, supported English and Spanish, and included a spelling dictionary of 100,000 words.

While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data.  Research  (link resides outside ibm.com) shows that this market is expected to be worth USD 24.9 billion by 2025.

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Register for the guide on foundation models

Many speech recognition applications and devices are available, but the more advanced solutions use AI and machine learning . They integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process human speech. Ideally, they learn as they go — evolving responses with each interaction.

The best kind of systems also allow organizations to customize and adapt the technology to their specific requirements — everything from language and nuances of speech to brand recognition. For example:

  • Language weighting: Improve precision by weighting specific words that are spoken frequently (such as product names or industry jargon), beyond terms already in the base vocabulary.
  • Speaker labeling: Output a transcription that cites or tags each speaker’s contributions to a multi-participant conversation.
  • Acoustics training: Attend to the acoustical side of the business. Train the system to adapt to an acoustic environment (like the ambient noise in a call center) and speaker styles (like voice pitch, volume and pace).
  • Profanity filtering: Use filters to identify certain words or phrases and sanitize speech output.

Meanwhile, speech recognition continues to advance. Companies, like IBM, are making inroads in several areas, the better to improve human and machine interaction.

The vagaries of human speech have made development challenging. It’s considered to be one of the most complex areas of computer science – involving linguistics, mathematics and statistics. Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. Research from Lippmann (link resides outside ibm.com) estimates the word error rate to be around 4 percent, but it’s been difficult to replicate the results from this paper.

Various algorithms and computation techniques are used to recognize speech into text and improve the accuracy of transcription. Below are brief explanations of some of the most commonly used methods:

  • Natural language processing (NLP): While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of artificial intelligence which focuses on the interaction between humans and machines through language through speech and text. Many mobile devices incorporate speech recognition into their systems to conduct voice search—e.g. Siri—or provide more accessibility around texting. 
  • Hidden markov models (HMM): Hidden Markov Models build on the Markov chain model, which stipulates that the probability of a given state hinges on the current state, not its prior states. While a Markov chain model is useful for observable events, such as text inputs, hidden markov models allow us to incorporate hidden events, such as part-of-speech tags, into a probabilistic model. They are utilized as sequence models within speech recognition, assigning labels to each unit—i.e. words, syllables, sentences, etc.—in the sequence. These labels create a mapping with the provided input, allowing it to determine the most appropriate label sequence.
  • N-grams: This is the simplest type of language model (LM), which assigns probabilities to sentences or phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain word sequences are used to improve recognition and accuracy.
  • Neural networks: Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent.  While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.
  • Speaker Diarization (SD): Speaker diarization algorithms identify and segment speech by speaker identity. This helps programs better distinguish individuals in a conversation and is frequently applied at call centers distinguishing customers and sales agents.

A wide number of industries are utilizing different applications of speech technology today, helping businesses and consumers save time and even lives. Some examples include:

Automotive: Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios.

Technology: Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.

Healthcare: Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes.

Sales: Speech recognition technology has a couple of applications in sales. It can help a call center transcribe thousands of phone calls between customers and agents to identify common call patterns and issues. AI chatbots can also talk to people via a webpage, answering common queries and solving basic requests without needing to wait for a contact center agent to be available. It both instances speech recognition systems help reduce time to resolution for consumer issues.

Security: As technology integrates into our daily lives, security protocols are an increasing priority. Voice-based authentication adds a viable level of security.

Convert speech into text using AI-powered speech recognition and transcription.

Convert text into natural-sounding speech in a variety of languages and voices.

AI-powered hybrid cloud software.

Enable speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics.

Learn how to keep up, rethink how to use technologies like the cloud, AI and automation to accelerate innovation, and meet the evolving customer expectations.

IBM watsonx Assistant helps organizations provide better customer experiences with an AI chatbot that understands the language of the business, connects to existing customer care systems, and deploys anywhere with enterprise security and scalability. watsonx Assistant automates repetitive tasks and uses machine learning to resolve customer support issues quickly and efficiently.

Speech Recognition: Everything You Need to Know in 2024

use of a speech recognition

Speech recognition, also known as automatic speech recognition (ASR) , enables seamless communication between humans and machines. This technology empowers organizations to transform human speech into written text. Speech recognition technology can revolutionize many business applications , including customer service, healthcare, finance and sales.

In this comprehensive guide, we will explain speech recognition, exploring how it works, the algorithms involved, and the use cases of various industries.

If you require training data for your speech recognition system, here is a guide to finding the right speech data collection services.

What is speech recognition?

Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text.

Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents, dialects, and speech patterns.

What are the features of speech recognition systems?

Speech recognition systems have several components that work together to understand and process human speech. Key features of effective speech recognition are:

  • Audio preprocessing: After you have obtained the raw audio signal from an input device, you need to preprocess it to improve the quality of the speech input The main goal of audio preprocessing is to capture relevant speech data by removing any unwanted artifacts and reducing noise.
  • Feature extraction: This stage converts the preprocessed audio signal into a more informative representation. This makes raw audio data more manageable for machine learning models in speech recognition systems.
  • Language model weighting: Language weighting gives more weight to certain words and phrases, such as product references, in audio and voice signals. This makes those keywords more likely to be recognized in a subsequent speech by speech recognition systems.
  • Acoustic modeling : It enables speech recognizers to capture and distinguish phonetic units within a speech signal. Acoustic models are trained on large datasets containing speech samples from a diverse set of speakers with different accents, speaking styles, and backgrounds.
  • Speaker labeling: It enables speech recognition applications to determine the identities of multiple speakers in an audio recording. It assigns unique labels to each speaker in an audio recording, allowing the identification of which speaker was speaking at any given time.
  • Profanity filtering: The process of removing offensive, inappropriate, or explicit words or phrases from audio data.

What are the different speech recognition algorithms?

Speech recognition uses various algorithms and computation techniques to convert spoken language into written language. The following are some of the most commonly used speech recognition methods:

  • Hidden Markov Models (HMMs): Hidden Markov model is a statistical Markov model commonly used in traditional speech recognition systems. HMMs capture the relationship between the acoustic features and model the temporal dynamics of speech signals.
  • Estimate the probability of word sequences in the recognized text
  • Convert colloquial expressions and abbreviations in a spoken language into a standard written form
  • Map phonetic units obtained from acoustic models to their corresponding words in the target language.
  • Speaker Diarization (SD): Speaker diarization, or speaker labeling, is the process of identifying and attributing speech segments to their respective speakers (Figure 1). It allows for speaker-specific voice recognition and the identification of individuals in a conversation.

Figure 1: A flowchart illustrating the speaker diarization process

The image describes the process of speaker diarization, where multiple speakers in an audio recording are segmented and identified.

  • Dynamic Time Warping (DTW): Speech recognition algorithms use Dynamic Time Warping (DTW) algorithm to find an optimal alignment between two sequences (Figure 2).

Figure 2: A speech recognizer using dynamic time warping to determine the optimal distance between elements

Dynamic time warping is a technique used in speech recognition to determine the optimum distance between the elements.

5. Deep neural networks: Neural networks process and transform input data by simulating the non-linear frequency perception of the human auditory system.

6. Connectionist Temporal Classification (CTC): It is a training objective introduced by Alex Graves in 2006. CTC is especially useful for sequence labeling tasks and end-to-end speech recognition systems. It allows the neural network to discover the relationship between input frames and align input frames with output labels.

Speech recognition vs voice recognition

Speech recognition is commonly confused with voice recognition, yet, they refer to distinct concepts. Speech recognition converts  spoken words into written text, focusing on identifying the words and sentences spoken by a user, regardless of the speaker’s identity. 

On the other hand, voice recognition is concerned with recognizing or verifying a speaker’s voice, aiming to determine the identity of an unknown speaker rather than focusing on understanding the content of the speech.

What are the challenges of speech recognition with solutions?

While speech recognition technology offers many benefits, it still faces a number of challenges that need to be addressed. Some of the main limitations of speech recognition include:

Acoustic Challenges:

  • Assume a speech recognition model has been primarily trained on American English accents. If a speaker with a strong Scottish accent uses the system, they may encounter difficulties due to pronunciation differences. For example, the word “water” is pronounced differently in both accents. If the system is not familiar with this pronunciation, it may struggle to recognize the word “water.”

Solution: Addressing these challenges is crucial to enhancing  speech recognition applications’ accuracy. To overcome pronunciation variations, it is essential to expand the training data to include samples from speakers with diverse accents. This approach helps the system recognize and understand a broader range of speech patterns.

  • For instance, you can use data augmentation techniques to reduce the impact of noise on audio data. Data augmentation helps train speech recognition models with noisy data to improve model accuracy in real-world environments.

Figure 3: Examples of a target sentence (“The clown had a funny face”) in the background noise of babble, car and rain.

Background noise makes distinguishing speech from background noise difficult for speech recognition software.

Linguistic Challenges:

  • Out-of-vocabulary words: Since the speech recognizers model has not been trained on OOV words, they may incorrectly recognize them as different or fail to transcribe them when encountering them.

Figure 4: An example of detecting OOV word

use of a speech recognition

Solution: Word Error Rate (WER) is a common metric that is used to measure the accuracy of a speech recognition or machine translation system. The word error rate can be computed as:

Figure 5: Demonstrating how to calculate word error rate (WER)

Word Error Rate (WER) is metric to evaluate the performance  and accuracy of speech recognition systems.

  • Homophones: Homophones are words that are pronounced identically but have different meanings, such as “to,” “too,” and “two”. Solution: Semantic analysis allows speech recognition programs to select the appropriate homophone based on its intended meaning in a given context. Addressing homophones improves the ability of the speech recognition process to understand and transcribe spoken words accurately.

Technical/System Challenges:

  • Data privacy and security: Speech recognition systems involve processing and storing sensitive and personal information, such as financial information. An unauthorized party could use the captured information, leading to privacy breaches.

Solution: You can encrypt sensitive and personal audio information transmitted between the user’s device and the speech recognition software. Another technique for addressing data privacy and security in speech recognition systems is data masking. Data masking algorithms mask and replace sensitive speech data with structurally identical but acoustically different data.

Figure 6: An example of how data masking works

Data masking protects sensitive or confidential audio information in speech recognition applications by replacing or encrypting the original audio data.

  • Limited training data: Limited training data directly impacts  the performance of speech recognition software. With insufficient training data, the speech recognition model may struggle to generalize different accents or recognize less common words.

Solution: To improve the quality and quantity of training data, you can expand the existing dataset using data augmentation and synthetic data generation technologies.

13 speech recognition use cases and applications

In this section, we will explain how speech recognition revolutionizes the communication landscape across industries and changes the way businesses interact with machines.

Customer Service and Support

  • Interactive Voice Response (IVR) systems: Interactive voice response (IVR) is a technology that automates the process of routing callers to the appropriate department. It understands customer queries and routes calls to the relevant departments. This reduces the call volume for contact centers and minimizes wait times. IVR systems address simple customer questions without human intervention by employing pre-recorded messages or text-to-speech technology . Automatic Speech Recognition (ASR) allows IVR systems to comprehend and respond to customer inquiries and complaints in real time.
  • Customer support automation and chatbots: According to a survey, 78% of consumers interacted with a chatbot in 2022, but 80% of respondents said using chatbots increased their frustration level.
  • Sentiment analysis and call monitoring: Speech recognition technology converts spoken content from a call into text. After  speech-to-text processing, natural language processing (NLP) techniques analyze the text and assign a sentiment score to the conversation, such as positive, negative, or neutral. By integrating speech recognition with sentiment analysis, organizations can address issues early on and gain valuable insights into customer preferences.
  • Multilingual support: Speech recognition software can be trained in various languages to recognize and transcribe the language spoken by a user accurately. By integrating speech recognition technology into chatbots and Interactive Voice Response (IVR) systems, organizations can overcome language barriers and reach a global audience (Figure 7). Multilingual chatbots and IVR automatically detect the language spoken by a user and switch to the appropriate language model.

Figure 7: Showing how a multilingual chatbot recognizes words in another language

use of a speech recognition

  • Customer authentication with voice biometrics: Voice biometrics use speech recognition technologies to analyze a speaker’s voice and extract features such as accent and speed to verify their identity.

Sales and Marketing:

  • Virtual sales assistants: Virtual sales assistants are AI-powered chatbots that assist customers with purchasing and communicate with them through voice interactions. Speech recognition allows virtual sales assistants to understand the intent behind spoken language and tailor their responses based on customer preferences.
  • Transcription services : Speech recognition software records audio from sales calls and meetings and then converts the spoken words into written text using speech-to-text algorithms.

Automotive:

  • Voice-activated controls: Voice-activated controls allow users to interact with devices and applications using voice commands. Drivers can operate features like climate control, phone calls, or navigation systems.
  • Voice-assisted navigation: Voice-assisted navigation provides real-time voice-guided directions by utilizing the driver’s voice input for the destination. Drivers can request real-time traffic updates or search for nearby points of interest using voice commands without physical controls.

Healthcare:

  • Recording the physician’s dictation
  • Transcribing the audio recording into written text using speech recognition technology
  • Editing the transcribed text for better accuracy and correcting errors as needed
  • Formatting the document in accordance with legal and medical requirements.
  • Virtual medical assistants: Virtual medical assistants (VMAs) use speech recognition, natural language processing, and machine learning algorithms to communicate with patients through voice or text. Speech recognition software allows VMAs to respond to voice commands, retrieve information from electronic health records (EHRs) and automate the medical transcription process.
  • Electronic Health Records (EHR) integration: Healthcare professionals can use voice commands to navigate the EHR system , access patient data, and enter data into specific fields.

Technology:

  • Virtual agents: Virtual agents utilize natural language processing (NLP) and speech recognition technologies to understand spoken language and convert it into text. Speech recognition enables virtual agents to process spoken language in real-time and respond promptly and accurately to user voice commands.

Further reading

  • Top 5 Speech Recognition Data Collection Methods in 2023
  • Top 11 Speech Recognition Applications in 2023

External Links

  • 1. Databricks
  • 2. PubMed Central
  • 3. Qin, L. (2013). Learning Out-of-vocabulary Words in Automatic Speech Recognition . Carnegie Mellon University.
  • 4. Wikipedia

use of a speech recognition

Next to Read

10+ speech data collection services in 2024, top 5 speech recognition data collection methods in 2024, top 4 speech recognition challenges & solutions in 2024.

Your email address will not be published. All fields are required.

Related research

Top 11 Voice Recognition Applications in 2024

Top 11 Voice Recognition Applications in 2024

How to set up and use Windows 10 Speech Recognition

Windows 10 has a hands-free using Speech Recognition feature, and in this guide, we show you how to set up the experience and perform common tasks.

use of a speech recognition

On Windows 10 , Speech Recognition is an easy-to-use experience that allows you to control your computer entirely with voice commands.

Anyone can set up and use this feature to navigate, launch applications, dictate text, and perform a slew of other tasks. However, Speech Recognition was primarily designed to help people with disabilities who can't use a mouse or keyboard.

In this Windows 10 guide, we walk you through the steps to configure and start using Speech Recognition to control your computer only with voice.

How to configure Speech Recognition on Windows 10

How to train speech recognition to improve accuracy, how to change speech recognition settings, how to use speech recognition on windows 10.

To set up Speech Recognition on your device, use these steps:

  • Open Control Panel .
  • Click on Ease of Access .
  • Click on Speech Recognition .

use of a speech recognition

  • Click the Start Speech Recognition link.

use of a speech recognition

  • In the "Set up Speech Recognition" page, click Next .
  • Select the type of microphone you'll be using. Note: Desktop microphones are not ideal, and Microsoft recommends headset microphones or microphone arrays.

use of a speech recognition

  • Click Next .
  • Click Next again.

use of a speech recognition

  • Read the text aloud to ensure the feature can hear you.

use of a speech recognition

  • Speech Recognition can access your documents and emails to improve its accuracy based on the words you use. Select the Enable document review option, or select Disable document review if you have privacy concerns.

use of a speech recognition

  • Use manual activation mode — Speech Recognition turns off the "Stop Listening" command. To turn it back on, you'll need to click the microphone button or use the Ctrl + Windows key shortcut.
  • Use voice activation mode — Speech Recognition goes into sleep mode when not in use, and you'll need to invoke the "Start Listening" voice command to turn it back on.

use of a speech recognition

  • If you're not familiar with the commands, click the View Reference Sheet button to learn more about the voice commands you can use.

use of a speech recognition

  • Select whether you want this feature to start automatically at startup.

use of a speech recognition

  • Click the Start tutorial button to access the Microsoft video tutorial about this feature, or click the Skip tutorial button to complete the setup.

use of a speech recognition

Once you complete these steps, you can start using the feature with voice commands, and the controls will appear at the top of the screen.

Quick Tip: You can drag and dock the Speech Recognition interface anywhere on the screen.

After the initial setup, we recommend training Speech Recognition to improve its accuracy and to prevent the "What was that?" message as much as possible.

Get the Windows Central Newsletter

All the latest news, reviews, and guides for Windows and Xbox diehards.

  • Click the Train your computer to better understand you link.

use of a speech recognition

  • Click Next to continue with the training as directed by the application.

use of a speech recognition

After completing the training, Speech Recognition should have a better understanding of your voice to provide an improved experience.

If you need to change the Speech Recognition settings, use these steps:

  • Click the Advanced speech options link in the left pane.

use of a speech recognition

Inside "Speech Properties," in the Speech Recognition tab, you can customize various aspects of the experience, including:

  • Recognition profiles.
  • User settings.
  • Microphone.

use of a speech recognition

In the Text to Speech tab, you can control voice settings, including:

  • Voice selection.
  • Voice speed.

use of a speech recognition

Additionally, you can always right-click the experience interface to open a context menu to access all the different features and settings you can use with Speech Recognition.

use of a speech recognition

While there is a small learning curve, Speech Recognition uses clear and easy-to-remember commands. For example, using the "Start" command opens the Start menu, while saying "Show Desktop" will minimize everything on the screen.

If Speech Recognition is having difficulties understanding your voice, you can always use the Show numbers command as everything on the screen has a number. Then say the number and speak OK to execute the command.

use of a speech recognition

Here are some common tasks that will get you started with Speech Recognition:

Starting Speech Recognition

To launch the experience, just open the Start menu , search for Windows Speech Recognition , and select the top result.

Turning on and off

To start using the feature, click the microphone button or say Start listening depending on your configuration.

use of a speech recognition

In the same way, you can turn it off by saying Stop listening or clicking the microphone button.

Using commands

Some of the most frequent commands you'll use include:

  • Open — Launches an app when saying "Open" followed by the name of the app. For example, "Open Mail," or "Open Firefox."
  • Switch to — Jumps to another running app when saying "Switch to" followed by the name of the app. For example, "Switch to Microsoft Edge."
  • Control window in focus — You can use the commands "Minimize," "Maximize," and "Restore" to control an active window.
  • Scroll — Allows you to scroll in a page. Simply use the command "Scroll down" or "Scroll up," "Scroll left" or "Scroll right." It's also possible to specify long scrolls. For example, you can try: "Scroll down two pages."
  • Close app — Terminates an application by saying "Close" followed by the name of the running application. For example, "Close Word."
  • Clicks — Inside an application, you can use the "Click" command followed by the name of the element to perform a click. For example, in Word, you can say "Click Layout," and Speech Recognition will open the Layout tab. In the same way, you can use "Double-click" or "Right-click" commands to perform those actions.
  • Press — This command lets you execute shortcuts. For example, you can say "Press Windows A" to open Action Center.

Using dictation

Speech Recognition also includes the ability to convert voice into text using the dictation functionality, and it works automatically.

If you need to dictate text, open the application (making sure the feature is in listening mode) and start dictating. However, remember that you'll have to say each punctuation mark and special character.

For example, if you want to insert the "Good morning, where do you like to go today?" sentence, you'll need to speak, "Open quote good morning comma where do you like to go today question mark close quote."

In the case that you need to correct some text that wasn't recognized accurately, use the "Correct" command followed by the text you want to change. For example, if you meant to write "suite" and the feature recognized it as "suit," you can say "Correct suit," select the suggestion using the correction panel or say "Spell it" to speak the correct text, and then say "OK".

use of a speech recognition

Wrapping things up

Although Speech Recognition doesn't offer a conversational experience like a personal assistant, it's still a powerful tool for anyone who needs to control their device entirely using only voice.

Cortana also provides the ability to control a device with voice, but it's limited to a specific set of input commands, and it's not possible to control everything that appears on the screen.

However, that doesn't mean that you can't get the best of both worlds. Speech Recognition runs independently of Cortana, which means that you can use the Microsoft's digital assistant for certain tasks and Speech Recognition to navigate and execute other commands.

It's worth noting that this speech recognition isn't available in every language. Supported languages include English (U.S. and UK), French, German, Japanese, Mandarin (Chinese Simplified and Chinese Traditional), and Spanish.

While this guide is focused on Windows 10, Speech Recognition has been around for a long time, so you can refer to it even if you're using Windows 8.1 or Windows 7.

More Windows 10 resources

For more helpful articles, coverage, and answers to common questions about Windows 10, visit the following resources:

  • Windows 10 on Windows Central – All you need to know
  • Windows 10 help, tips, and tricks
  • Windows 10 forums on Windows Central

Mauro Huculak

Mauro Huculak is technical writer for WindowsCentral.com. His primary focus is to write comprehensive how-tos to help users get the most out of Windows 10 and its many related technologies. He has an IT background with professional certifications from Microsoft, Cisco, and CompTIA, and he's a recognized member of the Microsoft MVP community.

  • 2 Microsoft will invest $3.3 billion in a Wisconsin AI datacenter, and President Biden is on-site to announce the news and take a shot at President Trump's Foxconn failures
  • 3 DesktopGPT brings GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo to Windows 11's backyard, potentially giving Microsoft's Copilot AI a run for its money as the best alternative
  • 4 Clippy, the infamous paperclip, is here to debloat Windows 11 and save you from ads
  • 5 Microsoft says most company execs won't hire anyone without an AI aptitude, prompting "a 142x increase in LinkedIn members adding AI skills like Copilot and ChatGPT to their profiles"

use of a speech recognition

PCMag editors select and review products independently . If you buy through affiliate links, we may earn commissions, which help support our testing .

Speak Up: How to Use Speech Recognition and Dictate Text in Windows

You can talk to windows using the built-in speech recognition or text dictation features..

Lance Whitney

Did you know you can issue commands to Windows? You can tell the operating system to open applications, dictate text, and perform many other tasks. This can be done through Cortana, or you can use the speech recognition built directly into Windows 10 and 11 in order to speak to any supported version of Windows.

Once you teach the operating system to understand the sound of your voice, it will respond to your commands. This is a feature that is especially useful to users with disabilities who cannot use the mouse and keyboard, but it is also available for anyone to use. There is even a built-in reference guide to show you what commands you can use.

Windows also offers a dictation feature that you can use to create documents, emails, and other files using the sound of your voice. Once the dictation is active, you’re able to dictate text as well as punctuation marks, special characters, and cursor movements.

Both features work similarly in Windows 10 and 11, however, there are some differences in the look and layout of the dictation window. Let's check out how to use speech recognition and dictation in Windows.

Activate Online Speech Recognition

In order to use speech recognition in Windows 10, you will first need to enable online speech recognition. This can be done if you open Settings > Privacy > Speech and enable Online speech recognition .

online speech recognition setting in Windows 10

While this is required in Windows 10, it is only optional in Windows 11 . If you want to enable this feature, head to Settings > Privacy & security > Speech and turn on Online speech recognition .

online speech recognition setting in Windows 11

To address any privacy concerns you may have about this feature, read the Microsoft Privacy Statement , which describes how it works.

How to Use Dictation

Open an application in which you want to dictate text, such as Notepad, WordPad, Microsoft Word, or Mail. To trigger the dictation, press the Windows key + H .

Trigger dictation in Windows 10

If you're using Windows 10, you will see the rectangular dictation window appear at the top of the screen with a message indicating that it is listening.

Trigger dictation in Windows 11

For Windows 11 users, the square dictation window appears at the bottom of the screen, also with a message to tell you that it’s listening.

Dictate punctuations and formatting

When you start speaking, Windows is smart enough to handle certain tasks automatically, such as capitalizing the first word of a sentence. You can then dictate punctuation and start a new paragraph by saying "period," "comma," "new line," "new paragraph," or whatever other action you need Windows to take. Here are the punctuation characters and symbols you can dictate, according to Microsoft:

If you make a mistake, simply undo it by saying "Undo that." Your recent word, phrase, or sentence will then be removed. If you stop speaking for a few seconds, the dictation will stop listening. You can also pause the dictation on your own by saying "Stop dictation," or by clicking the microphone icon. Click it again to start the dictation again.

Edit through dictation

Now, let's say you finished writing and need to edit the text to correct mistakes or change certain words. You can edit by voice, though the process is more cumbersome than using your mouse and keyboard. But if you know the right phrases, you may want to try it out. Here are the editing commands you can dictate, according to Microsoft:

How to Use Speech Recognition

Speech Recognition is another option if you want to control Windows 8.1, 10, or 11 with your voice. To set this up, open Control Panel in icon view and click the Speech Recognition applet. Choose the Start Speech Recognition link to set up the feature.

Start Speech Recognition

The first screen for setting up speech recognition explains what the feature does and how it works. Click Next , then choose whether you are using a headset, desktop, or standalone microphone. Click Next to see information on how to properly place your microphone.

Select Microphone

Click Next again and read the sentence aloud to make sure the speech recognition feature picked up the sound and volume of your voice. Click Next , and if your voice was properly detected, the screen will tell you the microphone is set up and ready to use.

Read Aloud

Click Next and decide if you want the speech recognition feature to examine the documents and email messages in your Windows search index. This helps the feature better understand the words you typically use. If you're OK with this, click Enable document review . If you're concerned about privacy issues, click Disable document review . Click Next.

Document Review

Click Next , then make a decision on Activation Mode. Select Use manual activation mode if you want to turn on speech recognition by clicking the microphone button. Choose Use voice activation mode to start speech recognition by saying "Start listening."

Activation Mode

Click Next to view a Reference Sheet listing all the commands you’re able to issue with your voice. Click the View Reference Sheet button to open and read a web page with all the voice commands. Looking for what you can say? These voice commands work in Windows 10 and 11.

Click Next , then choose whether you want Speech Recognition to automatically load each time you start Windows. Click Next to get a chance to learn how to use the feature. Click Start tutorial to get a built-in lesson or click Skip tutorial to bypass this part.

Run at Startup

If you chose to run the tutorial, an interactive web page pops up with videos and instructions on how to use speech recognition in Windows. The Speech Recognition control panel also appears at the top of the screen.

Customize and Control

You can now start talking to your computer or customize the speech recognition tool. Return to the Control Panel and open Speech Recognition . Click the Advanced speech options link to tweak the Speech Recognition and text-to-speech features.

Features and Options

If you right-click on the microphone button on the Speech Recognition panel at the top of the screen, a pop-up screen will appear. From this menu, you can access different features and configure various options.

More Inside PCMag.com

  • Yikes: Windows 10 Sees Uptick as Windows 11 Share Decreases
  • How to Remotely Access a PC From Your iPhone or Android Device
  • Microsoft Pushes Start Menu Ads to All Windows 11 Users
  • The 10 Worst Things About Windows 11
  • More PCs Can Upgrade to Windows 11 After Microsoft Drops 'Compatibility Hold'

About Lance Whitney

My experience.

I've been working for PCMag since early 2016 writing tutorials, how-to pieces, and other articles on consumer technology. Beyond PCMag, I've written news stories and tutorials for a variety of other websites and publications, including CNET, ZDNet, TechRepublic, Macworld, PC World, Time, US News & World Report, and AARP Magazine. I spent seven years writing breaking news for CNET as one of the site’s East Coast reporters. I've also written two books for Wiley & Sons— Windows 8: Five Minutes at a Time and Teach Yourself Visually LinkedIn .

My Areas of Expertise

More from lance whitney.

  • How to Extend or Mirror Your Mac's Screen to an iPad With Sidecar
  • How to Add Music to an Instagram Story
  • Get More Done: How to Use Multitasking Features on Your iPad
  • Device in Disarray? How to Use the Files App on Your iPhone or iPad
  • How to Print From Your iPhone or iPad

use of a speech recognition

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free

Using the Web Speech API

Speech recognition.

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.

Note: On some browsers, such as Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer . When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting mobile browser (such as Chrome).

HTML and CSS

The HTML and CSS for the app is really trivial. We have a title, instructions paragraph, and a div into which we output diagnostic messages.

The CSS provides a very simple responsive styling so that it looks OK across devices.

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Browsers currently support speech recognition with prefixed properties. Therefore at the start of our code we include these lines to allow for both prefixed properties and unprefixed versions that may be supported in future:

The grammar

The next part of our code defines the grammar we want our app to recognize. The following variable is defined to hold our grammar:

The grammar format used is JSpeech Grammar Format ( JSGF ) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:

  • The lines are separated by semicolons, just like in JavaScript.
  • The first line — #JSGF V1.0; — states the format and version used. This always needs to be included first.
  • The second line indicates a type of term that we want to recognize. public declares that it is a public rule, the string in angle brackets defines the recognized name for this term ( color ), and the list of items that follow the equals sign are the alternative values that will be recognized and accepted as appropriate values for the term. Note how each is separated by a pipe character.
  • You can have as many terms defined as you want on separate lines following the above structure, and include fairly complex grammar definitions. For this basic demo, we are just keeping things simple.

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

  • SpeechRecognition.continuous : Controls whether continuous results are captured ( true ), or just a single result each time recognition is started ( false ).
  • SpeechRecognition.lang : Sets the language of the recognition. Setting this is good practice, and therefore recommended.
  • SpeechRecognition.interimResults : Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this simple demo.
  • SpeechRecognition.maxAlternatives : Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this simple demo, so we are just specifying one (which is actually the default anyway.)

Starting the speech recognition

After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start() . The forEach() method is used to output colored indicators showing what colors to try saying.

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events .) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop() ) once a single word has been recognized and it has finished being spoken:

Handling errors and unrecognized speech

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

The error event handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis . This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter / Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting mobile browser.

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option> s via JavaScript (see later on.)

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis . This is API's entry point — it returns an instance of SpeechSynthesis , the controller interface for web speech synthesis.

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices() , which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name ), the language of the voice (grabbed from SpeechSynthesisVoice.lang ), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true .)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter / Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak() , passing it the SpeechSynthesisUtterance instance as a parameter.

In the final part of the handler, we include an pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch / rate values displayed in the UI, each time the slider positions are moved.

  • Trending Blogs
  • Geeksforgeeks NEWS
  • Geeksforgeeks Blogs
  • Tips & Tricks
  • Website & Apps
  • ChatGPT Blogs
  • ChatGPT News
  • ChatGPT Tutorial
  • What is Speech Recognition?
  • What is Image Recognition?
  • What is Recognition vs Recall in UX Design?
  • What is a Microphone?
  • What is Optical Character Recognition (OCR)?
  • Audio Recognition in Tensorflow
  • Image Recognition with Mobilenet
  • Automatic Speech Recognition using Whisper
  • Speech Recognition Module Python
  • Intent Recognition using TensorFlow
  • What is AWS DeepComposer?
  • What is Biometric Verification?
  • How to Set Up Speech Recognition on Windows?
  • Build a Video Transcription
  • What is AI Model ?
  • What is Accessibility Service in Android?
  • Automatic Speech Recognition using CTC
  • Personal Voice Assistant in Python
  • What is Biometrics ?

What is Voice Recognition?

Voice recognition is a technology that enables devices to understand and respond to spoken words. It turns what you say into text and lets you control devices just by talking to them. This technology is key in many modern tools like smartphones, smart speakers, and car systems, helping with tasks like sending messages, playing music, and finding information online. It’s especially useful for hands-free control and assists people with disabilities in interacting more easily with technology.

How Voice Recognition Works?

Voice recognition works through several steps to convert spoken language into text or commands that a computer can understand. Here is its working:

Sound Capture : The process begins when a microphone captures your voice.

Digital Conversion : The analog signal, which is the sound wave captured by the microphone, is converted into a digital signal. This is done through a process called analog-to-digital conversion (ADC). The digital signal represents the audio in a format that computers can understand and process, making it possible to analyze the sound wave precisely.

Noise Reduction : Background noises are filtered out to focus on the clear digital voice signal is broken down into smaller pieces called phonemes, which are the basic units of sound in speech.

Pattern Matching : Once the voice is clear, the system breaks the speech into small units called phonemes, which are the smallest units of sound in a language. The voice recognition software uses algorithms to compare these phonemes against a database of known phoneme patterns. This process helps the system identify which words are being spoken by matching the sequences of phonemes to its library of word patterns.

Contextual Understanding : The system analyzes the context and syntax of the sentence to better understand the meaning and to distinguish between words that sound similar.

Conversion to Text or Commands : Once the words are identified, they are either converted into text or interpreted as commands based on the user’s intent.

Feedback and Execution : If the voice input is a command, the device performs the action (like opening an app or adjusting settings). If it is dictation, it displays the text on the screen.

Throughout this process, advanced algorithms and machine learning help improve accuracy by learning from new inputs and adapting to the user’s voice characteristics over time.

Types of Voice Recognition System

oice recognition systems can be categorized based on their functionality, application, and the technologies they use. Here are some common types of voice recognition systems:

1. Speaker-Dependent Systems

These systems are trained to recognize the voice of a specific user. They require an initial training period where the user reads out specific texts so the system can learn to recognize their speech patterns and accents.

Use Case : Personalized applications, like user-specific voice commands in vehicles or personalized virtual assistants.

2. Speaker-Independent Systems

These systems are designed to understand speech inputs from any speaker without needing prior training on the speaker’s voice. They are generally less accurate at recognizing individual voice nuances but more versatile.

Use Case : General use applications, such as interactive voice response (IVR) systems in customer service.

3. Continuous Speech Recognition

These systems can handle natural speech flow without the user having to pause between words. They are sophisticated and require more processing power.

Use Case : Dictation software that converts speech to text for documents or emails.

4. Isolated Word Recognition

These systems require each word to be spoken separately with pauses in between. They are simpler and less prone to errors but less convenient for the user.

Use Case : Command-and-control systems where simple commands trigger actions, such as home automation devices.

5. Large Vocabulary Continuous Speech Recognition (LVCSR)

These systems have a very large database of words and can handle complex vocabularies and sentence structures.

Use Case : Advanced dictation and transcription services, like those used in legal and medical fields.

6. Multilingual Voice Recognition

These systems can recognize and process speech in multiple languages.

Use Case : Applications serving users from different linguistic backgrounds, such as multilingual virtual assistants and translation services.

7. Natural Language Processing (NLP)

Incorporates understanding the meaning behind the words and contextual cues, not just speech recognition.

Use Case : Advanced virtual assistants that can perform tasks based on conversational language, such as Siri, Google Assistant, and Alexa.

Advantages of Voice Recognition

Here are few advantages of voice recognition –

  • Convenience : Voice recognition allows users to perform tasks hands-free, which is especially useful when driving, cooking, or when one’s hands are otherwise occupied. It simplifies tasks such as sending texts, making phone calls, or setting GPS routes.
  • Accessibility : This technology provides essential assistance to people with disabilities, especially those who have difficulty using their hands. It enables them to control devices, interact with technology, and communicate more independently.
  • Speed : Speaking is generally faster than typing, so voice recognition can save time in data entry and command execution. This is particularly beneficial in work settings where efficiency is crucial, such as in medical dictation or issuing commands in fast-paced environments.
  • Improved Productivity : Voice recognition can streamline workflows by allowing for quicker data entry, facilitating multitasking, and reducing the need for physical interaction with devices.
  • Enhanced User Experience : Voice-activated assistants like Siri, Alexa, and Google Assistant offer a more intuitive way for users to interact with technology, making devices smarter and more responsive to human language.
  • Language Support : Modern voice recognition systems support multiple languages, making them versatile tools for global interaction and accessibility across different linguistic backgrounds.

In conclusion, voice recognition is a powerful technology that transforms how we interact with our devices, making everyday tasks simpler and more efficient. It helps everyone from busy professionals to individuals with physical limitations, enhancing accessibility and convenience across various applications. As this technology continues to evolve, it promises even greater integration into our daily lives, ensuring that voice-controlled devices are an essential part of our future.

What is Voice Recognition? – FAQs

What do you mean by voice recognition.

Voice recognition is  a deep learning technique used to identify, distinguish, and authenticate a particular person’s voice . It evaluates an individual’s unique voice biometrics, including frequency and flow of pitch, and natural accent.

What is an example of voice recognition?

Virtual assistants . Siri, Alexa and Google virtual assistants all implement voice recognition software to interact with users. The way consumers use voice recognition technology varies depending on the product.

Who invented voice recognition?

In 1952,  Bell Laboratories  designed the “Audrey” system which could recognize a single voice speaking digits aloud. Ten years later, IBM introduced “Shoebox” which understood and responded to 16 words in English. Across the globe other nations developed hardware that could recognize sound and speech.

What is one use of voice recognition?

You can use voice recognition to  control a smart home , instruct a smart speaker, and command phones and tablets. In addition, you can set reminders and interact hands-free with personal technologies. The most significant use is for the entry of text without using an on-screen or physical keyboard.

Why is voice recognition useful?

The benefits of voice recognition software are that it  provides a faster method of writing on a computer, tablet, or smartphone, without typing . You can speak into an external microphone, headset, or built-in microphone, and your words appear as text on the screen.

Please Login to comment...

Similar reads, improve your coding skills with practice.

 alt=

What kind of Experience do you want to share?

SpeechRecognition 3.10.4

pip install SpeechRecognition Copy PIP instructions

Released: May 5, 2024

Library for performing speech recognition, with support for several engines and APIs, online and offline.

Verified details

Maintainers.

Avatar for Anthony.Zhang from gravatar.com

Unverified details

Project links, github statistics.

  • Open issues:

View statistics for this project via Libraries.io , or by using our public dataset on Google BigQuery

License: BSD License (BSD)

Author: Anthony Zhang (Uberi)

Tags speech, recognition, voice, sphinx, google, wit, bing, api, houndify, ibm, snowboy

Requires: Python >=3.8

Classifiers

  • 5 - Production/Stable
  • OSI Approved :: BSD License
  • MacOS :: MacOS X
  • Microsoft :: Windows
  • POSIX :: Linux
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Multimedia :: Sound/Audio :: Speech
  • Software Development :: Libraries :: Python Modules

Project description

Latest Version

UPDATE 2022-02-09 : Hey everyone! This project started as a tech demo, but these days it needs more time than I have to keep up with all the PRs and issues. Therefore, I’d like to put out an open invite for collaborators - just reach out at me @ anthonyz . ca if you’re interested!

Speech recognition engine/API support:

Quickstart: pip install SpeechRecognition . See the “Installing” section for more details.

To quickly try it out, run python -m speech_recognition after installing.

Project links:

Library Reference

The library reference documents every publicly accessible object in the library. This document is also included under reference/library-reference.rst .

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst .

You have to install Vosk models for using Vosk. Here are models avaiable. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

See the examples/ directory in the repository root for usage examples:

First, make sure you have all the requirements listed in the “Requirements” section.

The easiest way to install this is using pip install SpeechRecognition .

Otherwise, download the source distribution from PyPI , and extract the archive.

In the folder, run python setup.py install .

Requirements

To use all of the functionality of the library, you should have:

The following requirements are optional, but can improve or extend functionality in some situations:

The following sections go over the details of each requirement.

The first software requirement is Python 3.8+ . This is required to use the library.

PyAudio (for microphone users)

PyAudio is required if and only if you want to use microphone input ( Microphone ). PyAudio version 0.2.11+ is required, as earlier versions have known memory management bugs when recording from microphones in certain situations.

If not installed, everything in the library will still work, except attempting to instantiate a Microphone object will raise an AttributeError .

The installation instructions on the PyAudio website are quite good - for convenience, they are summarized below:

PyAudio wheel packages for common 64-bit Python versions on Windows and Linux are included for convenience, under the third-party/ directory in the repository root. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the repository root directory .

PocketSphinx-Python (for Sphinx users)

PocketSphinx-Python is required if and only if you want to use the Sphinx recognizer ( recognizer_instance.recognize_sphinx ).

PocketSphinx-Python wheel packages for 64-bit Python 3.4, and 3.5 on Windows are included for convenience, under the third-party/ directory . To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the SpeechRecognition folder.

On Linux and other POSIX systems (such as OS X), follow the instructions under “Building PocketSphinx-Python from source” in Notes on using PocketSphinx for installation instructions.

Note that the versions available in most package repositories are outdated and will not work with the bundled language data. Using the bundled wheel packages or building from source is recommended.

Vosk (for Vosk users)

Vosk API is required if and only if you want to use Vosk recognizer ( recognizer_instance.recognize_vosk ).

You can install it with python3 -m pip install vosk .

You also have to install Vosk Models:

Here are models avaiable for download. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

Google Cloud Speech Library for Python (for Google Cloud Speech API users)

Google Cloud Speech library for Python is required if and only if you want to use the Google Cloud Speech API ( recognizer_instance.recognize_google_cloud ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_google_cloud will raise an RequestError .

According to the official installation instructions , the recommended way to install this is using Pip : execute pip install google-cloud-speech (replace pip with pip3 if using Python 3).

FLAC (for some systems)

A FLAC encoder is required to encode the audio data to send to the API. If using Windows (x86 or x86-64), OS X (Intel Macs only, OS X 10.6 or higher), or Linux (x86 or x86-64), this is already bundled with this library - you do not need to install anything .

Otherwise, ensure that you have the flac command line tool, which is often available through the system package manager. For example, this would usually be sudo apt-get install flac on Debian-derivatives, or brew install flac on OS X with Homebrew.

Whisper (for Whisper users)

Whisper is required if and only if you want to use whisper ( recognizer_instance.recognize_whisper ).

You can install it with python3 -m pip install SpeechRecognition[whisper-local] .

Whisper API (for Whisper API users)

The library openai is required if and only if you want to use Whisper API ( recognizer_instance.recognize_whisper_api ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_whisper_api will raise an RequestError .

You can install it with python3 -m pip install SpeechRecognition[whisper-api] .

Troubleshooting

The recognizer tries to recognize speech even when i’m not speaking, or after i’m done speaking..

Try increasing the recognizer_instance.energy_threshold property. This is basically how sensitive the recognizer is to when recognition should start. Higher values mean that it will be less sensitive, which is useful if you are in a loud room.

This value depends entirely on your microphone or audio data. There is no one-size-fits-all value, but good values typically range from 50 to 4000.

Also, check on your microphone volume settings. If it is too sensitive, the microphone may be picking up a lot of ambient noise. If it is too insensitive, the microphone may be rejecting speech as just noise.

The recognizer can’t recognize speech right after it starts listening for the first time.

The recognizer_instance.energy_threshold property is probably set to a value that is too high to start off with, and then being adjusted lower automatically by dynamic energy threshold adjustment. Before it is at a good level, the energy threshold is so high that speech is just considered ambient noise.

The solution is to decrease this threshold, or call recognizer_instance.adjust_for_ambient_noise beforehand, which will set the threshold to a good value automatically.

The recognizer doesn’t understand my particular language/dialect.

Try setting the recognition language to your language/dialect. To do this, see the documentation for recognizer_instance.recognize_sphinx , recognizer_instance.recognize_google , recognizer_instance.recognize_wit , recognizer_instance.recognize_bing , recognizer_instance.recognize_api , recognizer_instance.recognize_houndify , and recognizer_instance.recognize_ibm .

For example, if your language/dialect is British English, it is better to use "en-GB" as the language rather than "en-US" .

The recognizer hangs on recognizer_instance.listen ; specifically, when it’s calling Microphone.MicrophoneStream.read .

This usually happens when you’re using a Raspberry Pi board, which doesn’t have audio input capabilities by itself. This causes the default microphone used by PyAudio to simply block when we try to read it. If you happen to be using a Raspberry Pi, you’ll need a USB sound card (or USB microphone).

Once you do this, change all instances of Microphone() to Microphone(device_index=MICROPHONE_INDEX) , where MICROPHONE_INDEX is the hardware-specific index of the microphone.

To figure out what the value of MICROPHONE_INDEX should be, run the following code:

This will print out something like the following:

Now, to use the Snowball microphone, you would change Microphone() to Microphone(device_index=3) .

Calling Microphone() gives the error IOError: No Default Input Device Available .

As the error says, the program doesn’t know which microphone to use.

To proceed, either use Microphone(device_index=MICROPHONE_INDEX, ...) instead of Microphone(...) , or set a default microphone in your OS. You can obtain possible values of MICROPHONE_INDEX using the code in the troubleshooting entry right above this one.

The program doesn’t run when compiled with PyInstaller .

As of PyInstaller version 3.0, SpeechRecognition is supported out of the box. If you’re getting weird issues when compiling your program using PyInstaller, simply update PyInstaller.

You can easily do this by running pip install --upgrade pyinstaller .

On Ubuntu/Debian, I get annoying output in the terminal saying things like “bt_audio_service_open: […] Connection refused” and various others.

The “bt_audio_service_open” error means that you have a Bluetooth audio device, but as a physical device is not currently connected, we can’t actually use it - if you’re not using a Bluetooth microphone, then this can be safely ignored. If you are, and audio isn’t working, then double check to make sure your microphone is actually connected. There does not seem to be a simple way to disable these messages.

For errors of the form “ALSA lib […] Unknown PCM”, see this StackOverflow answer . Basically, to get rid of an error of the form “Unknown PCM cards.pcm.rear”, simply comment out pcm.rear cards.pcm.rear in /usr/share/alsa/alsa.conf , ~/.asoundrc , and /etc/asound.conf .

For “jack server is not running or cannot be started” or “connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)” or “attempt to connect to server failed”, these are caused by ALSA trying to connect to JACK, and can be safely ignored. I’m not aware of any simple way to turn those messages off at this time, besides entirely disabling printing while starting the microphone .

On OS X, I get a ChildProcessError saying that it couldn’t find the system FLAC converter, even though it’s installed.

Installing FLAC for OS X directly from the source code will not work, since it doesn’t correctly add the executables to the search path.

Installing FLAC using Homebrew ensures that the search path is correctly updated. First, ensure you have Homebrew, then run brew install flac to install the necessary files.

To hack on this library, first make sure you have all the requirements listed in the “Requirements” section.

To install/reinstall the library locally, run python -m pip install -e .[dev] in the project root directory .

Before a release, the version number is bumped in README.rst and speech_recognition/__init__.py . Version tags are then created using git config gpg.program gpg2 && git config user.signingkey DB45F6C431DE7C2DCD99FF7904882258A4063489 && git tag -s VERSION_GOES_HERE -m "Version VERSION_GOES_HERE" .

Releases are done by running make-release.sh VERSION_GOES_HERE to build the Python source packages, sign them, and upload them to PyPI.

To run all the tests:

To run static analysis:

To ensure RST is well-formed:

Testing is also done automatically by GitHub Actions, upon every push.

FLAC Executables

The included flac-win32 executable is the official FLAC 1.3.2 32-bit Windows binary .

The included flac-linux-x86 and flac-linux-x86_64 executables are built from the FLAC 1.3.2 source code with Manylinux to ensure that it’s compatible with a wide variety of distributions.

The built FLAC executables should be bit-for-bit reproducible. To rebuild them, run the following inside the project directory on a Debian-like system:

The included flac-mac executable is extracted from xACT 2.39 , which is a frontend for FLAC 1.3.2 that conveniently includes binaries for all of its encoders. Specifically, it is a copy of xACT 2.39/xACT.app/Contents/Resources/flac in xACT2.39.zip .

Please report bugs and suggestions at the issue tracker !

How to cite this library (APA style):

Zhang, A. (2017). Speech Recognition (Version 3.8) [Software]. Available from https://github.com/Uberi/speech_recognition#readme .

How to cite this library (Chicago style):

Zhang, Anthony. 2017. Speech Recognition (version 3.8).

Also check out the Python Baidu Yuyin API , which is based on an older version of this project, and adds support for Baidu Yuyin . Note that Baidu Yuyin is only available inside China.

Copyright 2014-2017 Anthony Zhang (Uberi) . The source code for this library is available online at GitHub .

SpeechRecognition is made available under the 3-clause BSD license. See LICENSE.txt in the project’s root directory for more information.

For convenience, all the official distributions of SpeechRecognition already include a copy of the necessary copyright notices and licenses. In your project, you can simply say that licensing information for SpeechRecognition can be found within the SpeechRecognition README, and make sure SpeechRecognition is visible to users if they wish to see it .

SpeechRecognition distributes source code, binaries, and language files from CMU Sphinx . These files are BSD-licensed and redistributable as long as copyright notices are correctly retained. See speech_recognition/pocketsphinx-data/*/LICENSE*.txt and third-party/LICENSE-Sphinx.txt for license details for individual parts.

SpeechRecognition distributes source code and binaries from PyAudio . These files are MIT-licensed and redistributable as long as copyright notices are correctly retained. See third-party/LICENSE-PyAudio.txt for license details.

SpeechRecognition distributes binaries from FLAC - speech_recognition/flac-win32.exe , speech_recognition/flac-linux-x86 , and speech_recognition/flac-mac . These files are GPLv2-licensed and redistributable, as long as the terms of the GPL are satisfied. The FLAC binaries are an aggregate of separate programs , so these GPL restrictions do not apply to the library or your programs that use the library, only to FLAC itself. See LICENSE-FLAC.txt for license details.

Project details

Release history release notifications | rss feed.

May 5, 2024

Mar 30, 2024

Mar 28, 2024

Dec 6, 2023

Mar 13, 2023

Dec 4, 2022

Dec 5, 2017

Jun 27, 2017

Apr 13, 2017

Mar 11, 2017

Jan 7, 2017

Nov 21, 2016

May 22, 2016

May 11, 2016

May 10, 2016

Apr 9, 2016

Apr 4, 2016

Apr 3, 2016

Mar 5, 2016

Mar 4, 2016

Feb 26, 2016

Feb 20, 2016

Feb 19, 2016

Feb 4, 2016

Nov 5, 2015

Nov 2, 2015

Sep 2, 2015

Sep 1, 2015

Aug 30, 2015

Aug 24, 2015

Jul 26, 2015

Jul 12, 2015

Jul 3, 2015

May 20, 2015

Apr 24, 2015

Apr 14, 2015

Apr 7, 2015

Apr 5, 2015

Apr 4, 2015

Mar 31, 2015

Dec 10, 2014

Nov 17, 2014

Sep 11, 2014

Sep 6, 2014

Aug 25, 2014

Jul 6, 2014

Jun 10, 2014

Jun 9, 2014

May 29, 2014

Apr 23, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded May 5, 2024 Source

Built Distribution

Uploaded May 5, 2024 Python 2 Python 3

Hashes for speechrecognition-3.10.4.tar.gz

Hashes for speechrecognition-3.10.4-py2.py3-none-any.whl.

  • português (Brasil)

Supported by

use of a speech recognition

use of a speech recognition

Use voice recognition in Windows

On Windows 11 22H2 and later, Windows Speech Recognition (WSR) will be replaced by voice access starting in September 2024. Older versions of Windows will continue to have WSR available. To learn more about voice access, go to Use voice access to control your PC & author text with your voice .

Set up a microphone

Before you set up speech recognition, make sure you have a microphone set up.

Select  (Start) > Settings  >  Time & language > Speech .

The speech settings menu in Windows 11

The Speech wizard window opens, and the setup starts automatically. If the wizard detects issues with your microphone, they will be listed in the wizard dialog box. You can select options in the dialog box to specify an issue and help the wizard solve it.

Help your PC recognize your voice

You can teach Windows 11 to recognize your voice. Here's how to set it up:

Press Windows logo key+Ctrl+S. The Set up Speech Recognition wizard window opens with an introduction on the Welcome to Speech Recognition page.

Tip:  If you've already set up speech recognition, pressing Windows logo key+Ctrl+S opens speech recognition and you're ready to use it. If you want to retrain your computer to recognize your voice, press the Windows logo key, type Control Panel , and select Control Panel in the list of results. In Control Panel , select Ease of Access > Speech Recognition > Train your computer to better understand you .

Select Next . Follow the instructions on your screen to set up speech recognition. The wizard will guide you through the setup steps.

After the setup is complete, you can choose to take a tutorial to learn more about speech recognition. To take the tutorial, select Start Tutorial in the wizard window. To skip the tutorial, select Skip Tutorial . You can now start using speech recognition.

Windows Speech Recognition commands

Before you set up voice recognition, make sure you have a microphone set up.

Select the  Start    button, then select  Settings   >  Time & Language > Speech .

use of a speech recognition

You can teach Windows 10 to recognize your voice. Here's how to set it up:

In the search box on the taskbar, type Windows Speech Recognition , and then select Windows Speech Recognition  in the list of results.

If you don't see a dialog box that says "Welcome to Speech Recognition Voice Training," then in the search box on the taskbar, type Control Panel , and select Control Panel in the list of results. Then select Ease of Access > Speech Recognition > Train your computer to understand you better .

Follow the instructions to set up speech recognition.

Facebook

Need more help?

Want more options.

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

use of a speech recognition

Microsoft 365 subscription benefits

use of a speech recognition

Microsoft 365 training

use of a speech recognition

Microsoft security

use of a speech recognition

Accessibility center

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

use of a speech recognition

Ask the Microsoft Community

use of a speech recognition

Microsoft Tech Community

use of a speech recognition

Windows Insiders

Microsoft 365 Insiders

Find solutions to common problems or get help from a support agent.

use of a speech recognition

Online support

Was this information helpful?

Thank you for your feedback.

Woman in wheelchair works on computer

Speech Accessibility Project

Beckman Institute for Advanced Science and Technology

Coming together to expand voice recognition

The University of Illinois Urbana-Champaign has announced the Speech Accessibility Project, a new research initiative to make voice recognition technology more useful for people with a range of diverse speech patterns and disabilities. 

Couple hugs and smiles

Now recruiting!

The Speech Accessibility Project is now recruiting U.S. and Puerto Rican adults:

  • who have Parkinson's and related neurological conditions like MSA, PSP, post-DBS, and LBD.
  • who have Down syndrome
  • who have cerebral palsy
  • who have amyotrophic lateral sclerosis
  • who have had a stroke

People over the age of 18 are eligible.  Unfortunately, we cannot recruit participants from Illinois, Texas, or Washington at this time because of their state privacy laws.

To get started, please visit the Speech Accessibility App .

Join the study

A person with CP looks at a tablet with someone else

Our progress

As of the end of April 2024, we've shared 185,000 speech samples with the companies that fund us: Amazon, Apple, Google, Meta and Microsoft.

Here at Illinois, researchers have trained an automatic speech recognition tool using the project's recordings. Before using recordings from the Speech Accessibility Project, the tool misunderstood speech 20% of the time. With data from the speech accessibility project, this decreased to 12%.

A man wearing headphones

Submit a proposal for using our data

We are now accepting proposals from nonprofits and companies who want to use our data to improve their own speech recognition tools.

About the project

The project has unprecedented cross-industry support from Amazon, Apple, Google, Meta, and Microsoft, as well as nonprofit organizations whose communities will benefit from this accessibility initiative, to make speech recognition more inclusive of diverse speech patterns.

Today’s speech recognition systems, such as voice assistants and translation tools, don’t always recognize people with a diversity of speech patterns often associated with disabilities. This includes speech affected by Lou Gehrig’s disease or Amyotrophic Lateral Sclerosis, Parkinson’s disease, cerebral palsy, and Down syndrome. In effect, many individuals in these and other communities may be unable to benefit from the latest speech recognition tools.

Learn more about the project .

Sign up to receive email updates

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

speech-emotion-recognition

Here are 178 public repositories matching this topic..., miteshputhran / speech-emotion-analyzer.

The neural network model is capable of detecting five different male/female emotions from audio speeches. (Deep Learning, NLP, Python)

  • Updated Feb 7, 2023
  • Jupyter Notebook

coqui-ai / open-speech-corpora

💎 A list of accessible speech corpora for ASR, TTS, and other Speech Technologies

  • Updated Jul 27, 2022

Renovamen / Speech-Emotion-Recognition

Speech emotion recognition implemented in Keras (LSTM, CNN, SVM, MLP) | 语音情感识别

  • Updated Mar 25, 2023

x4nth055 / emotion-recognition-using-speech

Building and training Speech Emotion Recognizer that predicts human emotions using Python, Sci-kit learn and Keras

  • Updated Nov 3, 2023

ddlBoJack / emotion2vec

Official PyTorch code for extracting features and training downstream models with emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

  • Updated May 10, 2024

audeering / w2v2-how-to

How to use our public wav2vec2 dimensional emotion model

  • Updated May 22, 2023

xuanjihe / speech-emotion-recognition

speech emotion recognition using a convolutional recurrent networks based on IEMOCAP

  • Updated Jul 8, 2019

Demfier / multimodal-speech-emotion-recognition

Lightweight and Interpretable ML Model for Speech Emotion Recognition and Ambiguity Resolution (trained on IEMOCAP dataset)

  • Updated Dec 21, 2023

speechbrain / speechbrain.github.io

The SpeechBrain project aims to build a novel speech toolkit fully based on PyTorch. With SpeechBrain users can easily create speech processing systems, ranging from speech recognition (both HMM/DNN and end-to-end), speaker recognition, speech enhancement, speech separation, multi-microphone speech processing, and many others.

  • Updated Apr 28, 2024

hkveeranki / speech-emotion-recognition

Speaker independent emotion recognition

  • Updated Apr 17, 2023

RayanWang / Speech_emotion_recognition_BLSTM

Bidirectional LSTM network for speech emotion recognition.

  • Updated Mar 31, 2019

SuperKogito / SER-datasets

A collection of datasets for the purpose of emotion recognition/detection in speech.

  • Updated May 7, 2024

david-yoon / multimodal-speech-emotion

TensorFlow implementation of "Multimodal Speech Emotion Recognition using Audio and Text," IEEE SLT-18

  • Updated Mar 25, 2024

m3hrdadfi / soxan

Wav2Vec for speech recognition, classification, and audio classification

  • Updated Apr 2, 2022

Data-Science-kosta / Speech-Emotion-Classification-with-PyTorch

This repository contains PyTorch implementation of 4 different models for classification of emotions of the speech.

  • Updated Nov 10, 2022

Jiaxin-Ye / TIM-Net_SER

[ICASSP 2023] Official Tensorflow implementation of "Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition".

  • Updated Nov 9, 2023

mkosaka1 / Speech_Emotion_Recognition

Using Convolutional Neural Networks in speech emotion recognition on the RAVDESS Audio Dataset.

  • Updated Apr 12, 2021

habla-liaa / ser-with-w2v2

Official implementation of INTERSPEECH 2021 paper 'Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings'

  • Updated Dec 23, 2021

shamanez / BERT-like-is-All-You-Need

The code for our INTERSPEECH 2020 paper - Jointly Fine-Tuning "BERT-like'" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

  • Updated Feb 26, 2021

Vincent-ZHQ / CA-MSER

Code for Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information

  • Updated Nov 27, 2023

Improve this page

Add a description, image, and links to the speech-emotion-recognition topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the speech-emotion-recognition topic, visit your repo's landing page and select "manage topics."

Multi-language: ensemble learning-based speech emotion recognition

  • Regular Paper
  • Published: 07 May 2024

Cite this article

use of a speech recognition

  • Anumula Sruthi 1 ,
  • Anumula Kalyan Kumar 2 ,
  • Kishore Dasari 1 ,
  • Yenugu Sivaramaiah 1 ,
  • Garikapati Divya 3 &
  • Gunupudi Sai Chaitanya Kumar 2  

Explore all metrics

Inaccurate emotional reactions from robots have been a problem for authors in previous years. Since technology has advanced, robots like service robots can communicate with people of many other languages. The traditional Speech Emotion Recognition (SER) method utilizes the same corpus for classifier testing and training to accurately identify emotions. However, this method could be more flexible for multi-lingual (multi-language) contexts, which is essential for robots that people use worldwide. This research proposes an ensemble learning method (HMLSTM and CapsNet) that uses a voting majority for a cross-corpus, multi-lingual SER system. This work utilizes three corpora (EMO-DB, URDU, and SAVEE) that offer a variety of languages (German, Urdu, and English) to test multi-language SER. We first use the Refined Attention Pyramid Network (RAPNet) for speech and emotion recognition to extract the features. Following that, the pre-processing step of the data is normalized using the Min–max normalization approach and IGAN to address data imbalance. To identify the emotions in speech into the appropriate group, use HMLSTM and CapsNet’s ensemble learning algorithms. With reasonable accuracy, the proposed ensemble learning approach enhances emotion recognition. It compares the effectiveness of the proposed ensemble learning method with existing traditional learning methods. Using data from a corpus trained on a different corpus, this study tests the performance of a classifier for multi-lingual emotion identification. In this experiment, distinct classifiers offer excellent accuracy for diverse corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

use of a speech recognition

Data availability

Data will be available when requested.

Kwon, S.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167 , 114177 (2021)

Article   Google Scholar  

Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127 , 73–81 (2021)

Kwon, S.: Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int. J. Intell. Syst. 36 (9), 5116–5135 (2021)

Meena, G., Mohbey, K.K., Kumar, S., Lokesh, K.: A hybrid deep learning approach for detecting sentiment polarities and knowledge graph representation on monkeypox tweets. Decis. Anal. J. 7 , 100243 (2023)

Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl. Syst. 211 , 106547 (2021)

Zhao, Z., Li, Q., Zhang, Z., Cummins, N., Wang, H., Tao, J., Schuller, B.W.: Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition. Neural Netw. 141 , 52–60 (2021)

Mohbey, K.K., Meena, G., Kumar, S., Lokesh, K.: A CNN-LSTM-based hybrid deep learning approach for sentiment analysis on Monkeypox tweets. New Gener. Comput. 14 , 1–19 (2023)

Google Scholar  

Yildirim, S., Kaya, Y., Kılıç, F.: A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl. Acoust. 173 , 107721 (2021)

Li, S., Xing, X., Fan, W., Cai, B., Fordson, P., Xu, X.: Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 448 , 238–248 (2021)

Liu, Z.T., Rehman, A., Wu, M., Cao, W.H., Hao, M.: Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 563 , 309–325 (2021)

Abdulmohsin, H.A.: A new proposed statistical feature extraction method in speech emotion recognition. Comput. Electr. Eng. 93 , 107172 (2021)

Hansen, L., Zhang, Y.P., Wolf, D., Sechidis, K., Ladegaard, N., Fusaroli, R.: A generalizable speech emotion recognition model reveals depression and remission. Acta Psychiatr. Scand. 145 (2), 186–199 (2022)

Fu, C., Dissanayake, T., Hosoda, K., Maekawa, T., & Ishiguro, H.: Similarity of speech emotion in different languages revealed by a neural network with attention. In: 2020 IEEE 14th international conference on semantic computing (ICSC) (pp. 381–386). IEEE (2020)

Kumaran, U., Radha Rammohan, S., Nagarajan, S.M., Prathik, A.: Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int. J. Speech Technol. 24 , 303–314 (2021)

Senthilkumar, N., Karpakam, S., Devi, M.G., Balakumaresan, R., Dhilipkumar, P.: Speech emotion recognition based on Bi-directional LSTM architecture and deep belief networks. Mater. Today Proc. 57 , 2180–2184 (2022)

Qadri, S. A. A., Gunawan, T. S., Kartiwi, M., Mansor, H., & Wani, T. M.: Speech emotion recognition using feature fusion of TEO and MFCC on multilingual databases. In: Recent trends in mechatronics towards industry 4.0: selected articles from iM3F 2020, Malaysia (pp. 681–691). Springer Singapore (2022)

Ma, Y., Wang, W.: MSFL: explainable multitask-based shared feature learning for multilingual speech emotion recognition. Appl. Sci. 12 (24), 12805 (2022)

Alsabhan, W.: Human-computer interaction with a real-time speech emotion recognition with ensembling techniques 1D convolution neural network and attention. Sensors 23 (3), 1386 (2023)

Gomathy, M.: Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. Int. J. Speech Technol. 24 (1), 155–163 (2021)

Ahmed, M.R., Islam, S., Islam, A.M., Shatabda, S.: An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Syst. Appl. 15 (218), 119633 (2023)

Pham, N.T., Dang, D.N., Nguyen, N.D., Nguyen, T.T., Nguyen, H., Manavalan, B., Lim, C.P., Nguyen, S.D.: Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. Expert Syst. Appl. 15 (230), 120608 (2023)

Chen, W., Hu, H.: Generative attention adversarial classification network for unsupervised domain adaptation. Pattern Recogn. 107 , 107440 (2020)

Kanna, P.R., Santhi, P.: Unified deep learning approach for efficient intrusion detection system using integrated spatial–temporal features. Knowl. Syst. 226 , 107132 (2021)

Wang, Z., Zheng, L., Du, W., Cai, W., Zhou, J., Wang, J., He, G.: A novel method for intelligent fault diagnosis of bearing based on capsule neural network. Complexity 2019 (2019), 1 (2019)

SAVEE dataset: https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee

EMO-DB dataset: https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb

URDU dataset: https://www.kaggle.com/datasets/hazrat/urdu-speech-dataset?select=files

Al-onazi, B.B., Nauman, M.A., Jahangir, R., Malik, M.M., Alkhammash, E.H., Elshewey, A.M.: Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion. Appl. Sci. 12 (18), 9188 (2022)

Khan, A.: Improved multi-lingual sentiment analysis and recognition using deep learning. J. Inform. Sci. 12 , 01655515221137270 (2023)

Download references

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and affiliations.

Department of Computer Science and Engineering, Koneru Lakshmaiah Educational Foundation, Vaddeswaram, Andhra Pradesh, India

Anumula Sruthi, Kishore Dasari & Yenugu Sivaramaiah

Department of Artificial Intelligence, DVR & Dr HS MIC College of Technology, Kanchikcherla, Andhra Pradesh, India

Anumula Kalyan Kumar & Gunupudi Sai Chaitanya Kumar

Department of Artificial Intelligence and Data Science, Laki Reddy Bali Reddy College of Engineering (Autonomous), Mylavaram, India

Garikapati Divya

You can also search for this author in PubMed   Google Scholar

Contributions

The contributions of authors are as follows: Anumula Sruthi, Anumula Kalyan Kumar, Kishore Dasari, Yenugu Sivaramaiah contributed to conceptualization, methodology, software, formal analysis, investigation, resources, writing—original draft, review & editing, and visualization. Garikapati Divya, Dr. G. Sai Chaitanya Kumar contributed to conceptualization, writing—review & editing.

Corresponding author

Correspondence to Gunupudi Sai Chaitanya Kumar .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Sruthi, A., Kumar, A.K., Dasari, K. et al. Multi-language: ensemble learning-based speech emotion recognition. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00553-6

Download citation

Received : 19 June 2023

Accepted : 11 April 2024

Published : 07 May 2024

DOI : https://doi.org/10.1007/s41060-024-00553-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Speech emotion recognition (SER)
  • Multi-lingual
  • Ensemble learning
  • Capsule Neural Network
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Use voice recognition in Windows

    Tip: If you've already set up speech recognition, pressing Windows logo key+Ctrl+S opens speech recognition and you're ready to use it.If you want to retrain your computer to recognize your voice, press the Windows logo key, type Control Panel, and select Control Panel in the list of results. In Control Panel, select Ease of Access > Speech Recognition > Train your computer to better ...

  2. What Is Speech Recognition?

    Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text, is a capability that enables a program to process human speech into a written format. While speech recognition is commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal ...

  3. Speech Recognition: Everything You Need to Know in 2024

    Speech recognition, also known as automatic speech recognition (ASR), speech-to-text (STT), and computer speech recognition, is a technology that enables a computer to recognize and convert spoken language into text. Speech recognition technology uses AI and machine learning models to accurately identify and transcribe different accents ...

  4. The Ultimate Guide To Speech Recognition With Python

    This article provides an in-depth and scholarly look at the evolution of speech recognition technology. The Past, Present and Future of Speech Recognition Technology by Clark Boyd at The Startup. This blog post presents an overview of speech recognition technology, with some thoughts about the future. Some good books about speech recognition:

  5. What is Speech Recognition?

    Speech recognition is the process of converting sound signals to text transcriptions. Steps involved in conversion of a sound wave to text transcription in a speech recognition system are: Recording: Audio is recorded using a voice recorder. Sampling: Continuous audio wave is converted to discrete values.

  6. Speak Up: How to Use Speech Recognition and Dictate Text in Windows

    How to Use Dictation. Open an application in which you want to dictate text, such as Notepad, WordPad, Microsoft Word, or Mail. To trigger the dictation, press the Windows key + H.. If you're ...

  7. Speech recognition

    Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT).It incorporates knowledge and research in the computer ...

  8. How to set up and use Windows 10 Speech Recognition

    Open Control Panel. Click on Ease of Access. Click on Speech Recognition. Click the Start Speech Recognition link. In the "Set up Speech Recognition" page, click Next. Select the type of ...

  9. How to use Windows' built-in speech recognition

    In Windows 10, type "speech" into the search box next to the Start button, and among the results select the Speech Recognition option (not, initially, the Speech Recognition desktop app). In ...

  10. Speech-to-Text AI: speech recognition and transcription

    Turn speech into text using Google AI. Convert audio into text transcriptions and integrate speech recognition into applications with easy-to-use APIs. Get up to 60 minutes for transcribing and analyzing audio free per month.*. New customers also get up to $300 in free credits to try Speech-to-Text and other Google Cloud products.

  11. How to use Speech Recognition

    Enter speech recognitionin the search box, and then tap or click Windows Speech Recognition. Say "start listening," or tap or click the microphone button to start the listening mode. Open the app you want to use, or select the text box you want to dictate text into. Say the text you want to dictate. To correct mistakes.

  12. HOW TO use Speech Recognition Built into Windows 10

    0:25 Turning on speech recognition 1:54 Dictating with Windows+H3:36 Dictating in MS Office programs5:36 Dictating with the Windows Speech Recognition Servic...

  13. SpeechRecognition

    SpeechRecognition. The SpeechRecognition interface of the Web Speech API is the controller interface for the recognition service; this also handles the SpeechRecognitionEvent sent from the recognition service. Note: On some browsers, like Chrome, using Speech Recognition on a web page involves a server-based recognition engine.

  14. Speak Up: How to Use Speech Recognition and Dictate Text in Windows

    Choose the Start Speech Recognition link to set up the feature. The first screen for setting up speech recognition explains what the feature does and how it works. Click Next, then choose whether ...

  15. Using the Web Speech API

    Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

  16. What is Voice Recognition?

    Voice recognition is a technology that enables devices to understand and respond to spoken words. It turns what you say into text and lets you control devices just by talking to them. This technology is key in many modern tools like smartphones, smart speakers, and car systems, helping with tasks like sending messages, playing music, and finding information online.

  17. SpeechRecognition · PyPI

    Google API Client Library for Python (required only if you need to use the Google Cloud Speech API, recognizer_instance.recognize_google_cloud) FLAC encoder (required only if the system is not x86-based Windows/Linux/OS X) Vosk (required only if you need to use Vosk API speech recognition recognizer_instance.recognize_vosk)

  18. SPEECH RECOGNITION SYSTEM

    Digital speech recognition is a challenging problem that requires the ability to learn complex signal characteristics such as frequency, pitch, intensity, timbre, and melody, which traditional ...

  19. Use voice recognition in Windows

    Tip: If you've already set up speech recognition, pressing Windows logo key+Ctrl+S opens speech recognition and you're ready to use it.If you want to retrain your computer to recognize your voice, press the Windows logo key, type Control Panel, and select Control Panel in the list of results. In Control Panel, select Ease of Access > Speech Recognition > Train your computer to better ...

  20. Speech Accessibility Project

    The Speech Accessibility Project is a partnership between the University of Illinois Urbana-Champaign and a group of technology companies to make voice recognition technology more useful for people with a range of diverse speech patterns and disabilities.

  21. How to use Speech Recognition

    Enter speech recognitionin the search box, tap or click Apps, and then tap or click Windows Speech Recognition. Say "start listening," or tap or click the Microphonebutton to start the listening mode. Say "open Speech Dictionary" and do any of the following: To add a word to the dictionary, say "Add a new word," and then follow the instructions.

  22. speech-emotion-recognition · GitHub Topics · GitHub

    The SpeechBrain project aims to build a novel speech toolkit fully based on PyTorch. With SpeechBrain users can easily create speech processing systems, ranging from speech recognition (both HMM/DNN and end-to-end), speaker recognition, speech enhancement, speech separation, multi-microphone speech processing, and many others.

  23. Multi-language: ensemble learning-based speech emotion recognition

    The traditional Speech Emotion Recognition (SER) method utilizes the same corpus for classifier testing and training to accurately identify emotions. However, this method could be more flexible for multi-lingual (multi-language) contexts, which is essential for robots that people use worldwide. This research proposes an ensemble learning method ...

  24. Introducing Speech Recognition for Uncommon Spoken Languages

    Advancements in speech recognition technology will now enable identifying uncommon languages. This article shows how speech recognition system for uncommon languages can be used. Speech recognition can now be used to identify uncommon spoken languages! Automated speech recognition technology has become one of the fastest-growing technolo