This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

Text to speech REST API

  • 3 contributors

The Speech service allows you to convert text into synthesized speech and get a list of supported voices for a region by using a REST API. In this article, you learn about authorization options, query options, how to structure a request, and how to interpret a response.

Use cases for the text to speech REST API are limited. Use it only in cases where you can't use the Speech SDK . For example, with the Speech SDK you can subscribe to events for more insights about the text to speech processing and results.

The text to speech REST API supports neural text to speech voices in many locales. Each available endpoint is associated with a region. A Speech resource key for the endpoint or region that you plan to use is required. Here are links to more information:

  • For a complete list of voices, see Language and voice support for the Speech service .
  • For information about regional availability, see Speech service supported regions .
  • For Azure Government and Microsoft Azure operated by 21Vianet endpoints, see this article about sovereign clouds .

Costs vary for prebuilt neural voices (called Neural on the pricing page) and custom neural voices (called Custom Neural on the pricing page). For more information, see Speech service pricing .

Before you use the text to speech REST API, understand that you need to complete a token exchange as part of authentication to access the service. For more information, see Authentication .

Get a list of voices

You can use the tts.speech.microsoft.com/cognitiveservices/voices/list endpoint to get a full list of voices for a specific region or endpoint. Prefix the voices list endpoint with a region to get a list of voices for that region. For example, to get a list of voices for the westus region, use the https://westus.tts.speech.microsoft.com/cognitiveservices/voices/list endpoint. For a list of all supported regions, see the regions documentation.

Voices and styles in preview are only available in three service regions: East US, West Europe, and Southeast Asia.

Request headers

This table lists required and optional headers for text to speech requests:

Request body

A body isn't required for GET requests to this endpoint.

Sample request

This request requires only an authorization header:

Here's an example curl command:

Sample response

You should receive a response with a JSON body that includes all supported locales, voices, gender, styles, and other details. The WordsPerMinute property for each voice can be used to estimate the length of the output speech. This JSON example shows partial results to illustrate the structure of a response:

HTTP status codes

The HTTP status code for each response indicates success or common errors.

Convert text to speech

The cognitiveservices/v1 endpoint allows you to convert text to speech by using Speech Synthesis Markup Language (SSML) .

Regions and endpoints

These regions are supported for text to speech through the REST API. Be sure to select the endpoint that matches your Speech resource region.

Prebuilt neural voices

Use this table to determine availability of neural voices by region or endpoint:

Voices in preview are available in only these three regions: East US, West Europe, and Southeast Asia.

Custom neural voices

If you've created a custom neural voice font, use the endpoint that you've created. You can also use the following endpoints. Replace {deploymentId} with the deployment ID for your neural voice model.

The preceding regions are available for neural voice model hosting and real-time synthesis. Custom neural voice training is only available in some regions. But users can easily copy a neural voice model from these regions to other regions in the preceding list.

Long Audio API

The Long Audio API is available in multiple regions with unique endpoints:

If you're using a custom neural voice, the body of a request can be sent as plain text (ASCII or UTF-8). Otherwise, the body of each POST request is sent as SSML . SSML allows you to choose the voice and language of the synthesized speech that the text to speech feature returns. For a complete list of supported voices, see Language and voice support for the Speech service .

This HTTP request uses SSML to specify the voice and language. If the body length is long, and the resulting audio exceeds 10 minutes, it's truncated to 10 minutes. In other words, the audio length can't exceed 10 minutes.

* For the Content-Length, you should use your own content length. In most cases, this value is calculated automatically.

The HTTP status code for each response indicates success or common errors:

If the HTTP status is 200 OK , the body of the response contains an audio file in the requested format. This file can be played as it's transferred, saved to a buffer, or saved to a file.

Audio outputs

The supported streaming and nonstreaming audio formats are sent in each request as the X-Microsoft-OutputFormat header. Each format incorporates a bit rate and encoding type. The Speech service supports 48-kHz, 24-kHz, 16-kHz, and 8-kHz audio outputs. Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz.

  • NonStreaming

If you select 48kHz output format, the high-fidelity voice model with 48kHz will be invoked accordingly. The sample rates other than 24kHz and 48kHz can be obtained through upsampling or downsampling when synthesizing, for example, 44.1kHz is downsampled from 48kHz.

If your selected voice and output format have different bit rates, the audio is resampled as necessary. You can decode the ogg-24khz-16bit-mono-opus format by using the Opus codec .

Authentication

Each request requires an authorization header. This table illustrates which headers are supported for each feature:

When you're using the Ocp-Apim-Subscription-Key header, you're only required to provide your resource key. For example:

When you're using the Authorization: Bearer header, you're required to make a request to the issueToken endpoint. In this request, you exchange your resource key for an access token that's valid for 10 minutes.

How to get an access token

To get an access token, you need to make a request to the issueToken endpoint by using Ocp-Apim-Subscription-Key and your resource key.

The issueToken endpoint has this format:

Replace <REGION_IDENTIFIER> with the identifier that matches the region of your subscription.

Use the following samples to create your access token request.

HTTP sample

This example is a simple HTTP request to get a token. Replace YOUR_SUBSCRIPTION_KEY with your resource key for the Speech service. If your subscription isn't in the West US region, replace the Host header with your region's host name.

The body of the response contains the access token in JSON Web Token (JWT) format.

PowerShell sample

This example is a simple PowerShell script to get an access token. Replace YOUR_SUBSCRIPTION_KEY with your resource key for the Speech service. Make sure to use the correct endpoint for the region that matches your subscription. This example is currently set to West US.

cURL sample

cURL is a command-line tool available in Linux (and in the Windows Subsystem for Linux). This cURL command illustrates how to get an access token. Replace YOUR_SUBSCRIPTION_KEY with your resource key for the Speech service. Make sure to use the correct endpoint for the region that matches your subscription. This example is currently set to West US.

This C# class illustrates how to get an access token. Pass your resource key for the Speech service when you instantiate the class. If your subscription isn't in the West US region, change the value of FetchTokenUri to match the region for your subscription.

Python sample

How to use an access token.

The access token should be sent to the service as the Authorization: Bearer <TOKEN> header. Each access token is valid for 10 minutes. You can get a new token at any time, but to minimize network traffic and latency, we recommend using the same token for nine minutes.

Here's a sample HTTP request to the Speech to text REST API for short audio:

  • Create a free Azure account
  • Get started with custom neural voice
  • Batch synthesis

Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see: https://aka.ms/ContentUserFeedback .

Submit and view feedback for

Additional resources

text to speech language api

Text to speech

An AI Speech feature that converts text to lifelike speech.

Bring your apps to life with natural-sounding voices

Build apps and services that speak naturally. Differentiate your brand with a customized, realistic voice generator, and access voices with different speaking styles and emotional tones to fit your use case—from text readers and talkers to customer support chatbots.

text to speech language api

Lifelike synthesized speech

Enable fluid, natural-sounding text to speech that matches the intonation and emotion of human voices.

text to speech language api

Customizable text-talker voices

Create a unique AI voice generator that reflects your brand's identity.

text to speech language api

Fine-grained text-to-talk audio controls

Tune voice output for your scenarios by easily adjusting rate, pitch, pronunciation, pauses, and more.

text to speech language api

Flexible deployment

Run Text to Speech anywhere—in the cloud, on-premises, or at the edge in containers.

text to speech language api

Tailor your speech output

Fine-tune synthesized speech audio to fit your scenario.  Define lexicons  and control speech parameters such as pronunciation, pitch, rate, pauses, and intonation with  Speech Synthesis Markup Language  (SSML) or with the  audio content creation tool .

text to speech language api

Deploy Text to Speech anywhere, from the cloud to the edge

Run Text to Speech wherever your data resides. Build lifelike speech synthesis into applications optimized for both robust cloud capabilities and edge locality using  containers .

Build a custom voice for your brand

Differentiate your brand with a unique  custom voice . Develop a highly realistic voice for more natural conversational interfaces using the Custom Neural Voice capability, starting with 30 minutes of audio.

Fuel App Innovation with Cloud AI Services

Learn five key ways your organization can get started with AI to realize value quickly.

Comprehensive privacy and security

Documentation.

AI Speech, part of Azure AI Services, is  certified  by SOC, FedRAMP, PCI DSS, HIPAA, HITECH, and ISO.

View and delete your custom voice data and synthesized speech models at any time. Your data is encrypted while it’s in storage.

Your data remains yours. Your text data isn't stored during data processing or audio voice generation.

Backed by Azure infrastructure, AI Speech offers enterprise-grade security, availability, compliance, and manageability.

Comprehensive security and compliance, built in

Microsoft invests more than $1 billion annually on cybersecurity research and development.

text to speech language api

We employ more than 3,500 security experts who are dedicated to data security and privacy.

The security center compute and apps tab in Azure showing a list of recommendations

Azure has more certifications than any other cloud provider. View the comprehensive list .

text to speech language api

Flexible pricing gives you the power and control you need

Pay only for what you use, with no upfront costs. With Text to Speech, you pay as you go based on the number of characters you convert to audio.

Get started with an Azure free account

text to speech language api

After your credit, move to  pay as you go  to keep building with the same free services. Pay only if you use more than your free monthly amounts.

text to speech language api

Guidelines for building responsible synthetic voices

text to speech language api

Learn about responsible deployment

Synthetic voices must be designed to earn the trust of others. Learn the principles of building synthesized voices that create confidence in your company and services.

text to speech language api

Obtain consent from voice talent

Help voice talent understand how neural text-to-speech (TTS) works and get information on recommended use cases.

text to speech language api

Be transparent

Transparency is foundational to responsible use of computer voice generators and synthetic voices. Help ensure that users understand when they’re hearing a synthetic voice and that voice talent is aware of how their voice will be used. Learn more with our disclosure design guidelines.

Documentation and resources

Get started.

Read the  documentation

Take the  Microsoft Learn course

Get started with a 30-day learning journey

Explore code samples

Check out the  sample code

See customization resources

Customize your speech solution with  Speech studio . No code required.

Start building with AI Services

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free
  • English (US)

Using the Web Speech API

Speech recognition.

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.

Note: On some browsers, such as Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer . When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting mobile browser (such as Chrome).

HTML and CSS

The HTML and CSS for the app is really trivial. We have a title, instructions paragraph, and a div into which we output diagnostic messages.

The CSS provides a very simple responsive styling so that it looks OK across devices.

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Browsers currently support speech recognition with prefixed properties. Therefore at the start of our code we include these lines to allow for both prefixed properties and unprefixed versions that may be supported in future:

The grammar

The next part of our code defines the grammar we want our app to recognize. The following variable is defined to hold our grammar:

The grammar format used is JSpeech Grammar Format ( JSGF ) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:

  • The lines are separated by semicolons, just like in JavaScript.
  • The first line — #JSGF V1.0; — states the format and version used. This always needs to be included first.
  • The second line indicates a type of term that we want to recognize. public declares that it is a public rule, the string in angle brackets defines the recognized name for this term ( color ), and the list of items that follow the equals sign are the alternative values that will be recognized and accepted as appropriate values for the term. Note how each is separated by a pipe character.
  • You can have as many terms defined as you want on separate lines following the above structure, and include fairly complex grammar definitions. For this basic demo, we are just keeping things simple.

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

  • SpeechRecognition.continuous : Controls whether continuous results are captured ( true ), or just a single result each time recognition is started ( false ).
  • SpeechRecognition.lang : Sets the language of the recognition. Setting this is good practice, and therefore recommended.
  • SpeechRecognition.interimResults : Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this simple demo.
  • SpeechRecognition.maxAlternatives : Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this simple demo, so we are just specifying one (which is actually the default anyway.)

Starting the speech recognition

After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start() . The forEach() method is used to output colored indicators showing what colors to try saying.

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events .) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop() ) once a single word has been recognized and it has finished being spoken:

Handling errors and unrecognized speech

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

The error event handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis . This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter / Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting mobile browser.

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option> s via JavaScript (see later on.)

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis . This is API's entry point — it returns an instance of SpeechSynthesis , the controller interface for web speech synthesis.

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices() , which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name ), the language of the voice (grabbed from SpeechSynthesisVoice.lang ), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true .)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter / Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak() , passing it the SpeechSynthesisUtterance instance as a parameter.

In the final part of the handler, we include an pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch / rate values displayed in the UI, each time the slider positions are moved.

Advanced Text to Speech API

  • ~400ms latency
  • High quality at speed

text to speech language api

Highest Quality Audio Output

Low latency turbo model.

Build Faster Than Ever

Build Faster Than Ever

ElevenLabs Grants

3 Months Free

11m characters, api features, 1000s of hq voices.

Create custom voices by cloning your own voice, create a new one from scratch or explore our library.

Real-time Latency

Get the fastest response time in the industry with our real-time API. Achieve ~400ms audio generation times at 128kbps.

Contextual awareness

Our text to speech model understands the context of the text to deliver the most natural sounding voices.

Enterprise-ready Security

Trusted security and data controls, soc2 and gdpr.

Compliant with the highest security and data handling standards

Full Privacy Mode

Optional Full Privacy mode that enables zero content and data retention on ElevenLabs servers. Exclusively for Enterprise.

End-To-End Encryption

Content and data sent to and from our models are always protected

Explore our resources

Python library, react text to speech guide, gaming ai voice guide, multilingual text to speech api in 29 languages, developer api, enterprise scale, frequently asked questions, what makes elevenlabs api the best tts api.

It offers unparalleled quality, multilingual capabilities, and low latency (<500ms), ensuring optimal user experience. It also provides a comprehensive library of voices and a variety of voice settings to suit any use-case.

What is a text to speech & AI voice API?

It is an application programming interface that allows developers to integrate text-to-speech and voice cloning capabilities into their applications. It works by leveraging deep learning to convert text into speech, and speech into a different voice. The technology has had significant growth in recent months due to its ability to create a more immersive user experience. It is used to create audiobooks, podcasts, voice assistants, and more. It can also be used to create custom voices for gaming, movies, and other media.

How do I get started with the text to speech API?

You can get started by signing up for a free account. Once you have an account, find your xi-api-key in your profile settings after registration. This key is required for authentication in API requests. You can then generate audio from text in a variety of languages by sending a POST request to the API with the desired text and voice settings. The API returns an audio file in response. Use programming languages like Python for these requests, as demonstrated in the example above.

How does the API ensure high-quality output?

It delivers audio at 128 kbps, allowing for a premium listening experience. It also offers a variety of voice settings to suit any use-case, including emotional range, contextual awareness, and voice variety.

Can I get support during the integration process?

Yes, extensive resources, an active developer community, and a responsive support team are available to assist you.

How many languages does the API support?

Our text to speech API supports 29 languages including Hindi, Spanish, German, Arabic & Chinese. Each voice maintains its unique characteristics across all languages.

What is the latency of the text to speech API?

The API boasts ultra-low latency, achieving approximately 400ms audio generation times with its Turbo model. This ensures a quick turnaround from text input to audio output. Multiple latency optimization modes are available, enabling significant improvements and responsiveness.

What are the use cases for the ElevenLabs TTS API?

The API can be used to create audiobooks, podcasts, voice assistants, and more. It can also be used to create custom voices for gaming, movies, and other media.

What is an AI voice API and how does it work?

An AI voice API is an application programming interface that allows developers to integrate text-to-speech and voice cloning capabilities into their applications. It works by leveraging deep learning to convert text into speech, and speech into a different voice.

What is the best text to speech (TTS) API?

The best text to speech API is one that offers high-quality output, multilingual capabilities, and low latency. It should also provide a comprehensive library of voices and a variety of voice settings to suit any use-case. You can find all of these features and more with ElevenLabs.

  • Português – Brasil

Using the Text-to-Speech API with Python

1. overview.

1215f38908082356.png

The Text-to-Speech API enables developers to generate human-like speech. The API converts text into audio formats such as WAV, MP3, or Ogg Opus. It also supports Speech Synthesis Markup Language (SSML) inputs to specify pauses, numbers, date and time formatting, and other pronunciation instructions.

In this tutorial, you will focus on using the Text-to-Speech API with Python.

What you'll learn

  • How to set up your environment
  • How to list supported languages
  • How to list available voices
  • How to synthesize audio from text

What you'll need

  • A Google Cloud project
  • A browser, such as Chrome or Firefox
  • Familiarity using Python

How will you use this tutorial?

How would you rate your experience with python, how would you rate your experience with google cloud services, 2. setup and requirements, self-paced environment setup.

  • Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one .

fbef9caa1602edd0.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
  • The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID ). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
  • For your information, there is a third value, a Project Number , which some APIs use. Learn more about all three of these values in the documentation .
  • Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Cloud Shell , a command line environment running in the Cloud.

Activate Cloud Shell

853e55310c205094.png

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue .

9c92662c6a846a5c.png

It should only take a few moments to provision and connect to Cloud Shell.

9f0e51b578fecce5.png

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

  • Run the following command in Cloud Shell to confirm that you are authenticated:

Command output

  • Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:

If it is not, you can set it with this command:

3. Environment setup

Before you can begin using the Text-to-Speech API, run the following command in Cloud Shell to enable the API:

You should see something like this:

Now, you can use the Text-to-Speech API!

Navigate to your home directory:

Create a Python virtual environment to isolate the dependencies:

Activate the virtual environment:

Install IPython and the Text-to-Speech API client library:

Now, you're ready to use the Text-to-Speech API client library!

In the next steps, you'll use an interactive Python interpreter called IPython , which you installed in the previous step. Start a session by running ipython in Cloud Shell:

You're ready to make your first request and list the supported languages...

4. List supported languages

In this section, you will get the list of all supported languages.

Copy the following code into your IPython session:

Take a moment to study the code and see how it uses the list_voices client library method to build the list of supported languages.

Call the function:

You should get the following (or a larger) list:

The list shows 58 languages and variants such as:

  • Chinese and Taiwanese Mandarin,
  • Australian, British, Indian, and American English,
  • French from Canada and France,
  • Portuguese from Brazil and Portugal.

This list is not fixed and grows as new voices are available.

This step allowed you to list the supported languages.

5. List available voices

In this section, you will get the list of voices available in different languages.

Take a moment to study the code and see how it uses the client library method list_voices(language_code) to list voices available for a given language.

Now, get the list of available German voices:

Multiple female and male voices are available, as well as standard, WaveNet, Neural2, and Studio voices:

  • Standard voices are generated by signal processing algorithms.
  • WaveNet, Neural2, and Studio voices are higher quality voices synthesized by machine learning models and sounding more natural.

Now, get the list of available English voices:

You should get something like this:

In addition to a selection of multiple voices in different genders and qualities, multiple accents are available: Australian, British, Indian, and American English.

Take a moment to list the voices available for your preferred languages and variants (or even all of them):

This step allowed you to list the available voices. You can read more about the supported voices and languages .

6. Synthesize audio from text

You can use the Text-to-Speech API to convert a string into audio data. You can configure the output of speech synthesis in a variety of ways, including selecting a unique voice or modulating the output in pitch, volume, speaking rate, and sample rate .

Take a moment to study the code and see how it uses the synthesize_speech client library method to generate the audio data and save it as a wav file.

Now, generate sentences in a few different accents:

To download all generated files at once, you can use this Cloud Shell command from your Python environment:

Validate and your browser will download the files:

44382e3b7a3314b0.png

Open each file and hear the result.

In this step, you were able to use Text-to-Speech API to convert sentences into audio wav files. Read more about creating voice audio files .

7. Congratulations!

You learned how to use the Text-to-Speech API using Python to generate human-like speech!

To clean up your development environment, from Cloud Shell:

  • If you're still in your IPython session, go back to the shell: exit
  • Stop using the Python virtual environment: deactivate
  • Delete your virtual environment folder: cd ~ ; rm -rf ./venv-texttospeech

To delete your Google Cloud project, from Cloud Shell:

  • Retrieve your current project ID: PROJECT_ID=$(gcloud config get-value core/project)
  • Make sure this is the project you want to delete: echo $PROJECT_ID
  • Delete the project: gcloud projects delete $PROJECT_ID
  • Test the demo in your browser: https://cloud.google.com/text-to-speech
  • Text-to-Speech documentation: https://cloud.google.com/text-to-speech/docs
  • Python on Google Cloud: https://cloud.google.com/python
  • Cloud Client Libraries for Python: https://github.com/googleapis/google-cloud-python

This work is licensed under a Creative Commons Attribution 2.0 Generic License.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

text to speech language api

Best Text to Speech APIs

List of the Top Text-to-Speech APIs (also known as TTS APIs) available on RapidAPI.

text to speech language api

  • Recommended APIs
  • Popular APIs
  • Free Public APIs for Developers
  • Top AI Based APIs
  • View All Collections
  • Entertainment
  • View All Categories

About this Collection:

Text to speech apis, about tts apis.

TTS APIs (text to speech APIs) can be used to enable speech-based text output in an app or program in addition to providing text on a screen.

What is text to speech?

Text to speech (TTS), also known as speech synthesis, is the process of converting written text to spoken audio. In most cases, text to speech refers specifically to text on a computer or other device.

How does a text-to-speech API work?

First, a program sends text to the API as a request, typically in JSON format. Optionally, text can often be formatted using SSML, a type of markup language created to improve the efficiency of speech synthesis programs.

Once the API receives the request, it will return the equivalent audio object. This object can then be integrated into the program which made the request and played for the user.

The best text to speech APIs also allow selection of accent and gender, as well as other options.

Who is text to speech for?

Text to speech is crucial for some users with disabilities. Users with vision problems may be unable to read text and interpret figures that rely on sight alone, so the ability to have content spoken to them instead of reading can mean the difference between an unusable program and a usable one.

While screen readers and other types of adaptive hardware and software exist to allow users with disabilities to use inaccessible programs, these can be complicated and expensive. It’s almost always better to provide a native text-to-speech solution within your program or app.

Text-to-speech APIs can also help nondisabled users, however. There are many use cases for text to speech, including safer use of an app or program in situations where looking at a screen might be dangerous, distracting or just inconvenient. For example, a sighted user following a recipe on their phone could have it read aloud to them instead of constantly having to clean their hands to check the next step.

Why is a text-to-speech API important?

Using an API for text to speech can make programs much more effective.

Especially because speech synthesis is such a specialized and complex field, an API can free up developers to focus on the unique strengths of their own program.

Users with disabilities also have higher expectations than in the past, and developers are better off meeting their needs with a robust, established text to speech API rather than using a homegrown solution.

What you can expect from the best text to speech APIs?

Any text to speech API will return an audio file.

The best produce seamless audio that sounds like it was spoken by a real human being. In some cases, APIs even allow developers to create their own voice model for the audio output they request.

High-quality APIs of any sort should also include support and extensive documentation.

Are there examples of the best free TTS APIs?

  • Text to Speech
  • IBM Watson TTS
  • Robomatic.ai
  • Text to Speech - TTS
  • Microsoft Text Translator
  • Text-to-Speech

Text to Speech API SDKs

All text to speech APIs are supported and made available in multiple developer programming languages and SDKs including:

  • Objective-C
  • Java (Android)

Just select your preference from any API endpoints page.

Sign up today for free on RapidAPI to begin using Text to Speech APIs!

Top 10 Text to Speech APIs

text to speech language api

Imagine a world where every written word has a voice, where websites, software, and applications effortlessly speak the language of their users. This is where text to speech (TTS) APIs reign supreme.

By seamlessly transforming text inputs into rich, natural-sounding audio files, TTS APIs bridge the gap between applications and users for a rich and immersive experience. They capture the subtle nuances of intonation, cadence, accent, and pronunciation to reel every listener in.

This post will discuss the top 10 text to speech API available in 2024. Whether you are a developer looking to add voice capabilities to your application or interested in the latest advancements in speech technology, these APIs can meet your voiceover needs. Let’s start!

Table of Contents

Key features, identify your requirements, natural-sounding voice, language support, integration capabilities, trial options, customer support, documentation and resources, customization and configuration, 10 best text to speech apis.

text to speech language api

Given the benefits of text to speech APIs, such as increased accessibility to digital content, enhanced user experience, multilingual support, scalability, and more, it is one of the most sought-after technological innovations. However, with the abundance of text to speech APIs, it’s easy to get lost in the sea of options.

To streamline your exploration, we have come up with a detailed list of the best  text to speech APIs   to explore in 2024:

Murf’s  text to speech API  helps businesses deploy high-quality, natural-sounding voices to their website, software, and applications at scale.

With a wide array of 100% natural-sounding AI voices available in 20+ languages , Murf enables the creation of professional voiceovers for videos and presentations, enhancing the overall user experience.

Powerful voice customization features for control over pitch, speed, pronunciation , and pause

Multiple export formats, including MP3 , WAV, and FLAC files

Access to 40+  high-fidelity English voices across accents like British, American, Scottish, and Indian for generating natural-sounding voiceovers 

Customizable sampling rates at 8kHz, 24kHz, and 48kHz

text to speech language api

Amazon Polly

Amazon Polly’s cloud-based TTS API uses speech synthesis markup language (SSML) to generate realistic speech from text. It enables users to seamlessly integrate speech synthesis into an application to enhance accessibility and engagement. Users can get Amazon Polly as a free text to speech API in the AWS free tier plan but with limitations in voice generation.

Supports Standard and Neural text to speech in over 20 language and language variants

SSML-based voice customizations for pitch, volume, rate, and pronunciation

Audio files are available in MP3 and OGG formats

Sampling rates at 8kHZ, 16.05kHz, 22.05kHz, and 24kHz/

Custom lexicons to add unique words and pronunciations

Microsoft Azure

Microsoft Azure’s text to speech API  follows a RESTful architecture for its text to speech interface. The cloud-based service allows flexible deployment, allowing users to run TTS at data sources. Plus, it uses SSML to exercise granular control over the synthetic speech’s rate, pitch, pause, pronunciation, and other parameters.

Supports 80+ language and language variants for different locales

Operates on neural text to speech with SSML-based audio control

Custom neural voice allows the training of AI models using actual voice samples for a personalized synthetic voice

Certified by PCI DSS, SOC, HIPAA, HITECH, FedRAMP, and ISO

Google Cloud Text to Speech

Google Cloud’s TTS API is built on the company’s proprietary DeepMind neural network, which is trained with large volumes of speech samples. As a result, Google text to speech AI API offers the widest selection of human-quality voices. 

Available in 50+ languages with localization features and 380+ voices

Voice internationalization using Neural2, Standard, WaveNet, and Studio voices

Custom voice training for a tailored brand voice

Voice tuning with built-in 20 semitones and configurable speaking rate, a 4x speed control

The IBM Watson text to speech API leverages IBM’s speech synthesis capabilities for HTTP and WebSocket interfaces. It uses SSML to offer two main voices: expressive neural voices and enhanced neural voices for natural-sounding conversations. Premium users can also create custom voices.

Leverages deep neural networks (DNNs) to predict pitch, spectral structure, and waveform

Works with 14+ language and language variations

Generated speech is available in Ogg, MP3, WAV, FLAC, PCM, A-law, Mu-law, G.729, and basic audio

The  Tune by Example feature  allows speech synthesis modifications without SSML knowledge

Lovo’s AI-powered voice generator and text to speech platform, Genny, effectively translates written text into hyper-realistic speech within seconds. 

Genny’s TTS API can analyze linguistic patterns and customize speech parameters like voice and accent to match specific requirements. 

Available in 100+ languages and 400+ voices of varying styles

Emotional Voices allow the incorporation of 25 emotions into speech

Upload subtitles or SRT files to automatically align voiceovers to videos

Voice cloning to generate branded voices

Play.ht offers TTS conversational synthetic voices that can match diverse applications. Users can pick from a variety of options in conversations, narrations, emotions, accents, and more to generate unique audio. Play.ht claims that its text to speech API can generate speech in less than 300 ms, which is impressive!

It boasts a library of 142 languages and accents in 829 AI voices

Automatic syncs for real-time updates of the latest voices

Audio files are downloadable in MP3 and WAV formats

Text and SSML support to manipulate speech

Resemble AI

Resemble’s RESTful TTS API allows users to create a voice in as little as five lines! As for the rest, users can programmatically access web-generated content. Alternatively, they can browse the Resemble AI marketplace and pick their favorite or record their voice. Either way, Resemble rapidly and scalably supports production-ready integrations for voice generation.

The Core Cloning engine supports the building and control of unique voices

One-click upload to customize voices from audio inputs (with due consent)

Hosts a thriving AI Voice Marketplace

Supports 35 languages with 100+ localization variables

Speechify’s voice API centers around the accessibility of websites and applications in publication, blogging, content marketing, and resource database management. It also helps businesses increase engagement and retain customers. Speechify is also available as a Chrome extension to read out textual content.

Inline player that seamlessly fits different layouts and designs of existing websites

Live text highlighting highlights the active sentences of words that Speechify is reading out

Floating widget that allows speech control even while scrolling

Speechify TTS API is available for web and iOS 

ReadSpeaker

The ReadSpeaker cloud-based text to speech API is straightforward, easy to integrate, and streams over multiple channels (desktop, web, mobile). The high-capacity TTS API is a part of the ReadSpeaker Web Application Service Platform and comes with SSML control to customize playback.

Built-in customizable dictionary to save specific terms

Offers 200+ voices in 50+ languages

Timing information allows synced active highlighting within the API

Produces audio files in multiple formats: PCM, A-law, u-law, Ogg, MP3, and WAV

Choosing the Best Text to Speech API for Your Needs

Choosing the best text to speech API is no child’s play. However, here’s a cheat sheet to simplify the selection:

Take stock of factors like text volume, voice characteristics, and intended application to narrow down your TTS APIs depending on project goals and user expectations.

Opt for software that offers a library of diverse voices with control over the tones, accents, emotions, and other expressive qualities to make the speech more natural sounding.

Multilingual support allows businesses to connect with their target audience in a local language. Language localization can also help them enter a new market segment.

Test for compatibility with programming languages, frameworks, and platforms to assess integration capabilities with the development environment.

TTS APIs offering free trials allow users to experience the product in real-world scenarios and evaluate industry-specific performance and service quality before committing to a paid plan.

Although the API documentation and forums offer sufficient aid and support during implementation and customization, having a TTS API provider with robust customer support can also help address integration issues and formulate specific use cases.

Go for a TTS API that transparently maintains comprehensive documentation and resources. It will improve the development and integration experience and help lend support and troubleshoot.

The TTS API should be customizable and configurable to accommodate business-specific project requirements. It should also grant flexibility regarding adjustments to audio output, such as voice modulation, pronunciation , and language, for an on-brand experience.

Choosing Murf: The Ideal Text to Speech API for Your Needs

TTS APIs offer the opportunity to integrate natural-sounding speech into business applications. With such capabilities, organizations can comfortably meet their goals surrounding accessibility, multilingual communication, and rich user experiences. The resulting innovations can also grant digital solutions a competitive edge in making modern applications more interactive and engaging.

If you are looking for a  text to speech API  that excels in versatility, quality, and ease of integration, Murf's unique AI voice generator could be your ideal choice. Just reach out to the Murf team and get your API key, generate an authentication token, and access a variety of natural-sounding voices in different languages.

text to speech language api

What is text to speech API?

Text to speech API  is a software interface that converts written text into spoken words. Businesses can integrate these with their applications, websites, and services to deliver information in natural-sounding, human-like speech to enhance the user experience and accessibility.

What are the benefits of TTS API?

The best text to speech API presents the following benefits:

TTS APIs are highly versatile and can be used in various domains like virtual assistants, customer service, accessibility tools, and navigation.

It improves access to content or information, especially for visually impaired users.

The natural-sounding speech makes the user experience richer, more engaging, interactive, and immersive.

It supports multiple languages, which increases the app/website/service’s global reach.

What is the best TTS API?

Determining the best TTS API   depends largely on the user’s unique requirements and objectives. Refer to the handy guide above that helps you identify the  best text to speech API .

Is there a text to speech API?

Yes, there are several TTS API service providers like:

Readspeaker

What is the most human-like text to speech API?

The following TTS APIs have the most natural-sounding, human-like audio outputs:

Google Cloud TTS

How do I enable text to speech API?

To enable a TTS API, you need to register with the chosen API service provider. Once you have selected the plan that meets your business goals, obtain the API keys and integrate them into your website or application.

Follow the API documentation for any specific use cases, implementation support, and customizations.

You should also read:

text to speech language api

How to create engaging videos using TikTok text to speech

text to speech language api

An in-depth guide on how to use Text to Speech on Discord

text to speech language api

Medical Text to Speech: Changing Healthcare for the Better

Blog – Synthesys

Synthesys AI Studio logo

10 Best Text-to-Speech APIs for Software Developers in 2023

by Oliver Goodwin | May 26, 2023

Reading Time: 8 minutes

Best Text-to-Speech APIs

Text-to-Speech technology has revolutionized how we interact with written content, offering a seamless auditory experience. This article delves into text-to-speech APIs, exploring the top options available and guiding you toward the best choices based on your needs.

The demand for natural-sounding speech that can effortlessly convert written text into lifelike audio content is rising in today’s fast-paced digital world. However, finding the right text-to-speech solution that meets your requirements can be daunting. The lack of clear information, overwhelming options, and varying quality levels make this choice very complex.

Have you ever struggled to find a text-to-speech API that delivers accurate transcriptions, lifelike voices, and supports multiple languages? Are you tired of spending hours researching, testing, and comparing different text-to-speech providers, only to end up with subpar results or complex integrations? 

Fret not, as we have done the groundwork for you. Whether you are a developer, content creator, or business owner, we will provide the necessary insights to make an informed decision and enhance your text-to-speech experience.

In this article, we will explore what exactly a text-to-speech API is and how it functions. We will also dive into a detailed review of the top ten text-to-speech APIs, examining their features, benefits, and pricing. Additionally, we have compiled a list of frequently asked questions (FAQs) to address the common queries and provide further clarity.

What is A Text-to-Speech API, and How Does It Work?​

What is A Text-to-Speech API, and How Does It Work?

A text-to-speech application programming interface (API) is a powerful technology that enables developers to convert written text into lifelike speech using artificial intelligence and machine learning algorithms . It allows applications, websites, and other digital platforms to generate natural-sounding audio output from textual content.

Below are the steps that explain how a text-to-speech API functioning works:

  • Text Analysis and Processing: The text-to-speech API analyses text, breaking it into smaller units such as sentences, phrases, or individual words. It considers punctuation, capitalization, and formatting to ensure accurate and natural speech output. This process involves Natural Language Processing (NLP) techniques and machine learning models to interpret the text effectively.
  • Linguistic Processing and Voice Generation: Using advanced linguistic rules and algorithms, the text-to-speech API interprets the text and determines the appropriate pronunciation, intonation, and emphasis. It applies Speech Synthesis Markup Language (SSML) and machine learning technology to generate natural and human-like speech. The API leverages a wide range of high-quality voices, including multiple languages and various speaking styles, to offer diverse options for audio output.
  • Audio Playback and Integration: Once the speech synthesis process is complete, the text-to-speech API delivers the synthesized audio in a suitable format, such as WAV or MP3. Developers can seamlessly integrate this audio playback into their applications, websites, or services. The API provides easy-to-use interfaces, allowing developers to incorporate text-to-speech capabilities effortlessly.

Using a text-to-speech API, developers can create applications with realistic voices, customizing the speech output to suit specific needs. In addition, these text-to-speech APIs enable the conversion of written text into spoken words, making it ideal for applications such as e-learning platforms, voice-based apps, video editing, and more.

Lastly, text-to-speech APIs support multiple languages, allowing users to experience lifelike speech in their preferred language.

The Best 10 Text-to-Speech APIs you Should Know​

The Best 10 Text-to-Speech APIs you Should Know

1. synthesys api.

text to speech API Synthesys AI Studio

Synthesys is an AI voice generator with a leading text-to-speech API that offers natural-sounding voices with lifelike intonations and high-quality audio. With its extensive language support and customizable speech styles, Synthesys provides an excellent choice for applications requiring human-like voices and accurate speech synthesis. 

Its vast library of languages also proves that it is versatile for various global applications. Let us take a look at the features and benefits that developers, content creators, and business owners stand to enjoy by opting for this API.

Key Features and Benefits:

  • The Synthesys text-to-speech API supports 140 different languages across every continent in the world. This makes the API cosmopolitan.
  • It contains a library of 374 unique voices.
  • It is super user-friendly and does not take much effort or involve any convoluted process.
  • The API has 31 programming language variations, which means it considers any kind of developer. Good news for programmers.
  • It offers 25 requests per minute, two hours of audio per day, and a 300-character limit per request for the Lifelike plan and 4,000 for the premium plan.

How It Works:

The first step to using the Synthesys API is purchasing a plan. Then, you generate an API secret key and get your API authentication key. Once the setup is complete, follow the guide to create your audio.

To access Synthesys’ API, two packages are available: the lifelike and the premium packages. The lifelike package costs $199 per month. To know how much the premium plan costs, please contact Synthesys Sales by clicking the button below .

2. Google Cloud Text-to-Speech API

Google text-to-Speech API

Google Cloud Text-to-Speech API empowers developers to integrate natural-sounding human speech into their applications. It can convert text or Speech Synthesys Markup Language (SSML) input into various audio formats like MP3 or LINEAR16.

  • 380+ voices and 50+ languages.
  • Developed based on DeepMind’s speech synthesis.
  • Offers the opportunity to create your unique voice.
  • Gives you audio format flexibility: MP3, Linear16, OGG, Opus, etc.
  • Pitch and volume flexibility.

Google Cloud Text-to-Speech API offers four pricing options: Neural2 voices at $16 per 1 million bytes, Studio voices at $160 per 1 million bytes, Standard voices at $4 per 1 million characters, and $16 per 1 million characters.

3. Amazon Polly API

Amazon Polly

Amazon Polly API gives you the experience of transformative capabilities. It is a cutting-edge service that effortlessly converts text into lifelike speech. It empowers your applications to engage users and venture into innovative realms of speech-enabled products.

  • Free 5 million characters per month for 12 months.
  • Freedom to customize speech that deploys lexicons and speech syntheses markup language.
  • Storage and redistribution of speech in standard formats, such as MP3 and OGG.

With Amazon Polly API, you enjoy two plans: the Standard voices priced at $4 per 1 million characters and Neural voices priced at $16 per 1 million characters.

4. Synthesia API

Synthesia API

Synthesia API provides accurate and customizable text-to-speech synthesis with lifelike voices, offering natural-sounding audio output for various applications requiring high-quality speech synthesis and enhanced user experiences. What makes Synthesia special is that it is a text-to-speech API for videos .

  • You can integrate your videos into SaaS apps.
  • You can create personalized videos.
  • Create cinematic content.
  • Get a paid Synthesia studio account.
  • Generate your API key.
  • Create and download your content.
  • Synthesia offers a $30 per month personal plan. However, this plan lacks API features. To unlock API, you must go for the enterprise plan, for which you must book a demo .

5. Murf AI API

Murf.ai API

Murf is one of the most popular text-to-speech tools in the market today. It is one of the few APIs that allow you to clone your voice or create custom voice models. Let us run through its features and benefits.

  • Over 40 languages but only in English.
  • 15-day trial for new registrants.
  • Ability to clone any voice you want .
  • Audio format flexibility.
  • Fill out the API access form.
  • Submit your exact requirements.
  • While your API is being readied, go through the studio’s API documentation.
  • Integrate your API into your websites and begin to create your content.

At Murf, access to API starts at $750 for three months.

6. HeyGen API

Heygen API

HeyGen is another text-to-speech tool, just like Synthesys , that can incorporate text-to-speech technology into videos.

  • Over 300 voices across 40+ languages.
  • Natural-sounding voices.
  • Multi-gender voices.
  • Ability to add Webhook to your programs.
  • Create a pro or enterprise HeyGen account.
  • Generate an API key.
  • Incorporate Webhook if you wish.
  • Start creating your content.

For access to API support, you need to create a pro account, which costs $2 per minute at 120 minutes per month.

7. Microsoft Azure Text-to-Speech API

microsoft azure API

Azure text-to-speech API lets you build apps and services that deploy text-to-speech solutions that are human-like and interoperable across various platforms and devices.

  • It is cloud-based, which means you can access your data and build your services or apps anytime, anywhere.
  • Customizable voices.
  • Voice flexibility—you can adjust your speech parameters, such as pitch, pronunciation, intonation, pauses, etc., using speech synthesis markup language.
  • Guaranteed data privacy and security with access to delete anything anytime.

Azure comes with four API plans: developer at $48.04, basic at $147.17 per month, standard at $686.72 per month, and premium at $2,795.17 per month.

8. Wellsaid text-to-speech API

Wellsaidlabs API

Wellsaid text-to-speech API is another platform that typifies convenience for developers. Let us see how.

  • You do not have to worry about hosting, scaling, and upgrading your voice architecture, as Wellsaid Lab handles all these.
  • Lifelike synthetic voices.
  • Restricted to MP3 audio format only.
  • Scalable up to billions of characters per month.

The cost of accessing Wellsaid text-to-speech API is not expressly stated, so you might have to book a call to find out.

9. AI Studios text-to-speech API

Deepbrain API

This API helps developers and producers streamline synthesis production by automating repetitive processes, minimizing editing, saving time, and ensuring efficiency.

  • Over 100 voices in more than 80 languages.
  • 99% of its avatars are reality avatars.
  • There is a library of templates that developers can choose from.
  • Subscribe to the API pro plan.
  • Make your API content.

Using AI Studio’s API library requires you to subscribe to the pro or enterprise plans. The pro plan costs $225 per month.

10. Hour One API

Hourone API

Hour One prioritizes simultaneity and multitasking. This is why this particular API empowers developers to create as many as 100s of text-to-speech content.

  • Seamless audio file sharing.
  • Adjustable volume and speed.

To use Hour One’s text-to-speech API, you must be subscribed to the Enterprise plan. To know more about the Enterprise plan, contact their support.

Frequently Asked Questions (FAQ)

What is text-to-speech api, and how does it work.

A text-to-speech API is a software interface that utilizes machine learning and natural language processing to convert written text into lifelike speech. By analyzing the text input, the API generates an audio output that mimics human speech patterns and intonations, providing a natural and immersive listening experience.

Text-to-speech APIs enable smooth integration of text-to-speech functionality into various applications and platforms, enhancing accessibility, user engagement, and content personalization.

What should I look out for when choosing a text-to-speech API?

User-friendliness, cost-effectiveness, human voice quality, extensive language support, flexible customization options, platform compatibility, lax usage limits, solid support, and documentation.

Is Synthesys a good option?

Yes, Synthesys API is one of the most viable options out there. It caters to any consumer wanting to employ it by easing the process to achieve the desired results.

Moreover, it contains as many languages, voices, programming languages, and flexible usage limits as are needed.

In Conclusion

The voiceover industry is witnessing a significant shift towards automation, and text-to-speech APIs are becoming increasingly popular.

However, the quality of text-to-speech APIs varies, and businesses must choose the best options for their needs. This article discussed the best ten text-to-speech APIs, exploring their features, benefits, pricing plans, etc.

It is important to note the criteria to consider when choosing a text-to-speech API. Criteria include voice quality, language support, customisation options, developer-friendliness, pricing and usage limits, platform compatibility, and support and documentation.

Among the options highlighted above, Synthesys API edges the others out slightly. While it is similar to many others in terms of pricing and voice flexibility, it possesses two standout features that make the job easy for developers and producers alike: programming language variations and usage limits.

Besides supporting 140 languages and 374 voices, it has a collection of 31 different programming language variations to aid in development diversity. Fortunately, it also comes with lax usage limits—a feature grossly lacking in the other options discussed.

Related Articles

how-to-make-your-video-content-come-alive-with-text-to-speech

  • Español – América Latina
  • Português – Brasil
  • Documentation
  • Cloud Text-to-Speech API

Cloud Text-to-Speech basics

Text-to-Speech allows developers to create natural-sounding, synthetic human speech as playable audio. You can use the audio data files you create using Text-to-Speech to power your applications or augment media like videos or audio recordings (in compliance with the Google Cloud Platform Terms of Service including compliance with all applicable law).

Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into audio data like MP3 or LINEAR16 (the encoding used in WAV files).

This document is a guide to the fundamental concepts of using Text-to-Speech. Before diving into the API itself, review the quickstarts .

Basic example

Text-to-Speech is ideal for any application that plays audio of human speech to users. It allows you to convert arbitrary strings, words, and sentences into the sound of a person speaking the same things.

Imagine that you have a voice assistant app that provides natural language feedback to your users as playable audio files. Your app might take an action and then provide human speech as feedback to the user.

For example, your app may want to report that it successfully added an event to the user's calendar. Your app constructs a response string to report the success to the user, something like "I've added the event to your calendar."

With Text-to-Speech, you can convert that response string to actual human speech to play back to the user, similar to the example provided below.

Your browser does not support the audio element. Example 1. Audio file generated from Text-to-Speech

To create an audio file like example 1, you send a request to Text-to-Speech like the following code snippet.

Speech synthesis

The process of translating text input into audio data is called synthesis and the output of synthesis is called synthetic speech . Text-to-Speech takes two types of input: raw text or SSML-formatted data (discussed below). To create a new audio file, you call the synthesize endpoint of the API.

The speech synthesis process generates raw audio data as a base64-encoded string. You must decode the base64-encoded string into an audio file before an application can play it. Most platforms and operating systems have tools for decoding base64 text into playable media files.

To learn more about synthesis, review the quickstarts or the Creating Voice Audio Files page.

Text-to-Speech creates raw audio data of natural, human speech. That is, it creates audio that sounds like a person talking. When you send a synthesis request to Text-to-Speech, you must specify a voice that 'speaks' the words.

Text-to-Speech has a wide selection of custom voices available for you to use. The voices differ by language, gender, and accent (for some languages). For example, you can create audio that mimics the sound of a female English speaker with a British accent like example 1, above. You can also convert the same text into a different voice, say a male English speaker with an Australian accent.

Your browser does not support the audio element. Example 2. Audio file generated with en-AU speaker

To see the complete list of the available voices, see Supported Voices .

WaveNet voices

Along with other, traditional synthetic voices, Text-to-Speech also provides premium, WaveNet-generated voices. Users find the Wavenet-generated voices to be more warm and human-like than other synthetic voices.

The key difference to a WaveNet voice is the WaveNet model used to generate the voice. WaveNet models have been trained using raw audio samples of actual humans speaking. As a result, these models generate synthetic speech with more human-like emphasis and inflection on syllables, phonemes, and words.

Compare the following two samples of synthetic speech.

Your browser does not support the audio element. Example 3. Audio file generated with a standard voice

Your browser does not support the audio element. Example 4. Audio file generated with a WaveNet voice

To learn more about the benefits of WaveNet-generated voices, see Types of voices .

Other audio output settings

Besides the voice, you can also configure other aspects of the audio data output created by speech synthesis. Text-to-Speech supports configuring the speaking rate, pitch, volume, and sample rate hertz.

Review the AudioConfig reference for more information.

Speech Synthesis Markup Language (SSML) support

You can enhance the synthetic speech produced by Text-to-Speech by marking up the text using Speech Synthesis Markup Language (SSML) . SSML enables you to insert pauses, acronym pronunciations, or other additional details into the audio data created by Text-to-Speech. Text-to-Speech supports a subset of the available SSML elements .

For example, you can ensure that the synthetic speech correctly pronounces ordinal numbers by providing Text-to-Speech with SSML input that marks ordinal numbers as such.

Your browser does not support the audio element. Example 5. Audio file generated from plain text input

Your browser does not support the audio element. Example 6. Audio file generated from SSML input

To learn more about how to synthesize speech from SSML, see Creating Voice Audio Files

Try it for yourself

If you're new to Google Cloud, create an account to evaluate how Text-to-Speech performs in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-04-16 UTC.

It Speaks! Create Synthetic Speech Using Text-to-Speech

Checkpoints.

Enable the Text-to-Speech API

Create a service account

  • Setup and requirements
  • Task 1. Enable the Text-to-Speech API
  • Task 2. Create a virtual environment
  • Task 3. Create a service account
  • Task 4. Get a list of available voices
  • Task 5. Create synthetic speech from text
  • Task 6. Create synthetic speech from SSML
  • Task 7. Configure audio output and device profiles
  • Congratulations!

Google Cloud self-paced labs logo

The Text-to-Speech API lets you create audio files of machine-generated, or synthetic , human speech. You provide the content as text or Speech Synthesis Markup Language (SSML) , specify a voice (a unique 'speaker' of a language with a distinctive tone and accent), and configure the output; the Text-to-Speech API returns to you the content that you sent as spoken word, audio data, delivered by the voice that you specified.

In this lab you will create a series of audio files using the Text-to-Speech API, then listen to them to compare the differences.

What you'll learn

In this lab you use the Text-to-Speech API to do the following:

  • Create a series of audio files
  • Listen and compare audio files
  • Configure audio output

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab , shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
  • Time to complete the lab---remember, once you start, you cannot pause a lab.

How to start your lab and sign in to the Google Cloud console

Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

  • The Open Google Cloud console button
  • Time remaining
  • The temporary credentials that you must use for this lab
  • Other information, if needed, to step through this lab

Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).

The lab spins up resources, and then opens another tab that shows the Sign in page.

Tip: Arrange the tabs in separate windows, side-by-side.

If necessary, copy the Username below and paste it into the Sign in dialog.

You can also find the Username in the Lab Details panel.

Click Next .

Copy the Password below and paste it into the Welcome dialog.

You can also find the Password in the Lab Details panel.

Click through the subsequent pages:

  • Accept the terms and conditions.
  • Do not add recovery options or two-factor authentication (because this is a temporary account).
  • Do not sign up for free trials.

After a few moments, the Google Cloud console opens in this tab.

Navigation menu icon

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

Activate Cloud Shell icon

When you are connected, you are already authenticated, and the project is set to your Project_ID , . The output contains a line that declares the Project_ID for this session:

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  • (Optional) You can list the active account name with this command:
  • Click Authorize .
  • (Optional) You can list the project ID with this command:

Set the region for your project

In Cloud Shell, enter the following command to set the region to run your project in this lab:

Navigation menu icon

On the top of the Dashboard, click +Enable APIs and Services .

Enter "text-to-speech" in the search box.

Click Cloud Text-to-Speech API .

Click Enable to enable the Cloud Text-to-Speech API.

Wait for a few seconds for the API to be enabled for the project. Once enabled, the Cloud Text-to-Speech API page shows details, metrics and more.

Click Check my progress to verify the objective. Enable the Text-to-Speech API

Python virtual environments are used to isolate package installation from the system.

  • Install the virtualenv environment:
  • Build the virtual environment:
  • Activate the virtual environment.

You should use a service account to authenticate your calls to the Text-to-Speech API.

  • To create a service account, run the following command in Cloud Shell:
  • Now generate a key to use that service account:
  • Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the location of your key file:

Click Check my progress to verify the objective. Create a service account

As mentioned previously, the Text-to-Speech API provides many different voices and languages that you can use to create audio files. You can use any of the available voices as the speaker for your content.

  • The following curl command gets the list of all the voices you can select from when creating synthetic speech using the Text-to-Speech API:

The Text-to-Speech API returns a JSON-formatted result that looks similar to the following:

Looking at the results from the curl command, notice that each voice has four fields:

  • name : The ID of the voice that you provide when you request that voice.
  • ssmlGender : The gender of the voice to speak the text, as defined in the SSML W3 Recommendation .
  • naturalSampleRateHertz : The sampling rate of the voice.
  • languageCodes : The list of language codes associated with that voice.

Also notice that some languages have several voices to choose from.

  • To scope the results returned from the API to just a single language code, run:

Now that you've seen how to get the names of voices to speak your text, it's time to create some synthetic speech!

For this, you build your request to the Text-to-Speech API in a text file titled synthesize-text.json .

  • Create this file in Cloud Shell by running the following command:
  • Using a line editor (for example nano , vim , or emacs ) or the Cloud Shell code editor, add the following code to synthesize-text.json :
  • Save the file and exit the line editor.

The JSON-formatted request body provides three objects:

  • The input object provides the text to translate into synthetic speech.
  • The voice object specifies the voice to use for the synthetic speech.
  • The audioConfig object tells the Text-to-Speech API what kind of audio encoding to send back.
  • Use the following code to call the Text-to-Speech API using the curl command:

The output of this call is saved to a file called synthesize-text.txt .

  • Open the synthesize-text.txt file. Notice that the Text-to-Speech API provides the audio output in base64-encoded text assigned to the audioContent field, similar to what's shown below:

To translate the response into audio, you need to select the audio data it contains and decode it into an audio file - for this lab, MP3. Although there are many ways that you can do this, in this lab you'll use some simple Python code. Don't worry if you're not a Python expert; you need only create the file and invoke it from the command line.

  • Create a file named tts_decode.py :
  • Using a line editor (for example nano , vim , or emacs ) or the Cloud Shell code editor, add the following code into tts_decode.py :

Save tts_decode.py and exit the line editor.

Now, to create an audio file from the response you received from the Text-to-Speech API, run the following command from Cloud Shell:

This creates a new MP3 file named synthesize-text-audio.mp3 .

Of course, since the synthesize-text-audio.mp3 lives in the cloud, you can't just play it directly from Cloud Shell! To listen to the file, you create a Web server hosting a simple web page that embeds the file as playable audio (from an HTML < audio> control).

  • Create a new file called index.html :
  • Using a line editor (for example nano , vim , or emacs ) or the Cloud Shell code editor, add the following code into index.html :

Back in Cloud Shell, start a simple Python HTTP server from the command prompt:

Web preview icon

Then select Preview on port 8080 from the displayed menu.

In the new browser window, you should see something like the following:

The Cloud Text-to-Speech Demo audio of the output from synthesizing text

Play the audio embedded on the page. You'll hear the synthetic voice speak the text that you provided to it!

When you're done listening to the audio files, you can shut down the HTTP server by pressing CTRL + C in Cloud Shell.

In addition to using text, you can also provide input to the Text-to-Speech API in the form of Speech Synthesis Markup Language (SSML) . SSML defines an XML format for representing synthetic speech. Using SSML input, you can more precisely control pauses, emphasis, pronunciation, pitch, speed, and other qualities in the synthetic speech output.

  • First, build your request to the Text-to-Speech API in a text file titled synthesize-ssml.json . Create this file in Cloud Shell by running the following command:
  • Using a line editor (for example nano , vim , or emacs ) or the Cloud Shell code editor, paste the following JSON into synthesize-ssml.json :

Notice that the input object of the JSON payload to send includes some different stuff this time around. Rather than a text field, the input object has a ssml field instead. The ssml field contains XML-formatted content with the <speak> element as its root. Each of the elements present in this XML representation of the input affects the output of the synthetic speech.

Specifically, the elements in this sample have the following effects:

  • <s> contains a sentence.
  • <emphasis> adds stress on the enclosed word or phrase.
  • <break> inserts a pause in the speech.
  • <prosody> customizes the pitch, speaking rate, or volume of the enclosed text, as specified by the rate , pitch , or volume attributes.
  • <say-as> provides more guidance about how to interpret and then say the enclosed text, for example, whether to speak a sequence of numbers as ordinal or cardinal.
  • <sub> specifies a substitution value to speak for the enclosed text.
  • In Cloud Shell use the following code to call the Text-to-Speech API, which saves the output to a file called synthesize-ssml.txt :

Again, you need to decode the output from the Text-to-Speech API before you can hear the audio.

  • Run the following command to generate an audio file named synthesize-ssml-audio.mp3 using the tts_decode.py utility that you created previously:
  • Next, open the index.html file that you created earlier. Replace the contents of the file with the following HTML:
  • Then, start a simple Python HTTP server from the Cloud Shell command prompt:

Web Preview icon

  • Play the two embedded audio files. Notice the differences in the SSML output: although both audio files say the same words, the SSML output speaks them a bit differently, adding pauses and different pronunciations for abbreviations.

Going beyond SSML, you can provide even more customization to your synthetic speech output created by the Text-to-Speech API. You can specify other audio encodings, change the pitch of the audio output, and even request that the output be optimized for a specific type of hardware.

Build your request to the Text-to-Speech API in a text file titled synthesize-with-settings.json :

  • Using a line editor (for example nano , vim , or emacs ) or the Cloud Shell code editor, paste the following JSON into synthesize-with-settings.json :

Looking at this JSON payload, you notice that the audioConfig object contains some additional fields now:

  • The speakingRate field specifies a speed at which the speaker says the voice. A value of 1.0 is the normal speed for the voice, 0.5 is half that fast, and 2.0 is twice as fast.
  • The pitch field specifies a difference in tone to speak the words. The value here specifies a number of semitones lower (negative) or higher (positive) to speak the words.
  • The audioEncoding field specifies the audio encoding to use for the data. The accepted values for this field are LINEAR16 , MP3 , and OGG_OPUS .
  • The effectsProfileId field requests that the Text-to-Speech API optimizes the audio output for a specific playback device. The API applies an predefined audio profile to the output that enhances the audio quality on the specified class of devices.

The output of this call is saved to a file called synthesize-with-settings.txt .

  • Run the following command to generate an audio file named synthesize-with-settings-audio.mp3 from the output received from the Text-to-Speech API:
  • Next open the index.html file that you created earlier and replace the contents of the file with the following HTML:
  • Now, restart the Python HTTP server from the Cloud Shell command prompt:

The Cloud Text-to-Speech Demo audio files of the output from synthesizing text, output from synthesizing SSML, and output with audio settings

  • Play the third embedded audio file. Notice that the voice on the audio speaks a bit faster and lower than the previous examples.

You have learned how to create synthetic speech using the Cloud Text-to-Speech API. You learned about:

  • Listing all of the synthetic voices available through the Text-to-Speech API
  • Creating a Text-to-Speech API request and calling the API with curl, providing both text and SSML
  • Configuring the setting for audio output, including specifying a device profile for audio playback

Finish your quest

This self-paced lab is part of the Language, Speech, Text & Translation with Google CLoud APIs quest. A quest is a series of related labs that form a learning path. Completing this quest earns you a badge to recognize your achievement. You can make your badge or badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. Refer to the Google Cloud Skills Boost catalog for all available quests.

Take your next lab

Continue your quest with Translate Text with the Cloud Translation API or try one of these:

  • Measuring and Improving Speech Accuracy
  • Entity and Sentiment Analysis with the Natural Language API

Next steps / Learn more

  • Check out the detailed documentation for the Text-to-Speech API on cloud.google.com.
  • Learn how to create synthetic speech using the client libraries for the Text-to-Speech API .

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated August 25, 2023

Lab Last Tested August 25, 2023

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

In this lab, you create a series of audio files using the Text-to-Speech API, then listen to them to compare the differences.

Duration: 0m setup · 60m access · 60m completion

AWS Region: []

Levels: introductory

Permalink: https://www.cloudskillsboost.google/catalog_lab/1052

Google Text to Speech API: A Beginners’ Guide Most comprehensive intro to Google Text to Speech API. Written for beginners.

By Hammad Syed in API

Share this post

text to speech language api

Generate AI Voices, Indistinguishable from Humans

Table of contents.

As developers, you’ve probably explored a lot of text to speech APIs, including Google Text to Speech API. Are you looking for an overview and review of Google TTS? If so, you’ve come to the right place.

Today, we’re going to explore everything you need to know about Google’s Text to Speech API. I’m also going to reveal what I believe is the best text to speech API platform. Is it Google? Stay tuned to find out.

What is Google Text to Speech API?

So, first things first, let’s explore the nitty gritty of Google Text to Speech API. The Google Cloud Text to Speech API is a part of the comprehensive suite of cloud services offered by the Google Cloud platform. It allows developers to easily integrate speech synthesis capabilities into their applications, enabling them to convert text input into high-quality AI voices.

This technology finds its application across a wide range of domains, from enhancing accessibility for visually impaired users to providing voice responses in virtual assistants.

How Google Text to Speech API works

If you’re anything like me, you’re probably wondering something along the lines of “How does Google TTS API synthesize voices?”

At its core, Google Text to Speech API works by taking input text, processing it using machine learning and neural network models, which have been trained on large datasets to know how to replicate language, and then transforming it into lifelike speech in the form of audio files which can be integrated into websites, apps, and more.

Developers can then specify parameters such as language code, audio encoding, and voice selection to customize the output. For example, they can change the language, voice, speaking rate, volume, and more according to their needs.

How to use Google Text to Speech API

Ready to make your computer talk? Using Google Text to Speech API is a breeze.

To use Google Text to Speech API, developers need to have a Google Cloud service account. After enabling the Text to Speech API through the Google Cloud Console, they can authenticate their application and start making API requests. Google provides tutorials, docs, SDKs, QuickStart guides, and client libraries, such as TextToSpeechClient, via GitHub in several programming languages, including Python and Node.js making it easier to integrate the API into existing projects.

Developers can also interact with it via gcloud’s command line.

To convert text to speech, developers need to send a request to the API endpoint texttospeech.googleapis.com with the desired text and configuration parameters. The API responds with an audio file containing the synthesized speech, which can then be used in applications or saved for later use.

The API supports various audio formats, including MP3 and LINEAR16, allowing for flexibility in application development.

Understanding key TTS concepts

To effectively utilize the Google Text to Speech API, it’s essential to grasp some key concepts:

  • AudioConfig: This parameter allows developers to specify various audio settings such as audio encoding, sample rate, and speaking rate.
  • SynthesisInput: It represents the text input that needs to be converted into speech.
  • VoiceSelectionParams: Developers can use this parameter to select the desired voice for the synthesized speech based on language and gender preferences.
  • SSMLVoiceGender: This parameter enables fine-grained control over the gender of the selected voice when using Speech Synthesis Markup Language (SSML).

Google Text to Speech API pricing

Upon my research, I also discovered Google Text to Speech API’s pricing. Its pricing model is based on the number of characters used. To use Google TTS, you must enable billing and you will be automatically charged if your usage exceeds the free character limit. Spaces are also included as characters. All Speech Synthesis Markup Language (SSML) tags except mark are also included in the character count.

So, what’s the free character limit? Well, it depends on the type of voice you’d like – the higher the quality, the higher the price.

For example, you get up to 1 million bytes of premium voices per month. After that threshold, you’ll pay $0.000016 per byte ($16 per 1 million bytes). For studio voices, you get up to 100 thousand bytes and pay $0.00016 per byte ($160 per 1 million bytes) after you hit the free limit. And lastly, you get 1 million characters free when it comes to standard voices, after which you’ll pay $0.000004 per character ($4 per 1 million characters).

As you can see, with Google Cloud’s pay-as-you-go pricing, you only pay for the amount of audio content you create.

Google Text to Speech API features

I did find that for that flexible pricing, Google Text to Speech API does offer a plethora of features. Let me walk you through some of its key offerings:

Large voice and language selection

Google TTS API offers a selection of over 380 voices across 50+ languages and variants, including 90 WaveNet voices. From Mandarin, Hindi, and Spanish to English, Russian, and many more, the options are diverse and cater to various linguistic needs. Plus, with high-fidelity voices available, the audio quality is top-notch.

Custom voices

With Google Text to Speech API’s voice cloning feature, users can create custom voices that resonate uniquely with users, ensuring a personalized and engaging experience. This helps craft synthetic voices that match the tone and style of your brand or application.

Long audio synthesis

Google Text to Speech API also supports long audio synthesis with support of up to 1 million bytes in a single session. This allows users to confidently tackle larger projects without worrying about compatibility issues, whether they’re working on extensive narrations or complex dialogue sequences.

SSML support

Developers, like myself, can take advantage of the API’s SSML support for fine-grained control over speech synthesis like pauses, pronunciation, pitch, speaking rate, and volume. For example, Google TTS allows users to personalize the pitch of a voice, up to 20 semitones more or less, and adjust your speaking rate to be 4x faster or slower than the normal rate. and increase the volume by up to 16db or decrease the volume by up to -96db.

Integration

Whether you’re developing web applications using Chrome or building native applications, integration is seamless thanks to Google Text to Speech API’s support for both REST and gRPC APIs, making it easy to integrate with various applications and devices, from phones and PCs to IoT devices like cars and speakers.

Format flexibility

Audio format flexibility is another highlight, with the ability to convert text to various formats including MP3, Linear16, and OGG Opus. This versatility ensures that synthesized speech can be seamlessly integrated into various applications and platforms.

Google Text to Speech API use cases

Now that we covered how to use Google Text to Speech API as well as its features, I want to touch on why someone would want to use it in the first place. Let’s delve into just some of the ways I use TTS APIs:

  • Accessibility solutions: I can use TTS APIs to create accessibility solutions for individuals with visual impairments, dyslexia, or other reading difficulties. Incorporating TTS helps people access information from digital platforms, including websites, applications, and ebooks.
  • Language learning platforms: I can integrate the API into language learning platforms to enhance learning experiences. When language apps offer audio support, learners can learn proper pronunciation faster and improve their listening and speaking skills.
  • Interactive voice response (IVR) systems: I’ve also used TTS APIs to deliver automated voice chat responses to customer queries and requests. This streamlines customer interactions, reduces wait times, and enhances overall service efficiency, benefiting both my business and my customers.
  • E-learning and educational resources: I can utilize a TTS API to create audio versions of educational materials such as lectures, textbooks, and study guides to help facilitate auditory learning for my students and accommodate diverse learning preferences.
  • Voice-enabled applications and devices: In my development projects, I integrate TTS APIs into voice-enabled applications and devices, such as virtual assistants, smart speakers, and IoT devices.
  • Content creation: I use TTS APIs to generate synthetic voices for multimedia projects, including podcasts, videos, and audiobooks. This saves me a ton of time when it comes to creating voice overs as well as money because I don’t have to hire voice actors.

Google Text to Speech API pros and cons

Since I’m always in pursuit of the best text to speech API features, I tried Google Text to Speech API so you don’t have to. Here are Google Text to Speech API’s top pros and cons based on my user experience:

Google Text to Speech API pros

Some areas where Google Text to Speech API shines, include:

  • Natural-sounding speech: I’ve tried a lot of TTS APIs and I do have to admit Google Text to Speech API generates speech that sounds remarkably human across a variety of languages.
  • Reliability and scalability: Being backed by Google Cloud Platform means I can rely on the infrastructure’s robustness, scalability, security measures, and automatic updates. This is crucial, especially for applications requiring consistent performance under varying loads.
  • Extensive language support: With support for a wide range of languages, the API allows me to create applications for global audiences and diverse user bases.
  • Flexible pricing: The pricing model is based on usage so I can pay for what I use, making it suitable for both small-scale projects and large-scale applications.
  • Low latency: With a latency of around 200ms (time to first audio byte), the API offers swift response times, enhancing user experience by minimizing delays.

Google Text to Speech API cons

Limitations and drawbacks of Google Text to Speech API include:

  • Dependency on internet connectivity: One significant limitation is the need for an internet connection to access the API. This could be problematic in scenarios where internet access is limited or unreliable.
  • Limited language support: While the API supports many languages, including English (en-US) it does not cover all languages or accents. This could be a drawback If I was trying to create applications for certain communities.
  • Complex integration: Integrating the API into applications requires a certain level of familiarity with cloud services and APIs. While this wasn’t difficult for me, this could pose a challenge for developers who are new to APIs.
  • Streaming limitations: Compared to other TTS APIs I’ve used, Google Text to Speech API is not the best choice for real-time streaming applications due to limitations in streaming capabilities.

PlayHT API – The #1 Google Text to Speech API alternative

PlayHT stands out as the premier text to speech API for seamlessly integrating real-time AI-generated voices into applications and projects. Boasting one of the fastest latencies available, PlayHT is the ideal choice for those prioritizing instant speech synthesis.

Whether you require an on-premise setup or prefer a cloud-based solution, PlayHT has you covered. PlayHT also offers a vast selection of over 800 unique voices, with an additional 20,000 text to speech voices options available through the community voice library and options to create instant or high-fidelity voice clones.

Take advantage of PlayHT’s API today and equip your applications with AI-generated speech that rivals the natural cadence and tone of human voices.

Frequently Asked Questions

How does google tts use json.

The Google Text to Speech API utilizes JSON for structuring requests and responses exchanged between client applications and the API.

Is Google Text to Speech API free?

Google TTS API is based on usage. While it does offer a certain character limit for free per month, it’s not free once the limit is reached. For more information, see the pricing selection above.

How good is Google Speech to Text API?

Google Speech to Text’s transcription is very accurate.

Can the Google Cloud Text to Speech API handle multiple languages?

Yes, the Google Cloud Text to Speech API supports 50+ languages and variants.

Recent Posts

Azure Text to Speech API

Hammad Syed

Hammad Syed holds a Bachelor of Engineering - BE, Electrical, Electronics and Communications and is one of the leading voices in the AI voice revolution. He is the co-founder and CEO of PlayHT, now known as PlayAI.

Similar articles

text to speech language api

Azure Text to Speech API: A Clear Guide.

text to speech language api

Google Vertex AI: Everything You Need to Know

text to speech language api

On-Premise VS. Cloud Text To Speech API

text to speech language api

What is an AI Voice Agent?

text to speech language api

What is SSML?

Get started with the best ai voice generator today.

  • No API Key? Sign Up Here
  • ← Return to iSpeech

Introduction

Welcome to the iSpeech Inc. Application Programming Interface (API) Developer Guide. This guide describes the available variables, commands, and interfaces that make up the iSpeech API.

The iSpeech API allows developers to implement Text-To-Speech (TTS) and Automated Voice Recognition (ASR) in any Internet-enabled application.

The API’s are platform agnostic which means any device that can record or play audio that is connected to the Internet can use the iSpeech API.

Minimum Requirements

Below are the minimum requirements needed to use the iSpeech API. The API can be used with and without a software development kit (SDK).

Internet Connection

iSpeech services require an Internet connection.

HTTP Protocol

The iSpeech API follows the HTTP standard by using GET and POST. Some web browsers limit the length of GET requests to a few thousand characters.

Request/Responses

Requests can be in URL-encoded, JSON, or XML data formats. You can specify the output data format of responses. For TTS, binary data is usually returned if the request is successful. For speech recognition, URL-encoded text, JSON, or XML can be returned by setting the output variable.

An API key is a password that is required for access. To obtain an API key please visit: http://www.ispeech.org/developers and register for a developer account.

API Features

You can retrieve the properties of your API keys. Key information includes a voice list, amount of credits, locales, and many other parameters.

Text to Speech

You can synthesize spoken audio through iSpeech TTS in a variety of voices, formats, bitrates, frequencies, and playback speeds. Math markup language (MathML) and Speech synthesis markup language (SSML) are also supported.

Automated Speech Recognition

You can convert spoken audio to text using a variety of languages and recognition models. We can create custom recognition models to improve recognition quality.

Position Markers

You can get the position in time when words are spoken in TTS audio.

You can get the position in time of mouth positions when words are spoken in TTS audio.

Developer Support

Automated purchasing system: https://www.ispeech.org/developer/purchase/ iSpeech sales can be contacted at the following phone number: +1-917-338-7723 from 10 AM to 6 PM Eastern Time, Monday to Friday. You can also email [email protected].

Support / Troubleshooting

Please contact our support team at [email protected] .

Software Development Kits

iSpeech SDKs simplify the iSpeech API. You should use iSpeech SDKs if the option is available. Only mobile SDKs made by iSpeech allow you to use the iSpeech API for free.

Availability

iPhone, Android, BlackBerry, .NET, Java (Server), PHP, Flash, Javascript/Flash, Ruby, Python, Perl

API Access Pricing

Authentication.

API Key Information Retrieval
HTTP Response

View/Edit Keys

Manage your API keys by using the iSpeech developer website . You can request additional features for your API keys on that website.

Request Parameter Reference

Transaction types.

The iSpeech API supports URL Encoded, XML, and JSON formats.

Supported transaction types

Request parameters.

Example HTTP GET Request (Using most variables)

The iSpeech Text-To-Speech API allows you to synthesize high-quality spoken audio in multiple formats. The iSpeech API doesn’t use callbacks because it’s fast and synchronous. You’ll always receive audio data or an error message in the same HTTP transaction.

Voices - Standard

HTTP GET Request (Setting voice to European French Female)

Voices - Custom

Custom Voices may be enabled for your account. They can be found in the developer portal -> api key properties -> custom voices. You can use them by setting the variable voice to the custom alias.

Voices - List Retrieval

HTTP GET Network Transaction to get XML voice list.
XML Response
JSON Response
REST / URL Encoded Response

A current list of voices that are enabled for an API key can be retrieved in REST, JSON, and XML format by using the following service. HTTP GET and POST are supported. A web browser or a REST client can be used to make these HTTP requests.

HTTP GET Request (Setting speed to 5)

Most voices support speed controls.

HTTP GET Request (Setting bitrate to 16 kilobits per second)

Note: Bitrates can only be selected for MP3s.

Valid values are 16, 24, 32, 48 (default), 56, 64, 80, 96, 112, 128, 144, 160, 192, 224, 256, or 320. Bitrates are listed in kilobits per second.

Example HTTP GET Request (Setting format to wav)

Frequencies

Example HTTP GET Request (Setting frequency to 16000 Hz)

Possible values: 8000, 11025, 16000 (default), 22050, 24000, 32000, 44100, 48000 cycles per second (Hertz)

Padding adds silence to a section of the audio file.

Start Padding

Example HTTP GET Request (Setting start padding to 3 seconds)

Adds a period of silence to the beginning of the audio file.

End Padding

Example HTTP GET Request (Setting end padding to 3 seconds)
Example HTTP GET Request (Setting pitch to 50)

Possible values: 0 to 200 (integer), 0 is lowest pitch, 100 is default, 200 is highest pitch. Pitch is enabled only on some voices.

Example HTTP GET Request (Setting bit depth to 8)

The bit depth is amount of audio detail for each audio sample.

Possible values are 8 and 16 (default) bits/sample on AIFF, FLAC, and WAVE file formats.

Example HTTP GET Request (Setting filename of audio)

The filename is the name of the audio file that will download. Specifying the extension is optional. If the extension is missing, the correct extension will be added automatically. The default is rest.[extension], for example: rest.mp3.

Speech Synthesis Markup Language (SSML)

Example HTTP GET Request (Emphasis added on the word big)

SSML tags are used to customize the way a text-to-speech engine creates audio. The tags can be used to add pauses, change emphasis, and change pronunciation. This option is disabled by default but can be requested by emailing [email protected]. The parameter “action” must set to “ssml” and the parameter “ssml” must be set to a complete SSML XML statement.

The parameter “text” is not used and the parameters voice and speed should be represented using the “voice” and “prosody” SSML tags instead of request parameters.

More information on SSML can be found at: http://www.w3.org/TR/speech-synthesis/.

Math Markup Language (MathML)

MathML tags are used to display and represent mathematical statements. This option is disabled by default but can be requested by emailing [email protected].

Remember to set “library” to “libmath” so that the MathML processor loads your text as MathML.

More information on MathML can be found at: https://developer.mozilla.org/en-US/docs/MathML

The following table lists MathML tags supported by the iSpeech API.

Example Transactions

HTTP POST URL encoded request for Text to Speech
HTTP POST JSON request for Text to Speech
HTTP POST XML request for Text to Speech
Example of a text-to-speech network transaction with an error. Responses with an error message return HTTP status response code “HTTP/1.0 202 Accepted”.

The following examples are packet captures from TCP connections that used the HTTP protocol. You can compare your network traffic to these transactions to debug code. Wireshark can be used to analyze network connections. A REST client can be used to make these HTTP requests.

Example network transactions containing MathML

HTTP GET, URL Encoded Request and Reply, +7 (says positive 7)
HTTP POST JSON Text-to-Speech request containing MathML
HTTP POST XML Text-to-Speech request containing MathML (The text: “+7” gets spoken as “positive seven”)

More information on MathML is available on http://www.w3.org/TR/MathML2/ and https://developer.mozilla.org/en-US/docs/MathML

HTTP GET URL encoded Text-to-Speech request containing MathML

Standard locales, custom locales.

Contact [email protected] for details.

Speech Recognition Models

Statistical speech recognition models are used to increase the probability of a correct result. Models with fewer word choices are faster and more accurate than the freeform models. For example, in the food model the words, “7 up” would be recognized as, “7up”. Another example is with a food model would recognize the audio from “ice cream” as “ice cream” instead of “i scream”.

Standard Freeform Models

Standard non-freeform models, custom models, speex modes.

The speexmode variable tells the server which format your Speex data is encoded in for improved speech recognition quality. It is highly recommended you include this parameter when using Speex encoding.

Example Transactions for Freeform Speech

Format of examples.

HTTP REST Request for Speech Recognition
HTTP JSON Request for Speech Recognition
HTTP XML network request for Speech Recognition

The following examples are packet captures from TCP connections that used the HTTP protocol. You can compare your network traffic to these transactions to debug code. Wireshark can be used to analyze network connections.

Command Lists

Command lists are used to limit the possible values returned during speech recognition. For example, if the command list contains only “yes” and “no”, the result will be either “yes” or “no”.

Example Transactions for Command Lists

Formatting of examples.

HTTP XML network request to detect commands from a list
HTTP REST network request to detect commands from a list
HTTP POST JSON request to detect commands from a list
Advanced Example, HTTP POST XML request to detect multiple audio commands from multiple lists

The following examples are packet captures of TCP connections that use the HTTP protocol. You can compare your network traffic with these transactions to debug code. Wireshark can be used to analyze network connections. A REST client can be used to make these HTTP requests.

Position markers provide information regarding word boundaries to allow applications to visually display the current location in spoken audio. It is similar to how a karaoke system would display lyrics.

This is accomplished by first retrieving audio from the iSpeech API (see section 2 for more details), then making a second request for an XML document which contains word boundary information.

Example Transactions for Position Markers

HTTP GET network transaction to retrieve position markers

Note: Marker data is currently only presented in XML form.

To obtain marker information from the iSpeech API, you query the server in the same manner as a normal text-to-speech request. The only difference between a TTS request and a marker request is the “action” parameter, which is set to “convert” for audio, and “markers” for marker information.

Marker Information Usage Technique

Once you have obtained an audio file and the respective marker information XML document, you are ready to highlight text.

There are many methods to processing iSpeech marker information; the following outlines the most basic of those methods. Use the following steps as a baseline implementation.

Media Player Considerations

Your media player must support “location”, “position”, or must notify you of its current progress periodically. For example, in Flash, we set a timer to poll for the audio position every 250 milliseconds. Highlighting will be more accurate with a low interval.

Implementation

If your media player supports retrieval of “current position” or similar, you can follow these basic steps:

  • Retrieve audio
  • Retrieve marker information xml
  • Parse xml into enumerable container/object
  • Load audio into media player and start playing
  • Create a timer and set it’s interval to 250 milliseconds
  • Inside of the newly created timer, at every interval query the media player’s current position
  • Convert the position to milliseconds (if you have number such as 1.343, simply multiply by 1000)
  • Move to first (or next) “word” node inside of marker information xml document
  • Check to see if current position is greater than or equal to the value of ”start” AND ALSO current position is less than or equal to the value of “end”, highlight the specified “text”
  • If current position is greater than “word” “end” value go to step 8

You can follow the above steps until the audio file is exhausted.

The same parameters must be sent in the markers request as the original TTS audio request. For example, if you pass a “speed” parameter during audio conversion, you must also send this parameter in your marker information request. If you fail to do so, the marker information will not line up correctly.

File type affects audio length. A MP3 file is always longer than a WAV file due to compression padding. The API will modify the file length accordingly.

Visemes provide information regarding the mouth position and time interval of spoken audio which allows applications to visually pronounce audio.

This is accomplished by first retrieving audio from the iSpeech API (see section 2 for more details), then making a second request for an XML document which contains viseme information.

Example Transaction for Viseme Retrieval

HTTP GET network transaction to retrieve viseme positions

To obtain viseme information from the iSpeech API, you query the server in the same manner as a normal text-to-speech request. The only difference between a TTS request and a marker request is the “action” parameter, which is set to “convert” for audio, and “viseme” for viseme information.

Viseme Chart

Viseme usage technique.

Once you have obtained an audio file and the respective viseme information XML document, you are ready to animate mouth positions to simulate speaking.

There are many methods to processing iSpeech viseme information; the following outlines the most basic of those methods. Use the following steps as a baseline implementation.

Your media player must support “location”, “position”, or must notify you of its current progress periodically. For example, in Flash, we set a timer to poll for the audio position every 250 milliseconds. Mouth positioning will be more accurate with a low interval.

  • Retrieve viseme information xml

The same parameters must be sent in the viseme request as the original TTS audio request. For example, if you pass a “speed” parameter during audio conversion, you must also send this parameter in your marker information request. If you fail to do so, the viseme will not line up correctly.

iSpeech Inc. (“iSpeech”) has made efforts to ensure the accuracy and completeness of the information in this document. However, iSpeech Inc. disclaims all representations, warranties and conditions, whether express or implied, arising by statute, operation of law, usage of trade, course of dealing or otherwise, with respect to the information contained herein. iSpeech Inc. assumes no liability to any party for any loss or damage, whether direct, indirect, incidental, consequential, special or exemplary, with respect to (a) the information; and/or (b) the evaluation, application or use of any product or service described herein.

iSpeech Inc. disclaims any and all representation that its products or services infringe upon any existing or future intellectual property rights. iSpeech Inc. owns and retains all right, title and interest in and to the iSpeech Inc. intellectual property, including without limitation, its patents, marks, copyrights and technology associated with the iSpeech Inc. services. No title or ownership of any of the foregoing is granted or otherwise transferred hereunder. iSpeech Inc. reserves the right to make changes to any information herein without further notice.

Revision History

  • About AssemblyAI

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

Growth at AssemblyAI

Choosing the best Speech-to-Text API , AI model, or open-source engine to build with can be challenging. You need to compare accuracy, model design, features, support options, documentation, security, and more.

This post examines the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API or AI model vs. an open-source library, or vice versa.

Looking for a powerful speech-to-text API or AI model?

Learn why AssemblyAI is the leading Speech AI partner.

Free Speech-to-Text APIs and AI Models

APIs and AI models are more accurate, easier to integrate, and come with more out-of-the-box features than open-source options. However, large-scale use of APIs and AI models can come with a higher cost than open-source options.

If you’re looking to use an API or AI model for a small project or a trial run, many of today’s Speech-to-Text APIs and AI models have a free tier. This means that the API or model is free for anyone to use up to a certain volume per day, per month, or per year.

Let’s compare three of the most popular Speech-to-Text APIs and AI models with a free tier: AssemblyAI, Google, and AWS Transcribe.

AssemblyAI is an API platform that offers AI models that accurately transcribe and understand speech, and enable users to extract insights from voice data. AssemblyAI offers cutting-edge AI models such as Speaker Diarization , Topic Detection, Entity Detection , Automated Punctuation and Casing , Content Moderation , Sentiment Analysis , Text Summarization , and more. These AI models help users get more out of voice data, with continuous improvements being made to accuracy .

AssemblyAI also offers LeMUR , which enables users to leverage Large Language Models (LLMs) to pull valuable information from their voice data—including answering questions, generating summaries and action items, and more. 

The company offers up to 100 free transcription hours for audio files or video streams, with a concurrency limit of 5, before transitioning to an affordable paid tier.

Its high accuracy and diverse collection of AI models built by AI experts make AssemblyAI a sound option for developers looking for a free Speech-to-Text API. The API also supports virtually every audio and video file format out-of-the-box for easier transcription.

AssemblyAI has expanded the languages it supports to include English, Spanish, French, German, Japanese, Korean, and much more, with additional languages being released monthly. See the full list here .

AssemblyAI’s easy-to-use models also allow for quick set-up and transcription in any programming language. You can copy/paste code examples in your preferred language directly from the AssemblyAI Docs or use the AssemblyAI Python SDK or another one of its ready-to-use integrations .

  • Free to test in the AI playground , plus 100 free hours of asynchronous transcription with an API sign-up
  • Speech-to-Text – $0.37 per hour
  • Real-time Transcription – $0.47 per hour
  • Audio Intelligence – varies, $.01 to $.15 per hour
  • LeMUR – varies
  • Enterprise pricing is also available

See the full pricing list here .

  • High accuracy
  • Breadth of AI models available, built by AI experts
  • Continuous model iteration and improvement
  • Developer-friendly documentation and SDKs
  • Enterprise-grade support and security
  • Models are not open-source

Google Speech-to-Text is a well-known speech transcription API. Google gives users 60 minutes of free transcription, with $300 in free credits for Google Cloud hosting.

Google only supports transcribing files already in a Google Cloud Bucket, so the free credits won’t get you very far. Google also requires you to sign up for a GCP account and project — whether you're using the free tier or paid.

With good accuracy and 125+ languages supported, Google is a decent choice if you’re willing to put in some initial work.

  • 60 minutes of free transcription
  • $300 in free credits for Google Cloud hosting
  • Decent accuracy
  • Multi-language support
  • Only supports transcription of files in a Google Cloud Bucket
  • Difficult to get started
  • Lower accuracy than other similarly-priced APIs
  • AWS Transcribe

AWS Transcribe offers one hour free per month for the first 12 months of use.

Like Google, you must create an AWS account first if you don’t already have one. AWS also has lower accuracy compared to alternative APIs and only supports transcribing files already in an Amazon S3 bucket.

However, if you’re looking for a specific feature, like medical transcription, AWS has some options. Its Transcribe Medical API is a medical-focused ASR option that is available today.

  • One hour free per month for the first 12 months of use
  • Tiered pricing , based on usage, ranges from $0.02400 to $0.00780
  • Integrates into existing AWS ecosystem
  • Medical language transcription
  • Difficult to get started from scratch
  • Only supports transcribing files already in an Amazon S3 bucket

Open-Source Speech Transcription engines

An alternative to APIs and AI models, open-source Speech-to-Text libraries are completely free--with no limits on use. Some developers also see data security as a plus, since your data doesn’t have to be sent to a third party or the cloud.

There is work involved with open-source engines, so you must be comfortable putting in a lot of time and effort to get the results you want, especially if you are trying to use these libraries at scale. Open-source Speech-to-Text engines are typically less accurate than the APIs discussed above.

If you want to go the open-source route, here are some options worth exploring:

DeepSpeech is an open-source embedded Speech-to-Text engine designed to run in real-time on a range of devices, from high-powered GPUs to a Raspberry Pi 4. The DeepSpeech library uses end-to-end model architecture pioneered by Baidu.

DeepSpeech also has decent out-of-the-box accuracy for an open-source option and is easy to fine-tune and train on your own data.

  • Easy to customize
  • Can use it to train your own model
  • Can be used on a wide range of devices
  • Lack of support
  • No model improvement outside of individual custom training
  • Heavy lift to integrate into production-ready applications

Kaldi is a speech recognition toolkit that has been widely popular in the research community for many years.

Like DeepSpeech, Kaldi has good out-of-the-box accuracy and supports the ability to train your own models. It’s also been thoroughly tested—a lot of companies currently use Kaldi in production and have used it for a while—making more developers confident in its application.

  • Can use it to train your own models
  • Active user base
  • Can be complex and expensive to use
  • Uses a command-line interface

Flashlight ASR (formerly Wav2Letter)

Flashlight ASR, formerly Wav2Letter, is Facebook AI Research’s Automatic Speech Recognition (ASR) Toolkit. It is also written in C++ and usesthe ArrayFire tensor library.

Like DeepSpeech, Flashlight ASR is decently accurate for an open-source library and is easy to work with on a small project.

  • Customizable
  • Easier to modify than other open-source options
  • Processing speed
  • Very complex to use
  • No pre-trained libraries available
  • Need to continuously source datasets for training and model updates, which can be difficult and costly
  • SpeechBrain

SpeechBrain is a PyTorch-based transcription toolkit. The platform releases open implementations of popular research works and offers a tight integration with Hugging Face for easy access.

Overall, the platform is well-defined and constantly updated, making it a straightforward tool for training and finetuning.

  • Integration with Pytorch and Hugging Face
  • Pre-trained models are available
  • Supports a variety of tasks
  • Even its pre-trained models take a lot of customization to make them usable
  • Lack of extensive docs makes it not as user-friendly, except for those with extensive experience

Coqui is another deep learning toolkit for Speech-to-Text transcription. Coqui is used in over twenty languages for projects and also offers a variety of essential inference and productionization features.

The platform also releases custom-trained models and has bindings for various programming languages for easier deployment.

  • Generates confidence scores for transcripts
  • Large support comunity
  • No longer updated and maintained by Coqui

Whisper by OpenAI, released in September 2022, is comparable to other current state-of-the-art open-source options.

Whisper can be used either in Python or from the command line and can also be used for multilingual translation.

Whisper has five different models of varying sizes and capabilities, depending on the use case, including v3 released in November 2023 .

However, you’ll need a fairly large computing power and access to an in-house team to maintain, scale, update, and monitor the model to run Whisper at a large scale, making the total cost of ownership higher compared to other options. 

As of March 2023, Whisper is also now available via API . On-demand pricing starts at $0.006/minute.

  • Multilingual transcription
  • Can be used in Python
  • Five models are available, each with different sizes and capabilities
  • Need an in-house research team to maintain and update
  • Costly to run

Which free Speech-to-Text API, AI model, or Open Source engine is right for your project?

The best free Speech-to-Text API, AI model, or open-source engine will depend on our project. Do you want something that is easy-to-use, has high accuracy, and has additional out-of-the-box features? If so, one of these APIs might be right for you:

Alternatively, you might want a completely free option with no data limits—if you don’t mind the extra work it will take to tailor a toolkit to your needs. If so, you might choose one of these open-source libraries:

Whichever you choose, make sure you find a product that can continually meet the needs of your project now and what your project may develop into in the future.

Want to get started with an API?

Get a free API key for AssemblyAI.

Popular posts

AI trends in 2024: Graph Neural Networks

AI trends in 2024: Graph Neural Networks

Marco Ramponi's picture

Developer Educator at AssemblyAI

AI for Universal Audio Understanding: Qwen-Audio Explained

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works

How DALL-E 2 Actually Works

Ryan O'Connor's picture

Google Text-to-Speech API - Python Integration Guide

Unreal Speech

Unreal Speech

Python and google's tts api - a simplified approach.

When integrating Google text to speech API Python, the process is streamlined and efficient. The Google Translate text to speech API , a key component of this integration, allows for the conversion of text into natural-sounding speech. This feature is advantageous for businesses seeking to enhance user experience through interactive voice response systems or audio-based content. The Google Translate text to speech API, with its multilingual support, offers a global reach, making it a valuable tool for businesses operating in diverse markets.

The Google text to speech API Python library, a comprehensive resource for developers, provides a simplified approach to implementing text to speech technology. This library, with its well-documented functions and methods, offers a clear path to integrating Google's TTS API into Python-based applications. The advantage lies in its ease of use, reducing the complexity often associated with such integrations. The benefit is a faster, more efficient development process, enabling businesses to quickly deploy voice-enabled services and improve customer engagement.

text to speech language api

Understanding Text to Speech Technology: A Comprehensive Glossary of Terms

API (Application Programming Interface): An API is a set of rules and protocols for building and interacting with software applications. It defines the methods and data formats that a program can use to communicate with other software or hardware.

Python: Python is a high-level, interpreted programming language known for its simplicity and readability. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.

TTS (Text-to-Speech): TTS is a type of assistive technology that reads digital text aloud. It's used in various applications, including voice-enabled email and spoken directions for navigation apps.

Google's TTS API: Google's TTS API is a cloud-based service that converts text into human-like speech. It leverages deep learning technologies to deliver high-quality voices and supports multiple languages.

JSON (JavaScript Object Notation): JSON is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It's often used when data is sent from a server to a web page.

HTTP (Hypertext Transfer Protocol): HTTP is the protocol used for transferring data over the internet. It defines how messages are formatted and transmitted, and what actions web servers and browsers should take in response to various commands.

SSML (Speech Synthesis Markup Language): SSML is a standardized markup language that provides a rich, XML-based language for assisting the generation of synthetic speech in web and other applications.

OAuth 2.0: OAuth 2.0 is an authorization framework that enables applications to obtain limited access to user accounts on an HTTP service. It's used by Google APIs to authenticate and authorize requests.

REST (Representational State Transfer): REST is an architectural style for designing networked applications. A RESTful web service, like Google's TTS API, uses HTTP methods to implement the concept of REST architecture.

What Is Google Text to Speech API Python: An In-Depth Exploration

Google's Text to Speech API Python—a feature-rich tool—offers a myriad of advantages for developers and businesses alike. Its core feature, the conversion of text into human-like speech, leverages Google's advanced deep learning technologies. This advantage enables the creation of applications with enhanced accessibility features, improving user experience. Consequently, businesses benefit from increased user engagement and potential growth in customer base—demonstrating the API's practical value in today's digital landscape.

Unveiling the Benefits and Advantages of Google Text to Speech API Key

Unveiling Google's Text to Speech API Key, one discovers a feature set that is both robust and innovative. This tool, powered by Google's cutting-edge deep learning algorithms, transforms text into speech that mirrors human intonation and rhythm—an advantage that opens new avenues for application development. Enhanced accessibility options, a direct result of this feature, enrich the user interface, fostering a more engaging user experience. This, in turn, can catalyze business growth by expanding the customer base—a testament to the API Key's tangible benefits in the evolving digital ecosystem.

Enhancing finance and corporate management with Google text to speech API Python benefits

Google's Text to Speech API Python, a feature-rich tool, harnesses the power of advanced machine learning to convert text into lifelike speech—providing a distinct advantage in the realm of application development. This technology, with its high perplexity and burstiness, offers a unique benefit to finance and corporate management sectors by enabling the creation of interactive voice response (IVR) systems, automated customer service, and real-time multilingual communication. Consequently, it fosters an enriched user experience, broadens customer reach, and propels business growth—underscoring its pivotal role in the digital transformation journey.

Government utilization of Google text to speech API Python for efficient public service

Recognizing the potential of Google's Text to Speech API Python, governments worldwide are leveraging its high perplexity and burstiness for efficient public service. This advanced machine learning tool, known for its ability to transform text into lifelike speech, is being utilized to streamline public communication—enhancing citizen engagement, improving service delivery, and fostering transparency. By integrating this technology into their digital infrastructure, governments can create interactive voice response systems, automate customer service, and facilitate real-time multilingual communication—thereby revolutionizing public service delivery and reinforcing their commitment to digital transformation.

Boosting business and ecommerce operations through Google text to speech API Python advantages

Amid the digital revolution, businesses and ecommerce platforms are harnessing the power of Google's Text to Speech API Python. This sophisticated tool, renowned for its high perplexity and burstiness, offers a myriad of advantages—transforming text into natural-sounding speech, automating customer interactions, and enabling real-time multilingual communication. By integrating this technology, organizations can elevate their operations, enhance customer engagement, and drive business growth. Thus, Google's Text to Speech API Python emerges as a game-changer in the realm of business and ecommerce operations.

Empowering education and training with Google text to speech API Python capabilities

As the digital landscape evolves, the significance of Google's Text to Speech API Python in education and training sectors is becoming increasingly apparent. This advanced tool—characterized by its high perplexity and burstiness—provides a unique solution to the challenges of remote learning and training. It converts text into lifelike speech, facilitating interactive learning experiences and overcoming language barriers. By leveraging this technology, educational institutions and training organizations can revolutionize their pedagogical approaches, foster student engagement, and enhance learning outcomes. Hence, Google's Text to Speech API Python is poised to redefine the future of education and training.

Industrial manufacturing and supply chains transformation with Google text to speech API Python

Industrial manufacturing and supply chains are undergoing a transformative shift, driven by Google's Text to Speech API Python. This tool, marked by high perplexity and burstiness, offers a novel solution to the complexities of real-time communication and data interpretation in these sectors. By converting text into natural-sounding speech, it enables seamless interaction between systems and users, thereby enhancing operational efficiency. Furthermore, it breaks down language barriers, facilitating global collaboration and coordination. Thus, Google's Text to Speech API Python is set to revolutionize industrial manufacturing and supply chains.

Medical research and healthcare innovation via Google text to speech API Python

As awareness of the potential for technological innovation in healthcare grows, a significant problem emerges—how to effectively leverage these advancements for medical research and patient care. Google's Text to Speech API Python, characterized by high perplexity and burstiness, positions itself as a solution. This tool transforms text into natural, comprehensible speech, enabling a more intuitive interaction between healthcare professionals and complex data systems. It simplifies the interpretation of intricate medical data, thereby accelerating research and improving patient outcomes. Moreover, it transcends language barriers, fostering international collaboration in medical research. Thus, Google's Text to Speech API Python is poised to drive healthcare innovation and medical research forward.

Google text to speech API Python's role in advancing social development

With the rising awareness of social development's technological needs, a critical issue surfaces—how to harness these advancements for societal betterment. Google's Text to Speech API Python, marked by its high perplexity and burstiness, offers a compelling solution. This tool converts text into understandable speech, facilitating seamless interaction between social workers and intricate data systems. It demystifies the analysis of complex social data, thus expediting research and enhancing community outcomes. Furthermore, it breaks down language barriers, promoting global cooperation in social research. Consequently, Google's Text to Speech API Python is set to propel social development and research forward.

Scientific research and engineering progress with Google text to speech API Python

Recognizing the escalating need for advanced tools in scientific research and engineering, a significant challenge emerges—leveraging these innovations for optimal results. Google's Text to Speech API Python, characterized by its elevated perplexity and burstiness, provides an intriguing answer. This technology transforms text into comprehensible speech, enabling effortless communication between researchers and complex data systems. It simplifies the interpretation of intricate scientific data, thereby accelerating research and improving engineering solutions. Moreover, it eliminates linguistic obstacles, fostering international collaboration in scientific research. As a result, Google's Text to Speech API Python is poised to drive scientific research and engineering progress.

Law and paralegal sectors' transformation using Google text to speech API Python

Amid the rapidly evolving legal landscape, Google's Text to Speech API Python emerges as a transformative tool for the law and paralegal sectors. This technology, marked by high perplexity and burstiness, converts intricate legal text into audible speech—facilitating seamless interaction between legal professionals and complex legal databases. It streamlines the interpretation of dense legal documents, expediting case research and enhancing legal strategies. Furthermore, it eradicates language barriers, promoting global collaboration in legal research. Consequently, Google's Text to Speech API Python is set to revolutionize the law and paralegal sectors.

Feature Highlights: Exploring the Capabilities of Google Text to Speech API Python

Google's Text to Speech API Python, a feature-rich tool, offers a myriad of capabilities. Its primary feature—TTS conversion—provides the advantage of transforming complex textual data into comprehensible speech. This capability benefits various sectors, particularly those dealing with intricate data, such as the legal and paralegal fields. By converting dense legal text into audible speech, it simplifies interaction with complex databases, accelerates research, and enhances strategic planning. Moreover, it eliminates language obstacles, fostering international cooperation in research endeavors. Thus, Google's Text to Speech API Python stands as a game-changer in data-intensive industries.

Unveiling cost-effectiveness in Google text to speech API Python's robust features

Despite the evident prowess of Google's Text to Speech API Python, businesses often grapple with cost-effectiveness—especially when dealing with voluminous, complex data. This concern escalates when the need for seamless, international collaboration arises, necessitating the elimination of language barriers. However, the robust features of this API offer a compelling solution. Its TTS conversion capability not only simplifies interaction with intricate databases but also accelerates research and strategic planning—thereby enhancing productivity. Furthermore, its language versatility fosters global cooperation, making it a cost-effective tool for data-intensive industries.

Legal regulations compliance made seamless with Google text to speech API Python

Legal regulations compliance presents a significant challenge for businesses—particularly when dealing with complex, multilingual data. This problem intensifies when one considers the need for efficient, global collaboration, which necessitates the removal of language barriers. Google's Text to Speech API Python, however, offers a potent solution. Its advanced TTS conversion feature not only streamlines interaction with complex databases but also expedites research and strategic planning—thus boosting productivity. Moreover, its language versatility promotes international cooperation, making it a cost-effective tool for data-heavy industries. Therefore, this API serves as a powerful ally in ensuring seamless compliance with legal regulations.

Sustainability-focused features of Google text to speech API Python

Recognizing the escalating demand for sustainable solutions in the tech industry, Google's Text to Speech API Python emerges as a frontrunner—equipped with features that prioritize environmental responsibility. Its energy-efficient design minimizes power consumption, thereby reducing the carbon footprint of businesses that utilize it. Furthermore, its cloud-based nature eliminates the need for physical servers, contributing to a reduction in e-waste. This API's sustainability-focused features, coupled with its robust language versatility and advanced TTS conversion capabilities, position it as an indispensable tool for businesses striving for eco-friendly operations.

Scalability potential in Google text to speech API Python's advanced features

Google's Text to Speech API Python—known for its scalability potential—offers advanced features that cater to the evolving needs of businesses. Its cloud-based architecture allows for seamless expansion, accommodating increasing user demands without the need for additional hardware. This scalability is further enhanced by its language versatility, supporting a multitude of languages and dialects, thus broadening its applicability. Moreover, its advanced TTS conversion capabilities ensure high-quality audio output, regardless of the scale of operations. These features, combined with its energy-efficient design, make Google's Text to Speech API Python a scalable, eco-friendly solution for businesses.

User-friendliness in Google text to speech API Python's feature exploration

Attention is drawn to the user-friendly nature of Google's Text to Speech API Python, a feature that sets it apart in the realm of TTS technologies. Its intuitive interface, coupled with comprehensive documentation, simplifies the process of feature exploration for developers—making it an accessible tool for businesses of all sizes. Interest is piqued by its ability to deliver high-quality audio output, a testament to its advanced TTS conversion capabilities. The desire for scalability and language versatility is met, as it supports a multitude of languages and dialects, and its cloud-based architecture allows for seamless expansion. Action is encouraged by its energy-efficient design, an eco-friendly solution that aligns with modern sustainability goals.

Wider market reach through feature-rich Google text to speech API Python

One encounters a challenge in reaching a broader market due to language barriers and scalability issues. This problem intensifies when the business expands, causing agitation among stakeholders. Google's Text to Speech API Python emerges as a solution—offering a feature-rich platform that not only supports a wide array of languages and dialects but also ensures scalability through its cloud-based architecture. Its high-quality audio output and energy-efficient design further enhance its appeal, making it a reliable tool for businesses aiming for global reach and sustainability.

Deployment simplicity: A key feature of Google text to speech API Python

Google's Text to Speech API Python showcases deployment simplicity—a feature that stands out in the realm of TTS technology. This advantage is realized through its user-friendly interface and straightforward integration process, which eliminates the need for extensive technical knowledge. Consequently, businesses benefit from a streamlined workflow, reduced setup time, and increased productivity. This API, with its cloud-based architecture, supports a multitude of languages and dialects, ensuring scalability and global reach. Furthermore, its high-quality audio output and energy-efficient design underscore its reliability and sustainability—essential attributes for businesses aiming for growth and longevity.

Exploring Use Cases for the Google Text to Speech API Key

As awareness of Google's Text to Speech API Key grows, it's crucial to understand its potential applications. One notable problem it addresses is the challenge of creating multilingual content—its support for numerous languages and dialects makes it a versatile tool for global businesses. Moreover, it positions itself as a reliable solution for producing high-quality audio content, thanks to its cloud-based architecture and energy-efficient design. This API Key's deployment simplicity, coupled with its user-friendly interface, further enhances its appeal, offering a streamlined workflow and reduced setup time. Thus, it emerges as a robust tool for businesses seeking to enhance productivity and reach a wider audience.

Scientific research and technology development groups leveraging Google text to speech API Python

Scientific research and technology development groups are increasingly cognizant of Google's Text to Speech API Python's potential. This awareness stems from the API's ability to tackle the complex issue of generating multilingual content—its extensive language and dialect support positions it as an invaluable asset for global operations. Furthermore, its cloud-based architecture and energy-efficient design ensure the production of superior audio content. The simplicity of deployment and user-friendly interface of this API Python enhance its appeal, offering a streamlined workflow and minimized setup time. Consequently, it stands as a powerful resource for organizations aiming to boost productivity and extend their reach.

Public offices and government contractors' integration of Google text to speech API Python

Public offices and government contractors face a significant challenge—efficiently generating multilingual content. This issue is further aggravated by the need for high-quality audio content, a streamlined workflow, and minimal setup time. Google's Text to Speech API Python emerges as a potent solution to these problems. Its extensive language support, cloud-based architecture, and energy-efficient design make it an ideal tool for these entities. Moreover, its user-friendly interface simplifies deployment, thereby enhancing productivity and global reach.

Google text to speech API Python in hospitals and healthcare facilities: A closer look

Within the healthcare sector, Google's Text to Speech API Python presents a transformative feature—its ability to convert text into natural-sounding speech. This advantage is particularly beneficial in hospitals and healthcare facilities, where clear, accurate communication is paramount. The benefit is twofold: it not only enhances patient care by providing comprehensible health information, but also streamlines administrative tasks, such as appointment reminders and medication instructions. This cloud-based solution, with its extensive language support and user-friendly interface, thus emerges as a powerful tool for improving healthcare efficiency and patient engagement.

Google text to speech API Python: A tool for banks and financial agencies

Google's Text to Speech API Python emerges as a potent tool in the banking and financial sector—its capacity to transform text into natural, human-like speech is a game-changer. This feature is particularly advantageous for banks and financial agencies, where precise, clear communication is crucial. It not only enhances customer service by delivering understandable financial information, but also optimizes administrative tasks, such as transaction alerts and loan reminders. This cloud-based solution, with its broad language support and intuitive interface, thus positions itself as an essential instrument for boosting financial service efficiency and customer engagement.

Google text to speech API Python: A strategic asset for businesses and ecommerce operators

Google's Text to Speech API Python—unveiling a new dimension in the realm of business and ecommerce operations—offers a unique feature: the conversion of text into lifelike speech. This advantage, pivotal in sectors demanding precise communication, elevates customer interactions by delivering comprehensible information, while streamlining administrative tasks such as notifications and reminders. Consequently, this cloud-based solution, with its extensive language support and user-friendly interface, manifests as a strategic asset, enhancing operational efficiency and customer engagement.

Social welfare organizations' innovative applications of Google text to speech API Python

Google's Text to Speech API Python—revolutionizing the landscape of social welfare organizations—introduces an innovative feature: the transformation of written content into natural-sounding speech. This advantage, crucial in areas requiring clear and concise communication, enhances user experience by providing easily understandable information, while simplifying administrative tasks such as alerts and reminders. As a result, this cloud-based tool, with its wide-ranging language support and intuitive interface, emerges as a tactical resource, boosting operational productivity and user engagement.

Google text to speech API Python's impact on educational institutions and training centers

Google's Text to Speech API Python—pioneering a new era for educational institutions and training centers—offers a distinctive feature: the conversion of text into lifelike speech. This advantage, pivotal in environments demanding precise and understandable communication, elevates user interaction by delivering comprehensible content, while streamlining managerial duties such as notifications and reminders. Consequently, this cloud-based solution, with its extensive language compatibility and user-friendly interface, emerges as a strategic asset, enhancing operational efficiency and learner engagement.

Industrial manufacturers and distributors: Streamlining operations with Google text to speech API Python

Google's Text to Speech API Python—revolutionizing industrial manufacturing and distribution sectors—introduces a unique feature: the transformation of text into natural-sounding speech. This advantage, crucial in settings requiring clear and accurate communication, enhances user engagement by providing intelligible content, while simplifying administrative tasks such as alerts and reminders. As a result, this cloud-based tool, with its broad language support and intuitive interface, becomes a tactical resource, boosting operational productivity and user interaction.

Law firms and paralegal service providers' innovative use of Google text to speech API Python

Law firms and paralegal service providers face a significant challenge—efficiently managing vast amounts of textual data. This issue, often leading to time-consuming manual processes, hampers productivity and client service. Google's Text to Speech API Python, however, offers an innovative solution. By converting text into natural-sounding speech, it enables these organizations to streamline data management, enhance client communication, and improve service delivery. This cloud-based tool, with its extensive language support and user-friendly interface, emerges as a strategic asset, elevating operational efficiency and client engagement.

Latest Research Insights on Advancements in Text-to-Speech Tech

As awareness of TTS synthesis grows, so does recognition of its potential. Problems in accessibility, language learning, and user engagement can be addressed by this technology. Recent research and engineering case studies reveal significant advancements—improved naturalness of speech, enhanced prosody, and better language models. These benefits position businesses, educational institutions, and social platforms to deliver superior user experiences, foster inclusivity, and drive engagement.

  • Text-to-speech Synthesis System based on Wavenet (2017) - This research paper, authored by Yuan Li, Xiaoshi Wang, and Shutong Zhang from Stanford University's Department of Computer Science, explores the development of a parametric TTS system based on WaveNet. WaveNet is a deep neural network introduced by DeepMind in 2016 for generating raw audio waveforms. The paper discusses the integration of convolutional layers into the TTS task to extract valuable information from the input data. It also addresses the limitations and challenges faced by the system.
  • Speech Synthesis: A Review - Archana Balyan, S. S. Agrawal, and Amita Dev authored this research paper, which provides an overview of recent advancements in speech synthesis. The focus is on the statistical parametric approach to speech synthesis based on Hidden Markov Models (HMMs). The paper discusses the simultaneous modeling of spectrum, excitation, and duration of speech using context-dependent HMMs. It aims to summarize and compare various synthesis techniques used in the field, contributing to the identification of research topics and applications in speech synthesis.

Wrapping Up: A Closer Look at Google Text to Speech API Python

Text to Speech technology, often abbreviated as TTS, is a rapidly evolving field with a plethora of terms that can be overwhelming for newcomers. Understanding these terms is crucial for anyone looking to leverage this technology. For instance, 'phoneme' refers to the smallest unit of sound, while 'prosody' pertains to the rhythm, stress, and intonation of speech. 'Voice synthesis', another key term, is the process of artificially producing human speech. These terms, among others, form the backbone of TTS technology, enabling developers to create applications that can convert written text into spoken words.

Google Text to Speech API Python is a powerful tool that allows developers to convert text into speech. This API, which stands for API, is a set of rules and protocols for building software and applications. With Google Text to Speech API Python, developers can create applications that read aloud text in a variety of languages and voices. This API is particularly useful for creating applications for visually impaired users, language learners, or anyone who benefits from auditory learning.

Google Text to Speech API Key offers numerous benefits and advantages. It provides access to a wide range of voices and languages, allowing developers to create applications that cater to a global audience. The API also supports SSML tags, which enable developers to control aspects of speech such as pronunciation, volume, and pitch. Furthermore, Google Text to Speech API Key is easy to integrate with existing applications, making it a versatile tool for developers.

Google Text To Speech Api Python: Quick Python Example

This Python example demonstrates how to use the pyttsx3 module to convert TTS. The 'init' function initializes the speech engine, and the 'setProperty' function is used to adjust the speech rate and volume. The 'say' function is then used to input the text that will be converted to speech, and 'runAndWait' is called to process the speech.

Google Text To Speech Api Python: Quick Javascript Example

This Javascript example demonstrates how to use the 'say' module to convert TTS. The 'require' function is used to import the 'say' module, and the 'speak' function is used to input the text that will be converted to speech.

Unique Unreal Speech Benefits Over Google Text to Speech API Python

Unreal Speech is revolutionizing the TTS technology landscape with its cost-effective solutions. It significantly reduces TTS costs by up to 95%, making it up to 20 times cheaper than competitors like Eleven Labs and Play.ht, and up to 4 times cheaper than tech giants such as Amazon, Microsoft, IBM, and Google. This cost efficiency is a game-changer for a wide range of organizations, from small to medium businesses, call centers, and telesales agencies, to podcast authors, content publishers, video marketers, and even enterprise-level organizations like hospitals, banks, and educational institutions. The pricing structure of Unreal Speech is designed to scale with the needs of these diverse users, offering a free tier for up to 1 million characters, and volume discounts for higher usage.

But cost efficiency is not the only advantage Unreal Speech brings to the table. It also offers the Unreal Speech Studio, a tool that enables users to create studio-quality voice overs for podcasts, videos, and more. Users can customize playback speed and pitch to generate the desired intonation and style, and choose from a wide variety of professional-sounding, human-like voices. The output can be downloaded in MP3 or PCM µ-law-encoded WAV formats in various bitrate quality settings. For those who want to experience the technology firsthand, a simple to use live Unreal Speech demo is available for generating random text and listening to the human-like voices of Unreal Speech.

Unreal Speech's robust infrastructure supports up to 3 billion characters per month for each client, with a latency of just 0.3 seconds and a 99.9% uptime guarantee. This high capacity and reliability have earned it rave reviews from users. Derek Pankaew, CEO of Listening.io, attests to the quality and cost-effectiveness of Unreal Speech, stating, "Unreal Speech saved us 75% on our TTS cost. It sounds better than Amazon Polly, and is much cheaper. We switched over at high volumes, and often processing 10,000+ pages per hour. Unreal Speech was able to handle the volume, while delivering high quality listening experience." Developed with love in San Francisco, U.S., Unreal Speech is a testament to the power of innovation in the field of TTS technology.

FAQs: Navigating the Intricacies of Google Text to Speech API Python

Grasping Google's TTS API usage in Python—free of charge—unlocks a plethora of benefits. It empowers developers to create robust, voice-enabled applications, enhancing user engagement. Understanding the setup process and obtaining the API key are crucial steps, enabling seamless integration and access to Google's advanced speech-to-text technology. This knowledge not only boosts technical proficiency but also catalyzes innovation in AI development.

How to use Google TTS API in Python?

Utilizing Google's TTS API in Python necessitates the installation of the Google Cloud SDK and setting up authentication via a JSON key file. Once these prerequisites are met, the user can import the texttospeech module from the google.cloud library. The synthesis_input object is then created, which contains the text to be converted. The voice object is defined next, specifying the language_code, ssml_gender, and name. The audio_config object is then set up, determining the audio format. The synthesize_speech method is finally called on the texttospeech client, passing the synthesis_input, voice, and audio_config objects as arguments. The resulting audio_content can be saved to a file for playback.

Is Google TTS API free?

Google's TTS API, while offering a robust set of features, is not entirely free. It operates on a pay-as-you-go model, with pricing tiers based on usage. For instance, the first million characters processed in a month are free, but subsequent usage incurs a cost. This pricing model allows businesses to scale their usage according to their needs, ensuring they only pay for what they use. It's important to note that the API supports multiple languages and voices, and integrates with SSML for enhanced control over speech output.

How to use Speech-to-Text API in Python?

To leverage the Speech-to-Text API in Python, one must first install the requisite Python SDK, followed by the importation of the speech module from the google.cloud library. The recognition process begins with the instantiation of a speech client. Subsequently, an audio object is created from a local audio file, and a configuration object is defined, specifying the language code and sample rate hertz. The recognize method is then invoked on the speech client, passing the audio and config objects. The transcriptions are extracted from the response object, providing the desired text output.

How do I set up Google text to speech API?

Setting up Google's TTS API involves a series of technical steps. Initially, the Google Cloud SDK must be installed, followed by the creation of a JSON key file for authentication. The texttospeech module from the google.cloud library is then imported. Subsequently, the synthesis_input object, containing the text for conversion, is established. The voice object is defined, specifying language_code, ssml_gender, and name. The audio_config object is set up to determine the audio format. Finally, the synthesize_speech method is invoked on the texttospeech client, passing synthesis_input, voice, and audio_config objects. The resulting audio_content can be saved for later use.

How do I get Google speech-to-text API key?

Obtaining a Google Speech-to-Text API key necessitates a series of technical steps. Initially, one must create a project in the Google Cloud Console, then enable the Speech-to-Text API for that project. Following this, the user must navigate to the 'Credentials' page in the console, click 'Create Credentials', and select 'API key'. The generated key, which serves as the unique identifier for the project, can then be used to authenticate requests to the API. It's crucial to secure this key, as it can be used to incur charges to the Google Cloud account.

Additional Resources for Mastering Google Text to Speech API Python

Attention is drawn to the resource titled "Using the Text-to-Speech API with Python" —a comprehensive guide published on Apr 20, 2023. This guide offers developers and software engineers an in-depth understanding of Google's Text-to-Speech API, enabling them to create more efficient, user-friendly applications.

Businesses and companies can benefit from the "Python Client for Google Cloud Text-to-Speech API" , published on Mar 30, 2023. This resource provides a detailed overview of the Python client, which can be instrumental in developing robust, scalable solutions that enhance customer engagement and satisfaction.

For educational institutions, healthcare facilities, government offices, and social organizations, "Text-to-Speech client libraries" is an invaluable resource. It provides code examples in multiple languages, including C++, Python, Java, Node.js, Go, Ruby, C#, PHP, fostering a more inclusive, accessible environment for all users.

The Best Speech-to-Text APIs in 2024

Josh Fox

, Jose Nicholas Francisco

speech-to-text gold trophy

If you've been shopping for a speech-to-text (STT) solution for your business, you're not alone. In our recent  State of Voice Technology  report, 82% of respondents confirmed their current utilization of voice-enabled technology, a 6% increase from last year.

The vast number of options for speech transcription can be overwhelming, especially if you're unfamiliar with the space. From Big Tech to open source options, there are many choices, each with different price points and feature sets. While this diversity is great, it can also be confusing when you're trying to compare options and pick the right solution.

This article breaks down the leading speech-to-text APIs available today, outlining their pros and cons and providing a ranking that accurately represents the current STT landscape. Before getting to the ranking, we explain exactly what an STT API is, and the core features you can expect an STT API to have, and some key use cases for speech-to-text APIs.

What is a speech-to-text API?

At its core, a speech-to-text (also known as automatic speech recognition, or ASR) application programming interface (API) is simply the ability to call a service to transcribe audio containing speech into written text. The STT service will take the provided audio data, process it using either machine learning or legacy techniques (e.g. Hidden Markov Models), and then provide a transcript of what it has inferred was said.

What are the most important things to consider when choosing a speech-to-text API?

What makes the best speech-to-text API? Is the fastest speech-to-text API the best? Is the most accurate speech-to-text API the best? Is the most affordable speech-to-text API the best? The answers to these questions depend on your specific project and are thus certainly different for everybody. There are a number of aspects to carefully consider in the evaluation and selection of a transcription service and the order of importance is dependent on your target use case and end user needs.

Accuracy - A speech-to-text API should produce highly accurate transcripts, even while dealing with varying levels of speaking conditions (e.g. background noise, dialects, accents, etc.). “Garbage in, garbage out,” as the saying goes. The vast majority of voice applications require highly accurate results from their transcription service to deliver value and a good customer experience to their users.

Speed - Many applications require quick turnaround times and high throughput. A responsive STT solution will deliver value with low latency and fast processing speeds.

Cost - Speech-to-text is a foundational capability in the application stack, and cost efficiency is essential. Solutions that fail to deliver adequate ROI and a good price-to-performance ratio will be a barrier to the overall utility of the end user application.

Modality - Important input modes include support for pre-recorded or real-time audio:

Batch or pre-recorded transcription capabilities - Batch transcription won't be needed by everyone, but for many use cases, you'll want a service that you can send batches of files to to be transcribed, rather than having to do it one-by-one on your end.

Real-time streaming - Again, not everyone will need real-time streaming. However, if you want to use STT to create, for example, truly conversational AI that can respond to customer inquiries in real time, you'll need to use a STT API that returns its results as quickly as possible.

Features & Capabilities - Developers and companies seeking speech processing solutions require more than a bare transcript. They also need rich features that help them build scalable products with their voice data, including sophisticated formatting and speech understanding capabilities to improve readability and utility by downstream tasks.

Scalability and Reliability - A good speech-to-text solution will accommodate varying throughput needs, adequately handling a range of audio data volumes from small startups to large enterprises. Similarly, ensuring reliable, operational integrity is a hard requirement for many applications where the effects from frequent or lengthy service interruption could result in revenue impacts and damage to brand reputation. 

Customization, Flexibility, and Adaptability - One size, fits few. The ability to customize STT models for specific vocabulary or jargon as well as flexible deployment options to meet project-specific privacy, security, and compliance needs are important, often overlooked considerations in the selection process.

Ease of Adoption and Use - A speech-to-text API only has value if it can be integrated into an application. Flexible pricing and packaging options are critical, including usage-based pricing with volume discounts. Some vendors do a better job than others to provide a good developer experience by offering frictionless self-onboarding and even including free tiers with an adequate volume of credits to help developers test the API and prototype their applications before choosing the best subscription option to choose.

Support and Subject Matter Expertise - Domain experts in AI, machine learning, and spoken language understanding are an invaluable resource when issues arise. Many solution providers outsource their model development or offer STT as a value-add to their core offering. Vendors for whom speech AI is their core focus are better equipped to diagnose and resolve challenge issues in a timely fashion. They are also more inclined to make continuous improvements to their STT service and avoid issues with stagnating performance over time.

What are the most important features of a speech-to-text API?

In this section, we'll survey some of the most common features that STT APIs offer. The key features that are offered by each API differ, and your use cases will dictate your priorities and needs in terms of which features to focus on.

Multi-language support - If you're planning to handle multiple languages or dialects, this should be a key concern. And even if you aren't planning on multilingual support now, if there's any chance that you would in the future, you're best off starting with a service that offers many languages and is always expanding to more.

Formatting - Formatting options like punctuation, numeral formatting, paragraphing, speaker labeling (or speaker diarization), word-level timestamping, profanity filtering, and more, all to improve readability and utility for data science

Automatic punctuation & capitalization - Depending on what you're planning to do with your transcripts, you might not care if they're formatted nicely. But if you're planning on surfacing them publicly, having this included in what the STT API provides can save you time.

Profanity filtering or redaction - If you're using STT as part of an effort for community moderation, you're going to want a tool that can automatically detect profanity in its output and censor it or flag it for review.

Understanding - A primary motivation for employing a speech-to-text API is to gain understanding of who said what and why they said it. Many applications employ natural language and spoken language understanding tasks to accurately identify, extract, and summarize conversational audio to deliver amazing customer experiences. 

Topic detection - Automatically identify the main topics and themes in your audio to improve categorization, organization, and understanding of large volumes of spoken language content..

Intent detection - Similarly, intent detection is used to determine the purpose or intention behind the interactions between speakers, enabling more efficient handling by downstream agents or tasks in a system in order to determine the next best action to take or response to provide.

Sentiment analysis - Understand the interactions, attitudes, views, and emotions in conversational audio by quantitatively scoring the overall and component sections as being positive, neutral, or negative. 

Summarization - Deliver a concise summary of the content in your audio, retaining the most relevant and important information and overall meaning, for responsive understanding, analysis, and efficient archival.

Keywords (a.k.a. Keyword Boosting) - Being able to include an extended, custom vocabulary is helpful if your audio has lots of specialized terminology, uncommon proper nouns, abbreviations, and acronyms that an off-the-shelf model wouldn't have been exposed to. This allows the model to incorporate these custom terms as possible predictions.

Custom models - While keywords provide inclusion of a small set of specialized, out-of-vocabulary words, a custom model trained on representative data will always give the best performance. Vendors that allow you to tailor a model for your specific needs, fine-tuned on your own data, give you the ability to boost accuracy beyond what an out-of-the-box solution alone provides.

Accepts multiple audio formats - Another concern that won't be present for everyone is whether or not the STT API can process audio in different formats. If you have audio coming from multiple sources that aren't encoded in the same format, having a STT API that removes the need for converting to different types of audio can save you time and money.

What are the top speech-to-text use cases?

As noted at the outset, voice technology that's built on the back of STT APIs is a critical part of the future of business. So what are some of the most common use cases for speech-to-text APIs? Let's take a look.

Smart assistants  - Smart assistants like Siri and Alexa are perhaps the most frequently encountered use case for speech-to-text, taking spoken commands, converting them to text, and then acting on them.

Conversational AI  - Voicebots let humans speak and, in real time, get answers from an AI. Converting speech to text is the first step in this process, and it has to happen quickly for the interaction to truly feel like a conversation.

Sales and support enablement  - Sales and support digital assistants that provide tips, hints, and solutions to agents by transcribing, analyzing and pulling up information in real time. It can also be used to gauge sales pitches or sales calls with a customer.

Contact centers  - Contact centers can use STT to create transcripts of their calls, providing more ways to evaluate their agents, understand what customers are asking about, and provide insight into different aspects of their business that are typically hard to assess.

Speech analytics  - Broadly speaking, speech analytics is any attempt to process spoken audio to extract insights. This might be done in a call center, as above, but it could also be done in other environments, like meetings or even speeches and talks.

Accessibility  - Providing transcriptions of spoken speech can be a huge win for accessibility, whether it's  providing captions for classroom lectures  or creating badges that transcribe speech on the fly.

How do you evaluate performance of a speech-to-text API?

All speech-to-text solutions aim to produce highly accurate transcripts in a user-friendly format. We advise performing side-by-side accuracy testing using files that resemble the audio you will be processing in production to determine the best speech solution for your needs. The best evaluation regimes employ a holistic approach that includes a mix of quantitative benchmarking and qualitative human preference evaluation across the most important dimensions of quality and performance, including accuracy and speed.

The generally accepted industry metric for measuring transcription quality is Word Error Rate (WER). Consider WER in relation to the following equation:

WER + Accuracy Rate = 100%

Thus, an 80% accurate transcript corresponds to a WER of 20%

WER is an industry standard focusing on error rate rather than accuracy as the error rate can be subdivided into distinct error categories. These categories provide valuable insights into the nature of errors present in a transcript. Consequently, WER can also be defined using the formula:

WER = (# of words inserted + # of words deleted + # of words substituted) / total # of words.

We suggest a degree of skepticism towards vendor claims about accuracy. This includes the qualitative claim that OpenAI’s model “approaches human level robustness on accuracy in English,” and the WER statistics published in Whisper’s documentation.

text to speech language api

Jose Nicholas Francisco

Apr 12, 2024

Implementing a Virtual Veterinarian Using Deepgram API

Samuel Adebayo

Apr 8, 2024

Tutorial: Building an end-to-end LLM Chatbot

Zian (Andy) Wang

Apr 5, 2024

DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding (Wang et al., 2023)

Apr 3, 2024

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

IMAGES

  1. 5 Best Speech-to-Text APIs

    text to speech language api

  2. 6 Best Speech-to-Text API for Your Modern Applications

    text to speech language api

  3. How to create Text to Speech App in JAVASCRIPT using WEB SPEECH API

    text to speech language api

  4. Using the Speech-to-Text API with C#

    text to speech language api

  5. Text to speech in the browser with the Web Speech API

    text to speech language api

  6. Bing Speech API text to speech support is now available in 34 languages

    text to speech language api

VIDEO

  1. Text-to-Speech / Speech-to-Text API Reflection

  2. API to Practice

  3. Text To Speech Using HTML , CSS & JavaScript

  4. Cloud Natural Language API: Qwik Start

  5. 10 free AI tools to make text to speech. Ek baar use krenge tow dhamal macha denge

  6. Analyze Speech & Language with Google APIs: ChallengeLab || ARC114 || #cloudskillsboost #googlecloud

COMMENTS

  1. Text to speech

    Introduction. The Audio API provides a speech endpoint based on our TTS (text-to-speech) model. It comes with 6 built-in voices and can be used to: Narrate a written blog post. Produce spoken audio in multiple languages. Give real time audio output using streaming. Here is an example of the alloy voice:

  2. Supported voices and languages

    Supported voices and languages. Text-to-Speech provides the following voices. The list includes Neural2, Studio, Standard, and WaveNet voices. Studio, Neural2 and WaveNet voices are higher quality voices with different pricing; in the list, they have the voice type 'Neural2', 'Studio' or 'WaveNet'. To use these voices to create synthetic speech ...

  3. Text-to-Speech AI: Lifelike Speech Synthesis

    Turn text into natural-sounding speech in 220+ voices across 40+ languages and variants with an API powered by Google's machine learning technology.

  4. APIs & reference

    Supported voices and languages. List of the voices available for use in Text-to-Speech. Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered ...

  5. Text to speech API reference (REST)

    The Speech service allows you to convert text into synthesized speech and get a list of supported voices for a region by using a REST API. In this article, you learn about authorization options, query options, how to structure a request, and how to interpret a response. Use cases for the text to speech REST API are limited.

  6. Text-to-Speech APIs (TTS API), from ElevenLabs

    The Murf.ai text-to-speech API converts written text into spoken words using digital signal processing algorithms. This integration is simple and secure, fitting seamlessly into existing technology stacks. ... The API also supports text and Speech Synthesis Markup Language (SSML), allowing for advanced pronunciation instructions and other ...

  7. Text to Speech

    Build apps and services that speak naturally. Differentiate your brand with a customized, realistic voice generator, and access voices with different speaking styles and emotional tones to fit your use case—from text readers and talkers to customer support chatbots. Start with $200 Azure credit.

  8. Using the Web Speech API

    Using the Web Speech API. The Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as text to speech, or tts) — which open up interesting new possibilities for accessibility, and control mechanisms. This article provides a simple introduction to both areas, along with demos.

  9. Text to Speech API

    You can find all of these features and more with ElevenLabs. The most accurate and best text to speech (TTS) API. Convert text into lifelike speech with best-in-class latency & the most advanced AI audio model ever. Quickly generate AI voices in multiple languages with our AI Voice API for your chatbots, AI agents, LLMs, websites, apps and more.

  10. How to use the OpenAI Text-to-Speech API

    Namely, text-to-speech systems take words written on a computer (or any other digital device) and read the text aloud. OpenAI's TTS API is an endpoint that enables users to interact with their TTS AI model that converts text to natural-sounding spoken language. The model has two variations: TTS-1: The latest AI model optimized for real-time ...

  11. Using the Text-to-Speech API with Python

    The Text-to-Speech API enables developers to generate human-like speech. The API converts text into audio formats such as WAV, MP3, or Ogg Opus. It also supports Speech Synthesis Markup Language (SSML) inputs to specify pauses, numbers, date and time formatting, and other pronunciation instructions.

  12. 7 Best Text to Speech APIs & Free Alternatives List

    Optionally, text can often be formatted using SSML, a type of markup language created to improve the efficiency of speech synthesis programs. Once the API receives the request, it will return the equivalent audio object. This object can then be integrated into the program which made the request and played for the user. ... Using an API for text ...

  13. Top 10 Text to Speech APIs 2024: Elevate User Experiences

    Supports Standard and Neural text to speech in over 20 language and language variants. SSML-based voice customizations for pitch, volume, rate, and pronunciation. Audio files are available in MP3 and OGG formats. Sampling rates at 8kHZ, 16.05kHz, 22.05kHz, and 24kHz/. Custom lexicons to add unique words and pronunciations.

  14. Best Text-to-Speech APIs for Software Developers

    The Best 10 Text-to-Speech APIs you Should Know. 1. Synthesys API. Synthesys is an AI voice generator with a leading text-to-speech API that offers natural-sounding voices with lifelike intonations and high-quality audio. With its extensive language support and customizable speech styles, Synthesys provides an excellent choice for applications ...

  15. Cloud Text-to-Speech basics

    Text-to-Speech converts text or Speech Synthesis Markup Language (SSML) input into audio data like MP3 or LINEAR16 (the encoding used in WAV files). This document is a guide to the fundamental concepts of using Text-to-Speech. Before diving into the API itself, review the quickstarts. Basic example

  16. It Speaks! Create Synthetic Speech Using Text-to-Speech

    GSP222. Overview. The Text-to-Speech API lets you create audio files of machine-generated, or synthetic, human speech.You provide the content as text or Speech Synthesis Markup Language (SSML), specify a voice (a unique 'speaker' of a language with a distinctive tone and accent), and configure the output; the Text-to-Speech API returns to you the content that you sent as spoken word, audio ...

  17. 16 Best Text to Speech API On The Market (Free & Paid, 2024)

    15. Colossyan API. Colossyan's API provides a Text-to-Speech converter that allows users to create natural-sounding voice-overs in more than 70 languages and accents. With Colossyan, users can choose from a variety of voice-over actors or even clone their own voice for an added personal touch. 16.

  18. Google Text To Speech API: A Simple Guide For Noobs

    SSMLVoiceGender: This parameter enables fine-grained control over the gender of the selected voice when using Speech Synthesis Markup Language (SSML). Google Text to Speech API pricing. Upon my research, I also discovered Google Text to Speech API's pricing. Its pricing model is based on the number of characters used.

  19. API Reference

    The iSpeech API allows developers to implement Text-To-Speech (TTS) and Automated Voice Recognition (ASR) in any Internet-enabled application. The API's are platform agnostic which means any device that can record or play audio that is connected to the Internet can use the iSpeech API.

  20. The Top Free Speech-to-Text APIs, AI Models, and Open ...

    Choosing the best Speech-to-Text API, AI model, or open-source engine to build with can be challenging.You need to compare accuracy, model design, features, support options, documentation, security, and more. This post examines the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision.

  21. Google Text-to-Speech API

    Attention is drawn to the resource titled "Using the Text-to-Speech API with Python" —a comprehensive guide published on Apr 20, 2023. This guide offers developers and software engineers an in-depth understanding of Google's Text-to-Speech API, enabling them to create more efficient, user-friendly applications.

  22. Best Speech-to-Text APIs in 2024

    8. Amazon Transcribe. Amazon Transcribe is offered as a part of the overall Amazon Web Services (AWS) platform. With similar features as Google and Microsoft's speech-to-text solutions, Amazon Transcribe offers good accuracy for pre-recorded audio, but poor accuracy for real-time streaming use cases.

  23. Speech to text

    The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.They can be used to: Transcribe audio into whatever language the audio is in. Translate and transcribe the audio into english.