tutorial // Jan 21, 2022
How to Add Text-to-Speech With the HTML5 Speech Synthesis API
How to use the HTML5 speech synthesis API to add text to speech to your app with multiple voice options.

Getting Started
For this tutorial, we're going to use CheatCode's full-stack JavaScript framework, Joystick. Joystick brings together a front-end UI framework with a Node.js back-end for building apps.
To begin, we'll want to install Joystick via NPM. Make sure you're using Node.js 16+ before installing to ensure compatibility (give this tutorial a read first if you need to learn how to install Node.js or run multiple versions on your computer):
Terminal
npm i -g @joystick.js/cli
This will install Joystick globally on your computer. Once installed, next, let's create a fresh project:
Terminal
joystick create app
After a few seconds, you will see a message logged out to cd
into your new project and run joystick start
:
Terminal
cd app && joystick start
After this, your app should be running and we're ready to get started.
Adding Bootstrap
Digging into the code, first, we want to add the Bootstrap CSS framework to our app. While you don't have to do this, it will make our app look a bit prettier and avoid us having to scramble together CSS for our UI. To do it, we're going to add the Bootstrap CDN link to the /index.html
file at the root of our project:
/index.html
<!doctype html>
<html class="no-js" lang="en">
<head>
<meta charset="utf-8">
<title>Joystick</title>
<meta name="description" content="An awesome JavaScript app that's under development.">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#FFCC00">
<link rel="apple-touch-icon" href="/apple-touch-icon-152x152.png">
<link rel="stylesheet" href="/_joystick/index.css">
<link rel="manifest" href="/manifest.json">
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3" crossorigin="anonymous">
${css}
</head>
<body>
...
</body>
</html>
Here, just above the ${css}
part in the file, we've pasted in the <link></link>
tag from the Bootstrap documentation that gives us access to the CSS portion of the framework.
That's it. Joystick will automatically restart and load this into the browser so we can start using it.
Wiring up a Joystick component with text to speech
In a Joystick app, our UI is built using the framework's built-in UI library @joystick.js/ui
. When we ran joystick create app
above, we were given some example components to work with. We're going to overwrite the /ui/pages/index/index.js
file with some HTML that will serve as the UI for our translator.
/ui/pages/index/index.js
import ui from '@joystick.js/ui';
const Index = ui.component({
css: `
h4 {
border-bottom: 1px solid #eee;
padding-bottom: 20px;
margin-bottom: 40px;
}
textarea {
margin-bottom: 40px;
}
`,
render: () => {
return `
<div>
<h4>Text to Speech Translator</h4>
<form>
<textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
<button class="btn btn-primary">Speak</button>
</form>
<div class="players"></div>
</div>
`;
},
});
export default Index;
To start, we want to replace the component that was in this file with what we see above. Here, we're defining a simple component with two things: a render
function which returns a string of HTML that we want to show in the browser and above it, a string of css
that we want to apply to the HTML we're rendering (Joystick automatically scopes the CSS we pass here to the HTML returned by our render
function).
If we load up http://localhost:2600
in a browser (port 2600
is where Joystick starts by default when we run joystick start
), we should see the Bootstrap-styled version of the HTML above.
/ui/pages/index/index.js
import ui from '@joystick.js/ui';
const Index = ui.component({
events: {
'submit form': (event, component) => {
event.preventDefault();
const text = event?.target?.textToTranslate?.value;
const hasText = text.trim() !== '';
if (!hasText) {
return component.methods.speak('Well you have to say something!');
}
component.methods.speak(text);
},
},
css: `...`,
render: () => {
return `
<div>
<h4>Text to Speech Translator</h4>
<form>
<textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
<button class="btn btn-primary">Speak</button>
</form>
<div class="players"></div>
</div>
`;
},
});
export default Index;
Next, we want to add an events
object to our component. Like the name implies, this is where we define event listeners for our component. Here, we're defining a listener for the submit
event on the <form></form>
element being rendered by our component. Just like our CSS, Joystick automatically scopes our events to the HTML being rendered.
Assigned to that submit form
property on our events
object is a function that will be called whenever the submit event is detected on our <form></form>
.
Inside of that function, first, we take in the event
(this is the browser DOM event) as the first argument and immediately call event.preventDefault()
on it. This prevents the browser from attempting to perform an HTTP POST
to the action
attribute on our form. Like the name suggests, this is the default behavior for browsers (we don't have an action
attribute on our form as we want to control the submission via JavaScript).
Next, once this is halted, we want to get the value typed into our <textarea></textarea>
. To do it, we can reference the textToTranslate
property on the event.target
object. Here, event.target
refers to the <form></form>
element as it's rendered in the browser (its in memory representation).
We can access textToTranslate
because the browser automatically assigns all fields within a form to it in memory using the field's name
attribute as the property name. If we look close at our <textarea></textarea>
, we can see that it has the name
attribute textToTranslate
. If we changed this to pizza
, we'd write event?.target?.pizza?.value
instead.
With that value stored in the text
variable, next, we create another variable hasText
which contains a check to make sure that our text
variable isn't an empty string (the .trim()
part here "trims off" any whitespace characters in case the user just hit the space bar over and over).
If we don't have any text in the input, we want to "speak" the phrase "Well you have to say something!" Assuming we did get some text, we just want to "speak" that text
value.
Notice that here we're calling to component.methods.speak
which we haven't defined yet. We'll tap into Joystick's methods
feature (where we can define miscellaneous functions on our component).
/ui/pages/index/index.js
import ui from '@joystick.js/ui';
const Index = ui.component({
methods: {
speak: (text = '') => {
window.speechSynthesis.cancel();
const message = new SpeechSynthesisUtterance(text);
speechSynthesis.speak(message);
},
},
events: {
'submit form': (event, component) => {
event.preventDefault();
const text = event?.target?.textToTranslate?.value;
const hasText = text.trim() !== '';
if (!hasText) {
return component.methods.speak('Well you have to say something!');
}
component.methods.speak(text);
},
},
css: `...`,
render: () => {
return `
<div>
<h4>Text to Speech Translator</h4>
<form>
<textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
<button class="btn btn-primary">Speak</button>
</form>
<div class="players"></div>
</div>
`;
},
});
export default Index;
Now for the fun part. Because the Speech Synthesis API is implemented in browsers (see compatibility here—it's quite good), we don't have to install or import anything; the entire API is accessible globally in the browser.
Adding a methods
object just above our events
, we're assigning the speak
method that we called to from our submit form
event handler.
Inside, there's not much to do:
- In case we change the text we've typed in and click the "Speak" button mid-playback, we want to call the
window.speechSynthesis.cancel()
method to tell the API to clear its playback queue. If we don't do this, it will just append playback to its queue and continue to play what we passed it (even past a browser refresh). - Create an instance of
SpeechSynthesisUtterance()
which is a class that takes in the text we want to speak. - Pass that instance to the
speechSynthesis.speak()
method.
That's it. As soon as we type some text in the box and hit "Speak," your browser (assuming it supports the API) should start blabbing.
Awesome. But we're not quite done. Believe it or not, the Speech Synthesis API also includes the option to use different voices. Next, we're going to update the HTML returned by our render
function to include a list of voices to choose from and update methods.speak
to accept different voices.
/ui/pages/index/index.js
import ui from '@joystick.js/ui';
const Index = ui.component({
state: {
voices: [],
},
lifecycle: {
onMount: (component) => {
window.speechSynthesis.onvoiceschanged = () => {
const voices = window.speechSynthesis.getVoices();
component.setState({ voices });
};
},
},
methods: {
getLanguageName: (language = '') => {
if (language) {
const regionNamesInEnglish = new Intl.DisplayNames(['en'], { type: 'region' });
return regionNamesInEnglish.of(language?.split('-').pop());
}
return 'Unknown';
},
speak: (text = '', voice = '', component) => {
window.speechSynthesis.cancel();
const message = new SpeechSynthesisUtterance(text);
if (voice) {
const selectedVoice = component?.state?.voices?.find((voiceOption) => voiceOption?.voiceURI === voice);
message.voice = selectedVoice;
}
speechSynthesis.speak(message);
},
},
events: {
'submit form': (event, component) => {
event.preventDefault();
const text = event?.target?.textToTranslate?.value;
const voice = event?.target?.voice?.value;
const hasText = text.trim() !== '';
if (!hasText) {
return component.methods.speak('Well you have to say something!', voice);
}
component.methods.speak(text, voice);
},
},
css: `
h4 {
border-bottom: 1px solid #eee;
padding-bottom: 20px;
margin-bottom: 40px;
}
select {
margin-bottom: 20px;
}
textarea {
margin-bottom: 40px;
}
`,
render: ({ state, each, methods }) => {
return `
<div>
<h4>Text to Speech Translator</h4>
<form>
<label class="form-label">Voice</label>
<select class="form-control" name="voice">
${each(state?.voices, (voice) => {
return `
<option value="${voice.voiceURI}">${voice.name} (${methods.getLanguageName(voice.lang)})</option>
`;
})}
</select>
<textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
<button class="btn btn-primary">Speak</button>
</form>
<div class="players"></div>
</div>
`;
},
});
export default Index;
To speed us up, we've output the remainder of the code we'll need above—let's step through it.
First, in order to get access to the available voices offered by the API, we need to wait for them to load in the browser. Above our methods
option, we've added another option to our component lifecycle
and to it, we've assigned an onMount()
function.
This function is called by Joystick immediately after our component is mounted to the DOM. It's a good way to run code that's dependent on the UI, or, like in this case, a way to listen for and handle global or browser-level events (as opposed to events generated by the HTML rendered by our component).
Before we can get the voices, though, we need to listen for the window.speechSynthesis.onvoiceschanged
event. This event is fired as soon as the voices are loaded (we're talking about fractions of a second, but just slow enough that we want to wait at the code level).
Inside of onMount
, we assign that value to a function that will be called when the event fires on the window
. Inside of that function, we call to the window.speechSynthesis.getVoices()
function which returns us a list of objects describing all of the voices available. So we can use this in our UI, we take the component
argument passed to the onMount
function and call its setState()
function, passing an object with the property voices
.
Because we want to assign a state value voices
to the contents of the variable const voices
here, we can skip writing component.setState({ voices: voices })
and just use the short-hand version.
Important: up above the lifecycle
option, notice that we've added another option state
set to an object and on that object, a property voices
set to an empty array. This is the default value for our voices
array, which will come into play next down in our render
function.
There, we can see that we've updated our render
function to use JavaScript destructuring so that we can "pluck off" properties from the argument it's passed—the component instance—for use in the HTML we return from the function.
Here, we're pulling in state
, each
, and methods
. state
and methods
are the values we set above in the component. each
is what's known as a "render function" (not to be confused with the function assigned to the render
option on our component).
Like the name suggests, each()
is used for looping or iterating over a list and returning some HTML for each item in that list.
Here, we can see the use of JavaScript string interpolation (denoted by the ${}
inbetween the opening and closing of the <select></select>
tag) to pass our call to each()
. To each()
, we pass the list or array (in this case, state.voices
) as the first argument and for the second, a function that will be called, receiving the current value being iterated over.
Inside of this function, we want to return some HTML that will be output for each item in the state.voices
array.
Because we're inside of a <select></select>
tag, we want to render a select option for each of the voices that we got from the Speech Synthesis API. Like we mentioned above, each voice
is just a JavaScript object with some properties on it. The ones we care about here are the voice.voiceURI
(the unique ID/name of the voice) and voice.name
(the literal name of the speaker).
Finally, we also care about the language being spoken. This is passed as lang
on each voice
object in the form of a standard ISO language code. In order to get the "friendly" representation (e.g., France
or Germany
), we need to convert the ISO code. Here, we're calling to a method getLanguageName()
defined in our methods
object which takes in the voice.lang
value and converts it to a human-friendly string.
Looking at that function up top, we take language
in as an argument (the string we passed from inside our each()
) and if it's not an empty value, create an instance of the Intl.DisplayNames()
class (Intl
is another global available in the browser), passing it an array of regions we want to support (since the author is a yank, just en
) and in the options for the second argument, setting the name type
to "region."
With the result of this stored in regionNamesInEnglish
, we call to that variable's .of()
method, passing in the language
argument passed to our function. When we pass it, we call the .split('-')
method on it to say "split this string in two at the -
character (meaning if we pass en-US
we'd get an array like ['en', 'US']
) and then, on the resulting array, call the .pop()
method to say "pop off the last item and return it to us." In this case, the last item is US
as a string which is the format anticipated by the .of()
method.
Just one more step. Notice that down in our submit form
event handler, we've add a variable for the voice
option (using the same technique to retrieve its value as we did for textToTranslate
) and then pass that as the second argument to our methods.speak()
function.
Back in that function, we add voice
as the second argument along with component
as the third (Joystick automatically passed component
as the last argument to our methods—it would be first if no arguments were passed, or, in this example, third if two arguments are passed).
Inside of our function, we've added an if (voice)
check and inside of that, we run a .find()
on the state.voices
array to say "find us the object with a .voiceURI
value equal to the voice
argument we passed to the speak
function (this is the en-US
string or voice.lang
). With that, we just set .voice
on our message
(the SpeechSynthesisUtterance
class instance) and the API takes over from there.
Done! If everything is in its right place, we should have a working text-to-speech translator.
Wrapping Up
In this tutorial, we learned how to write a component using the @joystick.js/ui
framework to help us build a text-to-speech API. We learned how to listen for DOM events and how to tap into the Speech Synthesis API in the browser to speak for us. We also learned about the Intl
library built into the browser to help us convert an ISO code for a date string into a human-friendly name. Finally, we learned how to dynamically switch voices via the Speech Synthesis API to support different tones and languages.