tutorial // Jan 21, 2022

How to Add Text-to-Speech With the HTML5 Speech Synthesis API

How to use the HTML5 speech synthesis API to add text to speech to your app with multiple voice options.

How to Add Text-to-Speech With the HTML5 Speech Synthesis API

Getting Started

For this tutorial, we're going to use CheatCode's full-stack JavaScript framework, Joystick. Joystick brings together a front-end UI framework with a Node.js back-end for building apps.

To begin, we'll want to install Joystick via NPM. Make sure you're using Node.js 16+ before installing to ensure compatibility (give this tutorial a read first if you need to learn how to install Node.js or run multiple versions on your computer):

Terminal

npm i -g @joystick.js/cli

This will install Joystick globally on your computer. Once installed, next, let's create a fresh project:

Terminal

joystick create app

After a few seconds, you will see a message logged out to cd into your new project and run joystick start:

Terminal

cd app && joystick start

After this, your app should be running and we're ready to get started.

Adding Bootstrap

Digging into the code, first, we want to add the Bootstrap CSS framework to our app. While you don't have to do this, it will make our app look a bit prettier and avoid us having to scramble together CSS for our UI. To do it, we're going to add the Bootstrap CDN link to the /index.html file at the root of our project:

/index.html

<!doctype html>
<html class="no-js" lang="en">
  <head>
    <meta charset="utf-8">
    <title>Joystick</title>
    <meta name="description" content="An awesome JavaScript app that's under development.">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta name="theme-color" content="#FFCC00">
    <link rel="apple-touch-icon" href="/apple-touch-icon-152x152.png">
    <link rel="stylesheet" href="/_joystick/index.css">
    <link rel="manifest" href="/manifest.json">
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.1.3/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-1BmE4kWBq78iYhFldvKuhfTAU6auU8tT94WrHftjDbrCEXSU1oBoqyl2QvZ6jIW3" crossorigin="anonymous">
    ${css}
  </head>
  <body>
    ...
  </body>
</html>

Here, just above the ${css} part in the file, we've pasted in the <link></link> tag from the Bootstrap documentation that gives us access to the CSS portion of the framework.

That's it. Joystick will automatically restart and load this into the browser so we can start using it.

Wiring up a Joystick component with text to speech

In a Joystick app, our UI is built using the framework's built-in UI library @joystick.js/ui. When we ran joystick create app above, we were given some example components to work with. We're going to overwrite the /ui/pages/index/index.js file with some HTML that will serve as the UI for our translator.

/ui/pages/index/index.js

import ui from '@joystick.js/ui';

const Index = ui.component({
  css: `
    h4 {
      border-bottom: 1px solid #eee;
      padding-bottom: 20px;
      margin-bottom: 40px;
    }

    textarea {
      margin-bottom: 40px;
    }
  `,
  render: () => {
    return `
      <div>
        <h4>Text to Speech Translator</h4>
        <form>
          <textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
          <button class="btn btn-primary">Speak</button>
        </form>
        <div class="players"></div>
      </div>
    `;
  },
});

export default Index;

To start, we want to replace the component that was in this file with what we see above. Here, we're defining a simple component with two things: a render function which returns a string of HTML that we want to show in the browser and above it, a string of css that we want to apply to the HTML we're rendering (Joystick automatically scopes the CSS we pass here to the HTML returned by our render function).

If we load up http://localhost:2600 in a browser (port 2600 is where Joystick starts by default when we run joystick start), we should see the Bootstrap-styled version of the HTML above.

/ui/pages/index/index.js

import ui from '@joystick.js/ui';

const Index = ui.component({
  events: {
    'submit form': (event, component) => {
      event.preventDefault();

      const text = event?.target?.textToTranslate?.value;
      const hasText = text.trim() !== '';

      if (!hasText) {
        return component.methods.speak('Well you have to say something!');
      }

      component.methods.speak(text);
    },
  },
  css: `...`,
  render: () => {
    return `
      <div>
        <h4>Text to Speech Translator</h4>
        <form>
          <textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
          <button class="btn btn-primary">Speak</button>
        </form>
        <div class="players"></div>
      </div>
    `;
  },
});

export default Index;

Next, we want to add an events object to our component. Like the name implies, this is where we define event listeners for our component. Here, we're defining a listener for the submit event on the <form></form> element being rendered by our component. Just like our CSS, Joystick automatically scopes our events to the HTML being rendered.

Assigned to that submit form property on our events object is a function that will be called whenever the submit event is detected on our <form></form>.

Inside of that function, first, we take in the event (this is the browser DOM event) as the first argument and immediately call event.preventDefault() on it. This prevents the browser from attempting to perform an HTTP POST to the action attribute on our form. Like the name suggests, this is the default behavior for browsers (we don't have an action attribute on our form as we want to control the submission via JavaScript).

Next, once this is halted, we want to get the value typed into our <textarea></textarea>. To do it, we can reference the textToTranslate property on the event.target object. Here, event.target refers to the <form></form> element as it's rendered in the browser (its in memory representation).

We can access textToTranslate because the browser automatically assigns all fields within a form to it in memory using the field's name attribute as the property name. If we look close at our <textarea></textarea>, we can see that it has the name attribute textToTranslate. If we changed this to pizza, we'd write event?.target?.pizza?.value instead.

With that value stored in the text variable, next, we create another variable hasText which contains a check to make sure that our text variable isn't an empty string (the .trim() part here "trims off" any whitespace characters in case the user just hit the space bar over and over).

If we don't have any text in the input, we want to "speak" the phrase "Well you have to say something!" Assuming we did get some text, we just want to "speak" that text value.

Notice that here we're calling to component.methods.speak which we haven't defined yet. We'll tap into Joystick's methods feature (where we can define miscellaneous functions on our component).

/ui/pages/index/index.js

import ui from '@joystick.js/ui';

const Index = ui.component({
  methods: {
    speak: (text = '') => {  
      window.speechSynthesis.cancel();

      const message = new SpeechSynthesisUtterance(text);

      speechSynthesis.speak(message);
    },
  },
  events: {
    'submit form': (event, component) => {
      event.preventDefault();

      const text = event?.target?.textToTranslate?.value;
      const hasText = text.trim() !== '';

      if (!hasText) {
        return component.methods.speak('Well you have to say something!');
      }

      component.methods.speak(text);
    },
  },
  css: `...`,
  render: () => {
    return `
      <div>
        <h4>Text to Speech Translator</h4>
        <form>
          <textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
          <button class="btn btn-primary">Speak</button>
        </form>
        <div class="players"></div>
      </div>
    `;
  },
});

export default Index;

Now for the fun part. Because the Speech Synthesis API is implemented in browsers (see compatibility here—it's quite good), we don't have to install or import anything; the entire API is accessible globally in the browser.

Adding a methods object just above our events, we're assigning the speak method that we called to from our submit form event handler.

Inside, there's not much to do:

  1. In case we change the text we've typed in and click the "Speak" button mid-playback, we want to call the window.speechSynthesis.cancel() method to tell the API to clear its playback queue. If we don't do this, it will just append playback to its queue and continue to play what we passed it (even past a browser refresh).
  2. Create an instance of SpeechSynthesisUtterance() which is a class that takes in the text we want to speak.
  3. Pass that instance to the speechSynthesis.speak() method.

That's it. As soon as we type some text in the box and hit "Speak," your browser (assuming it supports the API) should start blabbing.

Awesome. But we're not quite done. Believe it or not, the Speech Synthesis API also includes the option to use different voices. Next, we're going to update the HTML returned by our render function to include a list of voices to choose from and update methods.speak to accept different voices.

/ui/pages/index/index.js

import ui from '@joystick.js/ui';

const Index = ui.component({
  state: {
    voices: [],
  },
  lifecycle: {
    onMount: (component) => {
      window.speechSynthesis.onvoiceschanged = () => {
        const voices = window.speechSynthesis.getVoices();
        component.setState({ voices });
      };
    },
  },
  methods: {
    getLanguageName: (language = '') => {
      if (language) {
        const regionNamesInEnglish = new Intl.DisplayNames(['en'], { type: 'region' });
        return regionNamesInEnglish.of(language?.split('-').pop());
      }

      return 'Unknown';
    },
    speak: (text = '', voice = '', component) => {  
      window.speechSynthesis.cancel();

      const message = new SpeechSynthesisUtterance(text);

      if (voice) {
        const selectedVoice = component?.state?.voices?.find((voiceOption) => voiceOption?.voiceURI === voice);
        message.voice = selectedVoice;
      }

      speechSynthesis.speak(message);
    },
  },
  events: {
    'submit form': (event, component) => {
      event.preventDefault();
      const text = event?.target?.textToTranslate?.value;
      const voice = event?.target?.voice?.value;
      const hasText = text.trim() !== '';

      if (!hasText) {
        return component.methods.speak('Well you have to say something!', voice);
      }

      component.methods.speak(text, voice);
    },
  },
  css: `
    h4 {
      border-bottom: 1px solid #eee;
      padding-bottom: 20px;
      margin-bottom: 40px;
    }

    select {
      margin-bottom: 20px;
    }

    textarea {
      margin-bottom: 40px;
    }
  `,
  render: ({ state, each, methods }) => {
    return `
      <div>
        <h4>Text to Speech Translator</h4>
        <form>
          <label class="form-label">Voice</label>
          <select class="form-control" name="voice">
            ${each(state?.voices, (voice) => {
              return `
                <option value="${voice.voiceURI}">${voice.name} (${methods.getLanguageName(voice.lang)})</option>
              `;
            })}
          </select>
          <textarea class="form-control" name="textToTranslate" placeholder="Type the text to speak here and then press Speak below."></textarea>
          <button class="btn btn-primary">Speak</button>
        </form>
        <div class="players"></div>
      </div>
    `;
  },
});

export default Index;

To speed us up, we've output the remainder of the code we'll need above—let's step through it.

First, in order to get access to the available voices offered by the API, we need to wait for them to load in the browser. Above our methods option, we've added another option to our component lifecycle and to it, we've assigned an onMount() function.

This function is called by Joystick immediately after our component is mounted to the DOM. It's a good way to run code that's dependent on the UI, or, like in this case, a way to listen for and handle global or browser-level events (as opposed to events generated by the HTML rendered by our component).

Before we can get the voices, though, we need to listen for the window.speechSynthesis.onvoiceschanged event. This event is fired as soon as the voices are loaded (we're talking about fractions of a second, but just slow enough that we want to wait at the code level).

Inside of onMount, we assign that value to a function that will be called when the event fires on the window. Inside of that function, we call to the window.speechSynthesis.getVoices() function which returns us a list of objects describing all of the voices available. So we can use this in our UI, we take the component argument passed to the onMount function and call its setState() function, passing an object with the property voices.

Because we want to assign a state value voices to the contents of the variable const voices here, we can skip writing component.setState({ voices: voices }) and just use the short-hand version.

Important: up above the lifecycle option, notice that we've added another option state set to an object and on that object, a property voices set to an empty array. This is the default value for our voices array, which will come into play next down in our render function.

There, we can see that we've updated our render function to use JavaScript destructuring so that we can "pluck off" properties from the argument it's passed—the component instance—for use in the HTML we return from the function.

Here, we're pulling in state, each, and methods. state and methods are the values we set above in the component. each is what's known as a "render function" (not to be confused with the function assigned to the render option on our component).

Like the name suggests, each() is used for looping or iterating over a list and returning some HTML for each item in that list.

Here, we can see the use of JavaScript string interpolation (denoted by the ${} inbetween the opening and closing of the <select></select> tag) to pass our call to each(). To each(), we pass the list or array (in this case, state.voices) as the first argument and for the second, a function that will be called, receiving the current value being iterated over.

Inside of this function, we want to return some HTML that will be output for each item in the state.voices array.

Because we're inside of a <select></select> tag, we want to render a select option for each of the voices that we got from the Speech Synthesis API. Like we mentioned above, each voice is just a JavaScript object with some properties on it. The ones we care about here are the voice.voiceURI (the unique ID/name of the voice) and voice.name (the literal name of the speaker).

Finally, we also care about the language being spoken. This is passed as lang on each voice object in the form of a standard ISO language code. In order to get the "friendly" representation (e.g., France or Germany), we need to convert the ISO code. Here, we're calling to a method getLanguageName() defined in our methods object which takes in the voice.lang value and converts it to a human-friendly string.

Looking at that function up top, we take language in as an argument (the string we passed from inside our each()) and if it's not an empty value, create an instance of the Intl.DisplayNames() class (Intl is another global available in the browser), passing it an array of regions we want to support (since the author is a yank, just en) and in the options for the second argument, setting the name type to "region."

With the result of this stored in regionNamesInEnglish, we call to that variable's .of() method, passing in the language argument passed to our function. When we pass it, we call the .split('-') method on it to say "split this string in two at the - character (meaning if we pass en-US we'd get an array like ['en', 'US']) and then, on the resulting array, call the .pop() method to say "pop off the last item and return it to us." In this case, the last item is US as a string which is the format anticipated by the .of() method.

Just one more step. Notice that down in our submit form event handler, we've add a variable for the voice option (using the same technique to retrieve its value as we did for textToTranslate) and then pass that as the second argument to our methods.speak() function.

Back in that function, we add voice as the second argument along with component as the third (Joystick automatically passed component as the last argument to our methods—it would be first if no arguments were passed, or, in this example, third if two arguments are passed).

Inside of our function, we've added an if (voice) check and inside of that, we run a .find() on the state.voices array to say "find us the object with a .voiceURI value equal to the voice argument we passed to the speak function (this is the en-US string or voice.lang). With that, we just set .voice on our message (the SpeechSynthesisUtterance class instance) and the API takes over from there.

Done! If everything is in its right place, we should have a working text-to-speech translator.

Wrapping Up

In this tutorial, we learned how to write a component using the @joystick.js/ui framework to help us build a text-to-speech API. We learned how to listen for DOM events and how to tap into the Speech Synthesis API in the browser to speak for us. We also learned about the Intl library built into the browser to help us convert an ISO code for a date string into a human-friendly name. Finally, we learned how to dynamically switch voices via the Speech Synthesis API to support different tones and languages.

Written By
Ryan Glover

Ryan Glover

CEO/CTO @ CheatCode