HoloLens 2でAzure Cognitive ServicesのText To Speechを使う

はじめに

HoloLens 2で音声案内をするためAzure Cognitive ServicesのText To Speechを利用したいと思い、実装方法を調べました。
今回はその手順について書きたいと思います。

開発環境

Unity 2020.3.26f1
MRTK 2.7.3
Unity用Speech SDK 1.20.0

実装手順

Unityプロジェクトの作成

Microsoft Learnの以下のモジュールを参考にプロジェクトを作成してください。
（「Exercise - Import and configure resources」まででOKです）
Introduction to Mixed Reality Toolkit - Create a Mars Curiosity Rover hologram - Learn | Microsoft Docs

Speech SDKのインポート

UnityのプロジェクトにSpeech SDKをインストールします。この手順は以下の公式ドキュメントをご覧ください。
クイックスタート: 開発環境をセットアップする - Azure Cognitive Services | Microsoft Docs

スクリプトの作成

PCのスピーカーから音声を出すのであれば、ドキュメントにも記載があるように以下のコードで動作します。

var config = SpeechConfig.FromSubscription("<paste-your-speech-key-here>", "<paste-your-speech-location/region-here>");
using var synthesizer = new SpeechSynthesizer(config);
await synthesizer.SpeakTextAsync("こんにちは");

上記はUnityのエディタでは動作しますが、HoloLens 2で上記を実行してもスピーカーから音声は出ませんでした。
HoloLens 2のスピーカーから音声を出すためには、SpeakTextAsyncで得た音声データをUnityの「Audio Source」コンポーネントで再生する必要があります。
この実装方法については以下のサンプルのスクリプトを参考にしました。
cognitive-services-speech-sdk/HelloWorld.cs at master · Azure-Samples/cognitive-services-speech-sdk (github.com)

今回は以下のようなスクリプトにしました。

using Microsoft.CognitiveServices.Speech;
using System;
using UnityEngine;
 
public class TextToSpeech : MonoBehaviour
{
    SpeechConfig config;
    string speechSynthesisLanguage = "ja-JP"; 
    string speakText = "こんにちは";  
    string subscriptionKey = "[サブスクリプションキーを挿入してください]";
    string region = "[リージョンを挿入してください]";
 
    void Start()
    {
        //Text To Speechの設定
        config = SpeechConfig.FromSubscription(subscriptionKey, region);
        config.SpeechSynthesisLanguage = speechSynthesisLanguage; //話す言語の設定
        config.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw16Khz16BitMonoPcm); //これを設定しないと音声にノイズが入る
    }

 
    public async void SynthesizeAudioAsync()
    {
        try
        {
            //テキストを音声に変換して結果を得る
            using var synthesizer = new SpeechSynthesizer(config,null);
            var result = await synthesizer.SpeakTextAsync(speakText);
 
            //取得した音声をAudio Sourceコンポーネントで再生する
            var audioSource = gameObject.AddComponent<AudioSource>();
            var sampleCount = result.AudioData.Length / 2;
            var audioData = new float[sampleCount];
            for (var i = 0; i < sampleCount; ++i)
            {
                audioData[i] = (short)(result.AudioData[i * 2 + 1] << 8 | result.AudioData[i * 2]) / 32768.0F;
            }
            var audioClip = AudioClip.Create("SynthesizedAudio", sampleCount, 1, 16000, false);
            audioClip.SetData(audioData, 0);
            audioSource.clip = audioClip;
            audioSource.Play();
 
            Debug.Log("Success");
        }
        catch(Exception ex)
        {
            Debug.Log(ex.Message);
        }
    }
}

Unityエディタでの設定

まずHierarchyで右クリック→「CreateEmpty」を選択して空のオブジェクトを作り、先ほど作成したスクリプトをアタッチします。

次に音声を出すためのボタンを作成します。
Projectウィンドウから「PressableButtonHoloLens2」を検索し、Hierarchyにドラッグアンドドロップで挿入します。
ボタンを選択し、Inspectorの「Interactable」の中のEvent→OnClick()にて以下の画像のように操作を行います。