Study notes
Speach service
- Speech-to-Text
Core API
API that enables speech recognition in which your application can accept spoken input. - Text-to-Speech
Core API
API that enables speech synthesis in which your application can provide spoken output. - Speech Translation
- Speaker Recognition
- Intent Recognition
- Create Resource (Speach Service dedicated or Cognitive Services)
- Get Resource Location & On Key (Resource Keys/Endpoint)
1. Speech-to-Text
Processed: Interactive (real time) or batch.
In practice, most interactive speech-enabled applications use the Speech service through a (programming) language-specific SDK
Speech service supports speech recognition via:
- Speech-to-text API, which is the primary way to perform speech recognition.
- Speech-to-text Short Audio API, which is optimized for short streams of audio (up to 60 seconds).
- SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
- Write AudioConfig (Optional) to define the input sourcefor the audio to be transcribed (microphone or audio file)
- Recognized Speach (value), NaN or cancel
- transcript
Speech service offers two APIs for speach synthesis (spoken out from text):
- Text-to-speech API, which is the primary way to perform speech synthesis.
- Text-to-speech Long Audio API, which is designed to support batch operations that convert large volumes of text to audio.
- SpeechConfig object to encapsulate the information required to connect to your Speech resource (location & key)
- Write AudioConfig (Optional) to define the output device for the speech to be synthesized (default system speaker, null value or audio stream object that is returned directly)
- Reason property is set to the SynthesizingAudioCompleted enumeration.
- AudioData property contains the audio stream.
Config file -SpeechConfig
Speech service supports multiple output formats (audio)for the audio stream that is generated by speech synthesis.
Depending on your specific needs, you can choose a format based on the required:
- Audio file type
- Sample-rate
- Bit-depth
Speech service provides multiple voicesthat you can use to personalize your speech-enabled applications
- Standard voices - synthetic voices created from audio samples.
- Neural voices - more natural sounding voices created using deep neural networks.
Speech Synthesis Markup Language
Speach service use
- Speech SDK enables you to submit plain text to be synthesized into speech (via SpeakTextAsync() method)
- Speech Synthesis Markup Language(SSML) - XML-based syntax for describing characteristics of the speech you want to generate.
- Specify a speaking style (excited, cheerful...)
- Insert pauses or silence.
- Specify phonemes (phonetic pronunciations)
- Adjust the prosody of the voice (affecting the pitch, timbre, and speaking rate).
- Use common "say-as" rules (phone no, date...)
- Insert recorded speech or audio (include a standard recorded message)
- SpeakSsmlAsync() - submit the SSML description to the Speechservice.
Translate speech
Built on speech recognition:
- Recognize and transcrib spoken input in a specified language
- Return translations of the transcription in one or more other languages
- Speech or Cognitive Service resource must be already created.
- Have location and one Key (above service)
SpeechConfigobject - information required to connect to your Speech resource (location, key)
SpeechTranslationConfig object (input language, target languages)
Return if successs:
- Reason property has the enumerated value RecognizedSpeech
- Text property contains the
- Transcription in the original language
- Translations property contains a dictionary of the translations (using the two-character ISO language code, such as "en" for English, as a key).
Event based synthesis
1 to 1 translation, you can use event-based synthesis to capture the translation as an audio stream
- Specify the desired voice for the translated speech in the TranslationConfig.
- Create an event handler for the TranslationRecognizer object's Synthesizing event.
- In the event handler, use the GetAudio() method of the Result parameter (retrieve audio)
Doesn't require you to implement an event handler. You can use manual synthesis to generate audio translations for one or more target languages.
- Use a TranslationRecognizerto translate spokeninput into text transcriptionsin one or more target languages.
- Iterate through the Translations dictionaryin the result of the translation operation, using a SpeechSynthesizerto synthesize an audio stream for each language.
Speach Service core API, Login to view
Hands-On Create a speech-enabled app, Login to view
Hands-On Translate speech, Login to view
Resources
Process and Translate Speech with Azure Cognitive Speech Services - Training | Microsoft Learn
Speech-to-text REST API - Speech service - Azure Cognitive Services | Microsoft Learn
Use the text-to-speech API - Training | Microsoft Learn
SpeechSynthesisOutputFormat Enum (Microsoft.CognitiveServices.Speech) - Azure for .NET Developers | Microsoft Learn
Language support - Speech service - Azure Cognitive Services | Microsoft Learn
Speech-to-text REST API - Speech service - Azure Cognitive Services | Microsoft Learn
Use the text-to-speech API - Training | Microsoft Learn
SpeechSynthesisOutputFormat Enum (Microsoft.CognitiveServices.Speech) - Azure for .NET Developers | Microsoft Learn
Language support - Speech service - Azure Cognitive Services | Microsoft Learn