VoxSDK is a comprehensive toolkit designed to facilitate easy integration of AI-driven speech recognition and synthesis into your applications. With a focus on simplicity and efficiency, VoxSDK offers a set of React hooks and utilities to seamlessly connect with AI services for voice interactions.
VoxProvider
: A context provider to encapsulate the SDK's functionalities and make them accessible throughout your React application.useListen
: A hook to capture and transcribe user speech in real-time.useSpeak
: A hook for text-to-speech functionality, converting text responses into natural-sounding speech.
Install VoxSDK using npm:
npm install vox-sdk
Or using yarn:
yarn add vox-sdk
Install tslib.
Using npm
npm install tslib --save-dev
Using yarn
yarn add tslib -D
-
To set up VoxSDK, you will need to generate a speech_key and region from the Azure Portal.
-
Visit the Microsoft Azure Portal and create a speech resource to obtain your speech_key and region.
-
Learn more about text-to-speech and speech-to-text wiyh Microsoft Cognative Speech Services.
-
-
You will need to set up both the server and the client.
-
On your server, you will need to create a
GET
endpoint at/token
. -
Using the
speech_key
andregion
, you will generate an authorization token from Microsoft's APIs. -
Set these values in the .env file as
SPEECH_KEY
andSPEECH_REGION
. -
The
/token
endpoint should return the following response:.{ token:string, region:string }
-
Here's a sample implementation of the
/token
endpoint.
import express from "express";
import cors from "cors";
import "dotenv/config";
import axios from "axios";
const app = express();
app.use(
cors({
origin: process.env.FRONTEND_URL,
})
);
let token = null;
const speechKey = process.env.SPEECH_KEY;
const speechRegion = process.env.SPEECH_REGION;
const getToken = async () => {
try {
const headers = {
headers: {
"Ocp-Apim-Subscription-Key": speechKey,
"Content-Type": "application/x-www-form-urlencoded",
},
};
const tokenResponse = await axios.post(`https://${speechRegion}.api.cognitive.microsoft.com/sts/v1.0/issueToken`, null, headers);
token = tokenResponse.data;
} catch (error) {
console.error("Error while getting token:", error);
}
};
app.get("/token", async (req, res) => {
try {
res.setHeader("Content-Type", "application/json");
// When client asks for refresh token
const refreshTheToken = req.query?.refresh;
if (!token || refreshTheToken) {
await getToken();
}
res.send({
token: token,
region: speechRegion,
});
} catch (error) {
console.error("Error while handling /token request:", error);
res.status(500).send({ error: "An error occurred while processing your request." });
}
});
app.listen(8080, () => console.log("Server running on port 8080"));
- For detailed documentation you can visit sample app here.
-
Wrap your application with VoxProvider to make the SDK available throughout your app:
import { VoxProvider } from "vox-sdk"; function App() { return <VoxProvider>{/* Your app components go here */}</VoxProvider>; } export default App;
-
VoxProvider
expectsconfig
object which includes,-
baseUrl
: url to your backend. e.g. :https://exampleapp.com
, Ensure that the/token
route serves the token and region.. -
OnAuthRefresh
: A callback function that is invoked when any authentication error occurs or the token expires. -
headersForBaseUrl
: Option to pass baseUrl Config.
-
-
Here's the implmentation of the above two step.
<VoxProvider config={{ baseUrl: "https://exampleapp.com", onAuthRefresh: async () => { const { data } = await axios.get("https://exampleapp.com/token?refresh=true"); return { token: data.token, region: data.region }; }, headersForBaseUrl: { //... Bearer Authentication token or other config }, }} > <App /> </VoxProvider>
-
The
onAuthRefresh
callback will refresh the token and return it with the region. -
For more details you can visit here sample app implementation
After setting up the Server and VoxProvider we are ready to use useListen
and useSpeak
.
Integrate speech-to-text functionality in your components:
import { useListen } from "vox-sdk";
import React from "react";
const SpeechToText = () => {
const { answers, loading, startSpeechRecognition, stopSpeechRecognition } = useListen({
onEndOfSpeech: () => {
console.log(answers);
},
automatedEnd: true,
delay: 1000,
});
return (
<>
<button disabled={loading} onClick={startSpeechRecognition}>
Start Litsening
</button>
<button onClick={stopSpeechRecognition}> Stop Listening</button>
</>
);
};
export default SpeechToText;
-
automatedEnd
:- Expects a boolean value, default is
true
. - When the user finishes speaking, the hook will automatically start the speech-to-text conversion.
- To listen continuously until the user clicks
stopSpeechRecognition
, passfalse
.
- Expects a boolean value, default is
-
delay
:- Expects a value in milliseconds.
- This is the debounce duration for listening to the user.
- The default is set to 2000ms.
-
onEndOfSpeech
:- Expects a callback function that is invoked when speech ends.
startSpeechRecognition
: Function to start speech recognition.stopSpeechRecognition
: Function to stop speech recognition.answers
: Returns an array of strings containing all the transcribed text.answer
: The last transcribed text.recognizerRef
: An instance ofmicrosoft-cognitiveservices-speech-sdk
.
Implement text-to-speech in your application:
import React from "react";
import { useState } from "react";
import { useSpeak, SpeechVoices } from "vox-sdk";
const TextToSpeech = () => {
const [text, setText] = useState("");
const { interruptSpeech, speak, isSpeaking } = useSpeak({
onEnd: () => {
console.log("Spech ended");
},
shouldCallOnEnd: true,
throttleDelay: 1000,
voice: SpeechVoices.enUSAIGenerate1Neural, // AI Voices
});
return (
<>
<h3>Text To Speech</h3>
<input type="text" onChange={(e) => setText(e.target.value)} value={text} />
<button
onClick={() => {
speak(text);
}}
disabled={isSpeaking}
>
Start Speaking
</button>
<button
disabled={!isSpeaking}
onClick={() => {
interruptSpeech();
}}
>
Stop Speaking
</button>
</>
);
};
export default TextToSpeech;
-
voice
:-
Expects a string value.
-
Choose your preferred AI voice from Microsoft Azure.
-
Here's the list of available voices.
export enum SpeechVoices { // Arabic arAEFatimaNeural = "ar-AE-FatimaNeural", arBHAliNeural = "ar-BH-AliNeural", arEGSalmaNeural = "ar-EG-SalmaNeural", arJOTaimNeural = "ar-JO-TaimNeural", arKWFahedNeural = "ar-KW-FahedNeural", arLYImanNeural = "ar-LY-ImanNeural", arQAAmalNeural = "ar-QA-AmalNeural", arSAHamedNeural = "ar-SA-HamedNeural", arSYAmanyNeural = "ar-SY-AmanyNeural", arTNHediNeural = "ar-TN-HediNeural", arYEMaryamNeural = "ar-YE-MaryamNeural", // Chinese zhCNXiaoxiaoNeural = "zh-CN-XiaoxiaoNeural", zhCNYunxiNeural = "zh-CN-YunxiNeural", zhCNYunyeNeural = "zh-CN-YunyeNeural", zhHKHiuGaaiNeural = "zh-HK-HiuGaaiNeural", zhHKHiuMaanNeural = "zh-HK-HiuMaanNeural", zhTWHsiaoChenNeural = "zh-TW-HsiaoChenNeural", zhTWHsiaoYuNeural = "zh-TW-HsiaoYuNeural", // Danish daDKChristelNeural = "da-DK-ChristelNeural", daDKJeppeNeural = "da-DK-JeppeNeural", // Dutch nlBEArnaudNeural = "nl-BE-ArnaudNeural", nlBEDenaNeural = "nl-BE-DenaNeural", nlNLColetteNeural = "nl-NL-ColetteNeural", nlNLFennaNeural = "nl-NL-FennaNeural", // English (Australia) enAUNatashaNeural = "en-AU-NatashaNeural", enAUWilliamNeural = "en-AU-WilliamNeural", // English (Canada) enCAClaraNeural = "en-CA-ClaraNeural", enCALiamNeural = "en-CA-LiamNeural", // English (India) enINNeerjaNeural = "en-IN-NeerjaNeural", enINPrabhatNeural = "en-IN-PrabhatNeural", // English (UK) enGBLibbyNeural = "en-GB-LibbyNeural", enGBRyanNeural = "en-GB-RyanNeural", // English (US) enUSAIGenerate1Neural = "en-US-AIGenerate1Neural", enUSAmberNeural = "en-US-AmberNeural", enUSAriaNeural = "en-US-AriaNeural", enUSAshleyNeural = "en-US-AshleyNeural", enUSBrandonNeural = "en-US-BrandonNeural", enUSChristopherNeural = "en-US-ChristopherNeural", enUSCoraNeural = "en-US-CoraNeural", enUSDavisNeural = "en-US-DavisNeural", enUSElizabethNeural = "en-US-ElizabethNeural", enUSEricNeural = "en-US-EricNeural", enUSGuyNeural = "en-US-GuyNeural", enUSJacobNeural = "en-US-JacobNeural", enUSJasonNeural = "en-US-JasonNeural", enUSJennyNeural = "en-US-JennyNeural", enUSMichelleNeural = "en-US-MichelleNeural", enUSMonicaNeural = "en-US-MonicaNeural", enUSSaraNeural = "en-US-SaraNeural", enUSTonyNeural = "en-US-TonyNeural", // Finnish fiFINooraNeural = "fi-FI-NooraNeural", fiFISelmaNeural = "fi-FI-SelmaNeural", // French (Canada) frCADiegoNeural = "fr-CA-DiegoNeural", frCAFelixNeural = "fr-CA-FelixNeural", frCAJeanNeural = "fr-CA-JeanNeural", frCASylvieNeural = "fr-CA-SylvieNeural", // French (France) frFRDeniseNeural = "fr-FR-DeniseNeural", frFREloiseNeural = "fr-FR-EloiseNeural", frFRHenriNeural = "fr-FR-HenriNeural", // German deDEKatjaNeural = "de-DE-KatjaNeural", deDEKillianNeural = "de-DE-KillianNeural", // Greek elGRAthinaNeural = "el-GR-AthinaNeural", elGRNestorasNeural = "el-GR-NestorasNeural", // Hindi hiINMadhurNeural = "hi-IN-MadhurNeural", hiINSwaraNeural = "hi-IN-SwaraNeural", // Italian itITDiegoNeural = "it-IT-DiegoNeural", itITElsaNeural = "it-IT-ElsaNeural", // Japanese jaJPAoiNeural = "ja-JP-AoiNeural", jaJPNanamiNeural = "ja-JP-NanamiNeural", // Korean koKRInJoonNeural = "ko-KR-InJoonNeural", koKRSunHiNeural = "ko-KR-SunHiNeural", // Portuguese (Brazil) ptBRFranciscaNeural = "pt-BR-FranciscaNeural", ptBRAntonioNeural = "pt-BR-AntonioNeural", // Russian ruRUDmitryNeural = "ru-RU-DmitryNeural", ruRUSvetlanaNeural = "ru-RU-SvetlanaNeural", // Spanish (Mexico) esMXJorgeNeural = "es-MX-JorgeNeural", esMXDaliaNeural = "es-MX-DaliaNeural", // Spanish (Spain) esESElviraNeural = "es-ES-ElviraNeural", esESAlvaroNeural = "es-ES-AlvaroNeural", // Swedish svSESofieNeural = "sv-SE-SofieNeural", svSEMattiasNeural = "sv-SE-MattiasNeural", }
-
-
throttleDelay
:- Expects a value in milliseconds.
- This is the throttle duration for listening to the user.
- The default is set to 2000ms.
-
onEnd
:- Expects a callback function that is invoked when the AI speech ends.
- To invoke this, set shouldCallOnEnd to true.
-
speak
:- Function to start text-to-speech recognition.
- Expects a string argument to be converted to speech.
-
interruptSpeech
:- Function to stop the AI speech.
-
hasAllSentencesBeenSpoken
:- Returns a boolean value indicating if all sentences have been recognized.
-
isSpeaking
:- Returns a boolean value indicating if the AI is speaking.
-
streamedSentences
:- Returns an array of strings with all streamed sentences.
Contributions are welcome! Please read our Contributing Guide for more information.
This project is licensed under the MIT License.