Whisper - Speech Recognition on the TypeScript SDK
For examples of the types of outputs to expect, you can visit the Whisper Demo at OctoAI.
This guide will cover the basics of running an inference to convert speech to text using the TypeScript SDK, including using our healthCheck method to ensure the endpoint is healthy before sending it any requests.
As a next step, Whisper works very well with Asynchronous Inference Using the TypeScript SDK.
Requirements
- Please follow How to create an OctoAI API token if you don't have one already.
- Please follow the TypeScript SDK Installation & Setup guide to install the TypeScript SDK.
Whisper: Speech Recognition
Let's use another QuickStart pre-accelerated template example. Whisper is a natural language processing model that converts audio to text. Like with Stable Diffusion, we’ll use base64 encoding an mp3 or a wav file into a base64 string. For more information on all the ways you can customize your Whisper inferences, please see the Whisper OpenAPI documentation.
Please reference the QuickStart Templates on the TypeScript SDK for details on finding endpoint URLs for QuickStart and cloned templates and the TypeScript SDK Reference for more information on specific methods.
import { Client } from "@octoai/client";
import { readFileSync} from "fs";
const whisperEndpoint = "https://whisper-demo-kk0powt97tmb.octoai.run/predict";
const whisperHealthCheck = "https://whisper-demo-kk0powt97tmb.octoai.run/healthcheck";
// This instantiation approach takes your OCTOAI_TOKEN as an environment variable
// If you have not set it as an envvar, you can use the below instead
// const OCTOAI_TOKEN = "API token here from following the token creation guide";
// const client = new Client(OCTOAI_TOKEN);
const client = new Client();
// First, we need to convert an audio file to base64.
const audio = readFileSync("./octo_poem.wav", {
encoding: "base64",
});
// These are the inputs we will send to the endpoint, including the audio base64 string.
const inputs = {
language: "en",
task: "transcribe",
audio: audio,
};
async function wavToText() {
if (await client.healthCheck(whisperHealthCheck) === 200) {
const outputs: any = await client.infer(whisperEndpoint, inputs);
console.log(outputs.transcription);
}
}
wavToText().then();
With this particular test file, we will have the below printed in the console:
Once upon a time, an AI system was asked to come up with a poem for an octopus.
It said something along the lines of, Octopus, you are very wise.
Whisper Outputs
The above outputs
variable returns JSON in something like the following format.
{
prediction_time_ms: 626.42526,
response: {
segments: [ [Object] ],
word_segments: [
[Object]
]
},
transcription: ' Once upon a time, an AI system was asked to come up with a poem for an octopus. It said something along the lines of, Octopus, you are very wise.'
}
Each segment
is an object that looks something like:
{
start: 5.553,
end: 8.66,
text: 'It said something along the lines of, Octopus, you are very wise.',
words: [
{
word: 'It',
start: 5.553,
end: 5.633,
score: 0.945,
speaker: null
},
{
word: 'said',
start: 5.653,
end: 5.814,
score: 0.328,
speaker: null
},
// etc...
],
speaker: null
}
Each word_segment
is an object that looks something like:
{ word: 'Once', start: 0.783, end: 0.903, score: 0.883, speaker: null }
Updated 7 days ago