LLM - TextToText TypeScript SDK Inferences

Infer and InferStreams for LLMs

For examples of the types of outputs to expect, you can visit the LLaMA2-7b at OctoAI. The guide below will work equally well for our LLaMA2-70b and LLaMA2-13b endpoints which can be found here (although endpoint URLs will need to be updated).

This guide will cover running an inferStream and infer on LLMs, including how to use the healthCheck method and what you can expect as outputs.


LLaMA2-7B: TextToText

Let’s start with a simple example using a pre-accelerated QuickStart template from OctoAI.

Please reference the QuickStart Templates on the TypeScript SDK for details on finding endpoint URLs for QuickStart and cloned templates and the TypeScript SDK Reference for more information on classes.

To align with the accustomed standard of LLMs, OctoAI supports OpenAI-style inference arguments, both via thev1/completions and v1/chat/completions routes. For more information on OpenAI patterns, please see this helpful discussion.

It's recommended when using OpenAI style APIs and TypeScript, to build the response around interfaces to manage the different types of responses you can expect such as ChatCompletion, ChatCompletionChunk, and ChatCompletionChunk.Choice, however that is outside of the scope of this quick demo.

Client Instantiation for LLaMA2-7B Inferences

The setup for a LLaMA2-7B whether you are streaming tokens or waiting for the full response is very similar. Generally, streaming is the preferred experience for LLMs, but there are a number of applications including prototyping where not doing so is ideal.

import { Client } from "@octoai/client";

// The client will identify if OCTOAI_TOKEN is set as an environment variable
// If you choose not to, you can also pass the token to the Client constructor with the below code:
// OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
// const client = new Client(OCTOAI_TOKEN);
const client = new Client();

const llamaEndpoint =

const llamaHealthEndpoint = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/models"

// Inputs will also contain a "stream" value, set to true for streaming and false for non-streaming.
const inputs = {
    model: "llama-2-7b-chat",
    messages: [
            role: "assistant",
                "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
        { role: "user", content: "Write a blog about Seattle" },
    max_tokens: 256,

LLaMA2-7B: TextToText Non-Streaming Inference

Using the above instantiation, we can run a non-streaming inference fairly simply.

const noStreamInputs = { ...inputs, stream: false };

async function logNoStreamLlamaContent() {
    if (await client.healthCheck(llamaHealthEndpoint) == 200) {
        const outputs: any = await client.infer(llamaEndpoint, noStreamInputs);


Your output should be something like:

 Title: Discovering the Hidden Gems of Seattle: A Blogger's Guide

Hey there, fellow travel enthusiasts! As a blogger, I'm always on the lookout for new and 
exciting destinations to explore. And let me tell you, Seattle has got it going on! From its vibrant 
culture to its stunning natural beauty, this Pacific Northwest city has something for everyone. 
In this blog, I'll share my top picks for things to do, see, and eat in Seattle, so you can plan 
your next adventure.
1. Pike Place Market: This historic market is a must-visit for any foodie. With over 500 farmers, 
  producers, and craftspeople, you'll find everything from fresh seafood to artisanal cheeses. 
And don't forget to check out the famous fish throwing!
2. Space Needle: For a panoramic view of the city, head to the top of the iconic Space Needle. 
On a clear day, you can see Mount Rainier, Puget Sound, and the Olympic Mountains. Plus, the museum
inside is filled with fun exhibits and

Non-Streaming Object

If you were to log outputs instead, you'll have additional information that may be useful, including the id of the completion, the object type, and reason it finished. Interfaces, which are outside of the scope of this example for simplicity sake, are very useful for handling outputs.

  id: 'cmpl-c63a4f365c3e4001b6e51ed0d05e1854',
  object: 'chat.completion',
  created: 1694723087,
  model: 'llama-2-7b-chat',
  choices: [ { index: 0, message: [Object], finish_reason: 'length' } ],
  usage: { prompt_tokens: 39, total_tokens: 295, completion_tokens: 256 }

outputs.choices[0].message contains an object that looks something like:

  role: 'assistant',
  content: " Title: Discovering the Emerald City: Exploring Seattle's Hidden Gems\n" +
    '\n' +
    'As the largest city in the Pacific Northwest, Seattle is a bustling metropolis that offers a unique blend of culture, history, and natural beauty. From the iconic Space Needle to the bustling Pike Place Market, there are plenty of well-known attractions that draw visitors to this vibrant city. However, there are also many hidden gems waiting to be discovered, waiting beyond the surface-level experiences that make Seattle a truly special place. In this blog, we will delve into some of the lesser-known aspects of Seattle and explore the hidden corners that make this city so unique.\n' +
    '1. The Chihuly Garden and Glass Exhibit:\n' +
    "Located in the heart of Seattle's waterfront area, the Chihuly Garden and Glass Exhibit is a must-visit for art lovers. This stunning exhibit features the works of renowned glass artist Dale Chihuly in an indoor-outdoor setting that showcases his breathtaking creations. Visitors can explore the gardens, which are filled with colorful glass sculpt"

Inference Streams for LLMs Using the TypeScript SDK

Streaming inferences in the SDK with endpoints that allow this feature allow you to create experiences with LLMs that let you see as additional tokens are produced. You can view Llama-2-7b chat demo page to get an idea of the kind of experience this enables. This feature is currently only available for supporting LLMs such as LLaMA-2-7b and Vicuna-7b.

For a web app, depending on your framework, you'll likely manage this parsing through an event of some sort and set the text to be updated as new messages come in. For simplicity's sake, this example simply logs the contents to the console as additional token contents are concatenated.

// Use the same initial snippet to instantiate the client and setup inputs and endpoints.
// After, you can continue from here.
const streamInputs = { ...inputs, stream: true };

let text = ``;

async function logLlamaTokenContents() {
    const readableStream = await client.inferStream(
    const streamReader = readableStream.getReader();
    for (
        let { value, done } = await streamReader.read();
        { value, done } = await streamReader.read()
    ) {
      	// Termination of HuggingFace style APIs
        if (done) break;
        const decoded = new TextDecoder().decode(value);
      	// Termination of OpenAI style APIs
        if (
            decoded === "data: [DONE]\n" ||
            decoded.includes('"finish_reason": "')
        ) {
      	// Due to receiving an text-stream, we parse the stream into a usable object.
        const token = JSON.parse(decoded.substring(5));
        if (token.object === "chat.completion.chunk") {
            text += token.choices[0].delta.content;

Stream contents

The initial token will return something similar to Promise { <pending> } to let you know the inference is running.

With the exception of the initial and final objects in the stream, each token is an object similar to:

  id: 'cmpl-ed00000d00000b54b5449936e0000000',
  object: 'chat.completion.chunk',
  created: 100000000,
  model: 'llama-2-7b-chat',
  choices: [ { index: 0, delta: { role: null, content: 'scene' }, finish_reason: null } ]

Each token that contains object: 'chat.completion.chunk' has a token with content you can use. Each id references this specific stream. In the above example, we simply check if the object matches, then manipulate the token into our text string and log it.

Stream end

For HuggingFace style APIs, the termination of the stream is indicated by the second value done on streamReader.read().

For OpenAI style APIs, we first need to decode the contents using TextDecoder, then verify if the decoded value is "data: [DONE]\n" or if it terminated for a described reason if the text stream includes "finish_reason": ". The reasons are stop, length, or function_call. If "finish_reason": null, we do not need to terminate the loop. In this example, we take any potential string value in the finish_reason as a reason to terminate parsing.