LLM - TextToText Python SDK Inferences and Streaming

with LLaMA2-7B

For examples of the types of outputs to expect, you can visit the LLaMA2-7b at OctoAI. The guide below will work equally well for our LLaMA2-70b and LLaMA2-13b endpoints which can be found here (although endpoint URLs will need to be updated).

Requirements

LLaMA2-7B: Text2Text

Let’s start with a simple example using a pre-accelerated QuickStart template from OctoAI.

Please reference the Overview: QuickStart Templates on SDK to Run Inferences for details on finding endpoint URLs for QuickStart and cloned templates and the Python SDK Reference for more information on classes.

To align with the accustomed standard of LLMs, OctoAI supports OpenAI-style inference arguments, both via thev1/completions and v1/chat/completions routes. For more information on OpenAI patterns, please see this helpful discussion.

Llama2-7B: TextToText

from octoai.client import Client

OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
# The client will also identify if OCTOAI_TOKEN is set as an environment variable
client = Client(token=OCTOAI_TOKEN)

llama2_7b_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/chat/completions"
llama2_7b_health_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/models"

inputs = {
  "model": "llama-2-7b-chat",
  "messages": [
    {
      "role": "system",
      "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    },
    {
      "role": "user",
      "content": "Write a blog about Seattle"
    }
  ],
  "stream": False,
  "max_tokens": 256
}

# For llama2, you'll replace the quickstart template endpoint URL.
if client.health_check(llama2_7b_health_url) == 200:
  outputs = client.infer(endpoint_url=llama2_7b_url, inputs=inputs)

# Parse Llama2 outputs and print
  text = outputs.get('choices')[0].get("message").get('content')
  print(text)

Your output should be something like:

The city of Seattle is the largest in the state of Washington. 
It’s located in Western Washington and is known as the birthplace of Starbucks, Microsoft, and 
Boeing. It’s a very diverse city that has an interesting blend of culture and nature. 
Seattle has the highest number of microbreweries, coffee shops, and bookstores per capita, 
  so you’ll find plenty of opportunities to indulge yourself, or simply take a stroll through 
  Pike’s Place Market, which is one of the city’s main attractions. 
  There are also lots of great restaurants, especially if seafood and fresh ingredients are your 
  thing. Seattle was ranked among the top 3 best places in the U.S to live.
The city also boasts of a beautiful skyline with views of snow-capped mountains and Puget Sound. 
The city is also a major transportation hub so traveling to and from Seattle is easy. 
Seattle’s public transportation includes a light rail system that connects most neighborhoods 
to the downtown areas, an extensive bus network and a subway system. Seattle’s roadways consist 
of the Interstate System and State Highways.

Non-Streaming Object

If you were to log outputs instead, you'll have additional information that may be useful, including the id of the completion, the object type, and reason it finished.

{
  id: 'cmpl-c63a4f365c3e4001b6e51ed0d05e1854',
  object: 'chat.completion',
  created: 1694723087,
  model: 'llama-2-7b-chat',
  choices: [ { index: 0, message: [Object], finish_reason: 'length' } ],
  usage: { prompt_tokens: 39, total_tokens: 295, completion_tokens: 256 }
}

outputs.choices[0].message contains an object that looks something like:

{
  role: 'assistant',
  content: " Title: Discovering the Emerald City: Exploring Seattle's Hidden Gems\n" +
    '\n' +
    'As the largest city in the Pacific Northwest, Seattle is a bustling metropolis that offers a unique blend of culture, history, and natural beauty. From the iconic Space Needle to the bustling Pike Place Market, there are plenty of well-known attractions that draw visitors to this vibrant city. However, there are also many hidden gems waiting to be discovered, waiting beyond the surface-level experiences that make Seattle a truly special place. In this blog, we will delve into some of the lesser-known aspects of Seattle and explore the hidden corners that make this city so unique.\n' +
    '1. The Chihuly Garden and Glass Exhibit:\n' +
    "Located in the heart of Seattle's waterfront area, the Chihuly Garden and Glass Exhibit is a must-visit for art lovers. This stunning exhibit features the works of renowned glass artist Dale Chihuly in an indoor-outdoor setting that showcases his breathtaking creations. Visitors can explore the gardens, which are filled with colorful glass sculpt"
}

Inference Streams for LLMs Using the Python SDK

Streaming inferences in the SDK with endpoints that allow this feature allow you to create experiences with LLMs that let you see as additional tokens are produced. You can view Llama-2-7b chat demo page to get an idea of the kind of experience this enables. This feature is currently only available for supporting LLMs such as LLaMA-2-7b.

For more specific information about methods in the SDK, please check out the Python SDK Reference.

Requirements

from octoai.client import Client

OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
# The client will also identify if OCTOAI_TOKEN is set as an environment variable
client = Client(token=OCTOAI_TOKEN)

llama2_7b_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/chat/completions"

inputs = {
  "model": "llama-2-7b-chat",
  "messages": [
    {
      "role": "system",
      "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    },
    {
      "role": "user",
      "content": "Write a blog about Seattle"
    }
  ],
  "stream": True,
  "max_tokens": 256
}
  
text = ""

# The inference stream returns a StreamResponse object that contains a Token.
for token in client.infer_stream(llama2_7b_url, inputs):
  if token.get("object") == "chat.completion.chunk":
    text += token["choices"][0]["delta"]["content"]

# The public versions of LLMs limit tokens generated.
# Please clone the QuickStart template to remove this limitation.
assert len(text.split(" ")) >= 60
print(text)

You should receive a lovely paragraph about Seattle due to the max_tokens parameter limiting it from much more and similar limits on the QuickStart templates we host publicly. You'll also find that you are rate limited from making too many requests on our public endpoint. Please follow the Quickstart Template Endpoints guide to learn how to clone templates and remove this limit.

 Title: Discovering the Hidden Gems of Seattle: A Blog

Introduction:
Seattle, the Emerald City, is a hub of creativity, innovation, and natural beauty. 
From its stunning skyline to its vibrant culture, Seattle has something to offer for every kind 
of traveler. In this blog, we will explore some of the lesser-known gems of Seattle that are 
worth discovering. So, pack your bags and get ready to fall in love with this amazing city!

Section 1: Arts and Culture
Seattle is home to a thriving arts and culture scene. Here are a few hidden gems that you won't
want to miss:
1. Fremont Arts Center: Located in the historic Fremont district, this arts center features a 
	gallery space, a theater, and a gift shop. The center showcases local and regional artists, 
  and offers a variety of classes and workshops for both adults and children.
2. Center on Contemporary Art: This small but mighty art space in Pioneer Square features a 
	variety of contemporary art exhibitions, including painting, sculpture, and installation. 
  The center also offers a

Stream contents

The initial token will return a chat.completion object to let you know the inference is running. On stream end, it will return the string data: [DONE] for OpenAI APIs. By only paying attention to the chunks, we do not need to worry about these particular tokens in this example.

With the exception of the initial and final objects in the stream, each token is an object similar to:

{
  id: 'cmpl-ed00000d00000b54b5449936e0000000',
  object: 'chat.completion.chunk',
  created: 100000000,
  model: 'llama-2-7b-chat',
  choices: [ { index: 0, delta: { role: null, content: 'scene' }, finish_reason: null } ]
}

Each token that contains object: 'chat.completion.chunk' has a token with content you can use. Each id references this specific stream. In the above example, we simply check if the object matches, then manipulate the token into our text string and log it.