LLM - TextToText Python SDK Inferences and Streaming
with LLaMA2-7B
For examples of the types of outputs to expect, you can visit the LLaMA2-7b at OctoAI. The guide below will work equally well for our LLaMA2-70b and LLaMA2-13b endpoints which can be found here (although endpoint URLs will need to be updated).
Requirements
- Please follow How to create an OctoAI API token if you don't have one already.
- Please also verify you've completed Python SDK Installation & Setup.
- If you use the
OCTOAI_TOKEN
envvar for your token, you can instantiate the client withclient = Client()
- If you use the
LLaMA2-7B: Text2Text
Let’s start with a simple example using a pre-accelerated QuickStart template from OctoAI.
Please reference the Overview: QuickStart Templates on SDK to Run Inferences for details on finding endpoint URLs for QuickStart and cloned templates and the Python SDK Reference for more information on classes.
To align with the accustomed standard of LLMs, OctoAI supports OpenAI-style inference arguments, both via thev1/completions
and v1/chat/completions
routes. For more information on OpenAI patterns, please see this helpful discussion.
Llama2-7B: TextToText
from octoai.client import Client
OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
# The client will also identify if OCTOAI_TOKEN is set as an environment variable
client = Client(token=OCTOAI_TOKEN)
llama2_7b_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/chat/completions"
llama2_7b_health_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/models"
inputs = {
"model": "llama-2-7b-chat",
"messages": [
{
"role": "system",
"content": "Below is an instruction that describes a task. Write a response that appropriately completes the request."
},
{
"role": "user",
"content": "Write a blog about Seattle"
}
],
"stream": False,
"max_tokens": 256
}
# For llama2, you'll replace the quickstart template endpoint URL.
if client.health_check(llama2_7b_health_url) == 200:
outputs = client.infer(endpoint_url=llama2_7b_url, inputs=inputs)
# Parse Llama2 outputs and print
text = outputs.get('choices')[0].get("message").get('content')
print(text)
Your output should be something like:
The city of Seattle is the largest in the state of Washington.
It’s located in Western Washington and is known as the birthplace of Starbucks, Microsoft, and
Boeing. It’s a very diverse city that has an interesting blend of culture and nature.
Seattle has the highest number of microbreweries, coffee shops, and bookstores per capita,
so you’ll find plenty of opportunities to indulge yourself, or simply take a stroll through
Pike’s Place Market, which is one of the city’s main attractions.
There are also lots of great restaurants, especially if seafood and fresh ingredients are your
thing. Seattle was ranked among the top 3 best places in the U.S to live.
The city also boasts of a beautiful skyline with views of snow-capped mountains and Puget Sound.
The city is also a major transportation hub so traveling to and from Seattle is easy.
Seattle’s public transportation includes a light rail system that connects most neighborhoods
to the downtown areas, an extensive bus network and a subway system. Seattle’s roadways consist
of the Interstate System and State Highways.
Non-Streaming Object
If you were to log outputs
instead, you'll have additional information that may be useful, including the id
of the completion, the object type, and reason it finished.
{
id: 'cmpl-c63a4f365c3e4001b6e51ed0d05e1854',
object: 'chat.completion',
created: 1694723087,
model: 'llama-2-7b-chat',
choices: [ { index: 0, message: [Object], finish_reason: 'length' } ],
usage: { prompt_tokens: 39, total_tokens: 295, completion_tokens: 256 }
}
outputs.choices[0].message
contains an object that looks something like:
{
role: 'assistant',
content: " Title: Discovering the Emerald City: Exploring Seattle's Hidden Gems\n" +
'\n' +
'As the largest city in the Pacific Northwest, Seattle is a bustling metropolis that offers a unique blend of culture, history, and natural beauty. From the iconic Space Needle to the bustling Pike Place Market, there are plenty of well-known attractions that draw visitors to this vibrant city. However, there are also many hidden gems waiting to be discovered, waiting beyond the surface-level experiences that make Seattle a truly special place. In this blog, we will delve into some of the lesser-known aspects of Seattle and explore the hidden corners that make this city so unique.\n' +
'1. The Chihuly Garden and Glass Exhibit:\n' +
"Located in the heart of Seattle's waterfront area, the Chihuly Garden and Glass Exhibit is a must-visit for art lovers. This stunning exhibit features the works of renowned glass artist Dale Chihuly in an indoor-outdoor setting that showcases his breathtaking creations. Visitors can explore the gardens, which are filled with colorful glass sculpt"
}
Inference Streams for LLMs Using the Python SDK
Streaming inferences in the SDK with endpoints that allow this feature allow you to create experiences with LLMs that let you see as additional tokens are produced. You can view Llama-2-7b chat demo page to get an idea of the kind of experience this enables. This feature is currently only available for supporting LLMs such as LLaMA-2-7b.
For more specific information about methods in the SDK, please check out the Python SDK Reference.
Requirements
from octoai.client import Client
OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"
# The client will also identify if OCTOAI_TOKEN is set as an environment variable
client = Client(token=OCTOAI_TOKEN)
llama2_7b_url = "https://llama-2-7b-chat-demo-kk0powt97tmb.octoai.run/v1/chat/completions"
inputs = {
"model": "llama-2-7b-chat",
"messages": [
{
"role": "system",
"content": "Below is an instruction that describes a task. Write a response that appropriately completes the request."
},
{
"role": "user",
"content": "Write a blog about Seattle"
}
],
"stream": True,
"max_tokens": 256
}
text = ""
# The inference stream returns a StreamResponse object that contains a Token.
for token in client.infer_stream(llama2_7b_url, inputs):
if token.get("object") == "chat.completion.chunk":
text += token["choices"][0]["delta"]["content"]
# The public versions of LLMs limit tokens generated.
# Please clone the QuickStart template to remove this limitation.
assert len(text.split(" ")) >= 60
print(text)
You should receive a lovely paragraph about Seattle due to the max_tokens
parameter limiting it from much more and similar limits on the QuickStart templates we host publicly. You'll also find that you are rate limited from making too many requests on our public endpoint. Please follow the Quickstart Template Endpoints guide to learn how to clone templates and remove this limit.
Title: Discovering the Hidden Gems of Seattle: A Blog
Introduction:
Seattle, the Emerald City, is a hub of creativity, innovation, and natural beauty.
From its stunning skyline to its vibrant culture, Seattle has something to offer for every kind
of traveler. In this blog, we will explore some of the lesser-known gems of Seattle that are
worth discovering. So, pack your bags and get ready to fall in love with this amazing city!
Section 1: Arts and Culture
Seattle is home to a thriving arts and culture scene. Here are a few hidden gems that you won't
want to miss:
1. Fremont Arts Center: Located in the historic Fremont district, this arts center features a
gallery space, a theater, and a gift shop. The center showcases local and regional artists,
and offers a variety of classes and workshops for both adults and children.
2. Center on Contemporary Art: This small but mighty art space in Pioneer Square features a
variety of contemporary art exhibitions, including painting, sculpture, and installation.
The center also offers a
Stream contents
The initial token will return a chat.completion
object to let you know the inference is running. On stream end, it will return the string data: [DONE]
for OpenAI APIs. By only paying attention to the chunks, we do not need to worry about these particular tokens in this example.
With the exception of the initial and final objects in the stream, each token
is an object similar to:
{
id: 'cmpl-ed00000d00000b54b5449936e0000000',
object: 'chat.completion.chunk',
created: 100000000,
model: 'llama-2-7b-chat',
choices: [ { index: 0, delta: { role: null, content: 'scene' }, finish_reason: null } ]
}
Each token that contains object: 'chat.completion.chunk'
has a token with content you can use. Each id
references this specific stream. In the above example, we simply check if the object matches, then manipulate the token into our text
string and log it.
Updated 12 days ago